KR20160122542A

KR20160122542A - Method and apparatus for measuring pronounciation similarity

Info

Publication number: KR20160122542A
Application number: KR1020150052579A
Authority: KR
Inventors: 최재우; 김현수; 조경일; 정요원; 이강규; 문대영; 금명철; 김기곤; 변진영; 윤재선; 이항섭
Original assignee: 주식회사 셀바스에이아이
Priority date: 2015-04-14
Filing date: 2015-04-14
Publication date: 2016-10-24

Abstract

The method of measuring similarity according to the present invention includes receiving user speech data corresponding to reference speech data, processing the reference speech data with a speech recognition algorithm, Processing the user speech data with a speech recognition algorithm to generate second speech processing data; comparing the first speech processing data and the second speech processing data with learning target word data to generate pronunciation, accent, Speed and the speed of the user, and comparing the first similarity and the second similarity, and measuring the final similarity of the reference voice data and the user voice data, In speaking English, the pronunciation similarity degree which can efficiently learn pronunciation and correction There is an advantageous effect that it is possible to provide a static method and apparatus.

Description

METHOD AND APPARATUS FOR MEASURING PRONOUNCTION SIMILARITY [0002]

The present invention relates to a pronunciation similarity measuring method and apparatus, and more particularly, to a pronunciation similarity measuring method capable of comparing pronunciation of a user's voice with pronunciation of a preceding reference voice to evaluate similarity, &Lt; / RTI >

The importance of foreign language learning is emphasized by the trend of globalization, and interest in English education is increasing in particular. Also, in the modern society, interest in English ability centering on communication in real life is increasing, and researches on more effective and accurate English learning methods and language programs are continuously being carried out.

Meanwhile, since English speech recognition using speech recognition technology is popularized, an English learning method using speech recognition becomes popular. However, in the English learning method using currently used speech recognition technology, a recognition target word to be recognized is determined in advance And when the user utters the determined recognition target word, it finds out which of the words of the input user is closest to which of the previously registered words, and outputs the result, and displays this progress in the form of noon or score, It is judged whether or not the pronunciation of the speech is accurate.

This type of English learning method provides a pronunciation score only when the user watches and follows the text provided by a language program, and the user first hears the native speaker's speech and then hears content such as a movie, a video or a song There is a limitation that the pronunciation score is not evaluated based on the pronunciation of the voice of the user who listens to his / her ears.

In addition, in the English learning method using the speech recognition technology, a person must directly register a recognition target word to create a candidate group, and there is no standard for judging which pronunciation is more appropriate. In addition, when a sentence or a word includes rhyme, intonation, accentuation, and rhythm, the pronunciation of the user may not be recognized, or a wrong evaluation result may occur.

Accordingly, in a learning method in which the user is listening to the preceding reference voice and does not compare with the predetermined candidate group, the similarity can be measured by directly comparing the preceding reference voice with the preceding reference voice, There is a growing need to provide a method for measuring pronunciation similarity that is comparable to rhythm, intonation, accent, and rhythm included.

A problem to be solved by the present invention is to provide a pronunciation similarity measuring method and apparatus capable of measuring pronunciation similarity between a reference voice and a user voice preceded by speech recognition technology for English learning to hear and speak.

Another problem to be solved by the present invention is to provide a pronunciation similarity measuring method and apparatus capable of measuring pronunciation similarity by evaluating pronunciation, intonation, accentuation and speed included in a preceding reference voice and user voice.

Another object of the present invention is to provide a pronunciation similarity measuring method and apparatus capable of efficient pronunciation learning and correction.

The problems of the present invention are not limited to the above-mentioned problems, and other problems not mentioned can be clearly understood by those skilled in the art from the following description.

According to an aspect of the present invention, there is provided a pronunciation similarity measuring method comprising: receiving user speech data corresponding to reference speech data; processing the reference speech data with a speech recognition algorithm, Generating second sound processing data by processing the user sound data with a speech recognition algorithm; comparing the first sound processing data and the second sound processing data with learning target word data to generate pronunciation, accent, and accent And speed, and comparing the first similarity and the second similarity, and measuring the final similarity of the reference voice data and the user voice data to the first similarity degree and the second similarity degree .

According to another aspect of the present invention, there is provided a speech recognition method including the steps of receiving reference speech data corresponding to learning target speech data and learning target speech data.

According to still another aspect of the present invention, the step of calculating the first similarity degree and the second degree of similarity is characterized by dividing the first speech processing data and the second speech processing data into phoneme units and comparing them with learning target word data.

According to another aspect of the present invention, the method further includes providing the final similarity score to the user as a score.

According to an aspect of the present invention, there is provided a pronunciation similarity measuring apparatus comprising: a receiver for receiving user voice data corresponding to reference voice data; Processing the user speech data with a speech recognition algorithm to generate second speech processing data, and comparing the first speech processing data and the second speech processing data with learning target word data to generate pronunciation, accent, Speed and a speed of the user, and a processing unit for comparing the first similarity and the second similarity to calculate the final similarity between the reference voice data and the user voice data .

According to another aspect of the present invention, the receiving unit receives the reference speech data corresponding to the learning target speech data and the learning target speech data.

According to still another aspect of the present invention, the speech recognition unit includes dividing the first speech processing data and the second speech processing data into phonemes and comparing them with learning target word data.

According to another aspect of the present invention, there is provided an image processing apparatus including an output unit that provides a score to a user as a final similarity.

According to an aspect of the present invention, there is provided a computer-readable recording medium storing instructions for providing pronunciation similarity measurement method, the computer program product comprising: receiving user voice data corresponding to reference voice data; Processing the first speech processing data with the speech recognition algorithm to process the user speech data with the speech recognition algorithm to generate the second speech processing data, The first similarity degree and the second similarity degree which are evaluated by at least one of pronunciation, accentuation, accentuation, and speed are compared with the speech data, and the first similarity degree and the second similarity degree are compared with each other, And the degree of similarity is measured.

The details of other embodiments are included in the detailed description and drawings.

The present invention has the effect of measuring the pronunciation similarity between the reference voice and the user voice preceding by using the voice recognition technology for the English learning to hear and speak.

The present invention has the effect of measuring pronunciation similarity measured by evaluating pronunciation, intonation, accentuation, and speed included in the preceding reference voice and user voice.

The present invention has the effect of providing a pronunciation similarity measuring method and apparatus capable of efficient pronunciation learning and correction in English learning to be heard and spoken.

The effects according to the present invention are not limited by the contents exemplified above, and more various effects are included in the specification.

1 is a schematic configuration diagram of a pronunciation similarity measuring apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart of a pronunciation similarity measurement method according to an embodiment of the present invention.
FIG. 3 exemplarily shows a method of measuring the first similarity degree, the second similarity degree, and the final similarity degree in the pronunciation similarity measurement method according to an embodiment of the present invention.
FIG. 4 illustrates an exemplary screen implemented by the pronunciation similarity measurement method according to an embodiment of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

Like reference numerals refer to like elements throughout the specification.

It is to be understood that each of the features of the various embodiments of the present invention may be combined or combined with each other partially or entirely and technically various interlocking and driving is possible as will be appreciated by those skilled in the art, It may be possible to cooperate with each other in association.

In the present specification, the reference voice data means data including a preceding reference voice which is a learning target that the user wants to listen to and follow. The reference voice data can be input in various ways corresponding to the format in which the preceding reference voice is provided. For example, the reference voice data can be classified into direct reference voice data provided by a speaker's voice such as a native speaker (standard pronunciation provider) and indirect reference voice data provided through contents such as a movie or a song. At this time, the direct reference voice data is recognized from the microphone or the voice recorder, and the indirect reference voice data can be input through the moving picture application or the voice reproduction application.

In the present specification, the user voice data is data input by the user corresponding to the reference voice data, and the user learns pronunciation and intonation of the preceding reference voice by listening and following the preceding reference voice. The user voice data can be input to various applications through voice recognition. The user voice data can be converted into text type data through speech recognition.

In the present specification, the learning target word data is a word or a sentence that the user wants to listen to and follow, and is data in the form of text corresponding to the reference speech data, and is data to be used as a reference in calculating the similarity. The learning target speech data may be provided directly to receive the reference speech data or the user speech data. For example, the learning target data may be displayed through the display unit of the terminal used by the user. Further, when the user listens to and follows the preceding reference voice, it may not be directly displayed to the user or transmitted.

Various embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

1 is a schematic configuration diagram of a pronunciation similarity measuring apparatus according to an embodiment of the present invention.

Referring to FIG. 1, the pronunciation similarity measuring apparatus 100 includes a receiving unit 110, a voice recognizing unit 120, a processing unit 130, and a display unit 140.

The pronunciation similarity measuring apparatus 100 receives user voice data corresponding to reference voice data from the receiving unit 110 and processes the voice data with a voice recognition algorithm to generate first voice processing data and second voice processing data, respectively. Thereafter, the pronunciation similarity measuring apparatus 100 compares the first speech processing data and the second speech processing data with the learning target word data to obtain a first similarity degree and a second similarity degree, which evaluate at least one of pronunciation, accentuation, And provides the final similarity between the reference speech data and the user speech data, through comparison between the first and second similarities.

The pronunciation similarity measuring apparatus 100 may be used independently or in conjunction with various applications. An application that can be associated with the pronunciation similarity measuring apparatus 100 includes a sound recorder application, a sound reproducing application, a moving picture reproducing application, and the like. Further, the pronunciation similarity measuring apparatus 100 can be embedded in or linked to an English learning application.

The receiving unit 110 receives user voice data so that voice recognition can be performed in the pronunciation similarity measuring apparatus 100. The receiving unit 110 may be connected to an external input unit for receiving user voice data. For example, the input unit may include a microphone for directly receiving the user's voice or a voice recorder for recording the voice of the user. Accordingly, the user voice data is the voice of the user received via the microphone or the voice of the recorded user.

On the other hand, the receiving unit 110 can further receive the reference target speech data and the reference speech data corresponding to the learning target speech data. The receiving unit 110 can receive learning target speech data and reference speech data stored in a database in the pronunciation similarity measuring apparatus 100 and is connected to a separate input unit to receive learning target speech data and reference speech data from the outside It is possible. The reference voice to be learned can be provided to the user in various formats. For example, the reference voice may be a format provided by a speaker's utterance such as a native speaker and a format provided through content such as a movie or a song. Therefore, when the reference voice data is directly input from the native speaker, a voice recorder for recording and transmitting voice of a microphone or native speaker can be used. In addition, when the reference voice data is input separately from the voice in the content such as a moving picture or a song, a moving picture editing application or a voice editing application can be used.

The speech recognition unit 120 generates the first speech processing data and the second speech processing data using the speech recognition algorithm based on the reference speech data and the user speech data received through the receiver 110, respectively. The first speech processing data and the second speech processing data are data obtained by converting the reference speech and the user speech into texts. The similarity between the reference speech data and the user speech data is determined by using the first speech processing data and the second speech processing data Can be measured.

The speech recognition algorithm basically refers to a task in which an electronic device interprets a reference voice and a voice uttered by the user and recognizes the contents as text. Although not limited thereto, when the waveforms of the reference voice and the user voice are input to the electronic device, voice pattern information can be obtained by analyzing the voice waveform with reference to the acoustic model or the like. Then, the obtained voice pattern information is compared with the identification information, so that the text having the highest probability of matching in the identification information can be recognized.

Further, the speech recognition unit 120 compares the generated first speech processing data and second speech processing data with learning target word data, and calculates similarities in which one or more of pronunciation, accentuation, stress, and speed are evaluated. The speech recognition unit 120 may measure the first speech processing data and the second speech processing data by dividing the learning target word data and one syllable into a minimum unit for more accurate similarity measurement, . &Lt; / RTI >

The processing unit 130 compares the first similarity calculated by the speech recognition unit 120 with the second similarity to measure the final similarity between the reference speech data and the user speech data. The measured final similarity can be stored in the form of numbers, letters, symbols, and the like.

The display unit 140 receives the final similarity degree between the reference voice data and the user voice data from the processing unit 130 and provides the score to the user. Also, the display unit 140 can display one or more of the learning target data, reference voice data, and user voice data together with the final similarity.

1, the receiving unit 110, the voice recognizing unit 120, the processing unit 130, and the display unit 140 are shown as separate components. However, in the implementation of the present invention, . It may also be configured as part of various applications.

FIG. 2 is a flowchart illustrating a pronunciation similarity measurement method according to an embodiment of the present invention. Will be described with reference to Fig. 1 for convenience of explanation.

The pronunciation similarity measuring method according to the present invention is started by the receiving unit 110 receiving user voice data corresponding to reference voice data (S110).

The receiving unit 110 receives the user voice data corresponding to the reference voice data. The user voice data includes a voice of a user and may have various forms according to the type of voice received by the receiving unit 110. [ Specifically, the user inputs the user voice data to the pronunciation similarity measuring apparatus 100 by listening to and following the reference voice.

On the other hand, the receiving unit 110 may further include receiving the target speech data and / or the reference speech data corresponding to the learning target speech data. When the target word data and the reference speech data are stored in the database in the pronunciation similarity measuring apparatus 100, the process of separately receiving the target speech data and the reference speech data may be omitted. However, When receiving from an application, it may directly receive through the receiving unit 110.

For example, when the receiving unit 110 receives learning target word data from another application, it can receive learning target word data stored in another application. Further, when the user listens to and follows a voice in a content format such as a moving picture or a song, it may receive learning target word data from the subtitle of the moving picture or the lyrics of the song. At this time, the receiver 110 can directly receive a voice of a speaker such as a native speaker from a microphone or a sound recorder, and can receive a voice included in a moving image or a song through a separate application or algorithm.

The reception unit 110 transmits the received user speech data, learning target speech data, reference speech data, and user speech data to the speech recognition unit 120.

The speech recognition unit 120 processes the reference speech data using a speech recognition algorithm to generate first speech processing data, and processes the user speech data with a speech recognition algorithm to generate second speech processing data (S120).

The speech recognition algorithm basically refers to a task in which the pronunciation similarity measuring apparatus 100 analyzes a speech received from a native speaker or a user and a speech received from various contents and converts the contents into text. For example, when the user voice is taken as an example, the waveform of the voice uttered by the user is input to the receiving unit 110, and the voice waveform is analyzed with reference to the acoustic model or the like stored in the pronunciation similarity measuring apparatus 100, . Then, the obtained voice pattern information is compared with the identification information, so that the text having the highest probability of matching in the identification information can be recognized. The identification information is information in which text corresponding to a representative voice is stored according to an acoustic model stored in the pronunciation similarity measuring apparatus 100.

The first speech processing data and the second speech processing data may be composed of sentences corresponding to the received reference speech data and user speech data by combining the matched words by the speech recognition algorithm. The first speech processing data and the second speech processing data may be temporarily stored in the speech recognition section 120 for comparison with the learning target word data.

The speech recognition unit 120 compares the first speech processing data and the second speech processing data with the learning target speech data to calculate a first similarity degree and a second similarity degree in which one or more of pronunciation, intonation, (S130).

The first similarity degree and the second similarity degree are similar to each other between the first speech processing data and the second speech processing data and the learning target word data. The speech recognition section 120 acquires the first speech processing data and the second speech processing data Are compared with the target word data.

The first speech processing data, the second speech processing data, and the learning target speech data are data in text format, and the comparison between speech processing data and learning target speech data can be performed by direct comparison at the text level. In calculating the degree of similarity, basically one syllable can be measured as a minimum unit, and it may be measured by dividing it by a predetermined time interval according to the type of reference voice, or by measuring a word or a phoneme.

In the conventional speech recognition method, when the reference voice or the user voice includes a rhyme such as strong, rhythm, or intonation, there is no matching acoustic model or identification information, so that the voice itself is not recognized or an incorrect evaluation result is generated . However, when the speech processing data and learning target speech data are divided and compared on a phoneme-by-phoneme basis, the speech processing data and the learning target speech data are not the same regardless of whether the matching acoustic model or the false information is present or the speech containing the rhyme Speech is recognized per phoneme unit. Therefore, by comparing the texts of the phonemes in which the speech is recognized, the reference speech and the user speech can be compared. In addition, a certain rule may be present in units of phonemes for the error that a user makes to pronounce an English word. It is more preferable to construct a database by arranging words similar in phonemic unit, and then compare the phonemes in a database.

The first similarity degree and the second similarity degree are calculated by evaluating at least one of pronunciation, accentuation, accentuation, and speed between speech processing data and learning target word data. At this time, pronunciation, accentuation, accentuation, and speed can be compared by dividing speech processing data by phonemic unit. For each phoneme unit, the position of articulation, method of articulation, vocal fold vibration, vowel triangle, vowel angle, The first similarity degree and the second similarity degree can be calculated through a method of analyzing and comparing the pitches.

The calculated first similarity degree and the second similarity degree may be numerically expressed and optionally stored in the voice recognition section 120.

The processing unit 130 compares the first similarity with the second similarity, and measures the final similarity between the reference voice data and the user voice data (S140).

The final degree of similarity indicates the degree of similarity between the reference voice data and the user voice data, which means a similar degree of pronunciation of the reference voice and pronunciation of the user voice. The final similarity degree is measured by comparing the first similarity degree and the second similarity degree calculated by the speech recognition unit 120. The similarity degree between the reference voice data and the user voice data is indirectly evaluated through the learning target word data it means. Specifically, the final similarity degree is inversely proportional to the difference between the first similarity degree and the second similarity degree.

The measured final similarity can be represented by numbers, letters, symbols, and the like. For example, the final similarity may be expressed as a number between " 0 " and " 10 ". At this time, the larger the number, the more similar the reference voice and the user voice are. The final similarity can be obtained by subtracting the difference between the first similarity degree and the second similarity degree to ' 10 ' points as shown in the following equation (1).

[Equation 1]

Final similarity = 10 - (| second similarity - first similarity |)

The method of measuring the final similarity and the final similarity expressed in numbers will be further described with reference to FIG.

The pronunciation similarity measuring method of the present invention may further include the step of the display unit 140 providing the final similarity score to the user as a score. At this time, the display unit 140 may display one or more of the learning target data, reference voice data, and user voice data in various forms together with the final similarity. In addition, the display unit 140 may analyze the learning results based on the first similarity degree, the second similarity degree, and the final similarity level to improve the learning efficiency of the user, thereby providing feedback data to the user.

FIG. 3 exemplarily shows a method of measuring the first similarity degree, the second similarity degree, and the final similarity degree in the pronunciation similarity measurement method according to an embodiment of the present invention.

Referring to FIG. 3, the learning target word data 210 received in the pronunciation similarity measuring apparatus is illustrated as "it's back up it's back up Oh". Thereafter, a native speaker's reference voice corresponding to the learning target word data 210 is received by the pronunciation similarity measuring apparatus by the receiving unit, and processed by a speech recognition algorithm to generate first speech processing data 220. At this time, the first speech processing data 220 is exemplified as " eats vac it's bat up Oh. &Quot; On the other hand, the user listens to and follows the native voice of the native speaker and is processed by the voice recognition algorithm to generate the second voice processing data 240. At this time, the second voice processing data 240 is generated as "it's back up it's beg up O". The first speech processing data 220 and the second speech processing data 240 are compared with the learning target data 210 at the text level to calculate the first similarity 230 and the second similarity 250, respectively. The first degree of similarity 230 and the second degree of similarity 250 are divided into syllable units and can be expressed by a number between 0 and 10. At this time, the larger the number, it may mean that the speech processing data and learning target word data are similar to each other. The final similarity 260 is measured by comparing the first similarity 230 and the second similarity 250. [ The final similarity 260 in FIG. 3 can be obtained by subtracting the difference between the first similarity 230 and the second similarity 250 at the "10" point.

Referring to FIG. 3, in the case of 'back' of the learning target word data 210, since the reference speech data 320 is 'vac', the first phoneme is different from 'b' and 'v' Quot; 7 " may be displayed on a 10-point scale. On the other hand, since the user voice data 340 is 'back', it is recognized as the same text as the learning target word data 210, so that the score of the second similarity degree 350 can be displayed as '10'. In other words, it can be seen that the user voice uttered by the user is more similar to the reference target text, which is the target text than the reference voice uttered by the native speaker. However, in learning to listen to and follow, since the main purpose is to compare the pronunciation of the user voice with the pronunciation of the reference voice, when the reference voice data 320 'vac' and the user voice data 340 ' . Thus, the final similarity score (360) can be expressed as '7' obtained by subtracting '3', which is the difference between the score of the first similarity score (330) and the score of the second similarity score (350)

FIG. 4 illustrates an exemplary screen implemented by the pronunciation similarity measurement method according to an embodiment of the present invention.

Referring to FIG. 4, the pronunciation similarity measurement window 300 includes a target learning word display unit 310, a reference voice display unit 320, a user voice display unit 330, a final similarity degree display unit 340, and a learning evaluation display unit 350 .

The learning word display unit 310 displays words or phrases that the user wants to listen to and follow. The learning word display unit 310 displays words or phrases stored in the database. The learning word display unit 310 displays words or phrases Words or sentences received from external content may be displayed.

The reference voice display unit 320 is an area for displaying reference voice data, and can be shaped and displayed in various ways such as a graph or a figure expressing voice. Also, the user voice display unit 330 is an area for displaying user voice data, and corresponding to the reference voice display unit 320, the user voice data can be shaped and displayed in the same manner. In Fig. 4, pitch curves of the reference voice and the user voice are displayed.

The final similarity degree display unit 340 is an area displaying the final similarity degree measured by the pronunciation similarity measuring method of the present invention. For example, the final similarity degree display unit 340 may display the final similarity degree as a number between 0 and 10, and may be represented by symbols such as 'A', 'B', and 'C' Can be displayed.

The learning evaluation display unit 350 is an area for providing feedback to the user on the basis of the final similarity, and can provide details such as pronunciation, accentuation, accent and speed according to the type stored in the database .

In this specification, each block or each step may represent a part of a module, segment or code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions mentioned in the blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order according to the corresponding function.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, which is capable of reading information from, and writing information to, the storage medium. Alternatively, the storage medium may be integral with the processor. The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it is to be understood that the present invention is not limited to those embodiments and various changes and modifications may be made without departing from the scope of the present invention. . Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. Therefore, it should be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100 pronunciation similarity measuring device
110 receiver
120 speech recognition unit
130 processor
140 display unit
210 Learning Objective Data
220 first speech processing data
230 First degree of similarity
240 second voice processing data
250 second similarity
260 Final similarity
300 Similarity measure window
310 target learning display unit
320 Reference voice display
330 User Voice Display
340 Final similarity indicator
350 Learning evaluation display

Claims

Receiving user voice data corresponding to the reference voice data;
Processing the reference speech data with a speech recognition algorithm to generate first speech processing data, and processing the user speech data with a speech recognition algorithm to generate second speech processing data;
Comparing the first speech processing data and the second speech processing data with learning target word data to calculate first similarity and second similarity in which at least one of pronunciation, accentuation, stress, and speed is evaluated; And
And comparing the first similarity with the second similarity to measure the final similarity of the reference speech data and the user speech data.

The method according to claim 1,
Further comprising the step of: receiving learning target word data and reference speech data corresponding to the learning target word data.

The method according to claim 1,
Wherein the step of calculating the first similarity degree and the second degree of similarity comprises dividing the first speech processing data and the second speech processing data on a phoneme basis and comparing the first and second speech processing data with the learning target word data.

The method according to claim 1,
Further comprising the step of providing the final similarity to the user as a score.

A receiving unit for receiving user voice data corresponding to the reference voice data;
Processing the reference speech data with a speech recognition algorithm to generate first speech processing data, and processing the user speech data with a speech recognition algorithm to generate second speech processing data, wherein the first speech processing data and the second A speech recognition unit for comparing first speech processing data with learning target speech data and calculating a first similarity and a second similarity in which at least one of pronunciation, accentuation, stress and speed is evaluated; And
And a processor for comparing the first similarity with the second similarity to measure final similarity between the reference speech data and the user speech data.

6. The method of claim 5,
Wherein the receiving unit receives the learning target speech data and the reference speech data corresponding to the learning target speech data.

6. The method of claim 5,
Wherein the speech recognition unit includes dividing the first speech processing data and the second speech processing data on a phoneme basis and comparing the divided speech processing data with the learning target speech data.

6. The method of claim 5,
And an output unit for providing the final similarity to the user as a score.

Receives user voice data corresponding to the reference voice data,
Processing the reference speech data with a speech recognition algorithm to generate first speech processing data, processing the user speech data with a speech recognition algorithm to generate second speech processing data,
Comparing the first speech processing data and the second speech processing data with learning target word data to calculate a first similarity degree and a second similarity degree in which one or more of pronunciation, accentuation,
And comparing the first similarity with the second similarity to measure a final similarity of the reference voice data and the user voice data.