WO2022149662A1

WO2022149662A1 - Method and apparatus for evaluating artificial-intelligence-based korean pronunciation by using lip shape

Info

Publication number: WO2022149662A1
Application number: PCT/KR2021/005233
Authority: WO
Inventors: 송진주
Original assignee: 주식회사 헤이스타즈
Priority date: 2021-01-11
Filing date: 2021-04-26
Publication date: 2022-07-14
Also published as: KR20220101493A

Abstract

The present invention relates to a pronunciation evaluation apparatus and method for evaluating the pronunciation of a user by using video and audio of the pronunciation process of the user, and enables feedback about the pronunciation of the user to be provided even without the intervention of an instructor, and, in particular, allows the point in time when the pronunciation was incorrect to be provided together with an evaluation result of the pronunciation, so that the user easily corrects the pronunciation.

Description

Method and apparatus for evaluating Korean pronunciation based on artificial intelligence using lip shape

The present invention relates to a pronunciation evaluation apparatus and method for evaluating a user's pronunciation using an image and voice of the user's pronunciation process.

In general, foreign language education has been achieved by selecting content suitable for one from among pre-generated language content.

For example, the user accesses the language education system and selects and makes a reservation for the language content deemed suitable for him in consideration of the name of the language content, lecture level, instructor, etc. I had to go through a consultation visit.

However, such an existing method has a problem in that the interests of individual users or the level of individual users are not taken into consideration because the operating entity of the language education system is based on previously generating and uploading contents by grade.

In addition, such an existing method has a problem in that the user cannot receive an appropriate feedback on his or her condition. For example, there was a problem that the user could not receive feedback on the accuracy of his or her pronunciation, and thus the learning achievement could not be improved.

SUMMARY OF THE INVENTION The present invention is to solve the above-mentioned problems, and aims to provide a high level of feedback on the user's pronunciation.

A pronunciation evaluation method according to an embodiment of the present invention includes: generating one or more evaluation images from an input image including a pronunciation process of a user's first text and a voice according to the pronunciation process; generating a second text describing a sound corresponding to the one or more evaluation images; and generating a score based on the similarity between the first text and the second text.

The generating of the one or more evaluation images may include: grouping a plurality of frames constituting the input image into syllable units with reference to the voice; extracting only a representative frame by removing, from the grouped frame, a frame having at least one of a position of a user's lips, a tongue, and a tooth, and a frame having less than a predetermined threshold difference from an adjacent frame among the frames grouped in units of syllables; and adding a representative frame extracted for each syllable to the one or more evaluation images.

The generating of the second text may include generating the second text corresponding to the representative frame extracted for each individual syllable by using the learned first artificial neural network. In this case, the first artificial neural network may be a neural network trained to output a text representing a sound corresponding to the oral image according to an input of at least one oral image depicting the sound.

The pronunciation evaluation method according to an embodiment of the present invention may further include, before generating the one or more evaluation images, learning the first artificial neural network using learning data. At this time, the learning data may include at least one learning oral image describing the pronunciation process of the learning syllable, order information of the at least one learning oral image, and text corresponding to the learning syllable.

The pronunciation evaluation method according to an embodiment of the present invention further includes, after generating the second text, generating a third text from the voice according to the pronunciation process using a second learned artificial neural network. can do.

The generating of the score may include: generating a first score based on a degree of similarity between the first text and the second text; generating a second score based on a degree of similarity between the first text and the third text; and calculating the score based on the first score and the second score.

The pronunciation evaluation method according to an embodiment of the present invention may further include, after generating the score, providing evaluation content including the one or more evaluation images to the user terminal.

The providing of the evaluation content to the user terminal may include providing the first text and the one or more evaluation images in correspondence with each other, arranging and providing the one or more evaluation images in time series; arranging and providing one or more sample images corresponding to the first text in time series; and providing a time point in which a difference between the one or more evaluation images and the one or more sample images is equal to or greater than a predetermined threshold difference.

A pronunciation evaluation apparatus according to an embodiment of the present invention includes a control unit, wherein the control unit generates one or more evaluation images from an input image including a pronunciation process of a user's first text and a voice according to the pronunciation process, A second text describing a sound corresponding to one or more evaluation images may be generated, and a score may be generated based on the similarity between the first text and the second text.

The control unit groups a plurality of frames constituting the input image in units of syllables with reference to the voice, and at least one of a position of a user's lips, tongue, and teeth among the frames grouped in units of syllables is adjacent A frame and a frame less than a predetermined threshold difference may be removed from the grouped frame to extract only a representative frame, and the extracted representative frame for each individual syllable may be added to the one or more evaluation images.

The control unit generates the second text corresponding to the representative frame extracted for each individual syllable by using the learned first artificial neural network, and the first artificial neural network responds to the input of at least one oral image depicting a sound. Accordingly, it may be a neural network trained to output text representing a sound corresponding to the oral image.

The control unit learns the first artificial neural network by using the learning data, and the learning data includes at least one learning oral image depicting a pronunciation process of a learning syllable, order information of the at least one learning oral image, and the learning syllable. may include text corresponding to .

The controller may generate the third text from the voice according to the pronunciation process by using the learned second artificial neural network.

The control unit generates a first score based on the similarity between the first text and the second text, generates a second score based on the similarity between the first text and the third text, and includes the first score and The score may be calculated based on the second score.

The control unit may provide evaluation content including the one or more evaluation images to the user terminal.

The control unit provides the first text and the one or more evaluation images in correspondence with each other, the one or more evaluation images are arranged and provided in time series, and one or more sample images corresponding to the first text are arranged in time series and provide a time point in which a difference between the one or more evaluation images and the one or more sample images is greater than or equal to a predetermined threshold difference.

The present invention can provide feedback on the user's pronunciation without the intervention of the instructor.

In addition, the present invention determines the pronunciation accuracy of the user in consideration of both the accuracy of pronunciation based on the image acquired of the user's pronunciation process and the accuracy of pronunciation based on the user's pronunciation voice, so that more accurate pronunciation evaluation is performed.

In addition, the present invention allows the user to easily correct the pronunciation by providing, in detail, at which point in time the pronunciation is wrong, along with the evaluation result of the pronunciation as described above.

1 is a diagram schematically illustrating the configuration of a pronunciation evaluation system according to an embodiment of the present invention.

FIG. 2 is a diagram schematically illustrating the configuration of a pronunciation evaluation device 110 provided in the server 100 according to an embodiment of the present invention.

3 and 4 are diagrams for explaining an exemplary structure of an artificial neural network learned by the pronunciation evaluation apparatus 110 of the present invention.

5 is a diagram for explaining a method for the controller 112 to learn the first artificial neural network 520 using a plurality of learning data 510 according to an embodiment of the present invention.

6 is a diagram for explaining a process in which the controller 112 outputs the second text 540 using the first artificial neural network 520 according to an embodiment of the present invention.

7 is a diagram for explaining a method for the controller 112 to learn the second artificial neural network 560 using a plurality of learning data 550 according to an embodiment of the present invention.

FIG. 8 is a diagram for explaining a process in which the controller 112 outputs the third text 580 using the second artificial neural network 560 according to an embodiment of the present invention.

9 is a diagram for explaining a process in which the controller 112 generates one or more evaluation images according to an embodiment of the present invention.

10 is a diagram illustrating a series of processes in which the controller 112 generates a pronunciation score from an input image according to an embodiment of the present invention.

11 is an example of a screen 700 on which evaluation content provided to the user terminal 200 is displayed.

12 is a flowchart illustrating a pronunciation evaluation method performed by the controller 112 according to an embodiment of the present invention.

[Explanation of code]

100: server

110: pronunciation evaluation device

111: communication department

112: control unit

113: memory

200: user terminal

300: communication network

Since the present invention can apply various transformations and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. Effects and features of the present invention, and a method for achieving them, will become apparent with reference to the embodiments described below in detail in conjunction with the drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various forms.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and when described with reference to the drawings, the same or corresponding components are given the same reference numerals, and the overlapping description thereof will be omitted. .

In the following embodiments, terms such as first, second, etc. are used for the purpose of distinguishing one component from another, not in a limiting sense. In the following examples, the singular expression includes the plural expression unless the context clearly dictates otherwise. In the following embodiments, terms such as include or have means that the features or components described in the specification are present, and the possibility that one or more other features or components may be added is not excluded in advance. In the drawings, the size of the components may be exaggerated or reduced for convenience of description. For example, since the size and shape of each component shown in the drawings are arbitrarily indicated for convenience of description, the present invention is not necessarily limited to the illustrated bar.

The pronunciation evaluation system according to an embodiment of the present invention may evaluate the user's pronunciation by using the user's pronunciation process and an input image including a voice according to the pronunciation process.

In addition, the pronunciation evaluation system according to an embodiment of the present invention provides the correct mouth shape and the user's mouth shape together with the pronunciation evaluation score when providing the user's pronunciation evaluation result to help the user in correcting the pronunciation. can

In the present invention, 'voice' (音聲) refers to a human voice or speech sound, and may mean a specific and physical sound produced through a human pronunciation organ.

In the present invention, an 'artificial neural network' such as the first artificial neural network and the second artificial neural network is a neural network trained to be suitable for a service performed by the server 100, and is applied to a machine learning or deep learning technique. It may mean an artificial neural network trained by The structure of such a neural network will be described later with reference to FIGS. 3 to 4 .

A pronunciation evaluation system according to an embodiment of the present invention may include a server 100 , a user terminal 200 , and a communication network 300 as shown in FIG. 1 .

The user terminal 200 according to an embodiment of the present invention may refer to various types of devices that mediate the user and the server 100 so that the user can use various services provided by the server 100 . In other words, the user terminal 200 according to an embodiment of the present invention may mean various devices for transmitting and receiving data to and from the server 100 .

The user terminal 200 according to an embodiment of the present invention may acquire an input image including the user's pronunciation process and the corresponding voice, and transmit it to the server 100 . Also, the user terminal 200 may receive the evaluation content from the server 100 and provide it to the user. As shown in FIG. 1 , such a user terminal 200 may mean

portable terminals

201 , 202 , and 203 , or may mean a computer 204 .

Meanwhile, the user terminal 200 may include a display means for displaying content and the like in order to perform the above-described function, and an input means for obtaining a user's input for such content. In this case, the input means and the display means may be configured in various ways. For example, the input means may include, but is not limited to, a keyboard, a mouse, a trackball, a microphone, a button, a touch panel, and the like.

The communication network 300 according to an embodiment of the present invention may mean a communication network that mediates data transmission and reception between components of the pronunciation evaluation system. For example, the communication network 300 may include wired networks such as LANs (Local Area Networks), WANs (Wide Area Networks), MANs (Metropolitan Area Networks), ISDNs (Integrated Service Digital Networks), wireless LANs, CDMA, Bluetooth, satellite communication, etc. may cover a wireless network, but the scope of the present invention is not limited thereto.

The server 100 according to an embodiment of the present invention may evaluate the user's pronunciation by using the user's pronunciation process and the input image including the voice according to the pronunciation process. In addition, in providing the pronunciation evaluation result to the user terminal 200 , the server 100 may provide the correct mouth shape and the user's mouth shape together with the pronunciation evaluation score to help the user in correcting the pronunciation.

Referring to FIG. 2 , the pronunciation evaluation apparatus 110 according to an embodiment of the present invention may include a communication unit 111 , a control unit 112 , and a memory 113 . Also, although not shown in the drawings, the pronunciation evaluation apparatus 110 according to an embodiment of the present invention may further include an input/output unit, a program storage unit, and the like.

The communication unit 111 may be a device including hardware and software necessary for the pronunciation evaluation device 110 to transmit and receive signals such as control signals or data signals through wired/wireless connection with other network devices such as the user terminal 200 . .

The control unit 112 may include any type of device capable of processing data, such as a processor. Here, the 'processor' may refer to, for example, a data processing device embedded in hardware having a physically structured circuit to perform a function expressed as a code or command included in a program. As an example of the data processing device embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an ASIC (Application-Specific Integrated) Circuit) and a processing device such as an FPGA (Field Programmable Gate Array) may be included, but the scope of the present invention is not limited thereto.

The memory 113 performs a function of temporarily or permanently storing data processed by the pronunciation evaluation device 110 . The memory may include a magnetic storage medium or a flash storage medium, but the scope of the present invention is not limited thereto. For example, the memory 113 may temporarily and/or permanently store data (eg, coefficients) constituting the artificial neural networks. Of course, the memory 113 may also store training data for learning artificial neural networks. However, this is an example, and the spirit of the present invention is not limited thereto.

3 and 4 are diagrams for explaining an exemplary structure of an artificial neural network learned by the pronunciation evaluation apparatus 110 of the present invention. Hereinafter, for convenience of description, the first artificial neural network and the second artificial neural network will be collectively referred to as an 'artificial neural network'.

The artificial neural network according to an embodiment of the present invention may be an artificial neural network according to a convolutional neural network (CNN) model as shown in FIG. 3 . In this case, the CNN model may be a layer model used to extract features of input data by alternately performing a plurality of computational layers (Convolutional Layer, Pooling Layer).

The controller 112 according to an embodiment of the present invention may construct or train an artificial neural network model by processing the learning data according to a supervised learning technique. A detailed description of how the controller 112 trains the artificial neural network will be described later.

The control unit 112 according to an embodiment of the present invention uses a plurality of training data to input any one input data to the artificial neural network, so that an output value generated is close to a value marked on the corresponding training data. / Alternatively, the artificial neural network can be trained by repeating the process of updating the weight of each node.

In this case, the controller 112 according to an embodiment of the present invention may update the weight (or coefficient) of each layer and/or each node according to a back propagation algorithm.

The control unit 112 according to an embodiment of the present invention generates a convolution layer for extracting feature values of input data, and a pooling layer that forms a feature map by combining the extracted feature values. can do.

In addition, the control unit 112 according to an embodiment of the present invention combines the generated feature maps to generate a fully connected layer that prepares to determine the probability that the input data corresponds to each of a plurality of items. can

The controller 112 according to an embodiment of the present invention may calculate an output layer including an output corresponding to input data.

In the example shown in FIG. 3, input data is divided into 5X7 blocks, a 5X3 unit block is used to generate a convolution layer, and a 1X4 or 1X2 unit block is used to generate a pooling layer. However, this is exemplary and the spirit of the present invention is not limited thereto. Accordingly, the type of input data and/or the size of each block may be variously configured.

On the other hand, such an artificial neural network may be stored in the above-described memory 113 in the form of coefficients of at least one node constituting the artificial neural network, the weights of nodes, and coefficients of a function defining a relationship between a plurality of layers constituting the artificial neural network. can Of course, the structure of the artificial neural network may also be stored in the memory 113 in the form of source code and/or a program.

The artificial neural network according to an embodiment of the present invention may be an artificial neural network according to a recurrent neural network (RNN) model as shown in FIG. 4 .

Referring to FIG. 4 , the artificial neural network according to the recurrent neural network (RNN) model is an input layer L1 including at least one input node N1 and a hidden layer L2 including a plurality of hidden nodes N2. ) and an output layer L3 including at least one output node N3 . In this case, one or more evaluation images for generating the second text may be input to at least one input node N1 of the input layer L1 .

The hidden layer L2 may include one or more fully connected layers as illustrated. When the hidden layer L2 includes a plurality of layers, the artificial neural network may include a function (not shown) defining a relationship between each hidden layer.

At least one output node N3 of the output layer L3 may include an output value generated from the input value of the input layer L1 by the artificial neural network under the control of the controller 112 . For example, the output layer L3 may include second text data describing a sound corresponding to one or more evaluation images.

Meanwhile, a value included in each node of each layer may be a vector. In addition, each node may include a weight corresponding to the importance of the node.

Meanwhile, the artificial neural network uses a first function (F1) defining the relationship between the input layer (L1) and the hidden layer (L2) and a second function (F2) defining the relationship between the hidden layer (L2) and the output layer (L3). may include

The first function F1 may define a connection relationship between the input node N1 included in the input layer L1 and the hidden node N2 included in the hidden layer L2 . Similarly, the second function F2 may define a connection relationship between the hidden node N2 included in the hidden layer L2 and the output node N2 included in the output layer L2.

The functions between the first function F1, the second function F2, and the hidden layer may include a recurrent neural network model that outputs a result based on an input of a previous node.

In the process of learning the artificial neural network by the controller 112 , the first function F1 and the second function F2 may be learned based on a plurality of learning data. Of course, in the process of learning the artificial neural network, functions between the plurality of hidden layers in addition to the above-described first function F1 and second function F2 may also be learned.

The artificial neural network according to an embodiment of the present invention may be trained in a supervised learning method based on labeled learning data.

The control unit 112 according to an embodiment of the present invention uses a plurality of training data to input any one input data to the artificial neural network, and the above-described function so that an output value generated approaches the value marked on the corresponding training data. The artificial neural network can be trained by repeating the process of updating the fields (F1, F2, functions between hidden layers, etc.).

In this case, the controller 112 according to an embodiment of the present invention may update the above-described functions (F1, F2, functions between hidden layers, etc.) according to a back propagation algorithm. However, this is an example, and the spirit of the present invention is not limited thereto.

Meanwhile, the types and/or structures of the artificial neural networks described in FIGS. 3 and 4 are exemplary and the spirit of the present invention is not limited thereto. Therefore, artificial neural networks of various types of models may correspond to the 'artificial neural networks' described throughout the specification.

Hereinafter, a process in which the control unit 112 of the pronunciation evaluation apparatus 110 learns the first artificial neural network and the second artificial neural network will be first described, and the user's pronunciation using the learned first artificial neural network and the second artificial neural network will be described below. How to evaluate will be described later.

The control unit 112 according to an embodiment of the present invention may learn the first artificial neural network and the second artificial neural network by using respective learning data.

5 is a diagram for explaining a method for the controller 112 to learn the first artificial neural network 520 using a plurality of learning data 510 according to an embodiment of the present invention. 6 is a diagram for explaining a process in which the controller 112 outputs the second text 540 using the first artificial neural network 520 according to an embodiment of the present invention.

The first artificial neural network 520 according to an embodiment of the present invention relates to at least one learning oral image describing the pronunciation process of learning syllables included in each of the plurality of learning data 510 and the correlation between the text corresponding to the image. It can mean a neural network that has learned (or learned) relationships.

Therefore, the first artificial neural network 520 according to an embodiment of the present invention, as shown in FIG. 6 , according to the input of at least one oral image 530 depicting a sound, a sound corresponding to the oral image 530 . It may mean a neural network that has been trained (or learned) to output the second text 540 representing (音).

In this case, each of the plurality of learning data 510 may include at least one learning oral image describing the pronunciation process of the learning syllable, order information of the at least one learning oral image, and text corresponding to the learning syllable.

For example, in the case of the first learning data 511, at least one learning mouth image 511A describing the pronunciation process of a learning syllable, order information 511B of at least one learning mouth image, and text 511C corresponding to the learning syllable may include. Similarly, the second learning data 512 and the third learning data 513 also correspond to at least one learning oral image describing the pronunciation process of the learning syllable, the order information of the at least one learning oral image, and the learning syllable, respectively. It can contain text.

7 is a diagram for explaining a method for the controller 112 to learn the second artificial neural network 560 using a plurality of learning data 550 according to an embodiment of the present invention. FIG. 8 is a diagram for explaining a process in which the controller 112 outputs the third text 580 using the second artificial neural network 560 according to an embodiment of the present invention.

The second artificial neural network 560 according to an embodiment of the present invention refers to a neural network that has learned (or learned) the correlation between the learning voice included in each of the plurality of learning data 550 and the text corresponding to the voice. can do.

Accordingly, in the second artificial neural network 560 according to an embodiment of the present invention, as shown in FIG. 8 , the third text 580 corresponding to the voice 570 according to the input of the voice 570 according to the pronunciation process. It may mean a neural network trained (or trained) to output .

In this case, each of the plurality of learning data 550 may include a learning voice and a text corresponding to the learning voice.

For example, the first training data 551 may include a training voice 551A and a corresponding text 551B. Similarly, the second training data 552 and the third training data 553 may each include a training voice and a text corresponding to the learning voice.

Hereinafter, it will be described on the assumption that the first artificial neural network 520 and the second artificial neural network 560 have been learned according to the process described with reference to FIGS. 5 to 8 .

The controller 112 according to an embodiment of the present invention may acquire an input image from the user terminal 200 . In this case, the input image may include the user's first text pronunciation process and the user's voice generated according to the corresponding pronunciation process. For example, the controller 112 may provide the learning content including the first text “hello” to the user terminal 200 and request the user to read the first text aloud. In this case, the input image may be an image including a change process of the user's oral organs according to the passage of time.

The controller 112 according to an embodiment of the present invention may generate one or more evaluation images from the input images obtained according to the above-described process.

The controller 112 according to an embodiment of the present invention may group a plurality of frames constituting the input image in units of syllables with reference to the voice included in the input image.

For example, as in the example shown in FIG. 9 , when the input image is an image in which the user acquires a process of pronouncing “hello”, the control unit 112 provides frames including the process of the user pronouncing “an”. The frames may be grouped into one group, and frames including the process of pronouncing "nyeong" may be grouped into a second group. Of course, the control unit 12 may group the frames in the same way for the remaining syllables.

The control unit 112 according to an embodiment of the present invention selects a frame in which at least one of the user's lips, tongue position, and tooth position among the frames grouped in syllable units is less than a predetermined threshold difference from an adjacent frame in the grouped frame. By removing it, only the representative frame can be extracted (or only the representative frame is preserved). In this case, a reference 'adjacent frame' may be determined according to a predetermined rule. For example, the adjacent frame may be a frame at which a corresponding syllable starts, or may be an I-frame of an image. However, this is an example, and the spirit of the present invention is not limited thereto.

For example, in the example shown in FIG. 9, the controller 112 controls the first frame ( 611) may be extracted (or preserved) as a representative frame, and the second frame 612 may be removed from the first group.

However, as described above, the method of determining the representative frame and the removed frame is exemplary, and if it is a method of extracting only a predetermined frame from among a plurality of frames and excluding the remaining frames, it can be used as a method of determining the representative frame and the removed frame of the present invention. can

The control unit 112 according to an embodiment of the present invention may add the representative frame extracted for each syllable through the above-described process to one or more evaluation images. One or more evaluation images may be used to evaluate the user's pronunciation, and details thereof will be described later.

The controller 112 according to an embodiment of the present invention may generate a second text describing a sound corresponding to one or more evaluation images.

The controller 112 according to an embodiment of the present invention may generate the second text 630 corresponding to the representative frame 620 extracted for each individual syllable by using the learned first artificial neural network 520 . . At this time, the first artificial neural network 520 is a neural network learned according to the process described with reference to FIGS. 5 to 6 , and according to an input of at least one oral image depicting a sound, output text representing a sound corresponding to the oral image. It may be a trained neural network. For example, the control unit 112 according to an embodiment of the present invention inputs a representative frame for the syllable "in" extracted according to the process described in FIG. 9 to the first artificial neural network 520, and outputs the representative frame to the corresponding representative frame. The corresponding text (eg, "in") may be obtained.

Meanwhile, the controller 112 according to an embodiment of the present invention may generate the third text 650 from the voice (ie, the voice according to the pronunciation process, 640 ) included in the input image using the learned second artificial neural network. have. In this case, the second artificial neural network 560 is a neural network learned according to the process described in FIGS. 7 to 8 , and may be a neural network trained (or learned) to output a third text corresponding to a voice according to a voice input. have. For example, the control unit 112 according to an embodiment of the present invention inputs a voice corresponding to “in” of “hello” to the second artificial neural network 560, and outputs the text corresponding to the voice (for example, "Not") can be obtained.

The controller 112 according to an embodiment of the present invention may calculate the user's pronunciation score based on the second text 630 and the third text 650 obtained according to the above-described process.

In an embodiment of the present invention, the controller 112 may generate a score based on the similarity between the first text 660 and the second text 630 . In this case, the first text 660 is a text included in the learning content provided in advance to the user, and may mean a text requested to be read by the user.

In another embodiment of the present invention, the controller 112 generates a first score 670 based on the similarity between the first text 660 and the second text 630 , and the first text 660 and the third text 630 . A second score 680 may be generated based on the similarity of the text 650 . Also, the controller 112 may calculate the score 690 based on the first score 670 and the second score 680 .

As described above, the present invention determines the pronunciation accuracy of the user by considering both the accuracy of pronunciation based on the image obtained by the user's pronunciation process and the accuracy of pronunciation based on the user's pronunciation voice, so that more accurate pronunciation evaluation can be performed. have.

The controller 112 according to an embodiment of the present invention may provide the evaluation content including the score calculated according to the above-described process and one or more evaluation images to the user terminal.

Referring to FIG. 11 , on the screen, an area 710 in which a user's score is displayed, an area 720 in which a user's evaluation image is displayed, an area 730 in which a sample image is displayed, and a difference between the evaluation image and the sample image are predetermined. It may include an area 740 indicating a time point equal to or greater than a threshold difference of .

The controller 112 according to an embodiment of the present invention may provide the evaluation content so that the score calculated according to the above-described process is displayed in the area 710 where the score is displayed.

In addition, the control unit 112 according to an embodiment of the present invention provides evaluation content so that evaluation images extracted according to the process described in FIG. 9 are arranged and displayed in time series in the region 720 where the evaluation image of the user is displayed. can

In addition, the control unit 112 according to an embodiment of the present invention provides evaluation content such that the sample images (or correct answer images) stored in the memory 113 are arranged and displayed in time series in the area 730 where the sample images are displayed. can

In addition, the controller 112 according to an embodiment of the present invention may determine a time point between the evaluation image and the sample image equal to or greater than a predetermined threshold difference, and display the confirmed time point in the area 740 .

For example, as shown in FIG. 11 , the controller 112 causes the above-described image and the first text (“inside”) to be displayed in correspondence with the passage of time, and the difference between the images is determined in the form of a bar 731 . It is possible to display a time point that is greater than or equal to the threshold difference of .

However, the display form shown in FIG. 11 is exemplary and the spirit of the present invention is not limited thereto.

The present invention allows the user to easily correct pronunciation by providing, together with the evaluation result of the pronunciation, at which point in time the pronunciation is wrong.

12 is a flowchart illustrating a pronunciation evaluation method performed by the controller 112 according to an embodiment of the present invention. Hereinafter, descriptions of contents overlapping those described with reference to FIGS. 1 to 11 will be omitted, but will be described with reference to FIGS. 1 to 11 together.

The controller 112 according to an embodiment of the present invention may learn the first artificial neural network and the second artificial neural network by using the respective learning data. (S1210)

Accordingly, the first artificial neural network 520 according to an embodiment of the present invention, as shown in FIG. 6 , according to an input of at least one oral image 530 depicting a sound, a sound corresponding to the oral image 530 . It may mean a neural network that has been trained (or learned) to output the second text 540 representing (音).

The controller 112 according to an embodiment of the present invention may acquire an input image from the user terminal 200 . (S1220)

In this case, the input image may include the user's first text pronunciation process and the user's voice generated according to the corresponding pronunciation process. For example, the controller 112 may provide the learning content including the first text “hello” to the user terminal 200 and request the user to read the first text aloud. In this case, the input image may be an image including a change process of the user's oral organs according to the passage of time.

The controller 112 according to an embodiment of the present invention may generate one or more evaluation images from the input images obtained according to the above-described process. (S1230)

For example, as in the example shown in FIG. 9 , when the input image is an image in which the user acquires a process of pronouncing "hello", the control unit 112 provides frames including the process of the user pronouncing "an". The frames may be grouped into one group, and frames including the process of pronouncing "nyeong" may be grouped into a second group. Of course, the control unit 12 may group the frames in the same way for the remaining syllables.

The controller 112 according to an embodiment of the present invention may generate a second text describing a sound corresponding to one or more evaluation images. (S1240)

The controller 112 according to an embodiment of the present invention may generate the second text 630 corresponding to the representative frame 620 extracted for each individual syllable by using the learned first artificial neural network 520 . . At this time, the first artificial neural network 520 is a neural network learned according to the process described with reference to FIGS. 5 to 6 , and according to an input of at least one oral image depicting a sound, to output a text representing a sound corresponding to the oral image. It may be a trained neural network. For example, the control unit 112 according to an embodiment of the present invention inputs the representative frame for the syllable “in” extracted according to the process described in FIG. 9 to the first artificial neural network 520, and outputs the representative frame to the corresponding representative frame. The corresponding text (eg, "in") may be obtained.

Meanwhile, the controller 112 according to an embodiment of the present invention may generate the third text 650 from the voice (ie, the voice according to the pronunciation process, 640 ) included in the input image using the learned second artificial neural network. have. (S1250)

In this case, the second artificial neural network 560 is a neural network learned according to the process described in FIGS. 7 to 8 , and may be a neural network trained (or learned) to output a third text corresponding to a voice according to a voice input. have. For example, the control unit 112 according to an embodiment of the present invention inputs a voice corresponding to “in” of “hello” to the second artificial neural network 560, and outputs the text corresponding to the voice (for example, "Not") can be obtained.

The controller 112 according to an embodiment of the present invention may calculate the user's pronunciation score based on the second text 630 and the third text 650 obtained according to the above-described process. (S1260)

The controller 112 according to an embodiment of the present invention may provide the evaluation content including the score calculated according to the above-described process and one or more evaluation images to the user terminal. (S1270)

Referring to FIG. 11 , in the screen, an area 710 in which a user's score is displayed, an area 720 in which a user's evaluation image is displayed, an area 730 in which a sample image is displayed, and a difference between the evaluation image and the sample image are predetermined. It may include an area 740 indicating a time point equal to or greater than a threshold difference of .

In addition, the control unit 112 according to an embodiment of the present invention provides evaluation content so that the sample images (or correct answer images) stored in the memory 113 are arranged and displayed in time series in the area 730 where the sample images are displayed. can

For example, as shown in FIG. 11 , the controller 112 causes the above-described image and the first text (“inside”) to be displayed in correspondence with the passage of time, and the difference between the images is determined in the form of a bar 731 . It is possible to display a time point that is equal to or greater than the threshold difference of .

The embodiment according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a computer-readable medium. In this case, the medium may be to store a program executable by a computer. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like.

Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and used by those skilled in the computer software field. Examples of the computer program may include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

The specific implementations described in the present invention are only examples, and do not limit the scope of the present invention in any way. For brevity of the specification, descriptions of conventional electronic components, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connections or connecting members of the lines between the components shown in the drawings exemplarily represent functional connections and/or physical or circuit connections, and in an actual device, various functional connections, physical connections that are replaceable or additional may be referred to as connections, or circuit connections. In addition, unless there is a specific reference such as "essential", "importantly", etc., it may not be a necessary component for the application of the present invention.

Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the spirit of the present invention is not limited to the scope of the scope of the present invention. will be said to belong to

Claims

In the pronunciation evaluation method,

generating one or more evaluation images from an input image including a pronunciation process of a user's first text and a voice according to the pronunciation process;

generating a second text describing a sound corresponding to the one or more evaluation images; and

and generating a score based on a degree of similarity between the first text and the second text.
The method according to claim 1

The step of generating the one or more evaluation images

grouping a plurality of frames constituting the input image in units of syllables with reference to the voice;

extracting only a representative frame by removing a frame having less than a predetermined threshold difference from an adjacent frame in which at least one of a position of a user's lips, tongue, and teeth among the frames grouped in units of syllables is less than a predetermined threshold difference from the grouped frame; and

and adding a representative frame extracted for each syllable to the one or more evaluation images.
3. The method according to claim 2

The step of generating the second text is

generating the second text corresponding to the representative frame extracted for each individual syllable by using the learned first artificial neural network;

The first artificial neural network is

A method for evaluating pronunciation, which is a neural network trained to output text representing a sound corresponding to the oral image according to an input of at least one oral image depicting a sound.
The method according to claim 1

The pronunciation evaluation method is

After generating the second text,

Generating a third text from the voice according to the pronunciation process by using the learned second artificial neural network; further comprising a pronunciation evaluation method.
5. The method according to claim 4

The step of generating the score is

generating a first score based on a degree of similarity between the first text and the second text;

generating a second score based on a degree of similarity between the first text and the third text; and

and calculating the score based on the first score and the second score.