CN111951629A

CN111951629A - Pronunciation correction system, method, medium and computing device

Info

Publication number: CN111951629A
Application number: CN201910408726.1A
Authority: CN
Inventors: 刘晨晨; 沈欣尧; 崔守首; 胡太; 孙怿; 余津锐; 刘阿猛; 纪阳
Original assignee: Shanghai Liulishuo Information Technology Co ltd
Current assignee: Shanghai Liulishuo Information Technology Co ltd
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2020-11-17

Abstract

The embodiment of the invention provides a pronunciation correction system; the system comprises: the testing module is configured to output testing content determined based on the user learning data, and collect first testing data input by a user according to the testing content, wherein the first testing data comprises face image data and voice data; the pronunciation correction module is configured to extract user pronunciation characteristics from the first test data and match a corresponding pronunciation correction strategy, and feed back pronunciation correction content to the user based on the pronunciation correction strategy, wherein the pronunciation correction content is used for indicating a pronunciation difference type and the corresponding pronunciation correction strategy. The targeted pronunciation correction strategy is obtained based on the user test data, and the corresponding pronunciation correction content is matched, so that targeted pronunciation correction feedback is realized, the pronunciation problem of the user is corrected, the pronunciation correction effect is improved, and the learning experience of the user is improved. In addition, the invention also provides a language pronunciation correction method, a medium and a computing device.

Description

Pronunciation correction system, method, medium and computing device

Technical Field

Embodiments of the present invention relate to the field of software, and more particularly, embodiments of the present invention relate to a pronunciation correction system, method, medium, and computing device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Pronunciation capabilities are one of the important capabilities in language learning. In general, when learning various languages, learners can improve their pronunciation capability by reading aloud, reading with them, and so on. However, in most cases, the learner cannot know whether his own pronunciation is accurate, and although the traditional manual lecture method can evaluate the pronunciation ability of the learner, the evaluation is limited by the ability of the lecturer, and the evaluation cannot fully and accurately reflect the pronunciation problem of the learner.

Currently, the pronunciation evaluation function/pronunciation correction function in the existing language learning software or language learning terminal is to record the user's voice and feed back the evaluation result of the user's voice to the user to inform the user whether the pronunciation is accurate. However, the pronunciation evaluation result obtained by the existing pronunciation evaluation function/pronunciation correction function is not only single in content, but is usually an evaluation result of pronunciation accuracy of a user, and cannot correct a specific pronunciation error, so that the pronunciation evaluation result lacks pertinence, that is, the existing pronunciation evaluation function cannot expose weak links of most users in pronunciation, so that the existing pronunciation correction function cannot be trained based on the weak links of the users in pronunciation, and the pronunciation capability, reading capability and fluency of the users are difficult to improve.

Therefore, there is a need for an improved pronunciation correction scheme to solve the above-mentioned technical problems.

Disclosure of Invention

The pronunciation assessment result obtained by the pronunciation assessment/pronunciation correction function in the current language learning software or language learning tool is single in content and cannot indicate specific errors in pronunciation of a user, so that the pronunciation assessment result lacks pertinence, namely, the current pronunciation assessment cannot expose weak links of most users in pronunciation, the current pronunciation correction function cannot be trained based on the weak links of the users in pronunciation, and the pronunciation capability, the reading capability and the fluency of the users are difficult to improve. Therefore, an improved pronunciation correction solution is highly needed to solve the above technical problems.

In this context, embodiments of the present invention are intended to provide a pronunciation correction system, method, medium, and computing device.

In a first aspect of embodiments of the present invention, there is provided a pronunciation correction system comprising: a test module configured to output test contents determined based on the user learning data; acquiring first test data input by a user according to test contents, wherein the first test data comprises facial image data and voice data;

a pronunciation correction module configured to extract user pronunciation characteristics from the first test data and match a corresponding pronunciation correction strategy; and feeding back pronunciation correction contents to the user based on the pronunciation correction strategy, wherein the pronunciation correction contents are used for indicating the pronunciation difference type and the corresponding pronunciation correction strategy.

In one embodiment of the invention, the test module is further configured to: after the pronunciation correction module feeds back pronunciation correction contents to the user based on the pronunciation correction strategy, collecting second test data input by the user according to the pronunciation correction contents, wherein the second test data comprises facial image data and voice data; and obtaining an exercise result based on the second test data and a preset exercise strategy.

In one embodiment of the invention, the test module is further configured to: before the first test data or the second test data is acquired, whether the image acquisition device identifies a face region of the user is determined.

When the test module collects the first test data or the second test data, the test module is specifically configured to: if the image acquisition equipment identifies the face area of the user, outputting a test starting instruction to the user; and/or triggering the image acquisition equipment to start image data acquisition on the face area of the user to obtain face image data, wherein the face image data comprises image data corresponding to the facial action generated by the user based on the test content or pronunciation correction content.

In one embodiment of the invention, the test module is further configured to: after judging whether the image acquisition equipment identifies the face area of the user, if the image acquisition equipment does not identify the face area of the user, outputting an identification result; and/or indicating to the user through the adjusting instruction to adjust the relative position between the face area of the user and the image acquisition equipment.

In one embodiment of the invention, the facial image data includes mouth shape data and/or facial key point data.

In one embodiment of the invention, the user articulation feature comprises an articulation image feature. When the pronunciation correction module extracts the user pronunciation characteristics from the first test data and matches the corresponding pronunciation correction strategy, the pronunciation correction module is specifically configured to: extracting pronunciation image features from at least one image frame corresponding to the facial image data, wherein the pronunciation image features comprise user pronunciation mouth shape features corresponding to preset phonetic symbols; comparing the user pronunciation mouth shape characteristics corresponding to the preset phonetic symbols with the standard mouth shape characteristics corresponding to the preset phonetic symbols and matching the corresponding mouth shape difference types; and setting the mouth shape difference correction strategy corresponding to the mouth shape difference type as a pronunciation correction strategy.

In one embodiment of the present invention, the pronunciation correction content includes a pronunciation correction image. The pronunciation correction module is specifically used for simulating a mouth shape image used as a mouth shape correction guide image when the user pronounces a preset phonetic symbol correctly based on a mouth shape difference correction strategy and user facial features extracted from facial image data when the pronunciation correction module pushes pronunciation correction contents to the user based on the pronunciation correction strategy; and pushing the simulated mouth shape correction guide image as pronunciation correction content to the user.

In one embodiment of the present invention, the pronunciation correction module is further configured to: before collecting pronunciation correction data input by a user according to pronunciation correction contents, judging whether the image collection equipment identifies a user face area; the pronunciation correction module is specifically configured to, when collecting pronunciation correction data input by a user according to pronunciation correction contents and determining a pronunciation correction result: if the image acquisition equipment identifies the face area of the user, outputting an image acquisition starting instruction to the user; and/or triggering image acquisition equipment to start image data acquisition on a user face area so as to take the obtained face image data as pronunciation correction data; and comparing the user pronunciation mouth shape features extracted from the pronunciation correction data with the standard mouth shape features and matching corresponding pronunciation correction results.

In a second aspect of the embodiments of the present invention, there is also provided a pronunciation correction method, including: outputting test contents determined based on the user learning data; collecting first test data input by a user according to test contents and determining a pronunciation correction strategy, wherein the first test data comprises facial image data and voice data; extracting user pronunciation characteristics from the first test data and matching corresponding pronunciation correction strategies; and feeding back pronunciation correction contents to the user based on the pronunciation correction strategy, wherein the pronunciation correction contents are used for indicating the pronunciation difference type and the corresponding pronunciation correction strategy.

In an embodiment of the present invention, after feeding back pronunciation correction content to the user based on the pronunciation correction strategy, the method further includes: collecting second test data input by a user according to pronunciation correction content, wherein the second test data comprises facial image data and voice data; and obtaining an exercise result based on the second test data and a preset exercise strategy.

In an embodiment of the present invention, before acquiring the first test data or the second test data, the method further includes: it is determined whether the image capture device identifies a user's facial region. Collecting first test data or second test data, comprising: if the image acquisition equipment identifies the face area of the user, outputting a test starting instruction to the user; and/or triggering the image acquisition equipment to start image data acquisition on the face area of the user to obtain face image data, wherein the face image data comprises image data corresponding to the facial action generated by the user based on the test content or pronunciation correction content.

In one embodiment of the present invention, after determining whether the image capturing device identifies the face area of the user, the method further includes: if the image acquisition equipment does not identify the face area of the user, outputting an identification result; and/or indicating to the user through the adjusting instruction to adjust the relative position between the face area of the user and the image acquisition equipment.

In one embodiment of the invention, the user articulation feature comprises an articulation image feature. Extracting user pronunciation characteristics from the first test data and matching corresponding pronunciation correction strategies, comprising: extracting pronunciation image features from at least one image frame corresponding to the facial image data, wherein the pronunciation image features comprise user pronunciation mouth shape features corresponding to preset phonetic symbols; comparing the user pronunciation mouth shape characteristics corresponding to the preset phonetic symbols with the standard mouth shape characteristics corresponding to the preset phonetic symbols and matching the corresponding mouth shape difference types; and setting the mouth shape difference correction strategy corresponding to the mouth shape difference type as a pronunciation correction strategy.

In one embodiment of the present invention, the pronunciation correction content includes a pronunciation correction image. Based on a mouth shape difference correction strategy and user facial features extracted from facial image data, simulating a mouth shape image when the user correctly pronounces a preset phonetic symbol as a mouth shape correction guide image; and pushing the simulated mouth shape correction guide image as pronunciation correction content to the user.

In an embodiment of the present invention, before collecting pronunciation correction data input by a user according to pronunciation correction contents, the method further includes: it is determined whether the image capture device identifies a user's facial region. Collecting pronunciation correction data input by a user according to pronunciation correction contents and determining a pronunciation correction result, comprising: if the image acquisition equipment identifies the face area of the user, outputting an image acquisition starting instruction to the user; and/or triggering image acquisition equipment to start image data acquisition on a user face area so as to take the obtained face image data as pronunciation correction data; and comparing the user pronunciation mouth shape features extracted from the pronunciation correction data with the standard mouth shape features and matching corresponding pronunciation correction results.

In a third aspect of embodiments of the present invention, there is provided a medium storing computer-executable instructions for causing a computer to perform the functions of any one of the module configurations of the pronunciation correction system as described in the first aspect, or to perform the method of any one of the embodiments of the second aspect.

In a fourth aspect of embodiments of the present invention, there is provided a computing device comprising a processing unit, a memory, and an input/output (In/Out, I/O) interface; a memory for storing programs or instructions for execution by the processing unit; a processing unit, configured to execute the functions configured by any module in the pronunciation correction system according to the first aspect or execute the method according to any embodiment of the second aspect according to a program or instructions stored in a memory; an I/O interface for receiving or transmitting data under control of the processing unit.

According to the technical scheme provided by the embodiment of the invention, the targeted pronunciation correction strategy is obtained based on the user test data, and the corresponding pronunciation correction content is matched, so that the targeted pronunciation correction feedback is realized, the pronunciation problem of the user is corrected, the pronunciation correction effect is improved, and the learning experience of the user is improved. In addition, compared with the current pronunciation evaluation scheme, the embodiment of the invention consumes less computing resources and is more suitable for mobile terminal equipment.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically shows a schematic structural diagram of an application scenario according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a structure of a pronunciation correction system according to an embodiment of the present invention;

FIG. 3A is a schematic diagram illustrating a structure of a user interface for presenting pronunciation tests according to an embodiment of the invention;

FIG. 3B is a schematic diagram illustrating another user interface for presenting pronunciation tests, according to an embodiment of the invention;

FIG. 4 is a flow chart of a pronunciation correction strategy acquisition method according to an embodiment of the invention;

FIG. 5 is a schematic diagram illustrating a structure of a user interface for presenting pronunciation correction content according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating a pronunciation correction method according to an embodiment of the present invention;

FIG. 7 schematically shows a schematic structural diagram of a medium according to an embodiment of the invention;

FIG. 8 schematically illustrates a structural diagram of a computing device in accordance with an embodiment of the present invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present invention, a pronunciation correction system, method, medium, and computing device are provided.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention. Moreover, any number of elements in the drawings is intended to be illustrative and not restrictive, and any naming of any elements in the drawings is intended to be distinguished only and not in any limiting sense.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

The inventor finds that the pronunciation evaluation result obtained by the pronunciation evaluation/pronunciation correction function in the current language learning software or language learning tool is single in content, usually is most strategy evaluation on pronunciation accuracy of a user, and cannot correct specific errors in pronunciation of the user, so that the pronunciation evaluation result is lack of pertinence, namely, the existing pronunciation evaluation cannot expose weak links of most users in pronunciation, and the existing pronunciation correction function cannot be trained based on the weak links of the users in pronunciation, so that the pronunciation capability, reading capability and fluency of the users are difficult to improve.

To overcome the problems of the prior art, the present invention proposes a pronunciation correction system, method, medium, and computing device. The system comprises: the testing module is configured to output testing content determined based on the user learning data, and collect first testing data input by a user according to the testing content, wherein the first testing data comprises face image data and voice data; the pronunciation correction module is configured to extract user pronunciation characteristics from the first test data and match a corresponding pronunciation correction strategy, and feed back pronunciation correction content to the user based on the pronunciation correction strategy, wherein the pronunciation correction content is used for indicating a pronunciation difference type and the corresponding pronunciation correction strategy.

According to the pronunciation correction system, the targeted pronunciation correction strategy is obtained based on the user test data, and the corresponding pronunciation correction content is matched, so that targeted pronunciation correction feedback is realized, the pronunciation problem of the user is corrected, the pronunciation correction effect is improved, and the learning experience of the user is improved. In addition, compared with the current pronunciation evaluation scheme, the embodiment of the invention consumes less computing resources and is more suitable for mobile terminal equipment.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

The embodiment of the invention can be applied to pronunciation learning scenes, in particular to pronunciation learning scenes or pronunciation correction scenes in language learning, wherein languages comprise but are not limited to foreign languages such as English, French, German and Japanese, and Chinese branches such as Mandarin, Cantonese and Sichuan. The language learning scenario according to the embodiment of the present invention may be, for example, a pronunciation evaluation scenario, a pronunciation correction scenario, or other language learning scenarios in the language learning software or the language learning terminal, and is not limited in the embodiment of the present invention.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of pronunciation learning/correction of the present invention, in fig. 1, a user may perform pronunciation learning through a terminal device a, where the terminal device a may display image content to be learned by the user on a terminal device interface, and may also output audio content in a voice form to the user through an audio playing device such as a speaker, and when the user performs pronunciation learning of a language, the terminal device a may further collect voice/audio data and video/image data of the user during pronunciation through a microphone (audio collecting device)/a camera (image collecting device), so as to assist in determining whether the user has learned correct language pronunciation. It is understood that the voice/audio data and the video/image data may be downloaded from a server by the terminal a, and the data collected by the terminal a may be analyzed and processed by the server.

The above application scenarios are only examples, and in an actual application process, the server may have multiple stages, that is, the receiving server may receive the video sent by the terminal device. And the processing server processes the received video data according to the pronunciation learning method of the invention to obtain the pronunciation correct and error result of the user, and then feeds back the pronunciation correct and error result to the terminal A so that the user can correct the error.

The device for bearing the user interface applicable to the embodiment of the invention comprises a terminal and/or a network device, wherein the terminal comprises but is not limited to the following electronic devices: smart phones, tablet computers, MP4, MP3, PCs, PDAs, wearable devices, head-mounted display devices, and the like; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. Wherein the computer device can be operated alone to implement the invention, or can be accessed to a network and implement the invention through interoperation with other computer devices in the network. Further, the network where the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like. It should be noted that the user equipment, the network device, the network, etc. are only examples, and other existing or future computer devices or networks may also be included in the scope of the present invention, and are included by reference.

Exemplary System

A system for pronunciation correction according to an exemplary embodiment of the present invention is described below with reference to fig. 1 in conjunction with an application scenario. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.

An embodiment of the present invention provides a pronunciation correction system, as shown in fig. 2, the pronunciation correction system 200 at least includes:

a test module 201 configured to output test contents determined based on the user learning data; acquiring first test data input by a user according to test contents, wherein the first test data comprises facial image data and voice data;

a pronunciation correction module 202 configured to extract user pronunciation characteristics from the first test data and match a corresponding pronunciation correction strategy; pronunciation correction content is fed back to the user based on the pronunciation correction strategy, and the pronunciation correction content is used for indicating the pronunciation difference type and the corresponding pronunciation correction strategy.

The testing module 201 is further configured to collect second testing data input by the user according to the pronunciation correction content after the pronunciation correction module 202 feeds back the pronunciation correction content to the user based on the pronunciation correction strategy, and obtain an exercise result based on the second testing data and a preset exercise strategy. Wherein the second test data includes face image data and voice data.

The face image data according to the embodiment of the present invention includes image data corresponding to a face motion generated by a user based on test content or pronunciation correction content, for example, face video data when the user pronounces. The facial image data includes, but is not limited to, one or a combination of mouth shape data, facial key point data.

Considering that the existing pronunciation test or pronunciation learning technology is implemented based on audio data, and the audio data lacks accuracy in positioning the problem in the user pronunciation capability, the test module 201 also collects facial image data input by the user according to the test content or pronunciation correction content, so as to facilitate the subsequent positioning of the problem in the user pronunciation capability based on the facial image data, and further improve the accuracy in positioning the problem in the pronunciation capability.

The test module 201 is further configured to determine whether the image capture device identifies a facial region of the user before capturing the first test data or the second test data. Specifically, in an optional implementation manner, a Histogram of Oriented Gradient (HOG) image feature is used to perform face detection to obtain a Bounding Box (Bounding Box) of a user face area, so as to obtain a sliding window, and a sliding window manner is used to determine whether image data in all sliding windows is the user face area. The HOG image algorithm is combined with the local gradient and the gradient strength of the image to construct an image descriptor; the HOG image characteristic is a characteristic descriptor used for object detection in computer vision and image processing, and is formed by calculating and counting the gradient direction histogram of local areas of the image.

Optionally, before determining whether the image capturing device identifies the face area of the user, an image capturing area is set on the interface of the terminal device and displayed to the user, where the image capturing area corresponds to an image capturing range of the image capturing device. In determining whether the image capturing device recognizes the user face area, it is determined whether the user face area is within an image capturing range of the image capturing device, that is, whether the user face area displayed on the terminal device interface is within the image capturing area. Taking the example that the image capture area is set as a quadrilateral dashed frame in the terminal device interface shown in fig. 3A, the test module 201 may further determine whether the user's face area is within the quadrilateral dashed frame displayed on the terminal device interface before capturing the face image data in the first test data or the second test data.

In one case, the test module 201 determines that the image capturing device identifies the facial region of the user, and outputs a test start instruction to the user, where the test start instruction is used to instruct the user to start the acquisition process of the first test content, that is, to remind the user that a pronunciation test can be performed according to the test content. In this case, the test module 201 may also trigger the image capturing device to start capturing image data of the face region of the user to obtain the face image data. For example, triggering an image pickup device (such as a camera or a camera mounted on a language learning terminal) to start image data acquisition of the face region of the user.

It should be noted that the form of the test start instruction includes, but is not limited to, one or a combination of a prompt tone, a prompt animation, and an image identifier. For example, the test module 201 may play a prompt tone and play a prompt animation to the user in the presentation interface, so as to guide the user to record video data (i.e., capture image data).

In another case, the test module 201 determines that the image capturing device does not recognize the face region of the user, and outputs a recognition result, where the recognition result is used to indicate to the user that the image capturing device fails to recognize the face region of the user; and/or indicating to the user through the adjusting instruction to adjust the relative position between the face area of the user and the image acquisition equipment. The adjustment instruction related to the embodiment of the present invention includes, but is not limited to, target direction indication information, target angle indication information, target distance indication information, or other indication information for prompting an adjustment skill, for example, the target direction indication information may be "move left into box", "move up into box".

Taking the terminal device interface shown in fig. 3A as an example, before the facial image data in the first test data or the second test data is collected, the test module 201 sets the image collection area as a quadrilateral dashed frame in the terminal device interface, and determines whether the facial area of the user is within the quadrilateral dashed frame displayed on the terminal device interface. In a first preset time period, if an overlapping portion between the user face area and a quadrilateral dashed frame displayed on the terminal device interface is not greater than a threshold, it indicates that the image capturing device has not recognized the user face area within the first preset time period, in this case, the testing module 201 determines to continue recognizing the user face area, and outputs a recognition result "face recognition" of the first stage at the bottom of the terminal device interface. If the overlap between the user face area and the quadrilateral dashed frame displayed on the terminal device interface is not greater than the threshold yet within the second preset time period, it indicates that the image capturing device has not recognized the user face area within the second preset time period, in this case, the testing module 201 outputs an adjustment instruction "move the face into the frame" or "keep the face in the frame" in the terminal device interface. Alternatively, in this case, the test module 201 outputs the next test content according to the recognition result, so as to jump to the next test stage. It should be noted that the parameters such as the starting time point and the duration of the first preset time period may be preset or configured, and the parameters such as the starting time point and the duration of the second preset time period may be preset or configured. For example, the starting time point of the first preset time period is configured to have a duration of 30 seconds; the starting time point of the second preset time period is configured as the time when the first preset time period is ended, and the time length is configured as 1 minute and 30 seconds.

If the overlap between the user face area and the quadrilateral dashed frame displayed on the terminal device interface is greater than the threshold, that is, the test module 201 determines that the user face area is within the quadrilateral dashed frame displayed on the terminal device interface, it indicates that the image capturing device identifies the user face area, in this case, the test module 201 plays a "drop" sound (i.e., a prompt tone), and triggers the image capturing device to start to automatically capture the face image data corresponding to the user face area.

In an embodiment of the present invention, the user pronunciation features include pronunciation image features including, but not limited to, user pronunciation mouth shape features corresponding to the preset phonetic symbols. There are various implementations of the pronunciation correction module 202 extracting the user pronunciation characteristics from the first test data and matching the corresponding pronunciation correction strategy, wherein one implementation includes the following steps:

s401, extracting pronunciation image features from at least one image frame corresponding to the face image data.

Specifically, a user utterance start frame and a user utterance end frame are detected from the face image data, and the face image data is clipped to obtain at least one image frame based on the detected user utterance start frame and user utterance end frame. The method for detecting the user pronunciation start frame and the user pronunciation end frame in the embodiment of the invention can be, for example, a z-score threshold matching method, wherein z-score is the standard deviation divided by the average value subtracted from the current facial image data, and the smaller the z-score value is, the smaller the fluctuation of the facial image data is. A plurality of facial key points corresponding to each image frame are acquired from at least one image frame, and the distances between the plurality of facial key points and a preset central point are used as pronunciation image features, wherein the specific form of the pronunciation image features comprises but is not limited to a pronunciation image feature sequence. Furthermore, a first-order partial derivative is calculated for the pronunciation image feature sequence to obtain a partial derivative feature sequence, which is beneficial to improving the accuracy of pronunciation image features and improving the recognition effect of facial image data with poor acquisition effect.

S402, comparing the user pronunciation mouth shape characteristics corresponding to the preset phonetic symbols with the standard mouth shape characteristics corresponding to the preset phonetic symbols and matching the corresponding mouth shape difference types. Before S402, a user pronunciation mouth shape feature corresponding to a preset phonetic symbol is extracted from the pronunciation image feature.

Specifically, a pronunciation image feature sequence corresponding to a preset phonetic symbol is selected from at least one pronunciation image feature sequence corresponding to at least one image frame, and feature extraction is performed on the pronunciation image feature sequence to obtain a user pronunciation mouth shape feature corresponding to the preset phonetic symbol, wherein the user pronunciation mouth shape feature includes but is not limited to a mouth shape feature. Classifying the characteristics of the user pronunciation mouth shape corresponding to the preset phonetic symbols and the characteristics of the standard mouth shape corresponding to the preset phonetic symbols through a mouth shape classifier, if the characteristics of the user pronunciation mouth shape and the characteristics of the standard mouth shape do not belong to the same category, indicating that the deviation exists between the user pronunciation mouth shape and the standard mouth shape, and obtaining the mouth shape difference type according to the category to which the characteristics of the standard mouth shape belong. If the user pronunciation mouth shape feature and the standard mouth shape feature belong to the same category, no difference between the user pronunciation mouth shape and the standard mouth shape is shown.

It should be noted that the selection criteria of the pronunciation image feature sequence corresponding to the preset phonetic symbol may be set based on the mouth shape pronunciation condition, such as the following selection criteria: aiming at the phonetic symbols of MAX _ HEIGHT, the pronunciation image feature sequence is a pronunciation image feature sequence corresponding to the image frame at the maximum moment of the mouth shape; phonetic symbols for STANDTILL, i.e. phonetic symbols which do not open their mouth when pronounced, e.g.

Then pronunciation image featureThe sequence is a pronunciation image characteristic sequence corresponding to the image frame at the pause moment within a preset duration range; aiming at the phonetic symbols of MIN _ MAX, the pronunciation image feature sequence is a pronunciation image feature sequence corresponding to the image frame at the minimum moment and the maximum moment of the mouth shape; for the phonetic symbol of MIN _ MIN, such as vowel, the pronunciation image feature sequence is the pronunciation image feature sequence corresponding to the image frame at the maximum time of multiple mouth shapes. Besides the above four selection criteria, the pronunciation image feature sequence corresponding to the preset phonetic symbol may be selected by other manners or criteria, and the embodiment of the present invention is not limited.

S403, setting the mouth shape difference correction strategy corresponding to the mouth shape difference type as a pronunciation correction strategy.

More particularly, embodiments of the present invention relate to a mouth shape difference correction strategy corresponding to a mouth shape difference type, which includes a plurality of types, such as a user's mouth opening requiring more amplitude (enlargee), a user's mouth opening requiring moderate amplitude (MIDDLE), a user's mouth opening requiring less amplitude (SAMLL), a user's mouth Rounding (ROUND), a user's mouth splitting a little more or a user's mouth stretching into a "one" shape (FLAT).

In the embodiment of the invention, the pronunciation correction content comprises a pronunciation correction image and/or pronunciation correction prompt information. The pronunciation correction module 202 has a plurality of implementation manners for pushing pronunciation correction contents to the user based on the pronunciation correction policy, wherein one implementation manner is as follows: and generating pronunciation correction prompt information based on the mouth shape difference correction strategy and pushing the pronunciation correction prompt information to the user. Taking the terminal device interface shown in fig. 3B as an example, based on the mouth shape difference correction policy "entry", the pronunciation correction prompt message "mouth opens a little more", and the pronunciation correction prompt message "mouth opens a little more" is displayed in the middle of the terminal device interface.

The other realization mode is as follows: based on the mouth shape difference correction strategy and the user face features extracted from the face image data, simulating a mouth shape image when the user correctly pronounces a preset phonetic symbol as a mouth shape correction guide image, and pushing the simulated mouth shape correction guide image as pronunciation correction content to the user.

Still another implementation is: and selecting a mouth shape correction teaching image and/or mouth shape correction teaching characters from the pronunciation correction content library based on the mouth shape difference correction strategy as pronunciation correction contents to be pushed to the user. For example, the phonetic symbol/i:/instructional image and the phonetic symbol/i:/instructional text shown in the terminal device interface shown in fig. 5. Specifically, in one embodiment, based on the mouth shape difference correction policy, a standard mouth shape video matching a preset phonetic symbol is selected from pre-entered standard mouth shape videos, and the standard mouth shape video matching the preset phonetic symbol is pushed to a user.

After extracting the user pronunciation feature from the first test data, the pronunciation correction module 202 may in a further possible embodiment indicate to the user that the user pronunciation mouth shape is correct if it is determined that the user pronunciation mouth shape feature extracted from the facial image data is consistent with the standard mouth shape feature or if it is determined that the user pronunciation mouth shape feature extracted from the facial image data is similar to the standard mouth shape feature by more than a threshold, in which case the user pronunciation is correct or the user pronunciation mouth shape is correct. Further, it is determined in this case that the pronunciation correction contents are not fed back to the user.

The pronunciation correction module 202 sets the pronunciation correction policy to a retest policy after comparing the user pronunciation mouth shape features extracted from the facial image data with the standard mouth shape features and matching the corresponding mouth shape difference types. In this case, when the pronunciation correction module 202 pushes the pronunciation correction content to the user based on the pronunciation correction policy, specifically, a retest instruction for instructing the user to input test data based on the test content again or an end instruction for instructing the user to end the pronunciation correction is pushed to the user based on the retest policy. It should be noted that this step is similar to the step of outputting the test content by the test module 201, and the similarities are referred to each other and are not described herein again.

The pronunciation correction module 202 determines whether the image capture device recognizes a facial region of the user before capturing pronunciation correction data input by the user based on the pronunciation correction content. It should be noted that, here, the implementation manner of the pronunciation correcting module 202 determining whether the image capturing device recognizes the facial region of the user is similar to the implementation manner of the testing module 201 determining whether the image capturing device recognizes the facial region of the user, and the similar points are referred to each other, and are not described here again.

The pronunciation correction module 202 collects pronunciation correction data input by a user according to pronunciation correction contents and determines a plurality of implementation manners of pronunciation correction results, and one of the implementation manners is described herein, and if it is determined that the image acquisition device identifies a facial region of the user, an image acquisition start instruction is output to the user; and/or triggering image acquisition equipment to start image data acquisition on a user face area so as to take the obtained face image data as pronunciation correction data; and comparing the user pronunciation mouth shape features extracted from the pronunciation correction data with the standard mouth shape features and matching corresponding pronunciation correction results. The pronunciation correction data is similar to at least one of the image frames in the above S401 to S402, and the similarities are referred to each other and will not be described herein.

Specifically, the pronunciation mouth shape feature of the user is extracted from the pronunciation correction data, the pronunciation mouth shape feature of the user and the standard mouth shape feature are classified through a mouth shape classifier, if the pronunciation mouth shape feature of the user and the standard mouth shape feature do not belong to the same category, the fact that the pronunciation mouth shape of the user is deviated from the standard mouth shape is indicated, and a pronunciation correction result is obtained according to the category to which the standard mouth shape feature belongs. If the pronunciation mouth shape feature of the user and the standard mouth shape feature belong to the same category, the fact that the pronunciation mouth shape of the user is not different from the standard mouth shape is shown, and in the situation, the pronunciation correction result is used for indicating that the pronunciation mouth shape of the user is correct.

Exemplary method

Having described the system of exemplary embodiments of the present invention, the following description provides exemplary methods of implementation. The pronunciation correction method provided by the invention can realize the method executed by any module in the system provided by the embodiment corresponding to the embodiment in FIG. 2. Referring to fig. 6, the pronunciation correction method at least includes:

s601, outputting test contents determined based on user learning data;

s602, collecting first test data input by a user according to test contents and determining a pronunciation correction strategy, wherein the first test data comprises facial image data and voice data;

s603, extracting user pronunciation characteristics from the first test data and matching corresponding pronunciation correction strategies;

and S604, feeding back pronunciation correction contents to the user based on the pronunciation correction strategy, wherein the pronunciation correction contents are used for indicating the pronunciation difference type and the corresponding pronunciation correction strategy.

In an embodiment of the present invention, the face image data includes mouth shape data and/or face key point data.

Optionally, after the step S604 of feeding back the pronunciation correction content to the user based on the pronunciation correction policy, the method further includes: collecting second test data input by a user according to pronunciation correction content, wherein the second test data comprises facial image data and voice data; and obtaining an exercise result based on the second test data and a preset exercise strategy.

Optionally, before the acquiring the first test data or the second test data in S602, the method further includes: it is determined whether the image capture device identifies a user's facial region. Collecting first test data or second test data, comprising: if the image acquisition equipment identifies the face area of the user, outputting a test starting instruction to the user; and/or triggering the image acquisition equipment to start image data acquisition on the face area of the user to obtain face image data, wherein the face image data comprises image data corresponding to the facial action generated by the user based on the test content or pronunciation correction content.

Further, after determining whether the image capturing device identifies the face area of the user, the method further includes: if the image acquisition equipment does not identify the face area of the user, outputting an identification result; and/or indicating to the user through the adjusting instruction to adjust the relative position between the face area of the user and the image acquisition equipment.

Optionally, the user articulation feature comprises an articulation image feature. Extracting user pronunciation characteristics from the first test data and matching corresponding pronunciation correction strategies, comprising: extracting pronunciation image features from at least one image frame corresponding to the facial image data, wherein the pronunciation image features comprise user pronunciation mouth shape features corresponding to preset phonetic symbols; comparing the user pronunciation mouth shape characteristics corresponding to the preset phonetic symbols with the standard mouth shape characteristics corresponding to the preset phonetic symbols and matching the corresponding mouth shape difference types; and setting the mouth shape difference correction strategy corresponding to the mouth shape difference type as a pronunciation correction strategy.

Optionally, the pronunciation correction content includes a pronunciation correction image. Based on a mouth shape difference correction strategy and user facial features extracted from facial image data, simulating a mouth shape image when the user correctly pronounces a preset phonetic symbol as a mouth shape correction guide image; and pushing the simulated mouth shape correction guide image as pronunciation correction content to the user.

Optionally, before collecting pronunciation correction data input by the user according to the pronunciation correction content, the method further includes: it is determined whether the image capture device identifies a user's facial region. Collecting pronunciation correction data input by a user according to pronunciation correction contents and determining a pronunciation correction result, comprising: if the image acquisition equipment identifies the face area of the user, outputting an image acquisition starting instruction to the user; and/or triggering image acquisition equipment to start image data acquisition on a user face area so as to take the obtained face image data as pronunciation correction data; and comparing the user pronunciation mouth shape features extracted from the pronunciation correction data with the standard mouth shape features and matching corresponding pronunciation correction results.

Exemplary Medium

Having described the method and system of the exemplary embodiments of this invention, and referring next to FIG. 7, the present invention provides an exemplary medium having stored thereon computer-executable instructions operable to cause the computer to perform the method of any one of the exemplary embodiments of this invention corresponding to FIG. 6 or the functions of any one of the module configurations of the exemplary embodiments of this invention corresponding to FIG. 2.

Exemplary computing device

Having described the methods, media, and apparatus of exemplary embodiments of the invention, reference is next made to FIG. 8, which illustrates an exemplary computing device 80, where the computing device 80 includes a processing unit 801, a Memory 802, a bus 803, an external device 804, an I/O interface 805, and a network adapter 806, where the Memory 802 includes a Random Access Memory (RAM) 8021, a cache Memory 8022, a Read-Only Memory (ROM) 8023, and a Memory cell array 8025 of at least one Memory cell 8024. The memory 802 is used for storing programs or instructions executed by the processing unit 801; the processing unit 801 is configured to execute the method according to any one of the exemplary embodiments of the present invention corresponding to fig. 6 or the function of any one of the module configurations according to the exemplary embodiments of the present invention corresponding to fig. 2 according to the program or the instruction stored in the memory 802; the I/O interface 805 is used for receiving or transmitting data under the control of the processing unit 801.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A pronunciation correction system, comprising:

a test module configured to output test contents determined based on the user learning data; acquiring first test data input by a user according to the test content, wherein the first test data comprises facial image data and voice data;

2. The system of claim 1, wherein the testing module is further to:

after the pronunciation correction module feeds back pronunciation correction contents to the user based on the pronunciation correction strategy, acquiring second test data input by the user according to the pronunciation correction contents, wherein the second test data comprises facial image data and voice data;

and obtaining an exercise result based on the second test data and a preset exercise strategy.

3. The system of claim 1 or 2, wherein the testing module is further to:

before the first test data or the second test data is collected, judging whether the image collecting equipment identifies a face area of a user;

the test module is specifically configured to, when acquiring the first test data or the second test data:

if the image acquisition equipment is judged to identify the face area of the user, a test starting instruction is output to the user; and/or the presence of a gas in the gas,

triggering the image acquisition equipment to start image data acquisition on the face area of the user so as to obtain the face image data, wherein the face image data comprises image data corresponding to the facial action generated by the user based on the test content or pronunciation correction content.

4. The system of claim 3, wherein the testing module is further to:

after judging whether the image acquisition equipment identifies the face area of the user, if the image acquisition equipment does not identify the face area of the user, outputting an identification result; and/or the presence of a gas in the gas,

and indicating the relative position between the face area of the user and the image acquisition equipment to be adjusted to the user through an adjusting instruction.

5. A system as claimed in any one of claims 2 to 4, wherein the facial image data comprises mouth shape data and/or facial key point data.

6. The system of claim 1, wherein the user pronunciation features include pronunciation image features;

the pronunciation correction module is specifically configured to, when extracting user pronunciation features from the first test data and matching corresponding pronunciation correction strategies:

extracting the pronunciation image features from at least one image frame corresponding to the facial image data, wherein the pronunciation image features comprise user pronunciation mouth shape features corresponding to preset phonetic symbols;

comparing the user pronunciation mouth shape characteristics corresponding to the preset phonetic symbols with the standard mouth shape characteristics corresponding to the preset phonetic symbols and matching the corresponding mouth shape difference types;

and setting the mouth shape difference correction strategy corresponding to the mouth shape difference type as the pronunciation correction strategy.

7. The system of claim 6, wherein the pronunciation correction content includes a pronunciation correction image;

when the pronunciation correction module pushes pronunciation correction content to a user based on the pronunciation correction strategy, the pronunciation correction module is specifically configured to:

simulating a mouth shape image when the user pronounces a preset phonetic symbol correctly as a mouth shape correction guide image based on the mouth shape difference correction strategy and the user face characteristics extracted from the face image data;

and pushing the simulated mouth shape correction guide image as pronunciation correction content to a user.

8. A pronunciation correction method applied to the pronunciation correction system as claimed in one of claims 1 to 7, comprising:

outputting test contents determined based on the user learning data;

collecting first test data input by a user according to the test content and determining a pronunciation correction strategy, wherein the first test data comprises facial image data and voice data;

extracting user pronunciation characteristics from the first test data and matching corresponding pronunciation correction strategies;

and feeding back pronunciation correction contents to the user based on the pronunciation correction strategy, wherein the pronunciation correction contents are used for indicating the pronunciation difference type and the corresponding pronunciation correction strategy.

9. A medium storing program code which, when executed by a processor, implements the functionality of any one of the module arrangements in the pronunciation correction system as claimed in one of claims 1 to 7, or implements the method as claimed in claim 8.

10. A computing device comprising a processor and a storage medium storing program code which, when executed by the processor, implements the functionality of any one of the module arrangements in the pronunciation correction system as claimed in one of claims 1 to 7, or implements the method as claimed in claim 8.