CN114758647A - Language training method and system based on deep learning - Google Patents

Language training method and system based on deep learning Download PDF

Info

Publication number
CN114758647A
CN114758647A CN202210380469.7A CN202210380469A CN114758647A CN 114758647 A CN114758647 A CN 114758647A CN 202210380469 A CN202210380469 A CN 202210380469A CN 114758647 A CN114758647 A CN 114758647A
Authority
CN
China
Prior art keywords
pronunciation
preset
language
target learner
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210380469.7A
Other languages
Chinese (zh)
Inventor
史献忠
雷虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Lemon Technology Service Co ltd
Original Assignee
Wuxi Lemon Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Lemon Technology Service Co ltd filed Critical Wuxi Lemon Technology Service Co ltd
Publication of CN114758647A publication Critical patent/CN114758647A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • G10L15/075Adaptation to the speaker supervised, i.e. under machine guidance
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Abstract

The invention relates to a language training method and a system based on deep learning, belonging to the technical field of language learning of hearing-impaired people.A target learner performs characteristic analysis and statistics on voices on a preset pronunciation level based on an intelligent voice analysis technology by acquiring expression information when reading preset language character information to acquire an analysis result; displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve pronunciation. The invention utilizes the intelligent voice analysis technology to analyze and count the voice characteristics of learners on the levels of phonemes, characters, words and sentences, and feeds back various pronunciation errors in real time in an imaging mode to help practicers correct and improve pronunciation, thereby improving the effect of language rehabilitation training.

Description

Language training method and system based on deep learning
Technical Field
The invention belongs to the technical field of language learning of hearing-impaired people, and particularly relates to a language training method and system based on deep learning.
Background
Hearing impaired people have hearing feedback path obstacles due to physiological characteristics, so that the hearing impaired people cannot obtain the feedback of voice signals in time like normal people, and the pronunciation of the hearing impaired people has distortion of different degrees.
In the related art, a speech rehabilitation training system is usually used to assist hearing impaired people in performing speech rehabilitation training, so as to solve the problem of pronunciation distortion. The prior art speech rehabilitation training system compares all learner utterances solely through the standard generic model, giving each learner a generic evaluation and feedback relative to the standard generic model. However, hearing-impaired people have their own physiological characteristics and hearing shortness, and there is a certain deviation in individual accuracy through general evaluation and feedback given by a general model, so that the speech rehabilitation training effect is not significant.
Therefore, how to improve the rehabilitation training effect of the language using the learner becomes a technical problem to be solved urgently in the prior art.
Disclosure of Invention
The invention provides a language training method and system based on deep learning, and aims to solve the technical problem that in the prior art, the language rehabilitation training effect of a learner is poor.
The technical scheme provided by the invention is as follows:
in one aspect, a method for language training based on deep learning includes:
acquiring expression information of a target learner when the target learner reads preset language character information, wherein the expression information comprises voice;
based on an intelligent voice analysis technology, performing feature analysis and statistics on the voice on a preset pronunciation level to obtain an analysis result; the preset pronunciation hierarchy comprises: at least one of phoneme, single character pronunciation, phrase pronunciation and sentence pronunciation;
displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve pronunciation.
Optionally, the displaying the analysis result in a preset visualization manner includes: displaying preset character information pronunciation results with different accuracies by different marks; the pronunciation result comprises: accurate, complete and to be lifted; the different markings comprise different colors.
Optionally, the method further includes:
receiving a viewing instruction, and viewing a corresponding promotion strategy according to the viewing instruction;
if the checking instruction is to check the pronunciation result to be promoted, the promotion strategy is as follows: and displaying the reference lip shape corresponding to the standard pronunciation of the characters.
Optionally, the expression information further includes: mouth shape spatial shape data; the method further comprises;
comparing and evaluating teaching mouth shape space data corresponding to the preset language character information with the mouth shape space data of the target learner to generate an evaluation result;
the displaying the analysis result in a preset visualization manner further comprises: and displaying the evaluation result in a preset pictophonetic mode.
Optionally, the mouth shape data includes: a lip shape; the method comprises the steps of obtaining expression information of a target learner when the target learner reads preset language character information;
and extracting the lip shape of the target learner when the target learner pronounces according to the preset language character information based on the human face 3D model.
Optionally, the teaching mouth shape space data corresponding to the preset language text information and the mouth shape space data of the target learner are compared and evaluated to generate an evaluation result, including;
determining reference lip shape data in the teaching mouth shape space data corresponding to the preset language character information frame by frame;
determining lip data of the target learner based on invariants in an image transformation process;
and matching the reference lip data with the lip data of the target learner frame by taking the reference lip data as a reference to generate an evaluation result.
Optionally, the obtaining of the expression information of the target learner when reading the preset language text information includes:
based on the camera device, acquiring a mouth-shaped image of the target learner during pronunciation according to the preset language and character information, and displaying the mouth-shaped image on a display interface;
and acquiring mouth shape space data of the target learner when the target learner pronounces according to the preset language and character information according to the mouth shape image.
Optionally, the preset visualization manner includes; numerical scores, and/or, visual scores;
the evaluation rule of the visual score comprises the following steps: set shapes of different numbers or sizes are displayed to represent the evaluation results.
Optionally, based on the intelligent speech analysis technology, the speech is subjected to feature analysis and statistics on a preset pronunciation level, and an analysis result is obtained, including:
and comparing the voice with a preset hearing-impaired acoustic language model based on an intelligent voice analysis technology to obtain an analysis result.
In yet another aspect, a deep learning based language training system includes: the device comprises an acquisition module, an analysis module and a display module;
the acquisition module is used for acquiring the voice of the target learner when the target learner reads the preset language and character information;
the analysis module is used for carrying out feature analysis and statistics on the voice on a preset pronunciation level based on an intelligent voice analysis technology to obtain an analysis result; the preset pronunciation hierarchy comprises: at least one of phoneme, single character pronunciation, phrase pronunciation and sentence pronunciation;
the display module is used for displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve pronunciation.
The invention has the beneficial effects that:
according to the language training method and system based on deep learning, provided by the embodiment of the invention, the expression information of a target learner when reading the preset language character information is obtained, and the voice is subjected to characteristic analysis and statistics on a preset pronunciation level based on an intelligent voice analysis technology, so that an analysis result is obtained; displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve pronunciation. The invention utilizes the intelligent voice analysis technology to analyze and count the voice characteristics of learners on the levels of phonemes, characters, words and sentences, and feeds back various pronunciation errors in real time in an imaging mode to help practicers correct and improve pronunciation, thereby improving the effect of language rehabilitation training.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flowchart of a language training method based on deep learning according to an embodiment of the present invention;
FIG. 2 is a schematic view of a mouth-shaped template according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a lip match according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of the present invention showing an analysis result in a preset visualization manner;
FIG. 5 is a schematic structural diagram of a deep learning-based language training system according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a language training device based on deep learning according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
According to the world health organization's report, there are 15 million people worldwide with some degree of hearing loss, of which about 4.3 million people need rehabilitation services for hearing loss. 2780 thousands of people with hearing disabilities exist in China, more than 30 thousands of newly increased hearing disabilities exist every year, and the number of people with hearing disabilities is 3480 thousands of people. Hearing loss affects many aspects of an individual's life, particularly language communication ability, and the earlier patients perform the necessary language learning and training to help them in part with language restitution.
In the related art, a speech rehabilitation training system is usually used to assist hearing impaired people in performing speech rehabilitation training, so as to solve the problem of pronunciation distortion. The prior art speech rehabilitation training system compares all learner utterances solely through the standard generic model, giving each learner a generic evaluation and feedback relative to the standard generic model. However, each learner has personal characteristics, and the general evaluation and feedback given by the general model have certain deviation in individual accuracy, so that the language rehabilitation training effect is not obvious.
Based on the above, the embodiment of the invention provides a language training method based on deep learning.
Fig. 1 is a schematic flowchart of a language training method based on deep learning according to an embodiment of the present invention, and as shown in fig. 1, the method according to the embodiment of the present invention may include the following steps:
and S11, acquiring the expression information of the target learner when reading the preset language character information, wherein the expression information comprises voice.
In a specific implementation process, any hearing-impaired person needing deep learning-based language training can be defined as a target learner, and the target learner can perform pronunciation training by using the deep learning-based language training method provided by the application.
For example, after the learning training is started, the target learner may pronounce according to the preset language-character information, for example, the preset language-character information is a poem, and the target learner may read the poem to pronounce. After the target learner starts to pronounce, acquiring the expression information of reading poetry by the target learner, wherein the expression information can be voice.
And S12, performing feature analysis and statistics on the voice on a preset pronunciation level based on an intelligent voice analysis technology to obtain an analysis result.
Wherein, predetermine pronunciation level, include: at least one of phoneme, single character pronunciation, phrase pronunciation and sentence pronunciation.
For example, after the expression information such as the speech is acquired, the speech may be subjected to feature analysis and statistics at a preset pronunciation level of at least one of a phoneme, a single-word pronunciation, a phrase pronunciation, and a sentence pronunciation based on a speech analysis technique, and an analysis result may be acquired.
In some embodiments, optionally, based on the intelligent speech analysis technology, performing feature analysis and statistics on the speech at a preset pronunciation level to obtain an analysis result, including:
based on an intelligent voice analysis technology, comparing the voice with a preset hearing-impaired acoustic language model to obtain an analysis result.
The analysis result may be a scoring result, and is not particularly limited herein.
For example, the intelligent speech analysis technique may be an infrastructure based on deep learning models and algorithm frameworks RNN and LSTM intelligent speech analysis engines.
The preset hearing impairment acoustic voice model can be formed by training after early-stage training data acquisition is completed in a mobile phone APP mode in cooperation with the deaf-mute school. The training process can refer to the existing model training process, which is not described herein again. In the data acquisition process, the pronunciation standard of hearing-impaired people is not required, so that the sample data can be recognized and understood by people and can be used as the acquisition standard. After the language is obtained, the preset hearing impairment acoustic language model can be compared to obtain an analysis result. And the preset hearing-impaired acoustic language model can be learned and updated according to the collected voice again. The method can also be used for generating a preset hearing-impaired acoustic language model containing the characteristics of the voice of the user with complete hearing impairment by utilizing the collected specific voice data and carrying out characteristic parameter evaluation on the acoustic model and the language model formed by training the voice of the normal user to generate a comparative analysis result with reference significance on the (phoneme | word | sentence) level so as to timely and effectively correct the guidance intention and prompt for the trainer.
And S13, displaying the evaluation result in a preset visualization mode so that the target learner can view the evaluation result in real time to correct or improve the pronunciation.
In some embodiments, optionally, the analysis result is displayed in a preset visualization manner, including: displaying preset character information pronunciation results with different accuracies by different marks; pronunciation results, including: accurate, complete and to be lifted; the different markings comprise different colors.
For example, the accuracy of pronunciation of each word can be displayed in real time, and the judgment result of pronunciation accuracy can be distinguished by different colors. For example, red indicates that the pronunciation is to be promoted (i.e. to be corrected), light green indicates that the pronunciation is complete (which can be understood by a human being without correction), green indicates that it is accurate (clear), etc., and this is merely a list of colors and is not a limitation.
In some embodiments, optionally, the method may further include: receiving a viewing instruction, and viewing a corresponding lifting strategy according to the viewing instruction; if the checking instruction is to check the pronunciation result to be promoted, the promotion strategy is as follows: and displaying the reference lip shape corresponding to the standard pronunciation of the characters.
For example, the user may issue a viewing instruction according to the displayed color, for example, the viewing instruction may be a click, and the user views a promotion strategy of clicking pronunciations such as a single word and a phrase by clicking. For example, the sound of "o" may be displayed in red, and the reference lip of "o" may be displayed by clicking the sound of "o".
In some embodiments, optionally, obtaining the expression information of the target learner when reading the text information in the preset language includes: based on the camera device, acquiring a mouth-shaped image of the target learner during pronunciation according to the preset language and character information, and displaying the mouth-shaped image on a display interface; and acquiring mouth shape space data of the target learner during pronunciation according to the preset language and character information according to the mouth shape image.
In some embodiments, optionally, the mouth spatial shape data, comprises: a lip shape; acquiring mouth shape space data of a target learner during pronunciation according to preset language and character information, wherein the mouth shape space data comprises mouth shape space data;
and extracting the lip shape of the target learner during pronunciation according to the preset language and character information based on the human face 3D model.
For example, the lip shape of the target learner when speaking can be obtained according to the human face 3D model.
The face 3D model may be a FaceMesh model. In this embodiment, the RGB camera may be used to track the shape of the lips according to the lip shape comparison and analysis module of the google research team on the medianeedle FaceMesh. The Facemesh is a lightweight packet, only 3MB, has a very fast response speed, is very suitable for real-time inference on mobile equipment, and is suitable for application in a virtual reality (AR) scene. Facemesh can calculate the position of 468 feature points in each face, provide a 3D model of the face using such dense face mesh topology, and further obtain a smooth face contour map using a correlation subdivision surface algorithm.
In this embodiment, the lip shape of the target learner when speaking can be obtained according to Facemesh: the facial expression key points are evaluated in real time at high density through the FaceMesh model, which means that data can be reasoned locally, and lip areas in an original image can be separated by using key points around mouths and lips.
In some embodiments, optionally, the information is expressed, further comprising: mouth shape spatial shape data; the method further comprises the following steps: comparing and evaluating teaching mouth shape space data corresponding to preset language character information with the mouth shape space data of the target learner to generate an evaluation result; displaying the analysis result in a preset visualization mode, and further comprising: and displaying the evaluation result in a preset pictophonetic mode.
For example, language text information may be preset, standard pronunciation may be entered to read the preset language text information, data comparison of a reference lip shape corresponding to the reference lip shape may be performed according to the preset language text information, and an evaluation result may be obtained according to the matching degree. The lip shape comparison and evaluation module can be used for comparing the lip shapes, is based on the research result of a google research team on the mediapip FaceMesh, can use a single RGB (red, green and blue) camera to track the shapes of the lips, improves the outline and the shape of the mouth during pronunciation through real-time lip shape comparison, and is suitable for running on mobile equipment (mobile phones, raspberry pies and the like), desktop computers, notebook computers and even networks.
In some embodiments, optionally, comparing and evaluating teaching mouth shape space data corresponding to the preset language text information and the mouth shape space data of the target learner to generate an evaluation result, including: determining reference lip shape data in teaching mouth shape space shape data corresponding to preset language character information frame by frame; determining lip shape data of the target learner based on the invariant in the image conversion process; and matching the reference lip data with the lip data of the target learner frame by taking the reference lip data as a reference to generate an evaluation result.
For example, reference lip data corresponding to reading of preset language text information (e.g., a piece of poem) may be captured frame by frame; determining lip shape data when the target learner reads the preset language character information; and matching the two lip shape data to generate an evaluation result. In which various methods may be employed to measure the mouth shape when lip shape data of a target learner is acquired, and a method as "Hu moment" (Hu Moments) may be employed in the present embodiment, which is relatively robust against the shape, size, and distance limit to the camera while the user shakes the head. And (4) calculating by using invariant (central moments) in the image conversion process to obtain a group of numbers. Thereafter, the shapes may be aligned. Such a measurement scheme may be compatible with target translation, zoom, and rotation operations. Thus, the user can freely rotate the head without affecting the detection of the die itself.
When matching the lip shape, the user can use the video stream input from the camera in real time as the lip shape data of the target learner, and the matching degree of the contour shape of the mouth can be determined (for example, 1000 points are full points and complete matching is achieved) by using the technology of measuring the mouth shape.
In some embodiments, the shape of the mouth may be encoded in advance, and matching of the lip shape is achieved by the shape encoding. Fig. 2 is a schematic view of a mouth-shaped template according to an embodiment of the present invention, and fig. 3 is a schematic view of lip matching according to an embodiment of the present invention.
Referring to fig. 2, a die template may be preset and encoded. 16 mouth-shaped templates can be set, corresponding to the template codes MSEC-01 to MSEC-16, the template codes are respectively set to correspond to mouth-shaped codes B0000, B0001, B0010, B0011, B0100, B0101, B0110, B0111, B1000, B1001, B1010, B1011, B1100, B1101, B1110 and B1111, and binary codes are used for identification. Prediction in the die template) corresponding to the predicted and measured die.
Referring to fig. 3, the predicted mouth shape (Prediction) may be matched with the reference lip shape (Baseline), the matching degree may be set to be 1000 points, when the mouth shape is completely mismatched, the matching degree may be 0, and the matching degree may be set to be 7 points, 995 points, 1000 points, etc., as shown in the figure, which is not limited in detail herein.
In some embodiments, optionally, the visualization is preset, including; numerical scores, and/or, visual scores;
an evaluation rule for visual scoring comprising: set shapes of different numbers or sizes are displayed to represent the evaluation results.
For example, in a specific implementation process, the target learner may input and observe the shape of the lips of the target learner through a camera of a mobile phone, a computer or other mobile devices, may simultaneously recite the preset language text information, observe the shape of the lips of the target learner, obtain a real-time score according to the lip matching degree, and give an accumulated score after the preset language text information is finished.
Fig. 4 is a schematic diagram showing an analysis result in a preset visualization manner according to an embodiment of the present invention, and referring to fig. 4, a visual effect of the lip shape of the user may also be focused on, and a visual score is given in real time, for example, the number of stars represents the performance of the user, which is not described herein in detail.
According to the language training method based on deep learning provided by the embodiment of the invention, through acquiring the expression information of a target learner when reading the preset language character information, on the basis of an intelligent voice analysis technology, feature analysis and statistics are carried out on voice on a preset pronunciation level, and an analysis result is acquired; and displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve the pronunciation. The invention utilizes the intelligent voice analysis technology to analyze and count the voice characteristics of learners on the levels of phonemes, characters, words and sentences, and feeds back various pronunciation errors in real time in an imaging mode to help practicers correct and improve pronunciation, thereby improving the effect of language rehabilitation training.
Based on a general inventive concept, the embodiment of the invention also provides a language training system based on deep learning.
Fig. 5 is a schematic structural diagram of a language training system based on deep learning according to an embodiment of the present invention, and referring to fig. 3, the system according to an embodiment of the present invention may include the following structures: an acquisition module 31, an analysis module 32 and a display module 33.
The acquiring module 31 is used for acquiring the voice of the target learner when the target learner reads the preset language and character information;
the analysis module 32 is used for performing feature analysis and statistics on the voice on a preset pronunciation level based on an intelligent voice analysis technology to obtain an analysis result; presetting a pronunciation hierarchy, comprising: at least one of phoneme, single character pronunciation, phrase pronunciation and sentence pronunciation;
and the display module 33 is used for displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve the pronunciation.
Optionally, the display module 33 is configured to display preset text information pronunciation results with different accuracies by using different signs; a pronunciation result comprising: accurate, complete and to be lifted; the different markings comprise different colors.
Optionally, the obtaining module 31 is further configured to receive a viewing instruction, and view the corresponding lifting policy according to the viewing instruction;
if the checking instruction is to check the pronunciation result to be promoted, the promotion strategy is as follows: and displaying the reference lip shape corresponding to the standard pronunciation of the characters.
Optionally, the analysis module 32 is further configured to compare and evaluate teaching mouth shape space data corresponding to the preset language text information with the mouth shape space data of the target learner, so as to generate an evaluation result; the display module 33 is further configured to display the evaluation result in a preset pictophonetic manner.
Optionally, the analysis module is configured to determine, frame by frame, reference lip shape data in the teaching mouth shape space data corresponding to the preset language text information;
determining lip shape data of the target learner based on the invariant in the image conversion process;
and matching the reference lip data with the lip data of the target learner frame by taking the reference lip data as a reference to generate an evaluation result.
With regard to the system in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
According to the language training system based on deep learning provided by the embodiment of the invention, through acquiring the expression information of a target learner when reading the preset language character information, on the basis of an intelligent voice analysis technology, feature analysis and statistics are carried out on voice on a preset pronunciation level, and an analysis result is acquired; and displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve the pronunciation. The invention utilizes the intelligent voice analysis technology to analyze and count the voice characteristics of learners on each level of phonemes, characters, words and sentences, and feeds back various pronunciation errors in real time in an imaging mode to further help practicers to correct and improve pronunciation, thereby improving the effect of language rehabilitation training.
Based on a general inventive concept, the embodiment of the present invention further provides a language training device based on deep learning.
Fig. 6 is a schematic structural diagram of a language training device based on deep learning according to an embodiment of the present invention, and referring to fig. 4, the language training device based on deep learning according to an embodiment of the present invention includes: a processor 41, and a memory 42 coupled to the processor.
The memory 42 is used for storing a computer program, and the computer program is at least used for the language training method based on deep learning described in any of the above embodiments;
the processor 41 is used to invoke and execute computer programs in memory.
Embodiments of the present invention also provide a storage medium based on one general inventive concept.
A storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described deep learning based language training method.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar contents in other embodiments may be referred to for the contents which are not described in detail in some embodiments.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A language training method based on deep learning is characterized by comprising the following steps:
acquiring expression information of a target learner when the target learner reads preset language character information, wherein the expression information comprises voice;
based on an intelligent voice analysis technology, performing feature analysis and statistics on the voice on a preset pronunciation level to obtain an analysis result; the preset pronunciation hierarchy comprises: at least one of phoneme, single character pronunciation, phrase pronunciation and sentence pronunciation;
and displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve pronunciation.
2. The method of claim 1, wherein displaying the analysis results in a predetermined visualization manner comprises: displaying preset character information pronunciation results with different accuracies by different marks; the pronunciation result comprises: accurate, complete and to be lifted; the different markings comprise different colors.
3. The method of claim 2, further comprising:
receiving a viewing instruction, and viewing a corresponding promotion strategy according to the viewing instruction;
if the checking instruction is to check the pronunciation result to be promoted, the promotion strategy is as follows: and displaying the reference lip shape corresponding to the standard pronunciation of the characters.
4. The method of claim 1, wherein the presenting information further comprises: mouth shape space data; the method further comprises;
comparing and evaluating teaching mouth shape space data corresponding to the preset language character information with the mouth shape space data of the target learner to generate an evaluation result;
the displaying the analysis result in a preset visualization manner further comprises: and displaying the evaluation result in a preset pictophonetic mode.
5. The method of claim 4, wherein the mouth shape data comprises: a lip shape; the method comprises the steps of obtaining expression information of a target learner when the target learner reads preset language character information;
and extracting the lip shape of the target learner during pronunciation according to the preset language and character information based on the human face 3D model.
6. The method of claim 5, wherein comparing and evaluating the teaching mouth shape data corresponding to the preset language text information and the mouth shape data of the target learner to generate an evaluation result comprises:
determining reference lip shape data in the teaching mouth shape space shape data corresponding to the preset language text information frame by frame;
determining lip data of the target learner based on invariants in an image conversion process;
and matching the reference lip data with the lip data of the target learner frame by taking the reference lip data as a reference to generate an evaluation result.
7. The method of claim 4, wherein the obtaining of the presentation information of the target learner when reading the text information of the preset language comprises:
based on the camera device, acquiring a mouth-shaped image of the target learner during pronunciation according to the preset language and character information, and displaying the mouth-shaped image on a display interface;
and acquiring mouth-shape space shape data of the target learner during pronunciation according to the preset language and character information according to the mouth-shape image.
8. The method of claim 1, wherein the preset visualization manner comprises; numerical scores, and/or, visual scores;
the evaluation rule of the visual scoring comprises the following steps: set shapes of different numbers or sizes are displayed to represent the evaluation results.
9. The method according to any one of claims 1 to 8, wherein the performing feature analysis and statistics on the speech at a preset pronunciation level based on the intelligent speech analysis technology to obtain an analysis result comprises:
and comparing the voice with a preset hearing-impaired acoustic language model based on an intelligent voice analysis technology to obtain an analysis result.
10. A deep learning based language training system, comprising: the device comprises an acquisition module, an analysis module and a display module;
the acquisition module is used for acquiring the voice of the target learner when the target learner reads the preset language and character information;
the analysis module is used for carrying out feature analysis and statistics on the voice on a preset pronunciation level based on an intelligent voice analysis technology to obtain an analysis result; the preset pronunciation hierarchy comprises: at least one of phoneme, single character pronunciation, phrase pronunciation and sentence pronunciation;
the display module is used for displaying the analysis result in a preset visualization mode so that the target learner can view the analysis result in real time to correct or improve pronunciation.
CN202210380469.7A 2021-07-20 2022-04-12 Language training method and system based on deep learning Pending CN114758647A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110820211 2021-07-20
CN202110820211X 2021-07-20

Publications (1)

Publication Number Publication Date
CN114758647A true CN114758647A (en) 2022-07-15

Family

ID=82328507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210380469.7A Pending CN114758647A (en) 2021-07-20 2022-04-12 Language training method and system based on deep learning

Country Status (1)

Country Link
CN (1) CN114758647A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080004879A1 (en) * 2006-06-29 2008-01-03 Wen-Chen Huang Method for assessing learner's pronunciation through voice and image
JP2010185967A (en) * 2009-02-10 2010-08-26 Chubu Electric Power Co Inc Pronunciation training device
CN102169642A (en) * 2011-04-06 2011-08-31 李一波 Interactive virtual teacher system having intelligent error correction function
US20150056580A1 (en) * 2013-08-26 2015-02-26 Seli Innovations Inc. Pronunciation correction apparatus and method thereof
CN109036464A (en) * 2018-09-17 2018-12-18 腾讯科技(深圳)有限公司 Pronounce error-detecting method, device, equipment and storage medium
KR20190006348A (en) * 2017-07-10 2019-01-18 조희정 Method for training assessment and studying voice
CN109545189A (en) * 2018-12-14 2019-03-29 东华大学 A kind of spoken language pronunciation error detection and correcting system based on machine learning
CN110349565A (en) * 2019-07-02 2019-10-18 长春大学 A kind of auxiliary word pronunciation learning method and its system towards hearing-impaired people
CN111832412A (en) * 2020-06-09 2020-10-27 北方工业大学 Sound production training correction method and system
CN112614489A (en) * 2020-12-22 2021-04-06 作业帮教育科技(北京)有限公司 User pronunciation accuracy evaluation method and device and electronic equipment
US10997970B1 (en) * 2019-07-30 2021-05-04 Abbas Rafii Methods and systems implementing language-trainable computer-assisted hearing aids

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080004879A1 (en) * 2006-06-29 2008-01-03 Wen-Chen Huang Method for assessing learner's pronunciation through voice and image
JP2010185967A (en) * 2009-02-10 2010-08-26 Chubu Electric Power Co Inc Pronunciation training device
CN102169642A (en) * 2011-04-06 2011-08-31 李一波 Interactive virtual teacher system having intelligent error correction function
US20150056580A1 (en) * 2013-08-26 2015-02-26 Seli Innovations Inc. Pronunciation correction apparatus and method thereof
KR20190006348A (en) * 2017-07-10 2019-01-18 조희정 Method for training assessment and studying voice
CN109036464A (en) * 2018-09-17 2018-12-18 腾讯科技(深圳)有限公司 Pronounce error-detecting method, device, equipment and storage medium
CN109545189A (en) * 2018-12-14 2019-03-29 东华大学 A kind of spoken language pronunciation error detection and correcting system based on machine learning
CN110349565A (en) * 2019-07-02 2019-10-18 长春大学 A kind of auxiliary word pronunciation learning method and its system towards hearing-impaired people
US10997970B1 (en) * 2019-07-30 2021-05-04 Abbas Rafii Methods and systems implementing language-trainable computer-assisted hearing aids
CN111832412A (en) * 2020-06-09 2020-10-27 北方工业大学 Sound production training correction method and system
CN112614489A (en) * 2020-12-22 2021-04-06 作业帮教育科技(北京)有限公司 User pronunciation accuracy evaluation method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN106205633B (en) It is a kind of to imitate, perform practice scoring system
CN109191939B (en) Three-dimensional projection interaction method based on intelligent equipment and intelligent equipment
CN114898861A (en) Multi-modal depression detection method and system based on full attention mechanism
CN103218924A (en) Audio and video dual mode-based spoken language learning monitoring method
CN115205764B (en) Online learning concentration monitoring method, system and medium based on machine vision
WO2022194044A1 (en) Pronunciation assessment method and apparatus, storage medium, and electronic device
Mattos et al. Improving CNN-based viseme recognition using synthetic data
CN115188074A (en) Interactive physical training evaluation method, device and system and computer equipment
CN116109455A (en) Language teaching auxiliary system based on artificial intelligence
CN110096987B (en) Dual-path 3DCNN model-based mute action recognition method
CN111554318B (en) Method for realizing mobile phone terminal pronunciation visualization system
CN111950480A (en) English pronunciation self-checking method and system based on artificial intelligence
CN114758647A (en) Language training method and system based on deep learning
JP2021086274A (en) Lip reading device and lip reading method
CN110956142A (en) Intelligent interactive training system
CN109447863A (en) A kind of 4MAT real-time analysis method and system
CN112949554B (en) Intelligent children accompanying education robot
CN115831153A (en) Pronunciation quality testing method
CN111199378A (en) Student management method, student management device, electronic equipment and storage medium
Zhao et al. Pronouncing rehabilitation of hearing-impaired children based on chinese 3d visual-speech database
CN113593326A (en) English pronunciation teaching device and method
CN116543446B (en) Online learning concentration recognition analysis method based on AI technology
CN112232166A (en) Artificial intelligence-based lecturer dynamic evaluation method and device, and computer equipment
CN113986005B (en) Multi-mode fusion sight estimation framework based on ensemble learning
CN117423166B (en) Motion recognition method and system according to human body posture image data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination