CN103218924A - Audio and video dual mode-based spoken language learning monitoring method - Google Patents

Audio and video dual mode-based spoken language learning monitoring method Download PDF

Info

Publication number
CN103218924A
CN103218924A CN 201310108831 CN201310108831A CN103218924A CN 103218924 A CN103218924 A CN 103218924A CN 201310108831 CN201310108831 CN 201310108831 CN 201310108831 A CN201310108831 A CN 201310108831A CN 103218924 A CN103218924 A CN 103218924A
Authority
CN
China
Prior art keywords
video
user
pronunciation unit
information
monitoring method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201310108831
Other languages
Chinese (zh)
Inventor
许东星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI ZHONGSHI TECHNOLOGY DEVELOPMENT Co Ltd
Original Assignee
SHANGHAI ZHONGSHI TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI ZHONGSHI TECHNOLOGY DEVELOPMENT Co Ltd filed Critical SHANGHAI ZHONGSHI TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN 201310108831 priority Critical patent/CN103218924A/en
Publication of CN103218924A publication Critical patent/CN103218924A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses an audio and video dual mode-based spoken language learning monitoring method, which comprises the following steps: (a) establishing a sound information base and an image characteristic information base of all standard pronouncing units; (b) acquiring sound and video information during spoken language learning of a user in real time, performing compression coding and then transmitting to a server end; (c) decoding the sound of the user by the server and then segmenting to obtain sound information matching degree of each pronouncing unit of the user; and (d) extracting the image action characteristic information corresponding to each pronouncing unit from the simultaneously acquired video information by the server and giving the matching degree of the image action characteristic information and the image characteristic informationof thestandard pronouncing unit. According to the audio and video dual mode-based spoken language learning monitoring method provided by the invention, the sound and the image characteristic information are segmented and compared respectively by simultaneously acquiring the sound and video information, so that the defects and the reasons of pronouncing can be quickly and accurately found, the dependence on teacher resources is reduced and the learning efficiency is greatly improved.

Description

A kind of based on the bimodal verbal learning monitoring method of audio frequency and video
Technical field
The present invention relates to a kind of user's on-line study monitoring method, relate in particular to a kind of based on the bimodal verbal learning monitoring method of audio frequency and video.
Background technology
At present, under the megatrend of globalization, spoken language education is just becoming a huge enterprise in the whole world.For the situation of China, the upsurge of Chinese's foreign language studying and foreigner's learning Chinese is surging all the more.On the one hand, foreign language (particularly English) is an indispensable instrument in the business exchange activity, thereby has promoted the enthusiasm of Chinese on-the-job personnel's foreign language studying.According to incompletely statistics, big city such as Beijing, Shanghai has 1% on-the-job personnel input aspect foreign language learning to surpass 10% of income approximately.On the other hand, in the upsurge that globalization promotes to study English, also achieved new upsurge, such as fever for things Chinese.
But present conventional language education mode more and more can not satisfy this trend.As the basis of mutual interchange, the study of modern languages is the study of emphatic articulation more and more.In the language teaching, the teacher is as an effective feedback source, and some insurmountable problems are still arranged: the study of language needs repetition training, needs effectively to utilize the fragment time to practise whenever and wherever possible; But the finiteness of teacher resource can not be carried out man-to-man guidance at any time to all students.Many students lose interest to language learning under the conventional language teaching pattern gradually, and then become mute's foreign language, examination foreign language.
Computer-assisted language learning (computer-assisted language learning is called for short CALL) is carried out classroom instruction and auxiliary extracurricular behaviour according to language teaching plan and content that people prearrange.Just the someone inquires into how computing machine is used for education as far back as about nineteen fifty-five.Now, computer-assisted language learning and telecommunication network combine and are widely used in verbal learning.Online verbal learning platform is because of the time freedom, and expense is cheap, more and more obtains the favor of domestic consumer; And for the verbal learning platform, user's increase is not if increase qualified teachers' strength, the deficiency that must cause the unique user resource, and current society, human cost is more and more higher, how effectively to supervise user's study situation, become the major issue of pendulum at the verbal learning platform.Therefore be necessary to provide a kind of based on the bimodal verbal learning monitoring method of audio frequency and video, can substitute most of teacher resource, realize the compare of analysis of pronunciation of user's verbal learning and mouth type automatically, find out the not enough of user pronunciation and produce reason, help the user to correct, promote user's language learning.
Summary of the invention
Technical matters to be solved by this invention provides a kind of based on the bimodal verbal learning monitoring method of audio frequency and video, can realize the compare of analysis of pronunciation of user's verbal learning and mouth type automatically, help the user to find out the not enough of pronunciation and produce reason, reduce the dependence of teacher resource and improve learning efficiency.
The present invention solves the problems of the technologies described above the technical scheme that adopts to provide a kind ofly based on the bimodal verbal learning monitoring method of audio frequency and video, comprises the steps: a) to set up the acoustic information storehouse and the image feature information storehouse of all Received Pronunciation unit; Voice when b) gathering user's verbal learning in real time and video information are sent to server end behind the compressed encoding; C) after server receives the data decode that the user uploads, user's voice is carried out cutting, obtain acoustic information of each pronunciation unit of user, and provide user's each the pronunciation unit and the acoustic information matching degree of Received Pronunciation unit; D) server extracts the image motion characteristic information of each pronunciation unit correspondence from the video information of gathering simultaneously, and provides user's each the pronunciation unit and the image feature information matching degree of Received Pronunciation unit.
Above-mentioned based on the bimodal verbal learning monitoring method of audio frequency and video, wherein, described acoustic information matching degree adopts hidden Markov model, is characterized as the Mel cepstrum feature, and matching degree is the output of hidden Markov model posterior probability.Above-mentioned based on the bimodal verbal learning monitoring method of audio frequency and video, wherein, described image feature information comprises the position of lip, tooth and the tongue of each pronunciation unit correspondence, the position deviation that the position of lip, tooth and tongue when described image feature information matching degree is user pronunciation is corresponding with the Received Pronunciation unit.
Above-mentioned based on the bimodal verbal learning monitoring method of audio frequency and video, wherein, in the described step c) user's voice is carried out providing after the cutting beginning and ending time of each pronunciation unit, described step d) is extracted the image motion characteristic information of this pronunciation unit correspondence from the video information of gathering simultaneously according to the beginning and ending time of each pronunciation unit.
Above-mentioned based on the bimodal verbal learning monitoring method of audio frequency and video, wherein, described step d) is extracted the N pictures according to the beginning and ending time of each pronunciation unit from the video information of gathering simultaneously, calculating mean value after the image feature information matching degree of each pictures of comparison and Received Pronunciation unit, N is a natural number.
Above-mentioned based on the bimodal verbal learning monitoring method of audio frequency and video, wherein, the image motion characteristic information of described extraction pronunciation unit correspondence comprises following process: each pictures is located out with people's face earlier, adopt the outline position that detects lip, tongue and tooth based on the edge extracting algorithm of color gradient field then.
The present invention contrasts prior art following beneficial effect: provided by the invention based on the bimodal verbal learning monitoring method of audio frequency and video, by gathering voice and the video information in user's verbal learning simultaneously, and respectively the image feature information in voice and when pronunciation is carried out the cutting comparison, thereby can help the user to find out the not enough of pronunciation and generation reason quickly and accurately, the dependence that reduces teacher resource also improves learning efficiency greatly.
Description of drawings
Fig. 1 is for the present invention is based on the bimodal verbal learning monitoring of audio frequency and video schematic flow sheet.
Embodiment
The invention will be further described below in conjunction with drawings and Examples.
Fig. 1 is for the present invention is based on the bimodal verbal learning monitoring of audio frequency and video schematic flow sheet.
See also Fig. 1, provided by the inventionly comprise the steps: based on the bimodal verbal learning monitoring method of audio frequency and video
S101: acoustic information storehouse and the image feature information storehouse of setting up all Received Pronunciation unit; As the phoneme unit with Chinese, the plain unit of perhaps thinner consonant is the Received Pronunciation unit; Training standard pronunciation model on a database, database comprised all ages and classes section, different sexes, covered the image information of the pronunciation of all Received Pronunciation unit, and contain the Received Pronunciation mark; Acoustic information storehouse Model Selection hidden Markov model, image feature information storehouse make up the preferred support vector machine that adopts.
S102: the user opens microphone and camera according to the requirement of langue leaning system, reads/say the content that will learn; At this moment, system will gather user's voice and mouth shape audio/video information in real time, be sent to server end behind the compressed encoding.
S103: after server receives the data that the user uploads, to voice and the analysis of mouth shape (video) information decoding; The phonetic segmentation module is utilized the acoustic model of training by adopting the automatic speech recognition technology on large-scale dataset, user's voice is carried out cutting, and each the elementary cell time corresponding that obtains user pronunciation is stabbed; With Chinese is example, the phonetic segmentation of " we " this speech correspondence can be become " w o m en " four phoneme unit, and provide the pronunciation beginning and ending time of phoneme unit, even can be divided into the plain unit of thinner consonant; Then, by acoustics evaluation and test module user speech is evaluated and tested.Acoustics evaluation and test process is as follows: after the audio frequency cutting, the model of voice unit after the cutting and Received Pronunciation unit is mated, acoustic information matching degree model adopts hidden Markov model, is characterized as the Mel cepstrum feature, and matching degree is the output of hidden Markov model posterior probability.S104: because audio frequency and video gather simultaneously, the phoneme segmental information by audio frequency carries out cutting to video, and extracts some articulation characteristic informations of pronunciation respectively from video, as the position of mouth, tooth, tongue etc.; By video evaluation and test module user's articulation is evaluated and tested then, during evaluation and test, contrast mouth, tooth, the position of tongue and the matching degree of phoneme Received Pronunciation (model), mainly with mouth, tooth, tongue position deviation as the image feature information matching degree, if matching degree is lower than this phoneme corresponding threshold, then user's current phoneme pronunciation may have problems.Concrete matching process is as follows:
1. according to the segmental information of audio frequency, video is marked, obtain the video after the cutting video;
2. get a pictures the video after cutting;
3. at first people's face in the picture is located out, as method based on template matches by people's face detection module;
4. adopt the target extraction module to detect outline position such as lip, tongue, tooth then and profiles such as detected lip, tongue, tooth are changeed parametrization; As adopt the edge extracting algorithm based on the color gradient field commonly used; Notice that sometimes tongue, tooth are sightless, its edge may not exist, and then ignores during coupling;
5. then according to the cutting result, input parameter is mated with corresponding model, obtain the video evaluation result;
In order to improve matching precision, to every section cutting video, can extract the N pictures altogether according to the beginning and ending time, N is a natural number, repeated matching process 1-5 obtains averaged after N the image feature information matching degree, comprehensively obtains final video evaluation result.At last, the evaluation result of audio frequency and video is carried out analysis-by-synthesis, determine user's non-type place of pronunciation and possible error reason, feed back to the user; Simultaneously the non-type user data of the pronunciation of collecting is added in the database to the accumulation user data.
In sum, provided by the invention based on the bimodal verbal learning monitoring method of audio frequency and video, by gathering voice and the video information in user's verbal learning simultaneously, and respectively the image feature information in voice and when pronunciation is carried out the cutting comparison, thereby can help the user to find out the not enough of pronunciation and generation reason quickly and accurately, concrete advantage is as follows: 1) pass through the mode of audio frequency and video bimodulus, and the not enough place of consumer positioning pronunciation, more accurate; 2) by audio-frequency information video information is carried out cutting, reduced the computation complexity of video analysis; 3) by video information the action of user pronunciation process is analyzed, can more effectively be found the nonstandard reason of user pronunciation; 4) by the mode of audio frequency and video bimodulus, can allow the user exchange visits, hear/see concrete weak point,, can help the more effective special training of carrying out by contrast.
Though the present invention discloses as above with preferred embodiment; right its is not in order to qualification the present invention, any those skilled in the art, without departing from the spirit and scope of the present invention; when can doing a little modification and perfect, so protection scope of the present invention is when with being as the criterion that claims were defined.

Claims (6)

1. one kind based on the bimodal verbal learning monitoring method of audio frequency and video, it is characterized in that, comprises the steps:
A) set up the acoustic information storehouse and the image feature information storehouse of all Received Pronunciation unit;
Voice when b) gathering user's verbal learning in real time and video information are sent to server end behind the compressed encoding;
C) after server receives the data decode that the user uploads, user's voice is carried out cutting, obtain acoustic information of each pronunciation unit of user, and provide user's each the pronunciation unit and the acoustic information matching degree of Received Pronunciation unit;
D) server extracts the image motion characteristic information of each pronunciation unit correspondence from the video information of gathering simultaneously, and provides user's each the pronunciation unit and the image feature information matching degree of Received Pronunciation unit.
2. as claimed in claim 1ly it is characterized in that based on the bimodal verbal learning monitoring method of audio frequency and video that described acoustic information matching degree adopts hidden Markov model, is characterized as the Mel cepstrum feature, matching degree is the output of hidden Markov model posterior probability.
3. as claimed in claim 1 based on the bimodal verbal learning monitoring method of audio frequency and video, it is characterized in that, described image feature information comprises the position of lip, tooth and the tongue of each pronunciation unit correspondence, the position deviation that the position of lip, tooth and tongue when described image feature information matching degree is user pronunciation is corresponding with the Received Pronunciation unit.
4. each is described based on the bimodal verbal learning monitoring method of audio frequency and video as claim 1~3, it is characterized in that, in the described step c) user's voice is carried out providing after the cutting beginning and ending time of each pronunciation unit, described step d) is extracted the image motion characteristic information of this pronunciation unit correspondence from the video information of gathering simultaneously according to the beginning and ending time of each pronunciation unit.
5. as claimed in claim 4 based on the bimodal verbal learning monitoring method of audio frequency and video, it is characterized in that, described step d) is extracted the N pictures according to the beginning and ending time of each pronunciation unit from the video information of gathering simultaneously, calculating mean value after the image feature information matching degree of each pictures of comparison and Received Pronunciation unit, N is a natural number.
6. as claimed in claim 5 based on the bimodal verbal learning monitoring method of audio frequency and video, it is characterized in that, the image motion characteristic information of described extraction pronunciation unit correspondence comprises following process: each pictures is located out with people's face earlier, adopt the outline position that detects lip, tongue and tooth based on the edge extracting algorithm of color gradient field then.
CN 201310108831 2013-03-29 2013-03-29 Audio and video dual mode-based spoken language learning monitoring method Pending CN103218924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201310108831 CN103218924A (en) 2013-03-29 2013-03-29 Audio and video dual mode-based spoken language learning monitoring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201310108831 CN103218924A (en) 2013-03-29 2013-03-29 Audio and video dual mode-based spoken language learning monitoring method

Publications (1)

Publication Number Publication Date
CN103218924A true CN103218924A (en) 2013-07-24

Family

ID=48816664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201310108831 Pending CN103218924A (en) 2013-03-29 2013-03-29 Audio and video dual mode-based spoken language learning monitoring method

Country Status (1)

Country Link
CN (1) CN103218924A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413467A (en) * 2013-08-01 2013-11-27 袁苗达 Controllable compelling guide type self-reliance study system
CN104157181A (en) * 2014-07-22 2014-11-19 雷青云 Language teaching method and system
CN104505089A (en) * 2014-12-17 2015-04-08 福建网龙计算机网络信息技术有限公司 Method and equipment for oral error correction
CN104537901A (en) * 2014-12-02 2015-04-22 渤海大学 Spoken English learning machine based on audios and videos
CN105070118A (en) * 2015-07-30 2015-11-18 广东小天才科技有限公司 Method of correcting pronunciation aiming at language class learning and device of correcting pronunciation aiming at language class learning
CN105096935A (en) * 2014-05-06 2015-11-25 阿里巴巴集团控股有限公司 Voice input method, device, and system
WO2017082447A1 (en) * 2015-11-11 2017-05-18 주식회사 엠글리쉬 Foreign language reading aloud and displaying device and method therefor, motor learning device and motor learning method based on foreign language rhythmic action detection sensor, using same, and electronic medium and studying material in which same is recorded
KR20170055146A (en) * 2015-11-11 2017-05-19 주식회사 엠글리쉬 Apparatus and method for displaying foreign language and mother language by using english phonetic symbol
CN106971647A (en) * 2017-02-07 2017-07-21 广东小天才科技有限公司 A kind of Oral Training method and system of combination body language
CN107578772A (en) * 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 Merge acoustic feature and the pronunciation evaluating method and system of pronunciation movement feature
CN108352126A (en) * 2015-11-11 2018-07-31 株式会社Mglish Foreign language pronunciation and labelling apparatus and its method, including the use of the motor learning device based on foreign language rhythm action sensor, motor learning method and the electronic medium recorded to it and study teaching material of its device and method
CN109920304A (en) * 2019-04-19 2019-06-21 沈阳凯森教育科技有限公司 A kind of pronounciation training system and pronounciation training method
CN110047511A (en) * 2019-04-23 2019-07-23 赵旭 A kind of speech training method, device, computer equipment and its storage medium
CN111047922A (en) * 2019-12-27 2020-04-21 浙江工业大学之江学院 Pronunciation teaching method, device, system, computer equipment and storage medium
WO2020125038A1 (en) * 2018-12-17 2020-06-25 南京人工智能高等研究院有限公司 Voice control method and device
CN113421467A (en) * 2021-06-15 2021-09-21 读书郎教育科技有限公司 System and method for assisting in learning pinyin spelling and reading
CN115966061A (en) * 2022-12-28 2023-04-14 上海帜讯信息技术股份有限公司 Disaster warning processing method, system and device based on 5G message

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413467A (en) * 2013-08-01 2013-11-27 袁苗达 Controllable compelling guide type self-reliance study system
CN105096935B (en) * 2014-05-06 2019-08-09 阿里巴巴集团控股有限公司 A kind of pronunciation inputting method, device and system
CN105096935A (en) * 2014-05-06 2015-11-25 阿里巴巴集团控股有限公司 Voice input method, device, and system
CN104157181A (en) * 2014-07-22 2014-11-19 雷青云 Language teaching method and system
CN104537901A (en) * 2014-12-02 2015-04-22 渤海大学 Spoken English learning machine based on audios and videos
CN104505089A (en) * 2014-12-17 2015-04-08 福建网龙计算机网络信息技术有限公司 Method and equipment for oral error correction
CN105070118A (en) * 2015-07-30 2015-11-18 广东小天才科技有限公司 Method of correcting pronunciation aiming at language class learning and device of correcting pronunciation aiming at language class learning
KR20170055146A (en) * 2015-11-11 2017-05-19 주식회사 엠글리쉬 Apparatus and method for displaying foreign language and mother language by using english phonetic symbol
US10978045B2 (en) 2015-11-11 2021-04-13 Mglish Inc. Foreign language reading and displaying device and a method thereof, motion learning device based on foreign language rhythm detection sensor and motion learning method, electronic recording medium, and learning material
CN108352126A (en) * 2015-11-11 2018-07-31 株式会社Mglish Foreign language pronunciation and labelling apparatus and its method, including the use of the motor learning device based on foreign language rhythm action sensor, motor learning method and the electronic medium recorded to it and study teaching material of its device and method
WO2017082447A1 (en) * 2015-11-11 2017-05-18 주식회사 엠글리쉬 Foreign language reading aloud and displaying device and method therefor, motor learning device and motor learning method based on foreign language rhythmic action detection sensor, using same, and electronic medium and studying material in which same is recorded
KR101990021B1 (en) * 2015-11-11 2019-06-18 주식회사 엠글리쉬 Apparatus and method for displaying foreign language and mother language by using english phonetic symbol
CN106971647A (en) * 2017-02-07 2017-07-21 广东小天才科技有限公司 A kind of Oral Training method and system of combination body language
WO2019034184A1 (en) * 2017-08-17 2019-02-21 厦门快商通科技股份有限公司 Method and system for articulation evaluation by fusing acoustic features and articulatory movement features
CN107578772A (en) * 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 Merge acoustic feature and the pronunciation evaluating method and system of pronunciation movement feature
US11786171B2 (en) 2017-08-17 2023-10-17 Xiamen Kuaishangtong Tech. Corp., Ltd. Method and system for articulation evaluation by fusing acoustic features and articulatory movement features
WO2020125038A1 (en) * 2018-12-17 2020-06-25 南京人工智能高等研究院有限公司 Voice control method and device
CN109920304A (en) * 2019-04-19 2019-06-21 沈阳凯森教育科技有限公司 A kind of pronounciation training system and pronounciation training method
CN110047511A (en) * 2019-04-23 2019-07-23 赵旭 A kind of speech training method, device, computer equipment and its storage medium
CN111047922A (en) * 2019-12-27 2020-04-21 浙江工业大学之江学院 Pronunciation teaching method, device, system, computer equipment and storage medium
CN113421467A (en) * 2021-06-15 2021-09-21 读书郎教育科技有限公司 System and method for assisting in learning pinyin spelling and reading
CN115966061A (en) * 2022-12-28 2023-04-14 上海帜讯信息技术股份有限公司 Disaster warning processing method, system and device based on 5G message
CN115966061B (en) * 2022-12-28 2023-10-24 上海帜讯信息技术股份有限公司 Disaster early warning processing method, system and device based on 5G message

Similar Documents

Publication Publication Date Title
CN103218924A (en) Audio and video dual mode-based spoken language learning monitoring method
CN106331893B (en) Real-time caption presentation method and system
CN111341305B (en) Audio data labeling method, device and system
CN103559892B (en) Oral evaluation method and system
CN106782603B (en) Intelligent voice evaluation method and system
CN103594087B (en) Improve the method and system of oral evaluation performance
CN106205634A (en) A kind of spoken English in college level study and test system and method
CN110600033B (en) Learning condition evaluation method and device, storage medium and electronic equipment
CN103559894A (en) Method and system for evaluating spoken language
WO2019218467A1 (en) Method and apparatus for dialect recognition in voice and video calls, terminal device, and medium
WO2023050650A1 (en) Animation video generation method and apparatus, and device and storage medium
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN109326162A (en) A kind of spoken language exercise method for automatically evaluating and device
CN105895080A (en) Voice recognition model training method, speaker type recognition method and device
CN112614489A (en) User pronunciation accuracy evaluation method and device and electronic equipment
CN109300469A (en) Simultaneous interpretation method and device based on machine learning
CN104347071B (en) Method and system for generating reference answers of spoken language test
CN110111778B (en) Voice processing method and device, storage medium and electronic equipment
CN109545196B (en) Speech recognition method, device and computer readable storage medium
CN104505089A (en) Method and equipment for oral error correction
CN115249479A (en) BRNN-based power grid dispatching complex speech recognition method, system and terminal
CN107886940B (en) Voice translation processing method and device
CN109961789A (en) One kind being based on video and interactive voice service equipment
CN113837907A (en) Man-machine interaction system and method for English teaching
WO2022178996A1 (en) Multi-language speech model generation method and apparatus, computer device, and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130724