TWI294107B

TWI294107B - A pronunciation-scored method for the application of voice and image in the e-learning

Info

Publication number: TWI294107B
Application number: TW95115175A
Authority: TW
Inventors: Wen Chen Huang
Original assignee: Univ Nat Kaohsiung 1St Univ Sc
Priority date: 2006-04-28
Filing date: 2006-04-28
Publication date: 2008-03-01
Also published as: TW200741605A

Description

1294107 九、發明說明：【發明所屬之技術領域】本發明係有關於一種應用聲音與影像於數位學習的=音評量方法，尤其是指一種利用視覺化的方式，呈現出學習者在發音過程中，偵測出唇形的變化以用來跟教段者做評量比對，以顯示出發音過程中可能有錯誤的位置二並將學習者的聲音錄製下來，以供後續用來做聲音及兩者之間的評分。知像【先前技術】近年來嘴唇辨識的相關研究越來越多，最多的研究是關於嘴唇輪廓擷取的部份，例如透過嘴唇模式的方式來= 取輪廓，先將不同類型的嘴唇——分類，有了不同的嘴唇輪廓樣板（Lip Contour Pattern)，在把嘴唇以數學的方式或何化，來取得一般嘴唇輪廓資訊，之後透過模糊分群法（fuzzy clustering method)來比對該嘴唇可以用何種嘴唇輪廓樣板來對映，並可將對應到的嘴唇輪廢樣板加以變化成該嘴唇，以擷取出嘴唇輪廓。 ’另外由於數位學習的運用越來越廣泛，再加上電腦使用的普及率也越來越高，所以數位學習越來越受到矚目。例如目前有許多電腦辅助教學軟體（CAD)，透過圖形化的介7讓學習者藉由操作過程中學習。有許多的英語教學軟體提供了聽、說、讀、寫的教學及練習功能，但是仍然有 j當多，英語教學軟體在「說」的部分往往都只提供教學影片或是真人發音，無法做到判斷學習者發音的正確與否丄對於學習者練習發音是一大挑戰，若單純藉由「真人心曰」在在/、會對學習者事倍功半。然而對於聽障人士而 5 1294107 口英"吾發音教學是行不通的，因為他們無法從教學者的發音來學習語言。此外，目前許多的非同步線上教學，往往只提供單純 ^教學影片，藉由多媒體籠器將視訊送到客戶端播放，右要用在語音教學方面上則會有許多問題發生。如··第一，學習者無法知道自己的發音是否正確，因為只是單純 =聽教學影片上老師的發音或是句子練f，若自己跟著練 f則無法得知發音是否正確。第二，無法得知發音錯誤的部位，若藉由不斷的聽教學影片的發音，也只能將發音變的相似，但無法知道在那個音節所發出來的聲音唇形有錯誤。第二，聽障人士無法從這些英語教學句子中學習發音。又，在語音訓練的聽、說、讀、寫方面，”說，，的評 =與學習工具—直是很重要但也是比較困難作的一部分。國内外有些研究針對練習者所說出的聲音特性加以比對與評分，但是利用多媒體的特性，如唇型的變化與比對等研究則還不多見。、透過嘴唇輪廓樣板的比對，容易造成有些嘴唇無法從樣板中找出相似或是符合的輪廓，因而導致造成誤判或是錯誤的可能。近年來使用嘴唇視覺資訊來確認說話者的字句已引起許多研究者的興趣，由於聲音辨識所處的環境若有外部的雜訊干擾，往往會造成辨識率大大的下降，但是右由嘴唇唇型的資訊加以辅助，可以有效的增加準確度。為了從多變化的嘴唇輪廓以及顏色，嘴唇擷取會因為臉部的膚色以及嘴唇顏色隨著對比強度越弱而變的更困難。目前嘴唇影像分割的方法主要可分為兩種，第一種嘴唇分割方法是直接透過顏色空間來取得，此種方法的演曾 6 1294107 法是透過將RGB顏色轉換（co lor trans format ion)成另一種顏色系統已用來加強嘴唇跟膚色的對比，然而此種方法的處理時間是較快速的，但是由於轉換過後會有許多雜訊，所以在顏色對比不明顯時要晝分出嘴唇的輪廓邊界會變的較困難。因此透過馬可夫隨機域（Markov random field technique)可以用來減少顏色轉換後因雜訊所造成分割的錯誤。第二種嘴唇分割方法透過模糊C-means分群法（fuzzy C-means clustering)簡稱 FCM，乃是一種根據 " C-means algorithm 衍生而來的分群法，Bezdek 在 1973 年首先提出該方法，它是一種資料分群的技術，其利用模糊歸屬函數值表示每一點資料屬於某一集群中心的程度，此方法將全部資料中的任一個已知點到任一個集群中心的距離乘以該點的歸屬函數值作為目標函數，並以此目標函數之最小化達到判斷集群的目的。而模糊C-means分群法也會受到顏色對比強度而有所影響，所以有人提出一種新的模糊演算法 fuzzy c-means with shape function (FCMS)，除了加強空間距離的運算還有顏色資訊對比補 # 強，由於嘴唇相似的顏色可能分布不集中，或是嘴唇區塊經由擷取後並不完整，所以藉由橢圓外型函數（Elliptic Shape Function)可以定位出嘴唇範圍，並且將此區塊圈計算出來，已取得較精確的嘴唇輪廓，並且透過自動找嘴唇函式，能夠運用在連續影像的圖片上，可將這些圖片一一的找出嘴唇的輪廓範圍，而其作法是將一連續的RGB格式的影像，將嘴唇進行分割，之後藉由嘴唇特徵參數加以擷取出來，並使用隱藏式馬可夫模式（Hidden Markov Model)來加以辨認。 71294107 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a method for sound evaluation using sound and video in digital learning, and more particularly to a method of visualizing a learner in a pronunciation process. In the middle, the change of the lip shape is detected to be used for comparison with the teacher to display the position 2 where the pronunciation may be wrong and the learner's voice is recorded for subsequent use in making the sound. And the score between the two. Known [Prior Art] In recent years, more and more research has been done on lip recognition. The most research is on the part of the lip contour, such as the way through the lip pattern = contouring, first different types of lips - Classification, with different Lip Contour Patterns, mathematically or whatever the lips are used to obtain general lip contour information, and then the fuzzy clustering method can be used to compare the lips. Which lip contour template is mapped, and the corresponding lip wheel waste template can be changed into the lip to extract the lip contour. In addition, digital learning is becoming more and more popular because of the increasing use of digital learning and the increasing popularity of computer use. For example, there are many computer-aided teaching software (CAD), which allows learners to learn through the operation process through a graphical introduction. There are many English teaching software that provide listening, speaking, reading, and writing teaching and practice functions, but there are still many things. The English teaching software often only provides teaching videos or real human voices in the "speaking" part. To judge the correctness of the learner's pronunciation is a big challenge for the learner to practice the pronunciation. If you simply use the "real heart", you will get half the work for the learner. However, for the hearing impaired, 5 1294107 mouth English " my pronunciation teaching is not feasible, because they can not learn the language from the teacher's pronunciation. In addition, many non-synchronous online teachings often only provide simple teaching videos. The video is sent to the client through the multimedia cage, and there are many problems in the use of the voice teaching. For example, the learner cannot know whether his or her pronunciation is correct, because it is simply = listening to the teacher's pronunciation or sentence f in the teaching film. If you follow the practice f, you can't know whether the pronunciation is correct. Second, it is impossible to know the part where the pronunciation is wrong. If the pronunciation of the instructional film is continuously heard, the pronunciation can only be changed similarly, but it is impossible to know that there is an error in the lip shape of the sound syllable. Second, the hearing impaired cannot learn pronunciation from these English teaching sentences. In addition, in the aspects of listening, speaking, reading and writing in speech training, "saying, evaluation and learning tools" is a very important but difficult part of the work. Some studies at home and abroad are directed at the voices spoken by practitioners. Characteristics are compared and scored, but the use of multimedia features, such as lip shape changes and comparisons, is still rare. Alignment through the contours of the lips makes it easy for some lips to find similarities from the template or It is the conformity of the contour, which leads to the possibility of misjudgment or error. In recent years, the use of lip visual information to confirm the speaker's words has attracted the interest of many researchers. If there is external noise interference in the environment where the sound recognition is located, It tends to cause a significant drop in the recognition rate, but the right side is assisted by the information of the lip and lip type, which can effectively increase the accuracy. In order to change the contours and colors of the lips, the lips will be picked up because of the complexion of the face and the color of the lips. The weaker the contrast intensity, the more difficult it is. The current method of lip image segmentation can be divided into two types, the first type of lip The cutting method is directly obtained through the color space. The method of this method is to convert the RGB color into another color system to enhance the contrast between the lips and the skin color. The processing time of the method is relatively fast, but since there is a lot of noise after the conversion, it is difficult to distinguish the contour boundary of the lip when the color contrast is not obvious. Therefore, the Markov random field technique is used. It can be used to reduce the segmentation error caused by noise after color conversion. The second lip segmentation method is referred to as FCM by fuzzy C-means clustering, which is based on " C-means algorithm Derived grouping method, Bezdek first proposed this method in 1973. It is a data grouping technique that uses the value of fuzzy attribution function to indicate the extent to which each point of data belongs to a cluster center. The distance from a known point to any cluster center multiplied by the value of the attribution function at that point as the objective function, and The minimization of the objective function reaches the purpose of judging the cluster. However, the fuzzy C-means grouping method is also affected by the intensity of color contrast, so a new fuzzy algorithm fuzzy c-means with shape function (FCMS) has been proposed. The operation of enhancing the spatial distance has the color information contrast supplement # strong, because the similar color of the lips may not be concentrated, or the lip block is not complete after being taken, so the Elliptic Shape Function can be used. Positioning the lip range, and calculating the block circle, has obtained a more accurate lip contour, and can automatically use the lip function to apply to the continuous image, which can be used to find the lips one by one. The outline range, which is a method of dividing a lip into a continuous RGB format image, then extracting it by lip characteristic parameters, and recognizing it using a hidden Hovden Markov Model. 7

1294107 透過 fuzzy c-means with shape function (FCMS) 來擷取嘴唇輪廓，目前已經可以正確的偵測出嘴唇範圍，但由於透過較複雜的數學運算來圈選輪廓，所耗費的時間往往會來的比較多。而在語音評分方面，國内的相關研究有： (一）利用聲音的三種特徵參數【音量強度曲線 (Magnitude)、基頻執跡（Pitch c〇nt〇ur)以及梅爾倒頻譜參數（Mel〜Frequency Cepstral Coefficients)】並刀另J使用動悲時間扭曲」（dynamic time waning，DTW)與 Hidden Markov Model(HMM)兩種方法進行评分，貫驗結果顯示，梅爾倒頻譜參數代表的重一要性最_高，其次是基頻執跡，最後是音量強度曲線。 (-）透過英文语音訊號切割」提供—個將語音訊號切割 f母個因，時間區段的方法’並以預先訓練好的兩種英/文發音聲學模型當作比對標準。爾後經由語音辨認技術依不同的母語提供合適的聲學模型來切割出正確的發音區段。所使用的評分方式是比較標準語音和評分語音的相似度，另外採用四個特徵參數：音量強 f j基頻執跡曲線、發聲及緩變化及_對數機率差異。容包：音ΐ訓練上國内的相關研究則有:教學内音、韻律節奏，學f者透過口次與辨識發日的規則形式為主要的練習活動。另’國外在融合聲音盘旦✓你/ audio-visual系統在語音辨識方° t的blmodal 來說，在唇語辨識上的研究辨上識的獲得重視…般操取嘴唇影像的特徵點可 1294107 分為以輪廓為主（contour-based)以及影像為主 (image-based)。以輪廓為主的研究法是利用邊緣資訊 (edge information)、可變樣版（deformable templates)、或是主動式輪廓（active contour)來獲得特徵值。此法最大的優點是在空間上經過位移 (translation)、旋轉（rotation)、放大縮小（scaling)以及不同明亮度（illumination)下，特徵值仍然保持不變。但是，以輪廓為主的方法會失去很多有用的資訊，例如牙齒和舌頭的出現等。目前自動語音辨識系統（automatic speech recogniser，ASR)能夠在前端將說話者的s_h加以區分出來，但是由於自動語音辨識純易受關遭環境外部的蜂音干擾’也S為這樣使得語音_率變低。【發明内容】士發明主要提出_個創新的概m見連續唇型的影像在英語教學方面，透過辰％、型及聲音的比對來加以評分，以增進語言學習者—個L視覺學習介面。發用視覺化的方式，呈現出學習者在 Ϊ二==的變化以用來跟教學者做評量比評分。以下針對學習者、教學者、衡量：系、、死等二方面的考篁來 )學習者：對於學習者而越好，透過簡單的幾個按二二== 正確。協助聽障人士在學 ▲自己的曰疋否予_ b g過程中，能夠透過圖 ( 9 1294107 果形化的方式，來有效的指出需要改善發音的部位能使得在真人教學的職底下，具有多-層的輔助效 -），學者：提供錄製教學者發音的嘴唇圖片原始構及聲音檔，讓教學者可以自由的擴充英語發音練習句，只需透過錄製新的嘴唇圖片原始檔，即可讓學習者練習新的發音句子，來做評分基準。 (二）系統：在系統方面上，會擁取發音學習者的嘴唇連續圖片以及聲音權，並藉由系統預先錄製好的教學者聲音及隨，來進行評分輯的動作，独視覺化的方式來呈現5平估後的結果，以及顯示和教學者發音差異 ^大的部位’讓學習者可以更快得知須待加強的部 —由於目前E-Learni ng用在許多補習班業者、學校等， =能夠有效的運用’則可以減少許多人事上的成本。尤繞=學習者而言’若能夠提供聲音以及視覺化的方式來 =¾央語發音’以視覺化連續嘴唇圖片來顯示出發音錯誤的位置；透過聲音來漏學習者練f英語句子發音，並比 =央语發音’這樣可以更有效的來橋正學習者的發音練 =而且以視覺化方式的好處在於不僅可簡示出錯誤的置，且也可以協助聽障人士在學習英語過程中，有一個更方便的輔助工具來幫助。針對本發明所提的互動式英語教學的特點可列舉如下· 1.，夠提供紅發音的功能：主料解f者在練習發音 …判斷唇型是否能夠符合教學者的唇型，讓學習者能 10 1294107 夠清楚的知道，再哪些音節需要加強練習。 2. 協助聽障者學習語言：由於聽障者學f語言最大的障就是無法從發音中，清楚地聽到聲音來學習。_ 嘴唇辨識來幫助聽障者在語言上的學習，由觀察嘴唇的連續畫面，來逐步學習並著^ = ^以讓聽障者得知㈣是否跟教學麵她^ = 士具有障礙的人，能夠藉由嘴唇的唇 ;力並透過評分機制來了解正確率為何。幻子岛 3. ==本發明提出一種創新評分機制，者可以付知自己的分數大㈣在哪個區間範圍。 4. =合聲音及巧來評估發音準確^目前錢音的外邻技ϋ轉是滿成熟的，但是所處的週遭環境若有 =部聲音的干擾，會使得辨識能力大大的下降，破壞^ =的準確性。所關由連續嘴唇圖像的辅助來加強評估 ^音的準確性，不僅可關視覺化的方式呈現，且能夠補足聲音受外部干_缺失，結合聲音及影像來提分的正轉性。學習發音教學軟體，大多是以錄音的模式讓予1者猎由不斷重複練f來學f，不過對於學f者而言，難得知自己發音的正確與否，因此，本發明藉由人類說㈣嘴，的大小輪廓來比對，得知發音的正雜，並可 =^日即作取。而本發明大約可分為以下幾項功能：首 2 I音？學者可先將自己的發音㈣以及聲音，透過 ^ ;功1將每句英語錄製下來，並做嘴唇關鍵畫面掏取 ey f二me)找出重要的關鍵嘴唇晝面，以後和學習者做比對，第一、對於學習者而言只需利用WebCam對準自己 1294107 的嘴唇進行錄製，祕會自行將錄製的畫面掏取成圖片’之後再圈選出嘴唇的部位，即可和教學者的嘴辰、做比對，來得知發音的正確性及評分；第三、透過編號^算 (Connected Component Analysis)來自動偵測嘴辰廓，並有效的找出嘴唇特徵範圍值；第四、英語發立二二由教學者來更新擴充，並將連續嘴唇圖# 2 = 下來，且取得偶㈣參數；第五、本發明在視^匕^ 評估上採用则方法，藉由結合聲音以。來做評分，可提升準確性。 4㈣源’ 本發明所運用之系統可分為以幾者、英語教學者、語句資料庫'評分機;個其層中面·學習學習者’藉由Webcam擷取出學習者的發音連嘴辰聲音’並從語句資料庫讀取教學者嘴唇及聲音; 師輸供-個英語教學錄製的功能，英語老己嘴cr的連、、、只衫像錄製至丨咨並可重複顿的錄製下—句英語； U丨貝枓庫内’ 辰型’主要儲存目前所錄製的英語教學句子及以提供日後搜尋魏之用途成效，Ρ 主要絲評估學習者料語發音準確度及 ΐ 二檻值一呼分機制來八2 哪一個得分區段；本發明係透過贿心機制來相對聲音與影像來加以評估。【實施方式】為令本發明所運用之技術内容、發明目的及其達成之 1294107 功效有更完整且清楚的揭露，茲併參閱所揭之圖式及圖號：於下呼細說明之，並請一 :發:應用聲音與影像於數位學習上的發音評量方中2= 化的方式，呈現出學習者在發音過程偵測出唇形的變化’以用來跟教學者做評量比對，並々不出發音過程中可能有錯誤的位置，同時將學習者的聲 =錄製下來’以供後續用來做聲音及影像兩者之間的評分。而本發明的流程步驟包括【以下 (幻新增句子（S1) 教學者只需藉由此介面輸入要新增的英語句子，以及要儲存到的位置為何，之後會把這些路徑及句子寫到資料庫中，以方便之後的讀取【如第二圖所示】。 (b)選擇句子（S2) 提供學習者欲選取練習的句子【參第三圖】，或是藉由視窗上所提供的搜尋功能來找尋要練習的句子，同時在「所有句子」的選項裡面，系統能夠自動將之前儲存的句子全部列出來，並依據字母的順序來分類。 (c) WebCam 擷取（S3) 學習者/教學者將自己的嘴唇對準WebCam之後按下擷取钮，系統則會自動的將畫面中的影像擷取成一張張的嘴唇圖片【參第四圖】。 (d) 自動化嘴唇區間選取（S4) 本發明透過嘴唇自動偵測來取得嘴唇資訊，藉由取得嘴唇資訊來運用在口語發音教學。其係採用顏色空間YCbCr來切割先找出人臉皮膚，之後再由皮膚區塊中透過RGB顏色空間來進行嘴唇自動偵測。 13 1294107 第五圖為嘴唇自動偵測的流程：首先，進行人臉範圍偵測，為了能夠準確的嘴唇區塊以及加快處理速度，先透過YCbCr顏态％ <間來切割出人臉部份，再由人臉的範圍内進行RGB、 HSV顏色切割來判斷出嘴唇的可能區塊，而為了能_、、| 楚的取得嘴唇區塊，所以在透過嘴唇顏色切割後續^ 形態學運算以及中間值濾波器進行雜訊處理，這時^ 理過後的二值化影像再由編號運算將可能的相同區2 給予同樣編號，之後找出最大的區塊即為嘴唇區塊。 (一）人臉範圍"ί貞測（S41) 為了有效的偵測出嘴唇區塊範圍，本發明先^ 對人臉進行前處理，透過YCbCr來切割出皮膚顏色，經過YCbCr轉換後取Cb的值介於[77，130]， Cr的值介於[130，173]，所產生的影像為第六圖所示，再由切割後的皮膚區間找出人臉的上、下、左、右四個邊界（11)、（12)、（13)、（14)，由於透過WebCam所擷取下來的嘴唇圖片，是擷取鼻子以下的範圍，所以需要再將最左邊界（13)以及最右邊界（14)做内縮的動作，即將左右各内縮八分之一來當作最左邊界（13)以及最右邊界（14)的臨界線，若搜尋的範圍超出此臨界邊線則以内縮的八分之一為邊界。當取得人臉範圍邊界01)、 (12)、（13)、（14)後，便可在此範圍内搜尋所需要的嘴唇範圍資訊。 (二）嘴唇顏色切割（S42) 14 1294107 在本發明中，為了能夠做到嘴唇自動偵測，以方便在評分時能夠更準確的取得嘴唇特徵資訊，且就一般人而言，人臉嘴唇主要顏色大都為紅色，所以在本發明中不討論其他因口紅而導致嘴唇顏色不同，透過這些顏色資訊便可以概廓的取得出嘴唇範圍區塊。在顏色切割嘴唇區塊方面，本發明透過RGB顏色系統R值跟G值的顏色資訊來切割出紅色的區塊範圍，以及YCbCr色彩空間的Cb以及Cr來切割。1294107 Through the fuzzy c-means with shape function (FCMS) to capture the lip contour, the lip range can now be detected correctly, but because of the more complicated mathematical operations to circle the contour, the time it takes will often come. More. In terms of speech scoring, domestic related researches include: (1) Using the three characteristic parameters of sound [Magnitude, Pitch c〇nt〇ur, and Mel Cepstrum parameters (Mel) ~Frequency Cepstral Coefficients)] and the dynamic time waning (DTW) and Hidden Markov Model (HMM) methods are used to score. The results of the test show that the Melt's cepstrum parameter represents the same weight. The most important is the high, followed by the base frequency, and finally the volume intensity curve. (-) Cut through the English voice signal to provide a method for cutting the voice signal into the parent cause, the time segment' and using the pre-trained two English/text pronunciation acoustic models as the comparison criteria. The appropriate acoustic model is then tailored to the correct pronunciation segment by voice recognition techniques in different native languages. The scoring method used is to compare the similarity between standard speech and scoring speech, and adopt four characteristic parameters: strong volume f j fundamental frequency detour curve, vocalization and gradual change, and _ log probability difference. Rong Bao: The relevant domestic research on music training includes: teaching internal sound and rhythm rhythm. The learning method is the main practice activity through the oral and identification rules. In addition, 'foreign in the fusion sound Pandan ✓ You / audio-visual system in the speech recognition side ° t blmodal, the research on the recognition of lip recognition to gain attention to the importance of the lip image 1294107 is divided into contour-based and image-based. Contour-based research uses edge information, deformable templates, or active contours to obtain feature values. The biggest advantage of this method is that the eigenvalues remain unchanged in spatial translation, rotation, scaling, and illumination. However, the contour-based approach loses a lot of useful information, such as the appearance of teeth and tongue. At present, the automatic speech recognition system (ASR) can distinguish the s_h of the speaker at the front end, but since the automatic speech recognition is purely susceptible to buzzing interference outside the environment, the S is such that the speech _ rate is changed. low. [Summary of the Invention] The invention mainly proposes an innovative image. The continuous lip type image is scored in English teaching through the comparison of %, type and sound to enhance the language learner-L visual learning interface. . In a visual way, the learner's change in ==== is used to score with the teacher. The following are for the learner, the teacher, the measurement: the system, the death, and so on.) Learner: The better for the learner, the simpler by two or two == correct. Assisting the hearing-impaired people in learning ▲ their own 曰疋予 _ bg process, through the figure (9 1294107 fruiting way, to effectively point out the need to improve the pronunciation of the part can make the real teaching - Layer auxiliary effect -), scholar: Provides the original image and sound file of the lip picture recording the speaker's pronunciation, so that the teacher can freely expand the English pronunciation practice sentence, just by recording the new lip picture original file, you can let Learners practice new pronunciation sentences to benchmark. (2) System: In terms of system, it will capture the continuous picture of the lips of the pronunciation learner and the right to sound, and perform the action of scoring the series by the pre-recorded voice of the teacher and the accompanying system. To present the results of the 5 flat assessments, as well as the differences between the display and the learners' pronunciations, 'allowing learners to know more about the departments that need to be strengthened—because E-Learni ng is currently used in many tutors, schools, etc. , = can be used effectively 'can reduce the cost of many personnel.尤绕 = learner's 'if you can provide sound and visual way to =3⁄4 Yangyu pronunciation' to visualize the continuous lip picture to show the position of the wrong pronunciation; through the sound to leak the learner to practice the English sentence pronunciation, And it is more effective to pronounce the pronunciation of the learner's pronunciation = and the advantage of the visual way is not only to show the wrong position, but also to assist the hearing impaired in the process of learning English. There is a more convenient aid to help. The characteristics of the interactive English teaching proposed by the present invention can be enumerated as follows: 1. The function of providing red pronunciation is sufficient: the main material solution f is practicing the pronunciation... determining whether the lip shape can conform to the lip shape of the teacher, and let the learner Can 10 1294107 Know enough, which syllables need to strengthen the practice. 2. Assisting the hearing impaired to learn the language: Because the hearing impaired learn the biggest obstacle of f language, it is impossible to clearly hear the sound from the pronunciation to learn. _ Lip recognition to help the hearing impaired in language learning, by observing the continuous picture of the lips, to gradually learn and ^ = ^ to let the hearing impaired know (4) whether it is with the teaching face ^ ^ 士 people with obstacles, It is possible to understand the correct rate by the lip of the lips; the force and the scoring mechanism. Fantasy Island 3. == The present invention proposes an innovative scoring mechanism in which it is possible to know which range of scores is large (4). 4. = Sound and skill to assess the pronunciation accurately ^ The current neighboring technology of Qianyin is full of maturity, but if there is interference in the surrounding environment, the recognition ability will be greatly reduced, destroying ^ = accuracy. The correction of the continuous lip image enhances the accuracy of the sound, not only in a visual way, but also complements the positive rotation of the sound by external dryness, combined with sound and image. Learning the pronunciation teaching software, most of them use the recording mode to let one hunter to learn f repeatedly, but it is difficult for the learner to know whether the pronunciation is correct or not. Therefore, the present invention is said by humans. (4) The size and outline of the mouth are compared, and the pronunciation is found to be mixed, and can be taken as ^^. The present invention can be roughly divided into the following functions: the first 2 I tone? Scholars can first record their own pronunciation (4) and sound through ^; Gong 1 will record each sentence of English, and make a key picture of the lips to capture ey f two me) to find out the important key lips, and later compare with the learner Yes, first, for the learner, just use WebCam to align the lips of 1294107 to record, and the secret will take the recorded picture into a picture by itself, and then circle the part of the lips, and the mouth of the teacher. Chen, do comparison, to know the correctness and score of the pronunciation; third, through the Connected Component Analysis (Automatically detect the mouth profile, and effectively find the range of lip characteristics; fourth, English hair Li Er 2 is updated by the instructor, and the continuous lip map # 2 = down, and the even (four) parameters are obtained; fifth, the present invention adopts the method in the evaluation of the image by combining the sounds. To score, you can improve accuracy. 4 (four) source 'The system used in the present invention can be divided into several, English teacher, sentence database 'scoring machine; one layer of the middle face · learning learner' by Webcam to extract the learner's pronunciation and mouth sound 'And read the teacher's lips and voice from the statement database; the teacher loses the supply - the function of the English teaching recording, the English old mouth cr, even, and the shirt is recorded to the 丨 consultation and can be repeated recording - Sentence English; U丨贝枓库内 '辰型' mainly stores the currently recorded English teaching sentences and provides the results of the future search for Wei, Ρ Mainly assesses the learner's pronunciation accuracy and ΐ 槛The sub-mechanism comes with a score of 8 and 2; the invention is evaluated by means of a brooding mechanism relative to sound and image. [Embodiment] For a more complete and clear disclosure of the technical content, the object of the invention, and the effect of the 1294107 which are utilized by the present invention, reference is made to the drawings and drawings, which are illustrated in the following, and Please send one: send sound and image in the pronunciation evaluation method of digital learning, and show that the learner detects the change of the lip shape during the pronunciation process' to compare with the teacher. Yes, there is no possibility of a wrong position in the pronunciation process, and the learner's voice = recorded 'for subsequent use to score between the sound and the image. The process steps of the present invention include the following: (S1) the singer adds only the English sentences to be added by this interface, and the location to be stored, and then writes the paths and sentences to the syllabus. In the database, for later reading, as shown in the second figure. (b) Select the sentence (S2) to provide the sentence that the learner wants to select the exercise [see the third picture], or by providing it on the window. The search function is used to find sentences to be practiced, and in the "All sentences" option, the system can automatically list all previously stored sentences and sort them according to the order of the letters. (c) WebCam Capture (S3) Learning The teacher/teacher will point his/her lips to the WebCam and press the capture button. The system will automatically capture the image in the picture into a single lip image [see the fourth picture]. (d) Automated lip interval selection (S4) The present invention uses lip automatic detection to obtain lip information, and uses lip information to apply to spoken language teaching. The color space YCbCr is used to cut and first find the face skin, and then The skin is automatically detected by the RGB color space. 13 1294107 The fifth picture shows the process of automatic lip detection: First, face range detection, in order to accurately correct the lip block and speed up the processing, first Through the YCbCr face state % < between the face part, and then the RGB, HSV color cut in the range of the face to determine the possible block of the lips, and in order to _,, | Chu get the lip area Block, so after the lip color cutting, the subsequent morphological operation and the intermediate value filter for the noise processing, then the binarized image after the correction is given the same number 2 by the numbering operation, and then find the maximum The block is the lip block. (1) Face Range " 贞贞 (S41) In order to effectively detect the range of the lip block, the present invention first pre-processes the face and cuts it through YCbCr. Skin color, after YCbCr conversion, the value of Cb is between [77,130], the value of Cr is between [130,173], and the resulting image is shown in the sixth figure, and then the skin area after cutting is found. Upper and lower face , the left and right four borders (11), (12), (13), (14), because the lip image captured through the WebCam is the range below the nose, so the leftmost boundary needs to be 13) and the rightmost boundary (14) to do the retracting action, that is, the left and right indented one eighth is regarded as the critical line of the leftmost boundary (13) and the rightmost boundary (14), if the search range exceeds this The critical edge is bounded by one-eighth of the indentation. When the face range boundaries 01), (12), (13), (14) are obtained, the required lip range information can be searched within this range. (2) Lip color cutting (S42) 14 1294107 In the present invention, in order to enable automatic lip detection, it is convenient to obtain lip characteristic information more accurately when scoring, and in general, the main color of the face and lips Most of them are red, so in the present invention, other lipsticks are not discussed, and the color of the lips is different. Through the color information, the lip range block can be obtained. In the color cutting lip block, the present invention cuts out the red block range and the Cb and Cr of the YCbCr color space by the color information of the R value and the G value of the RGB color system.

1.透過RGB顏色系統切割【參第七圖】：由於嘴唇色彩顏色偏向於紅色，所以可以透過RGB顏色系統的R值跟G值，及紅色跟綠色的關係來做影像切割，關係如下： (1) (2)1. Cut through the RGB color system [Refer to the seventh picture]: Since the color of the lips is biased to red, the image can be cut by the R value and the G value of the RGB color system, and the relationship between red and green. The relationship is as follows: 1) (2)

Aim ^lim 0 <r- otherwiseAim ^lim 0 <r- otherwise

其中，R代表RGB顏色系統紅色值，G代表綠色值，Liim為最小的門楹值，為門檻值的最大上限，若R值跟G值之間的比例介於區間之内，則設成1，否則為0，藉由轉換成二值化影像來偵測出嘴唇區塊，以利用區塊範圍來求得嘴唇的中心座標位置以及嘴唇的寬高。目前測試的過程中，範圍介於在1. 1到2. 8之間可以將嘴唇區塊區分出來。 2.透過HSV色彩空間切割：在HSV顏色系統中，Η 15 1294107 表示為色調值（Hue)、S表示為飽和度 (Saturation)而V表示為明度（Value)，並可將 HSV量化成八種顏色分別為··紅色、黃色、綠色、青綠色、藍色、紫紅色、白色與黑色，當中色調值的範圍是在[〇, 359]，而飽和度以及亮度的範圍是在[0· 0, 1· 〇]，透過HSV顏色系統可以將顏色清楚的切割出來。所以為了切割出嘴唇顏色區塊先將原始嘴唇影像轉換成HSV，設定Η 的範圍值為大於300到小於60之間、S的範圍值介於[0·16,0·85]，而V的值介於 [0·21，0·85]，針對每個像素點（ρ)來判斷當中每個的值是否藉於在上述範圍中，若有則將此項素點設成1，否則設成〇。第八圖為透過HSV 所找出的結果。透過兩種不同顏色系統所切割出來的嘴唇區塊各有所利弊，由RGB顏色系統容易受到光線亮度所影響其結果，故本發明結合此兩種顏色系統來進行嘴唇區塊擷取來提升準確性。 (三）雜訊去除（S43) 由於在擷取嘴唇圖片會受到許多光線以及網眼本身硬體上的限制，所以透過嘴唇顏色切割後所產生的二值影像（Binary Image)會有許多雜訊’以及可能會因為其他顏色可能會落入相同的範圍值，所切割出來的嘴唇區塊不只一塊，所以本發明透過形態學運算（Morphology Operation) 以及中間值濾波器（Median Filter)來去除不需 16 1294107 要的區塊範圍。 1·形態學運算··是—種可以把—張二值化過後的影像，透過分析來得到不同外型所構成的影像，而常用的幾種基本運算元包含：侵^ (Er〇sion)、膨脹（Dilation)、斷開（0pen)以及閉合（Cl0se)。經由型態學運算，將經過rgb ，色系統切割的嘴唇圖片透過斷開跟閉合運算，可以將一些多餘的雜點去除並可以填補細小的空缺，使得嘴唇輪廓較為明顯。 2中=值濾波n :經由形態學運算的斷開跟閉合運异後，為了要去除剩餘的雜訊，透過3X3大小的中間值濾波器（Median Filter)來去除影像的雜訊，若在此3X3矩陣中有超過一半的像素”沾（P)為白色，則將此遮罩的九個像素點（p) 7改變成白色，透過此方法可以去除分部較，散的雜訊，以減少嘴唇區塊擷取的錯誤率。第九圖為經過型態學運算的斷開以及閉合後，再透過中間值濾波器來進行雜訊去除的結果。口 (四）嘴唇區塊擷取（S44) 、口透過上述的嘴唇顏色切割以及雜訊去除，可 =區^出可能為嘴唇輪廓的區塊，由於擷取說話時連續影像的嘴唇圖片，會限定透過WebCam掏，該嘴唇附近部位，這是因為為了㈣清楚的取传嘴唇變動資訊以用來做評分估計，若WebCam 擷取時疋以上半身來擷取，相對的很難區分出嘴 17 1294107 唇變動的大小，所以本發明只擷取該嘴唇附近範圍來進行評分。編號演算法（Connected Component Analysis)是一種用來將物體影像中物體邊界和區域組成加以區分的一種方法，可以把影像中不同區塊的圖形給予一個編號，用來區別該影像是由哪些區塊所構成的。對於編號演算法本發明採取以每個像素點（P)(pixel)為單位來進行編號，其方法如下：圖片中的每一個像素點（P)P的鄰邊可分為四個鄰邊或是八個鄰邊’若是四鄰邊則是像素點 (P) 1，2，3，4四個點’若是八鄰邊則是有 1，2,3,4,5,6,7,8八個像素點（P)所構成的鄰邊，如第十圖之（a)與（b)。 η衣只Where R is the red value of the RGB color system, G is the green value, and Liim is the minimum threshold, which is the maximum upper limit of the threshold. If the ratio between the R value and the G value is within the interval, set to 1 Otherwise, it is 0, and the lip block is detected by converting into a binarized image to use the block range to determine the center coordinate position of the lips and the width and height of the lips. In the current test, the lip block can be distinguished between 1.1 and 2. 2. Cutting through HSV color space: In the HSV color system, Η 15 1294107 is expressed as hue value (Hue), S is expressed as saturation (Saturation) and V is expressed as brightness (Value), and HSV can be quantized into eight kinds. The colors are red, yellow, green, cyan, blue, magenta, white, and black, with the range of tonal values being [〇, 359], and the range of saturation and brightness being [0·0 , 1· 〇], the color can be clearly cut through the HSV color system. Therefore, in order to cut the lip color block, the original lip image is first converted into HSV, and the range of Η is set to be greater than 300 to less than 60, and the range of S is between [0·16, 0·85], and V The value is between [0·21, 0·85], and for each pixel (ρ), it is judged whether the value of each of them is in the above range, and if so, the prime point is set to 1, otherwise Cheng Yu. The eighth picture shows the results found through the HSV. The lip blocks cut by the two different color systems have their own advantages and disadvantages, and the RGB color system is susceptible to the brightness of the light. Therefore, the present invention combines the two color systems to perform lip block extraction to improve accuracy. Sex. (3) Noise removal (S43) Since the image of the lips is subject to a lot of light and the hardware itself, there is a lot of noise in the binary image (Binary Image) produced by lip color cutting. 'And may be because other colors may fall into the same range of values, the cut lips block is more than one piece, so the present invention does not need to be removed by Morphology Operation and Median Filter 16 1294107 The desired block range. 1. Morphological operation·· is a type of image that can be binarized and analyzed to obtain images of different shapes. The commonly used basic arithmetic elements include: Er〇sion , Dilation, Disconnect (0pen), and Closure (Cl0se). Through the morphological operation, the image of the lip cut by the rgb and color system is broken and closed, and some extra noise can be removed and the small gap can be filled, so that the lip contour is more obvious. 2 = = value filter n: After the morphological operation of the break and close the transport, in order to remove the remaining noise, through the 3X3 size median filter (Median Filter) to remove the image noise, if here More than half of the pixels in the 3X3 matrix are white (P) is white, then the nine pixels (p) 7 of this mask are changed to white. This method can remove the partial and scattered noise to reduce the noise. The error rate of lip block capture. The ninth figure shows the result of noise removal after the disconnection and closing of the type operation, and then through the intermediate value filter. (4) Lip block capture (S44) ), the mouth through the above lip color cutting and noise removal, can = area ^ may be the contour of the lips, because the lip image of the continuous image when the speech is captured, will be limited through the WebCam 掏, the vicinity of the lip, this It is because (4) clearly pass the lip change information for the score estimation. If the WebCam captures the upper body to draw, it is relatively difficult to distinguish the size of the lip change of the mouth 17 1294107, so the present invention only captures The lips Closed range is used for scoring. Connected Component Analysis is a method for distinguishing the boundaries of objects in an object image and the composition of regions. You can assign a number to the different blocks in the image to distinguish the The image is composed of which blocks. For the numbering algorithm, the present invention takes numbers in units of pixels (P) (pixel), as follows: Each pixel (P)P in the picture The neighboring edge can be divided into four neighboring edges or eight neighboring edges. If the four neighboring edges are pixel points (P) 1, 2, 3, and 4 points, if there are eight neighboring edges, there are 1, 2, and 3, 4, 5, 6, 7, 8 adjacent pixels (P), as shown in the tenth (a) and (b).

个赞昍在此貫狍例甲係採用四鄰邊要是像素點（Ρ)的鄰邊都為同一個區塊，將經過中間值濾波器過後的二值影像從左上點一一的判斷到右下點，採用遞迴的方式逐一判斷每一點，而該像素點（Ρ)的搜尋順序為上、左、下、右，當遞迴結束時代表已經將一個區塊標號完畢，並往下一個像素點（Ρ)進行搜尋，若已經是 =號過的則跳過再往下一個搜尋依此類推等= ，個，素點(Ρ)都判斷過後，則所有的區塊也都节成。第十一圖為原始影像，第十二圖為編唬過唬的影像。透過編號演算法將每個區塊做完編號後，可 18 1294107 ^知道由於透過RGB顏色系統切割，是以嘴唇顏 j為主來進行區塊切割，所以本發明以區塊 r 大的編號來進行搜尋及代表嘴唇區塊，找哥f式是由該影像的左上點開始進行搜尋，若找到第一個點（即像素點（p)的值為該區塊數目最大的編號），代表為此區塊的最上位置將此點的座標紀錄下來，再往下繼續搜尋嘴唇輪廓最下 : 料’當找到上下邊界時，同時也找到嘴唇輪廓左右邊界’演算法下：In this case, the four neighboring edges are the same block, and the neighboring edges of the pixel (Ρ) are the same block. The binary image after the intermediate value filter is judged from the upper left point to the lower right. Point, each point is judged one by one by recursively, and the search order of the pixel (Ρ) is up, left, down, right, and when the end of the recursion means that a block has been marked and the next pixel is Point (Ρ) to search, if it is already = the number is skipped to the next search and so on, etc., after the prime point (Ρ) has been judged, then all the blocks are also completed. The eleventh picture is the original image, and the twelfth picture is the edited image. After numbering each block by numbering algorithm, it can be known that the block is cut by the lip color j by cutting through the RGB color system, so the present invention uses the block r number. Searching and representing the lip block, looking for the brother f is the search from the upper left point of the image, if the first point is found (that is, the value of the pixel (p) is the number with the largest number of blocks), the representative is The top position of this block records the coordinates of this point, and then continues to search for the lowest lip contour: Material 'When the upper and lower boundaries are found, the left and right edges of the lip contour are also found' algorithm:

Fqi" i = 〇 to圖片高度 F〇r j = 0 to圖片寬度Fqi" i = 〇 to picture height F〇r j = 0 to picture width

If pixel(i，j) =區塊數目最多的編號Then top = i < top ? i ： top bottom = i > bottom ? i ： top left = j < left ? j ： left right = j > right ? j ： right End If • Next jIf pixel(i,j) = the number with the largest number of blocks. Then top = i < top ? i : top bottom = i > bottom ? i : top left = j < left ? j : left right = j > Right ? j : right End If • Next j

Next i 當中的 top、bottom、left、right 分別代表嘴唇輪廓的最上邊界（11)、最下邊界（12)、最左邊界（13)以及最右邊界（14)，第十三圖為找尋輪廓方法。當取得嘴唇輪廊最上邊界（11)、最下邊界 (12)、最左邊界（13)以及最右邊界（14)時，便可以取得到此嘴唇的中心位置（0)以及寬（W)、高 19 1294107 【如第十四圖所示】，而這些可以用來當作嘴唇的簡易資訊。 (e)評分機制（S5) 本發明之評分機制係透過視覺化嘴唇評分以及嘴唇結合聲音兩種機制。在嘴唇評分機制中，透過人類發音時嘴唇變動的形狀來跟標準發音嘴唇做圖樣比對，以評估出之間的差異；在結合嘴唇及聲音評分機制中，目前聲音的辨識率已算成熟，但由於當環境有 ^ 外部聲音干擾時，其資料的準確性會大大下降，所以藉由嘴唇及聲音兩者之間彼此輔助，使之在外部聲音干擾下，也能正確的評分。 1.視覺化嘴唇評分：在視覺化嘴唇評分中，本發明係使用動態時間扭曲（Dynamic Time Warping，DTW) 演算法配合圖樣比對（Pattern Matching)的方式進行，以透過標準發音教學者的嘴唇圖片和學習者嘴唇圖片彼此之間的相似性或是差異性，過濾出需要刪除多餘的圖片或是保留重要的關鍵晝面來做評 • 分的計算，若相似度越高代表所得到的分數越高。動態時間扭曲（Dynamic Time Warping，DTW) 常用於解決語音辨識的問題，由於每次說同樣句子的時間長短不一，為了比較兩者的相似度，使得測試資料的聲音與參考資料的聲音之間得到最小誤差的非線性對應【如第十五圖】，然而在第十六圖中透過動態時間扭曲（DTW)，來找出A跟B兩者發音時間長短可能不一致，所以DTW可以將兩者的發音資料做對應，使在說同一個句子的情況下，能夠讓兩 20 1294107 =長短不一的圖片張數（教學者連續嘴唇圖片及學習者連續嘴唇圖片）正確的對應【如第十七圖】。备教學者的唇型視訊有m個影像，學習者則有n 個影像時【參第十八圖】，令教學者的影像為 t(l),t(2),...5t(m)，而學生的影像為Top, bottom, left, and right in Next i represent the uppermost boundary (11), the lowermost boundary (12), the leftmost boundary (13), and the rightmost boundary (14) of the lip contour, respectively. The thirteenth image is the contour. method. When the uppermost boundary (11), the lowermost boundary (12), the leftmost boundary (13), and the rightmost boundary (14) of the lip rim are obtained, the center position (0) and width (W) of the lip can be obtained. High 19 1294107 [as shown in Figure 14], and these can be used as simple information for the lips. (e) Scoring mechanism (S5) The scoring mechanism of the present invention is a mechanism for visualizing lip scores and lip-to-mouth combined sounds. In the lip scoring mechanism, the shape of the lip changes in human pronunciation is compared with the standard pronunciation lip to evaluate the difference; in the combined lip and sound scoring mechanism, the current sound recognition rate is mature. However, since the accuracy of the data is greatly reduced when the environment has external sound interference, the lips and the sounds are mutually assisted so that they can be correctly scored under external sound interference. 1. Visualized lip score: In the visualized lip score, the present invention uses a Dynamic Time Warping (DTW) algorithm in conjunction with Pattern Matching to pass the standard pronunciation of the teacher's lips. The similarity or difference between the picture and the learner's lip picture, filtering out the need to delete the extra picture or retain the important key face to calculate the score, if the higher the degree, the score obtained The higher. Dynamic Time Warping (DTW) is often used to solve the problem of speech recognition. Because the time of each sentence is different, in order to compare the similarity between the two, the sound of the test data and the sound of the reference data Obtain a nonlinear response with the smallest error [as in the fifteenth figure]. However, in the sixteenth figure, through dynamic time warping (DTW), it can be found that the length of the pronunciation of A and B may be inconsistent, so DTW can Correspondence of the pronunciation data, so that in the case of the same sentence, the two 20 1294107 = the number of pictures of different lengths (the continuous lip picture of the teacher and the continuous lip picture of the learner) are correctly matched [such as the seventeenth Figure]. The teacher's lip video has m images, and the learner has n images [see Figure 18], so that the teacher's image is t(l), t(2),...5t(m) ), and the student’s image is

S(1)，s(2)，···，<!〇,動態時間扭曲（DTW)的目的為在m，η構成的平面上找出一條最佳的對應路徑，此路徑的起點為（丨，丨）最終點為（m，n)，假設d(i，j)為 t(i)與s(j)的距離，也就是d(i，丨，最佳路徑也就是由（1，1)到（m，η)的最小累積距離 D(i，j)。所以 rf(u) =而―，且而―a)=幺办ι(υ)—咕定義如下： ί=1Μ ί(1) =< (i) ΚΙ) =< KU)，Kl，2)，_..，(l，p)，r(2，l)”.”r(2，p)”.”r(g,p) (2) t(l)圖形大小=r(l)圖形大小= =ρΧί? (3) 计异出d(l，l)的起始位置後机，在依序的找出D(l，2)，D(l，3)......的累積距離，可透過S(1), s(2), ···, <!〇, the purpose of dynamic time warping (DTW) is to find an optimal corresponding path on the plane formed by m, η. The starting point of this path is (丨,丨) The final point is (m,n), assuming d(i,j) is the distance between t(i) and s(j), that is, d(i,丨, the best path is also (1) , 1) to the minimum cumulative distance D(i,j) of (m,η). So rf(u) = and ―, and ―a)=幺ι(υ)—咕 is defined as follows: ί=1Μ ί (1) =< (i) ΚΙ) =< KU), Kl, 2), _.., (l, p), r(2, l)"."r(2,p)"." r(g,p) (2) t(l)Graph size=r(l)Graph size==ρΧί? (3) Calculate the starting position of d(l,l), and find it in sequence The cumulative distance of D(l, 2), D(l, 3)...

下列數學式子找出： ^(i9 j) = min^ (4) D{Uj^\) + d{iJ) D(/-l,;-l) + 2rf(/,y) D(i-lj) + d(i，j) 找到了此累積距離之後，再循序倒推回前一步的最小累積距離，即可得到最佳路徑。假設最佳路徑為： c(l)，c(2)，…c(p),c(k)=(t(k)，s(k))，，那最佳路徑的追溯演算法為： 21 1294107 for m = k -1 to 1 i = c(m + 1) j = d(m + 1) next 其中m代表往回追溯的次數，c(m)用來紀錄第m 步i的位置，d(m)用來紀錄第m步j的位置，所以每次往回追溯時，則會找尋 D(i，j-l)，D(i-l，j-l)，D(i-1，j)這三個方向哪個累積距離值最小，來當作追溯的路徑。此外還有一些基本的假設來加速DTW的運算： :C⑴=(1，1)，咖)= (m，n) (5) .t(k-l)<t(k) • (6) s(k-l)<s(k) • (7) s{k)-s{k -1) <1 • 1 t(k) - s{k) l< w, W為視窗大小 (8) I .邊界條件 Π .遞增條件 nr.連續條件 IV.視窗限制 V ·陵度限制（Slope Constraint):在t方向走了 X 次後，至少要在s方向走y次。因此，在第十六圖顯現出利用動態時間扭曲 (DTW)處理後的教學者及學習者的連續影像，也因為使用了動態時間扭曲（DTW)的方式將相似的影像去除掉，以利於之後的圖形輪廓比例評分上的實行。使用圖形輪廓比例的評分機制有著一些的優點，例如：因只針對某一圖形的輪廓來計算，因此 22 1294107The following mathematical formulas are found: ^(i9 j) = min^ (4) D{Uj^\) + d{iJ) D(/-l,;-l) + 2rf(/,y) D(i- Lj) + d(i,j) After finding the cumulative distance, and then pushing back the minimum cumulative distance of the previous step, you can get the best path. Suppose the best path is: c(l), c(2),...c(p),c(k)=(t(k), s(k)), and the traceback algorithm for the best path is: 21 1294107 for m = k -1 to 1 i = c(m + 1) j = d(m + 1) next where m represents the number of times backtracked, and c(m) is used to record the position of the mth step i, d(m) is used to record the position of the mth step j, so every time you go back, you will find D(i,jl), D(il,jl), D(i-1,j) The direction of which cumulative distance value is the smallest is taken as the path of traceability. There are also some basic assumptions to speed up the DTW operation: :C(1)=(1,1), coffee)= (m,n) (5) .t(kl)<t(k) • (6) s( Kl)<s(k) • (7) s{k)-s{k -1) <1 • 1 t(k) - s{k) l< w, W is the window size (8) I . Boundary conditions Π .Incremental condition nr. Continuous condition IV. Window limitation V · Slope Constraint: After X times in the t direction, at least y times in the s direction. Therefore, in the sixteenth image, continuous images of the learners and learners using dynamic time warping (DTW) processing are also revealed, and similar images are removed by using dynamic time warping (DTW) to facilitate subsequent The outline of the graphic outline is scored on the implementation. The scoring mechanism using the graphical contour scale has some advantages, for example: it is calculated only for the contour of a certain graph, so 22 1294107

9 ο Ίχ c. i=0 在整體的計算量會比掃描整張圖型的計算量較小，也因此在誤差上也會相對地減少，提高比對的圖形比對的準確度，除此之外，因為使用了圖形輪廓的比例來做為評分的方法，所以也就不需擔心正規化的問題’因此基於以上的考量，本發明採取圖形輪廓比例的機制來做為評分的標準。目前經過Spatial-temporal【時間一空間對應】後的圖片為RGB全彩（24bits)，而r、g、B三值的範圍是介在0到255之間，利用教學者及學習者的圖形輪廓的比例兩兩相減，將每一張教學者與學習者的圖形輪廓（公式11)的比例相減後的值作加總 (公式12)，如下：9 ο Ίχ c. i=0 The amount of calculation in the whole will be smaller than that of scanning the whole pattern, so the error will be relatively reduced, and the accuracy of the comparison of the comparison will be improved. In addition, since the ratio of the contour of the graphic is used as the method of scoring, there is no need to worry about the problem of normalization. Therefore, based on the above considerations, the present invention adopts a mechanism of graphical contour ratio as a criterion for scoring. At present, the pictures after Spatial-temporal [Time-Space Correspondence] are RGB full color (24bits), and the range of r, g, and B values is between 0 and 255, using the graphic outline of the educator and learner. The ratios are subtracted from each other, and the values obtained by subtracting the ratio of each teacher's and the learner's graphical outline (Equation 11) are summed (Equation 12) as follows:

Rate = W/ΗRate = W/Η

Rate為教學者以及學生的圖片寬高比例計算方式，η為教學者第i張的圖形輪廓比例，乂為學習者第i張的圖形輪廓比例，K為所有的圖片張數，e為兩兩相減後的誤差值加總平均，而目前轉換評分數值，0到100分的做法如下：Rate is the method for calculating the aspect ratio of the image of the teacher and the student, η is the proportion of the outline of the figure of the ith piece of the teacher, 乂 is the proportion of the outline of the i-th picture of the learner, K is the number of pictures, and e is two The subtracted error value sums the average, and the current conversion score value, 0 to 100 points, is as follows:

MaxE = max(Ei), i = 1,···,η (11)MaxE = max(Ei), i = 1,···,η (11)

Score = 100-ΙΟΟχ—^— (12)Score = 100-ΙΟΟχ—^— (12)

MaxE 而MaxE為最大的誤差值，將100*(E/MaxE)來得到誤差的分數，之後再用100減掉誤差分數，即可 23MaxE and MaxE is the maximum error value, 100*(E/MaxE) to get the error score, then use 100 to subtract the error score, then 23

1294107 求得真正的得分為何。 (2)ii現覺化嘴唇及聲音評分：目前由於在聲音辨識中备ί，易文到外部聲音的干擾，對於在辨識過程發I造成很大的差異，所以透過嘴唇輪廓等來當作 X田評分辅助的參數，並透過發音語料跟標準語料的比對，加強評分的準確性。本發明透過語音辨識技術以及語音評分機制，、’错助視覺化嘴唇評分的互相輔助，驗證結合聲音比單純只用聲音或是嘴唇來評分具有較高的在浯音訊號處理上，由於在截取出語音的特徵多數引爲先藉由語音輸入裝置將口語類比訊號輸、】電知轉成數位訊號，並將此類比訊號進行語音的前處理，之後再進行特徵參數擷取以當作語音評分的比對參數。以下進一步介紹語音處理的流程與方法【參第十九圖】。 (1) 數位訊號轉換與取樣首先進行語音之數位訊號轉換（S51)由於人類所發出的聲音為類比訊號，透過電腦的語音輸入裝置將之轉換成數位訊號，再接著進行數位訊號取樣（S52) ’並予以音框化（S53)，其中聲音取樣的資料為每秒16KHz，而音框大小為512點，約32毫秒，且每個音框彼此重疊（0verlap)170 點，約佔一音框的三分之一。 (2) 端點偵測（S54) 當一段語音被錄製的時候，通常會有幾段語 24 1294107 音屬於靜音的部份，然而這些靜音的部份絕大多數疋不需要的【如二十圖】，所以為了能夠更有效取得所需的語音資料，需進行端點偵測來將語音貝料頭尾靜音的部份加以去除，本發明係採用透過因短時距能量（Short-Term Energy)以及越零率（Zero Crossing)並設定一門檻值 (Threshold)來判斷是否靜音。、 ◎短時距能量：在一段聲音中，不同的時間點所 > 發出的聲音能量會有所不同，因此在靜音部份的能量會比有聲的部份要來的低，故只需設定個門檻值，當音量超過此門摄值則代表是有聲曰的區段，若未超過則表示可能靜音的部份，透過此種方式就可以知道從哪個音框開始才是所需要的語音資料。能量測量是把一短時間内所有的音框能量值加總後取平均能量，如下式子·· 1 N+m+l • Ek = N^lS(n)l (13) 式子中的仏代表第k個音框能量的平均值，而N為一音框内取樣的點數，$⑻代表在第k個音框内的第n個點，！〇為該音框的起始點位置。 ◎越零率：所謂越零率代表為聲音信號經過零點的振動次數，數學表示方式如下： Α =全ί Μ1心⑻)- sgn_ -1))! (14) 25 (15) 其中㈣1294107 What is the true score? (2) ii sensational lip and sound score: At present, due to the interference in the sound recognition, the interference from Yiwen to the external sound causes a big difference in the identification process, so it is regarded as X through the contour of the lips. The field scores the auxiliary parameters and compares the pronunciation corpus with the standard corpus to enhance the accuracy of the score. The invention utilizes the speech recognition technology and the speech scoring mechanism, and the mutual assistance of the 'missing visualized lip scores, and verifying that the combined sound is higher than the sound only using the sound or the lip to score higher on the sound signal processing, due to interception Most of the characteristics of the speech are firstly converted into digital signals by the speech input device, and the analog signals are pre-processed, and then the feature parameters are extracted to be used as the speech score. Comparison parameters. The flow and method of speech processing are further described below [see Figure 19]. (1) Digital signal conversion and sampling First, the digital signal conversion of the voice (S51) is performed by the voice input device of the computer as a analog signal, and then converted into a digital signal by the computer's voice input device, and then digital signal sampling (S52) 'And the sound box (S53), wherein the sound sampled data is 16KHz per second, and the sound box size is 512 points, about 32 milliseconds, and each sound box overlaps each other (0 verlap) 170 points, about one sound box One third of the. (2) Endpoint detection (S54) When a voice is recorded, there are usually a few paragraphs where the voice of 24 1294107 is muted. However, most of these silent parts are not needed. In order to obtain the required voice data more efficiently, endpoint detection is needed to remove the silent part of the voice head and tail. The present invention uses short-term energy (Short-Term Energy). ) and Zero Crossing and set a Threshold to determine if it is muted. ◎Short-time energy: In a sound, the sound energy emitted by the different time points will be different, so the energy in the silent part will be lower than that of the sound part, so just set The threshold value, when the volume exceeds the threshold value, it means that there is a voiced section. If it is not exceeded, it indicates the part that may be muted. In this way, you can know which voice frame is the required voice data. . The energy measurement is to add the average energy of all the sound energy values of a short time, as shown in the following formula: 1 N+m+l • Ek = N^lS(n)l (13) Represents the average of the energy of the kth frame, and N is the number of points sampled in a frame, and $(8) represents the nth point in the kth frame! 〇 is the starting point position of the frame. ◎ Zero-zero rate: The so-called zero-zero rate represents the number of vibrations of the sound signal passing through the zero point. The mathematical expression is as follows: Α = full Μ 心 1 heart (8)) - sgn_ -1))! (14) 25 (15) where (4)

式子中的代表第k個越零率的値， sgn(s(n))代表在第η個點若大於零則為i否則為-1。然而雜訊的越零率大於氣音的越零率，而氣音的越零率又大於有聲音的越零率' 所以若要偵測出氣音必須藉由越零率來偵測，並設定一個門檻值來判斷。所以透過能量偵測以及越零率可以將無聲子音以及有聲^ 音偵測出來。 (3)特徵參數擷取（S55)In the equation, the 第gn(s(n)) represents the kth zero-zero rate, and if the nth point is greater than zero, then i is -1. However, the zero rate of the noise is greater than the zero rate of the air, and the zero rate of the air is greater than the zero rate of the sound. Therefore, if the air sound is detected, it must be detected by the zero rate and set. A threshold is used to judge. Therefore, the silent sound and the sound can be detected through the energy detection and the zero rate. (3) Feature parameter acquisition (S55)

母個人說話時，隨著其性別、年齡、地域等因素都會有其特定的發音方式，即使是同一個人，在不同心理狀況或生理狀態下所產生的語音訊號也會有所差異；因此若直接採用語音訊號二波形來從事比對的工作，不僅資料的處理量很大’同時所得到的辨識率也是非常有限的；因此在從事語音訊號處理時，必須先求得較適當的語音訊號特徵，並將之組成所謂的特徵向量 (featurevector)，經處理後，原本的語音訊號便可以此特徵向量來取代，以做為系統辨識的依據，此過程即為語音訊號特徵擷取（feature extraction)。因此，在此過程可以使用語音線性預測編瑪（LPC)演算法及倒頻譜參數來操取特徵參數。 ◎語音線性預測編碼（LPC): 26 樣表音分財μ要的_ 而且計算逮度也ί〜财效的^語音參數語音樣本可、。LPC的理論在於假設一個預測’若將實際:z==之線性組合來的話，= 縣之誤減至最小 LPC係數，如^^圭的預測器’其係數便稱為 (} ^ak'St ~M±Qzl/(n) ◎倒頻譜特徵： (6) 能量語音訊號在各頻帶的平均譜係數。雖==特徵則是指語音訊號的倒頻統所處理^E f特徵似乎更接近人類聽覺系示，借用>n、° w，但根據目前研究的結果顯識率κι Μ轉徵往往可得到較高的語音辨的語音=倒頻譜特徵已成為目前最常被使用 (4)圖樣比對評分（S56) 七使用圖樣比對（pattern Matching)的方法， 24分的語音和標準語音資料逐音素地來做比、又三以期找出評分語音和標準語音的差異程度， ^藉此對評分語音進行評分，在進行圖樣比對針對曰里強度曲線、基頻執跡、梅爾倒頻譜來調整參數’調整後再求取此特徵的最小動態時間扭曲（DTW)平均誤差，而評分的語音若和此標準語音的相似度愈高，則得到的分數也將愈高。 27 1294107 此外，在語音辨識所用到的聲學模型是以隱藏式馬可夫模型（Hidden Markov Model, HMM)為基礎所訓練出來的，從一些先前的研究上發現，隱藏式馬可夫模型基本上是一種雙重隨機過程，而之所以稱為隱藏式是因為其中有一組隨機過程是隱藏的，看不見的，在語音中就如同人類在發聲的過程中其發聲器官狀態變化是看不見的，好比喉嚨、舌頭與口腔的變化是不可能從可觀测的浯音訊號序列看出來的。而另一組隨機過程稱為觀測序列（observation sequence)，它是由狀態觀測機率（state observation probab i 1 i ty )來描述在每個狀態下觀測到各種 5吾音特徵參數的機率分佈。HMM的特性正好適用於描述語音的特性，可以把每個狀態看成是聲道 (vocal tract)正處於某個發聲組態 (articulatory configuration)，而狀態觀測機率則描述了在某個發聲狀態下聽到各種聲音的可能性。經由以上的實施說明，可知本發明具有如下所列之各項優點： 1·本發明藉由電腦自動判斷發音時唇型的正確性，並糾正發曰時唇型的錯誤位置，可令本發明成為一有效的英語評分方式。 2·本發明因係採用互動式的發音橋正教學，因此，學習者可以在家利用本發明之互動式的教學系統自我學習發曰、、東t ’並透過嘴唇的連續畫面，來加強需要改進的音 28 1294107 節，以續正發音時的唇型。 3.同時本發_可藉由絲的方式來提升學習者興趣’經由系統的自動評分，可以讓學f者得唇型的準確度，以期有效達成發音矯正的目的。X曰、 4·由於聽p早者受限於聽力上的限制，所以要曰將會變的困難，因此本發明藉由嘴唇的連續書面，处夠有效的幫助聽障者練習發音，協助聽障人士:审= 學習正確的語言發音。 h °更快 5. 本發明藉由唇型比對的相關技術，以利用發音時廓的變化來評估發音的正確性，並令嘴唇評分機制= 估子驾者發音時之嘴唇輪廓變化的準確度提高。 6. 本發明藉由互動式的教學可以透過嘴辱的連^書檢驗學習者的發音是否正確，並可進—料正發音的唇型，可以讓學習者可以認清自己在學習語言時^立缺失，因此可以幫助學習者在學習上的效果，進而與習上的困難度。予 ^綜上所述，本發明實施例確能達到所預期之使用功效’又其所揭露之具體構造，不僅未曾見諸於同類: 中，、亦未曾公開於申請前，誠已完全符合專觀之規二要求，爰依法提出發明專利之申請，懇請惠予 ^ 准專利，則實感德便。一並賜 29 1294107 【圖式簡單說明】第一圖：本發明之系統流程圖第二圖：本發明之新增英語教學句子晝面示意圖第三圖：本發明之學習者練習句子選取示意圖第四圖：本發明之WebCam顯示晝面示意圖第五圖：本發明之嘴唇偵測流程圖第六圖：本發明將左右各内縮八分之一後所形成之偵測嘴唇區塊範圍示意圖、第七圖：本發明透過RGB顏色系統切割後之嘴唇影像 ^ 示意圖第八圖：本發明透過HSV顏色系統切割後之嘴唇影像示意圖第九圖：本發明之嘴唇影像經過中間值濾波器後的結果不意圖第十圖：本發明於進行編號演算法時所採用之四鄰邊與八鄰邊區塊示意圖第十一圖：本發明於進行編號演算法前的原始影像示馨意圖第十二圖：本發明於進行編號演算法後編號過的影像不意圖第十三圖：本發明之嘴唇邊界偵測示意圖第十四圖：本發明之擷取嘴唇資訊示意圖第十五圖：本發明經過DWT處理後的教學者及學習者的連續影像示意圖第十六圖：本發明透過DTW後所對應到的位置示意圖第十七圖：本發明透過DTW後嘴存圖片所對應的位置 30 1294107 示意圖第十八圖：本發明之DTW矩陣示意圖第十九圖：本發明之語音處理流程圖第二十圖：本發明之聲波經由端點偵測後之示意圖【主要元件符號說明】When a parent speaks, he or she will have a specific pronunciation according to his or her gender, age, geography, etc. Even if the same person, the voice signals generated under different psychological conditions or physiological conditions will be different; Using the voice signal two waveforms to perform the comparison work, not only the processing capacity of the data is large, but also the recognition rate is very limited; therefore, when performing voice signal processing, it is necessary to obtain a more appropriate voice signal feature. And it is formed into a so-called feature vector. After processing, the original voice signal can be replaced by this feature vector as the basis for system identification. This process is the feature extraction of voice signals. Therefore, voice linear predictive marshalling (LPC) algorithms and cepstral parameters can be used in this process to manipulate feature parameters. ◎ Speech Linear Predictive Coding (LPC): 26 sample sounds are divided into _ _ and the calculation of the catch is also ί ~ financial effects ^ voice parameters voice samples can be. The theory of LPC is to assume that a prediction 'if the actual: z == linear combination, = county error reduced to the minimum LPC coefficient, such as ^ ^ Gui's predictor 'the coefficient is called (} ^ ak ' St ~M±Qzl/(n) ◎Cepstrum characteristics: (6) The average spectral coefficient of the energy speech signal in each frequency band. Although the == characteristic means that the scrambling of the speech signal is processed to be closer to humans. Hearing system, borrowing >n, ° w, but according to the results of the current research, the recognition rate κι Μ can often get higher speech-resolved speech = cepstrum feature has become the most commonly used (4) pattern Comparison score (S56) Seven methods using pattern matching, 24 points of speech and standard speech data are played by phonetic basis, and three to find the difference between the scored speech and the standard speech. The scoring speech is scored, and the minimum dynamic time warp (DTW) average error of the feature is obtained after the pattern matching is adjusted for the 强度强度 intensity curve, the fundamental frequency trajectory, and the Mel cepstrum. The similarity between the voice and the standard voice High, the higher the score will be. 27 1294107 In addition, the acoustic model used in speech recognition is based on the Hidden Markov Model (HMM), which has been found in some previous studies. The hidden Markov model is basically a double stochastic process, and it is called hidden because there is a set of random processes that are hidden and invisible. In speech, it is like the state of human vocal organs in the process of vocalization. Changes are invisible, as changes in the throat, tongue, and mouth are not visible from observable sequences of arpeggio signals. Another set of stochastic processes is called an observation sequence, which is the probability of state observation. (state observation probab i 1 i ty ) to describe the probability distribution of various 5 eigen characteristic parameters observed in each state. The characteristics of HMM are just right for describing the characteristics of speech, and each state can be regarded as a channel ( Vocal tract) is in an articulatory configuration, while state observation probability is described in some The possibility of hearing various sounds in the utterance state. Through the above implementation description, the present invention has the following advantages: 1. The invention automatically determines the correctness of the lip shape when the sound is pronounced by the computer, and corrects the hairpin. The wrong position of the lip shape can make the invention an effective English scoring method. 2. The invention adopts the interactive pronunciation bridge teaching, so the learner can self-learn at home using the interactive teaching system of the invention. Bun, Dong T' and through the continuous picture of the lips, to strengthen the sound 28 1294107 section to improve the lip shape when the pronunciation is continued. 3. At the same time, the hair _ can improve the learner's interest by means of silk. Through the automatic scoring of the system, the accuracy of the lip type can be obtained, so as to effectively achieve the purpose of vocal correction. X曰, 4·Because listening to p is limited by hearing restrictions, it is difficult to make it difficult. Therefore, the present invention is effective in helping the hearing impaired to practice pronunciation by the continuous writing of the lips. Disabled: Review = Learn the correct language pronunciation. h ° faster 5. The present invention uses the correlation technique of lip alignment to evaluate the correctness of the pronunciation using the change of the pronunciation time profile, and makes the lip scoring mechanism = accurate estimation of the lip contour change when the driver is pronounced Increased. 6. The invention can check whether the pronunciation of the learner is correct through the interactive teaching, and can input the lip shape of the positive pronunciation, so that the learner can recognize that he is learning the language ^ It is lacking, so it can help the learner's effect in learning, and then the difficulty of learning. In summary, the embodiment of the present invention can achieve the expected use efficiency' and the specific structure disclosed therein, not only has not been seen in the same kind: in, and has not been disclosed before the application, Cheng has fully complied with the special According to the requirements of the Regulations 2, the application for invention patents is filed according to law, and the application for the patent is granted. 29 1294107 [Simple description of the diagram] The first diagram: the flow chart of the system of the present invention, the second diagram: the schematic diagram of the new English teaching sentence of the present invention, the third diagram: the learner's practice sentence selection diagram of the present invention Figure 4: Schematic diagram of the WebCam display of the present invention. FIG. 5 is a flow chart of the lip detection of the present invention. FIG. 6 is a schematic diagram showing the range of the detected lip block formed by the indentation of one-eighth of each of the left and right sides of the present invention. Figure 7: Lip image of the present invention after being cut by the RGB color system. FIG. 8 is a schematic view of the lip image of the present invention after being cut by the HSV color system. FIG. 9 is a result of the lip image of the present invention passing through the intermediate value filter. FIG. 11 is a schematic diagram of the four adjacent and eight neighboring blocks used in the numbering algorithm of the present invention. FIG. 11 is a view showing the original image of the present invention before the numbering algorithm. Invention of the image numbered after the numbering algorithm is not intended to be the thirteenth picture: the invention of the lip boundary detection diagram of the fourteenth: the capture of the present invention Lips Information Schematic Figure 15: Schematic diagram of continuous image of the instructor and learner after DWT processing of the present invention. Figure 16: Schematic diagram of the position corresponding to the present invention after passing through the DTW. Figure 17: After the invention is transmitted through the DTW Position corresponding to the mouth image 30 1294107 Schematic FIG. 18: Schematic diagram of the DTW matrix of the present invention FIG. 19: Flow chart of the speech processing of the present invention FIG. 10: Schematic diagram of the sound wave of the present invention after being detected by the endpoint [Main component symbol description]

<本發明> (S1) 新增句子 (S2) 選擇句子 (S3) WebCam榻取 (S4) 自動化嘴唇區間選取 (S41) 人臉範圍偵測 (S42) 嘴唇顏色切割 (S43) 雜訊去除 (S44) 嘴唇區塊擷取 (11) 邊界 (12) 邊界 (13) 邊界 (14) 邊界 (S5) 評分機制 (S51) 數位訊號轉換 (S52) 數位訊號取樣 (S53) 音框化 (S54) 端點偵測 (S55) 特徵參數擷取 (S56) 圖樣比對評分 (〇) 中心位置 (W) 寬 (H) (P) 像素點 31<Invention> (S1) New sentence (S2) Select sentence (S3) WebCam couch (S4) Automatic lip interval selection (S41) Face range detection (S42) Lip color cutting (S43) Noise removal (S44) Lip block capture (11) Boundary (12) Boundary (13) Boundary (14) Boundary (S5) Scoring mechanism (S51) Digital signal conversion (S52) Digital signal sampling (S53) Frame (S54) Endpoint Detection (S55) Feature Parameter Acquisition (S56) Pattern Comparison Score (〇) Center Position (W) Width (H) (P) Pixel 31

Claims

1294107 X. Patent application scope: 1. A method for assessing the pronunciation of audio and video on digital learning. The process steps include: (1) Adding a sentence, the teacher inputs the sentence to be added through the interface, and uses WebCam captures the lips of the instructor and extracts the images in the face into a picture of the lips, and saves the sentences and the captured lips into the database; (2) Selecting sentences and learning Select the sentence to be practiced from the database; (3) WebCam will take it, use WebCam to capture the lips of the learner, and capture the image in the picture into a picture of the lips; (4) Automated lip interval Select, through the color space conversion to cut the skin color of the face, and remove the noise, and the image is numerically calculated by algorithm to define the face range as the search for the lips boundary; (5) the scoring mechanism, The captured lip contour images and pronunciation corpus are compared with the standard lip images and standard corpus established in the database and scored. 2. The method for applying sound and video to digital learning as described in claim 1 of the patent application, wherein the system in the pronunciation evaluation method can list all the sentences stored in the database for The learner can easily select the sentence to be practiced. 3. The method for assessing the pronunciation and image applied to digital learning as described in the second paragraph of the patent application, wherein the sentences stored in the database are classified according to the order of the letters. 4. The method for assessing the pronunciation and image applied to digital learning as described in the first paragraph of the patent application, wherein the color conversion space used for face detection is selected in the automatic lip interval selection = 32 1294107 It is a YCbCr system. 5. The method for assessing sounds applied to digital learning as described in claim 4, wherein the face image is subjected to YCbCr conversion and the value of Cb is between [77, 130] and the value of Cr. Between [130, 173] to cut out the skin interval.

6. The method for assessing the sound and image applied to digital learning as described in claim 1, 4 or 5, wherein the upper, lower, left and right sides of the face are found by the cut skin region. Four boundaries, and the innermost movement of the leftmost boundary and the rightmost boundary, that is, the left and right indented one eighth is regarded as the critical line of the leftmost boundary and the rightmost boundary, if the searched range exceeds the critical boundary Then it is bounded by one-eighth of the indentation. 7. The method for assessing sounds and images applied to digital learning as described in claim 5, wherein the color system used for lip color cutting is an RGB color system, by the RGB color system. The R value, the G value, and the relationship between red and green are used to cut the image of the face range. The relationship is:

1 Ln RG 0 <- otherwise <U, where 'R represents the red value of the RGB color system, g represents the green value, and when the center is the minimum threshold value, it is the maximum upper limit of the threshold value, if the magnitude is between the G value and the G value The ratio is between and c/lim, then set to 1, otherwise it is 〇, forming a binarized image. 8. The method for assessing the pronunciation of sound and image applied to digital learning as described in item 7 of the patent scope, wherein the processed binarized image is further given 33 1294107 ==::r to the _ number, and then found out === Item = The applied sound and image are between 1.1 and 2.8 in digital. In the middle of the lip block, the range of the lip is in the range of ίο., as described in item 5 of the Uli 辄第第音与与与与 , , , , , , , The color system used is the HSV color system, Η is expressed as hue value (Hue), S is not saturated, 豺 rainbow is illusive, and ν is expressed as “value” when converting the original lip image into HSV. Set the range of H, S, and V. When the value of each pixel is within the set range, set the prime point to 1, otherwise set it to 二 to form a binarized image. 11. The method for assessing sounds applied to digital learning as described in claim 10, wherein the range of values set is greater than 300 to less than 6 、, and the range of s is At [0· 16, 0·85], the value of V is between [〇· 21, 〇. 85]. 12. The method for assessing the sound and image applied to digital learning as described in the first paragraph of the patent application, wherein the HSV can be quantized into eight colors, namely: red, yellow, green, cyan, Blue, fuchsia, white and black. 13. The method for assessing the sound and image applied to digital learning as described in claim 5, wherein the lip color is combined with the RGB color system and the hsv color system to perform the lip block. In the RGB color system, the image of the face range is cut by the relationship between the R value and the G value of the rgB color system and the relationship between red and green. The relationship is: 1 ^ ^iim - ^lim [ lf \ G ~ 0 < — otherwise · where R represents the red value of the RGB color system, G represents the green value, where ^ is the minimum threshold, "the version is the maximum upper limit of the threshold, if the R value is between the G value The ratio is between ^ and ^, then set to 1, otherwise 0; and in the HSV color system, Η is expressed as hue value (Hue), S is expressed as saturation (Saturation), and V is expressed as brightness (Value) ), when converting the original lip image into HSV, set the range values of Η, S, and V. When the value of each pixel is within the set range, set the prime point to 1, otherwise set Cheng Yu, to form a binarized image. 14. If the scope of patent application is 13 The method for applying sound and image in digital learning, wherein the processed binarized image is given the same number by the numbering operation, and then the largest block is the lip. 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。 16. The method for assessing sounds applied to digital learning as described in claim 13 wherein the range of values set is greater than 300 to less than 60, and the range of S is between [0.16, 0.85], and the value of V is between [0.21, 0.85]. 17. Apply the sound and image as described in item 13 of the patent application to the digit 35 1294107, and use the four neighbors as a block. Operation, and each point is judged one by one by means of recursive judgment, and the order of reciprocal judgment of the neighboring pixels of the same block is up, left, down, and right. When the recursive judgment ends, the representative has already marked a block. Finished and down Pixel blocks and search.

23. The method for assessing sounds and images applied to digital learning as described in claim 22, wherein searching for blocks and pixels is performed by searching for the number of blocks having the largest number of blocks. And on behalf of the lip block, the search method is to search from the upper left point of the image. If the first point is found to represent the uppermost position of the block, the coordinates of the point are recorded, and then the search for the lip contour is continued. The lower boundary, when the upper and lower boundaries are found, also finds the left and right borders of the lip contour, and then obtains the center position and width of the lip. The algorithm is: For i = 0 to the height of the image For j = 0 to the width of the image If pixel ( i,j)=The number with the largest number of blocks The top top = i < top ? i : top bottom = i > bottom ? i : top left = j < left ? j : left right = j > right ? j : right end If Next j Next i The top, bottom, left, and right represent the uppermost, lowermost, leftmost, and rightmost borders of the lip outline, respectively. 37 1294107 24. The method for assessing the pronunciation and image applied to digital learning as described in claim 1, wherein in the scoring mechanism step, a Dynamic Time Warping (DTW) algorithm is used. In conjunction with the Pattern Matching method, the similarity or difference between the lip image of the standard pronunciation teacher and the learner's mouth image is filtered to filter out the image that needs to be deleted or retained. The important key points are to make the scores different. If the similarity is higher, the score will be higher. ^ 25 · The application of sound and video in the digital learning method according to claim 24, wherein the operation of the _ time (four) can be between the sound of the test data and the sound of the reference material. Obtain a nonlinear response with minimum error; if the lip video of the reference data has m images, and the lip shape of the test data has n images, the lip video image of the reference data is t(l) 't(2),t(m) 'The lip image of the sound of the side material is s(l), S(2),...,S(n), dynamic time warp (10)) will be in the plane called n Find the best corresponding path on the line, the starting point of this impurity is (1, ^ the final point is (m, n) 'Assume d(i,]) is the distance between t(1) and s(j), that is, ^ (6)-Canb's best path is the most t accumulation of (1) to (4), f is from D(i,j); therefore, and W)丨' is defined as follows: I , < track "., ί (1,顺24),, ί(2, team is β =<_,_,..·,·), _",r(2,p) ” for the corpse) t(l)graphic size=!»〇) The size of the graph == 夕 ^ The difference between the starting position of d (l, l) and the following: column mathematics The sub-order finds D(1,2), d〇,3)....··= 38 1294107 , > 7 y / After finding this cumulative distance, 're-sequence _ minimum cumulative distance, you can get the most Good path; suppose the best ^^ C(l), C(2),...c(P),c(k)=(t(1),s(k)) i~p ' is the retrospective calculation of Du path The law is:

D(i9 j) = miJ Dii -I J-1)+ 2d(i9 j) for m = k -1 to 1 i = c(m + 1) j = d(m + 1) [c(m), d(m)] = 3min< next

Where m represents the number of times backtracked, c(m) is used to record the position of the mth step i, and d(m) is used to record the number! „The position of step j, so every time you go back to the back, you will find D(i,j-1) D(i —^卜1) D(i —L ]) 2 The cumulative distance value is the smallest in three directions. To be used as a traceback path. • As described in Patent Application No. 25, the method for assessing the pronunciation and image applied to digital learning, in which the boundary conditions can be used: ^,, gab (4); -1)^ί(Α:) · Continuous condition: 咻卜咻 κι · 视固限制:味卜啦), w is the window size; Shaanxi limit (S1〇Pe Constraint): walked in the t direction x - At least Germany must be in the 3rd side, 6th, 6th, a, J, J, X, and y, and the yoke and conditional acceleration of the dynamic time warping operation. 39 1294107 27·As stated in the scope of claim 1 Applying sound and image to the method of pronunciation evaluation in digital learning, in which the score of the graphic contour is used in the step of the scoring mechanism, and the ratio of the graphical contour of the reference data and the test data is subtracted, and each reference is used. The value of the data and the contour of the test data minus the value of the test data is added to total R Z(d) 17 _*=〇_ where Rate is the reference width and height ratio calculation method of the reference data and the test data '7; for the i-th image of the reference material, the fork is the ratio of the outline of the i-th image of the test data, κ is For all the number of pictures, E is the sum of the error values after subtraction, and the current conversion score, 0 to 100 points, is as follows: MaxE = max(Ei), i = 1,···,η Score = ί〇〇^100 X — MaxE and MaxE is the maximum error value, i〇〇*(E/MaxE) to get the error score, then use 100 minus the error score to get the true score. 28 ^ The method for assessing the pronunciation and image applied to digital learning as described in item 1 of the patent scope, wherein the speech signal processing in the scoring mechanism step must be performed before the feature parameters of the speech are taken out. The voice input device is used to input the spoken analog signal into the computer and converted into a number and a billion number, and the analog signal is pre-processed, and then the feature parameter is extracted to be used as a comparison parameter of the voice score.