TW419643B

TW419643B - A method of continuous language recognition

Info

Publication number: TW419643B
Application number: TW87105920A
Authority: TW
Inventors: Yen Lu Chow
Original assignee: Kent Ridge Digital Labs; Apple Computer
Priority date: 1998-04-17
Filing date: 1998-04-17
Publication date: 2001-01-21

Abstract

A method of continuous speech recognition is disclosed which includes the steps of dividing a speech signal to be recognised into a plurality of frames, analysing the content of each frame with reference to a plurality of syllable models to identify probable syllables candidates in the speech signal, storing the syllables candidates in a lattice, deriving word candidates from contiguous syllable combinations using a syllable pronunciation lexicon, and applying contextual language model using a stack decoder. The method is also applicable to character recognition.

Description

419643 Λ7 Β7 五、發明説明（1 ) 經濟部中央標準局員工消費合作社印製本發明係有臑於一種連續語言辨識方法。近來所提出的自動語言辨識系統多是以軟體爲主，其操作方式是從輸入語言訊號產生一連串字句假設。在語音辨識上，最爲普遍的方法是利用統計方法，對一段連嫌時間內的語音波形進行取樣，經由音譜處理產生語音訊號的圖形，將其分成同樣時間間隔的一連串參數向量，藉由比較這些參數向量以及一些字音樣式和統計語言樣式，判斷所輸入的部份波形經分析後符合特殊語音事件的可能性，以決定最可能的字串。前面所提到的一般性技術之關鍵在於使用了堆叠解碼器（stack decoder)，利用這個方法，待處理的語音訊號會先被分成複數個圖幅（frame)，而以參數向量表示，將第一個圖幅向量與許多個與預期語音相符合的推論分析比對，根據輸入的參數向董以及所有推論的可能字元擴充間相符的程度，擴充的推論經過排列，以其可信度及可能性排列儲存，直到輸入即爲推論所預期的，這些所儲存的推論以其可能性排列，形成對應於各圖幅的堆叠。每一個推論會被進一步擴充，以提供其他的推論，好與接下來的圖幅比對，當分析至話語的最後，就有許多對應於最後圖幅的最後推論，對應於整句待辨識語音的每一個推論都有其可信度 0 有兩個因素用來決定輸入語音之圖幅可以與特定推論相符的可能性，即聲音的可信度及上下文語意的可信度，聲音的可信度是比對輸入的語音向量及預期推論的隱藏馬 2 本紙疚尺度適用中國國家標準（CNS ) A4規格（210X297公漦） ----------i------訂------味 * 一乂請先閱讀背面之注意事項再填寫本頁) 41964^ 4J-M4_3_ 五、發明説明（2 ) 可夫樣式（Hidden Markov Model，HMM)，而上下文語意可信度是由上下文語意的観點來比對輸入圖幅的內容，即使用句子構造或統計法則來分析語言內容整體上的前後關係，藉由加成他們的_數機率來結合這些可信度，以得到各推論的可信度。所提出來的這種技術其缺點在於需要與每一個圖幅比對的可能推論實在太多了，即使刪減對應於任何特殊圖幅的推論數目*只剩最有可能的幾種選擇*這種技術仍然需要運用到大量的計算》本發明的目的在提供一種經改良的連績語音辨識方法〇根據本發明的第一構想，所提供的連續語言辨識方法包括步驟：把一語言訊號分成複數個組成部份：分析該組成部份與複數個有關的音節樣式，以辨識該語言訊號中可能的音節：將該經分辨的音節存於一音節格（lattice# ;以及分析連續音節組合的上下文語意。本發明是一種兩階段技術，在第一階段中，發展一以音節辨識爲基礎的音節格，接著於第二階段分析上下文語意。419643 Λ7 B7 V. Description of the invention (1) Printed by the Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economics The present invention is related to a continuous language recognition method. Recently, most of the automatic language recognition systems proposed are software-based, and their operation mode is to generate a series of hypotheses from input language signals. In speech recognition, the most common method is to use statistical methods to sample the speech waveform over a period of time, and generate a graph of the speech signal through sound spectrum processing, and divide it into a series of parameter vectors at the same time interval. These parameter vectors, as well as some phonetic styles and statistical language styles, determine the possibility that some of the input waveforms meet the special speech events after analysis to determine the most likely string. The key to the general technology mentioned above is the use of a stack decoder. With this method, the speech signal to be processed is first divided into a plurality of frames, which are represented by parameter vectors. A map vector is compared with many inference analyses that are consistent with the expected speech. According to the input parameters, the degree of correspondence between Dong and all possible inferences of the extended characters is arranged. The extended inferences are ranked according to their credibility and The possibilities are stored until the input is expected by the inferences. These stored inferences are arranged by their possibilities to form a stack corresponding to each frame. Each inference will be further expanded to provide other inferences for comparison with the next picture frame. When the analysis reaches the end of the discourse, there are many final inferences corresponding to the last picture frame, corresponding to the entire sentence to be identified. Each inference has its credibility. There are two factors that determine the probability that the input speech can be matched with a particular inference, that is, the credibility of the sound and the credibility of the contextual semantics, and the credibility of the sound. The degree is a hidden horse that compares the input speech vector with the expected inference. 2 The guilt scale of this paper applies the Chinese National Standard (CNS) A4 specification (210X297). ---------- i ------ Order ------ Taste * Please read the notes on the back before filling this page) 41964 ^ 4J-M4_3_ V. Description of the invention (2) Hidden Markov Model (HMM), and the contextual meaning is credible The degree is to compare the content of the input picture by the point of context and semantics, that is, use sentence construction or statistical rules to analyze the overall context of the language content, and combine these credibility by adding their probability. To get the credibility of each inference. The disadvantage of the proposed technology is that there are too many possible inferences that need to be compared with each picture frame, even if the number of inferences corresponding to any particular picture frame is deleted * Only the most likely options are left * this This technology still requires a lot of calculations. The object of the present invention is to provide an improved continuous speech recognition method. According to the first concept of the present invention, the continuous speech recognition method provided includes the steps of dividing a language signal into a plurality of numbers. Components: analyzing the syllable patterns related to the component to identify the possible syllables in the language signal: storing the resolved syllables in a syllable grid (lattice #; and analyzing the context of consecutive syllable combinations) Semantics. The present invention is a two-stage technique. In the first stage, a syllable lattice based on syllable recognition is developed, and then the context semantics are analyzed in the second stage.

經濟部中央標準扃員工消費合作社印I (請先閱讀背面之注意事項再填寫本頁) 在本發明的第一較佳實施例中|語言訊號是一有聲的語音訊號，語音訊號被分成複數個圖幅I再使用堆疊解碼器分析上下文語意》在本發明的第一較隹實施例中，以兩階段程序取代同時整髋分析話語中的聲音及上下文語意可信度，在第一階 3 本紙張尺度適用中國囷家標準（CNS ) Λ4規格（210X297公漦） 419643 五、發明説明（3 ) 經濟部中央標準局員工消費合作社印製段中，以音節分析形成話語的語音圖幅內容之向量，只決定話語中聲音的可信度*將所得到經辨識的音節選擇存於音節格中*其中包含了音節、開始時間、結束時間及聲音可信度等四個羼性，觀察音節格中音節的連續組合，將其以上下文語意爲基礎的推論比對，來分析待辨識語音，由此•根據詞庫檢査音節格及結合合理的音節組成，然後結合他們的可能性或可信度排行，可獲得聲音的可信度》使用堆曼解碼器以可信度順序儲存這些推論，此偫道些推論的上下文語意是正確的，一旦分析完畢所有的組合，最後堆叠中最上方的推論就有最髙的可信度，即結束時此推論即符合該語句。較佳者，本案方法中音節格的生成部份是採用覼藏馬可夫樣式（HMM)來比對擁有複數個樣式型態的每一音節· 樣式型態及每一個圖形間的一致性在機率統計上使用 Viterbi演算法來分析，可找到HMM及输入語音圖幅間最有可能的時態及音譜一致性，這些資訊在音節結束時原則上會被放在音節格中，接著在時間上往回分析音節組合的上下文語意之可信度，將音節格從待辨識語音的結束端往開始端推進。本發明實施例的優點在於具有相當準確性結果的辨識時間可減少*將本實施例的方法用於即時辨識應用上，相對於習知的方法*刪除掉的推論比較少，因此提昇了準確性。本發明的實施例特別適用於中文的語音辨識*因爲中 4 本紙掁尺度適用中國國家標準（CNS ) A4规格（210X2W公及） (錡先閲讀背面之注意事項再填艿本頁) 、'17 ^ 4t9&43 at ___ B7 五、發明説明（4 ) 經濟部中央標準局員工消費合作社印^ 文是由單音節文字所組成，而且包含聲調（四聲）總共只有1，600種音節· 在本發明的第二較佳實施例，語言訊號的組成部份不再只是聲音，而是視覺上的，可以拼音型態表示，如音節的漢語拼音鍵盤輸入法· 在第三較佳實施例中，其組成部份是文字，如將中文字輸入文字辨識程式· 本發明得藉由範例及有關之下列圖示做更詳細的敘述，其中：第一圖係表示本案方法之第一較佳實施例；以及第二圖係表示使用Viterbi演算法經HMM/語音圖幅方陣找尋最佳路徑。在此將敘述三個本發明之較佳實施例，可用來辨識不同的語言訊號》本發明的第一較佳實施例係辨識一聲音語言信號，包括四個主要的步驟： 1·將語音訊號分成複數個圖幅，以向量表示其圖幅之內容，作爲特徵向量； 2. 分析圖幅之內容及音節樣式； 3. 將具髙可信度的經辨識音節選擇放入音節格中：以及 4. 使用詞庫、語言樣式及堆疊解碼器分析連續耷節組合的上下文語意。這些將於第一圖以一個簡單的實例說明。在步驟1A，此例子中的語音是一個雨音節的中國字「 5 本紙伕X度適用中國國家標準（CN’S M4現格（210〆297公楂，) II------.--V,-------1T-------戈 (請先閱讀背面之注意事項再填寫本頁) 419643五、發明説明（5 ) 經濟部中央標準局員工消費合作社印¾ 蘋果」，將其取樣後，樣品被分成具有相同時間間隔的圖幅F1-F0，以向量表示在每一個圖幅中的語音訊號，分別是向量V1-V6，選擇向量參數以定義圖幅訊號的音譜特性〇接著使用循環演算法（recurs iveal go rithm)比對所得到的語音向量V1-V6與HMM音節樣式，如此領域中所熟知的Viterbi演算法。第二圖顯示這個演算法可以被想像成穿過這些格子的最佳路徑，其中，垂直軸表示HMM的型態，水平軸表示語音圄幅F1-F6以語音向童V1-V6表示《圖上的每一圓點表示在這段時間觀察到這個圖幅的對數機率 (log probability) *每一條穿過圓點的線表示對數轉變機率，對數轉變機率加上沿著這條路徑的對數輸出機率所得的總和可計算任何路徑的對數機率，Viterbi演算法針對每一條路徑計算擴充前一圖幅的所有分枝之可能性，以及辨識最後圖幅中HMM及有最髙發生可能性的語音圖幅間之配合程度。在中文裡，有1，6〇〇種音節發音，因此也要有同樣數目的HMM音節樣式，輸入的圖幅要與所有的1，600種音節樣式藉由格子型動態程式予以比對，所以有很多的選擇音節可能性與每一音節選擇一起被儲存，包含有一結束時間，其定義爲對應特殊音節樣式的最後樣式型態之圖幅的時間 *同時有一假設的開始時間，就是對應到音節樣式的第一樣式型態之圖幅的時間，以及音節型式的辨識及發生可能性· 1_^---------.--1------·ΐτ------Ψ1 (請先閱讀背面之注意事項再填Tt?本頁) 本紙張尺度適用中國國家標準（CNS〉A4i>L格（210>( 297公垃） 4 »9643 A7 B7 五、發明説明（6 ) 定義每一音節的資訊被放在音節格中，如圖1B所示，共有十四個這種音節S1-S14，這些音節格包含很多音節，有不同的開始時間和結束時間，然而，這個語音包含複數個連續音節，因此我們可以使用音節發音樹詞庫，只考慮這些在時間上是連續的音節組合，來分析音節格，就如圖 1C所示•這些音節格是以堆疊解碼器操作，將音節格由時間T = tf向T = te往回推進*使用音節發音詞庫推導出可能的字元選擇，找到所有合理的音節順序，以建立字元順序假設的推論之堆叠*這些可能字元選擇會擴充推論，再應用語言樣式則推導出每一種推論的排行" 請參聞圖1D，堆疊SK1-SK6對應於每一個圖幅F1-F6 ，就一個特定圖幅而言，堆叠中每一個推論表示的音節順序可由音節格從話語的結尾導出，這個推論有很多個往回的音節擴充，使用下列演算法以符合音節格中的音節。 1. 在話語的結尾，即時間〇處•以一空的推論初始化一堆叠，針對所有在時間卜結束的可能單一音節進行分析，將音節格往回推進並收集所有到時間、結束節點的音節，比較每一推論之後，音節的上下文語意可信度就決定了，道些推論則被放入音節開始的圖幅之堆曼中，並以可信度的順序排列。 2. 對以圖幅F6表示的時間間隔進行同樣的程序，藉由音節格中所有可能的音節擴充使推論增加，直到話語的開始處 •就會有以可信度順序排好的推論來表示這段話語。舉例來說，在第一圖中的時間tf有四個可能的音節S14 7 -n 1 n I n _ I > ^ 1^1 n . I- —-J . I 丁^_ —1 ——_ ______AK (請先閱讀背面之注意事項再填寫本頁) 經濟部中决標準扃貝工消費合作社印装本饫掁尺度通闳中國國家標準（CNS ) A40L格i: 2I0X 297公t ) 經濟部中央標準局貝工消費合作社印¾ ίί9643 Λ7 B7 . . _ . 五、發明説明（7 ) 、S13、S11以及S12對應到音節，如圖1B/1C，比較他們與時間tr的推論*放入於音節開始之圖幅的堆*SK6、SK5 及SK4中。同樣地應用到在圖幅F6的堆叠，有兩個可能的音節 S10及S7擴充音節S14，比較其與推論，分別放入堆叠SK2 及SK3中· 組續進行演算圖幅F5的堆叠，比對每一個擴充的音節與推論，決定音節組合正確的可能性，這個可能性是以對數表示，與音節格中的每一音節有關的聲音對數機率相加，機率的結合定義了各別堆叠中推論的位置。如對應於圖幅F1的最後堆叠SK1所表示，經由音節格 *在話語的開始端，可以得到對應於所有可能路徑的最後推論，這些會以結合機率的順序排列在堆疊中，最有可能表示道個完整話語的推論是S3-S11 : PING2GU03 (女一 *7 〆 «乂 I V ) · 爲了減少需處理的資料量，音節格及堆叠會如同習知技藝被刪減，因此*低於一個指定可信度的推論就會被忽略》這裡所描述的連嫌語音辨識方法之較佳實施例不能解釋爲限制本案，舉例來說，語音圖幅的內容不需要以向量表示，而可以函數表示，像是適合演算法之曲線。男外，音節的辨識最好是使用Viterbi演算法來著手比較音節與隱藏馬可夫樣式，當然，其他的演算法也可用在其他的樣式種類。堆叠解碼器將音節格往回推進，當然也可以使音 8 本+紙乐尺度適用中國國家標準（CXS ) Λ4規格（210X 297公左） _________V-______T______ 、Γ - Λ (請先閲讀背面之注意事項再填寫本頁) 4ΐ9β43 五、發明説明（8 ) 經濟部中央標準局員工消費合作社印裝節格往前推進，將每個躕幅從假設的開始時間擴充音節* 使用堆叠解碼器是比較好的選擇，但是仍有其他選擇可供使用。上面所描述的第一較佳實施例使用有聲的語音訊號作爲輸入訊號，而導出音節格，在本發明的第二較佳實施例中，輸入信號是有關視覺的，如從鍵盤上輸入每一音節的漢語拼音表示法，每敲擊一個鍵輸入一單一組成。在第二較佳實施例中，第一圖的格構1B是由對應於鍵入音節的所有可能（或相似）音節所構成，例如，介於鍵入音節的開始與結束時間之音節格可能包含有不同聲調或相似發音的音節，在第二較佳實施例中，音節已定義了開始與結束時間，因此或多或少簡化了音節格，所得到音節格與前面一樣使用堆叠解碼器分析》在本案的第三較佳實施例中，語言訊號就像第二較佳實施例所述，同樣是以視覺表示，但是是以對應於每一音節的中文文字進行文字輸入，中文文字可以由筆劃或視爲圖像來分析，這個圖像被分解成複數個片段，每一筆劃或每一個片段代表一個欲由文字辨識程式分析的組成，這個程式提供了許多這個文字的選擇，以可能性排序儲存起來，每一個文字經過辨識選擇後，會儲存在文字格中，除了將音節辨識變成文字辨識之外，文字格與音節格有同樣的屬性，然後如同第一較佳實施例*使用詞庫分析文字格的上下文語意，筆劃可藉由輸入裝置輸入，如滑鼠或光學筆/ 墊•如果是圖像辨識，可以將圖像掃瞄後再存入傳統的裝 9 本紙恢尺度適用中國國家標準（CNS ) Λ4規格（210Χ：：97公i ) (請先閲讀背面之注意Ϋ項再填寫本頁 •*1Central Standard of the Ministry of Economic Affairs 扃 Employee Cooperative Cooperative Seal I (Please read the notes on the back before filling this page) In the first preferred embodiment of the present invention | The language signal is a voice signal, and the voice signal is divided into a plurality of Figure I uses a stack decoder to analyze context semantics. "In the first comparative embodiment of the present invention, a two-stage procedure is used to replace the sound and context semantic credibility of the entire hip analysis discourse at the same time. The paper scale is applicable to the Chinese Family Standard (CNS) Λ4 specification (210X297 gong) 419643 V. Description of the invention (3) In the printed section of the Consumer Cooperatives of the Central Standards Bureau of the Ministry of Economic Affairs, the syllable analysis is used to form the vector of the speech picture content of the discourse , Only determine the credibility of the sound in the discourse * save the identified syllable selection in the syllable grid * which contains the four characteristics of syllable, start time, end time, and sound credibility, observe the syllable grid Continuous combination of syllables, which is based on contextual semantics, to analyze inferred speech, thereby • checking syllable grids and combinations according to the thesaurus Sound syllable composition, and then combine their probability or credibility ranking to obtain the credibility of the sound. "The use of a heapman decoder to store these inferences in the order of credibility, the context meaning of these inferences is correct Once all the combinations have been analyzed, the uppermost inference in the final stack has the highest credibility, that is, the inference is consistent with the sentence at the end. Preferably, in the method of this case, the syllable grid generation part uses the Tibetan Markov pattern (HMM) to compare each syllable with a plurality of pattern types. The pattern patterns and the consistency between each figure are probabilistic statistics. Using the Viterbi algorithm to analyze, we can find the most likely tense and consistency of the phonetic spectrum between the HMM and the input speech picture. This information will be placed in the syllable grid in principle at the end of the syllable, and then go back in time. The credibility of the contextual semantics of the syllable combination is analyzed, and the syllable lattice is advanced from the end to the beginning of the speech to be recognized. The advantage of the embodiment of the present invention is that the identification time with a fairly accurate result can be reduced. * The method of this embodiment is used for real-time identification applications. Compared with the conventional method, * there are fewer inferences deleted, so the accuracy is improved. . The embodiments of the present invention are particularly suitable for Chinese speech recognition. * Because the Chinese paper standard of 4 Chinese paper (CNS) A4 specifications (210X2W) are applicable (read the precautions on the back before filling this page), '17 ^ 4t9 & 43 at ___ B7 V. Description of the Invention (4) Printed by the Consumers' Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs ^ The text is composed of monosyllabic characters, and contains only 1,600 syllables in total (four tones). · In the present invention In the second preferred embodiment of the present invention, the components of a language signal are no longer just sounds, but visually, and can be represented in pinyin form, such as a Chinese syllable keyboard input method. In the third preferred embodiment, The component is text. If a Chinese character input text recognition program is used, the present invention will be described in more detail by using examples and the following related diagrams, where: The first diagram shows the first preferred embodiment of the method in this case; And the second picture shows the use of the Viterbi algorithm to find the best path through the HMM / voice map square matrix. Three preferred embodiments of the present invention will be described here, which can be used to identify different language signals. The first preferred embodiment of the present invention is to identify a voice language signal, including four main steps: 1. The voice signal Divided into a number of maps, the content of which is represented by a vector as a feature vector; 2. Analyze the content of the map and the syllable style; 3. Put the identified syllables with credibility into the syllable grid: and 4. Use thesaurus, language styles, and stacked decoders to analyze the contextual semantics of consecutive stanza combinations. These will be illustrated with a simple example in the first figure. In step 1A, the voice in this example is a Chinese syllable of rainy syllable "5 papers X degrees are applicable to Chinese national standards (CN'S M4 is now (210〆297 hawthorn,) II --------.-- V , ------- 1T ------- Ge (Please read the notes on the back before filling this page) 419643 V. Description of Invention (5) Printed by the Consumers' Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs ¾ Apple " After sampling it, the sample is divided into frames F1-F0 with the same time interval, and the speech signals in each frame are represented by vectors, which are vectors V1-V6. Select the vector parameters to define the sound spectrum of the frame signal. Characteristics 〇 Then use a recursive algorithm (recurs iveal go rithm) to compare the resulting speech vectors V1-V6 with the HMM syllable pattern, such as the Viterbi algorithm well known in the field. The second figure shows that this algorithm can be imagined as wearing The best path through these grids, where the vertical axis represents the type of HMM, and the horizontal axis represents the voice width F1-F6 with the voice to the child V1-V6. "Each dot on the graph indicates that it was observed during this time Log probability of this frame * each bar passing through the circle The line of represents the logarithmic transition probability. The sum of the logarithmic transition probability plus the logarithmic output probability along this path can calculate the logarithmic probability of any path. The Viterbi algorithm calculates for each path the extension of all branches of the previous picture. Possibility, and the degree of coordination between the HMM and the most likely phonetic picture in the last picture. In Chinese, there are 1,600 syllable pronunciations, so there must be the same number of HMM syllable patterns. The input image is compared with all 1,600 syllable patterns by a grid-type dynamic program, so there are many possibilities for selecting syllables to be stored with each syllable selection, including an end time, which is defined as The time of the picture of the last pattern of the special syllable pattern * At the same time, there is an assumed start time, which is the time of the picture of the first pattern of the syllable pattern, and the identification and occurrence possibility of the syllable pattern. 1 _ ^ ---------.-- 1 ------ · ΐτ ------ Ψ1 (Please read the notes on the back before filling in Tt? This page) This paper size is applicable to China National standard CNS> A4i > L (210 > (297)) 4 »9643 A7 B7 V. Description of the invention (6) The information defining each syllable is placed in the syllable grid, as shown in Figure 1B. There are fourteen of these Syllables S1-S14. These syllable cells contain many syllables with different start times and end times. However, this speech contains multiple consecutive syllables, so we can use the syllable pronunciation tree lexicon, considering only that these are continuous in time. To analyze the syllable grid as shown in Figure 1C. • These syllable grids operate with stacked decoders, pushing the syllable grid back from time T = tf to T = te * using the syllable pronunciation thesaurus to derive the possibility Character selection, find all reasonable syllable sequences, and build a stack of inferences based on the hypothesis of the character order * These possible character selections will expand the inference, and then apply the language style to derive the ranking of each inference " 1D, the stack SK1-SK6 corresponds to each frame F1-F6. As far as a specific frame is concerned, the syllable order represented by each inference in the stack can be derived from the end of the discourse by the syllable case. This inference There are a number of syllables back expansion, use the following algorithm to match syllable lattice of syllables. 1. At the end of the utterance, at time 0 • Initialize a stack with an empty corollary, analyze all possible single syllables at the end of time, advance the syllable cell back and collect all syllables that reach the time and end node, After comparing each inference, the contextual and semantic credibility of the syllable is determined, and these inferences are put into the pile of man at the beginning of the syllable and arranged in the order of credibility. 2. The same procedure is performed for the time interval represented by the picture frame F6. The inference is increased by all possible syllable expansions in the syllable grid until the beginning of the discourse. • There will be inferences arranged in order of confidence. This discourse. For example, at time tf in the first picture, there are four possible syllables S14 7 -n 1 n I n _ I > ^ 1 ^ 1 n. I- —-J. I DING ^ _ —1 —— _ ______AK (Please read the precautions on the back before filling out this page) Ministry of Economic Affairs Final Standard 印 Printed by Shellfish Consumer Cooperatives 饫掁 Standards are in accordance with Chinese National Standard (CNS) A40L Grid i: 2I0X 297g t) Ministry of Economic Affairs Printed by the Central Bureau of Standards Shellfish Consumer Cooperative ¾ ί9643 Λ7 B7.. _. V. Description of the Invention (7), S13, S11, and S12 correspond to the syllables, as shown in Figure 1B / 1C. Compare their inference with time tr * put in Heaps of syllable start frames * SK6, SK5 and SK4. The same applies to the stack in frame F6, there are two possible syllables S10 and S7 extended syllable S14, compare it with the inference, and put them in stacks SK2 and SK3 respectively. Set the calculation of stack F5 and compare Each expanded syllable and inference determines the possibility of correct syllable combination. This possibility is expressed in logarithm. The logarithmic probabilities of the sounds associated with each syllable in the syllable grid are added. The combination of probabilities defines the inferences in the respective stacks. s position. As indicated by the last stack SK1 corresponding to frame F1, the final inferences corresponding to all possible paths can be obtained through the syllable grid * at the beginning of the utterance. These will be arranged in the stack in order of probability, most likely to indicate The inference of a complete discourse is S3-S11: PING2GU03 (Female 1 * 7 〆 «乂 IV) · In order to reduce the amount of data to be processed, the syllable grid and stacking will be deleted as in conventional techniques, so * is less than one specified The inference of credibility will be ignored. "The preferred embodiment of the connected speech recognition method described here cannot be interpreted as limiting the case. For example, the content of the speech frame does not need to be represented by a vector, but can be represented by a function. Like a curve that fits the algorithm. Outside of men, the best way to identify syllables is to use the Viterbi algorithm to compare syllables with hidden Markov patterns. Of course, other algorithms can also be used in other styles. The stacked decoder advances the syllable grid. Of course, it can also make 8 notes + paper music scales applicable to the Chinese National Standard (CXS) Λ4 specifications (210X 297 male left) _________ V -______ T______, Γ-Λ (Please read the notes on the back first Please fill in this page again for details) 4ΐ9β43 V. Description of the invention (8) The printed section of the Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs advances forward to expand each syllable from the assumed start time * It is better to use a stack decoder Choice, but there are still other options available. The first preferred embodiment described above uses a vocal voice signal as an input signal to derive a syllable grid. In a second preferred embodiment of the present invention, the input signal is visual, such as inputting each The Chinese phonetic notation of syllables, each time a key is input, a single composition is entered. In the second preferred embodiment, the lattice 1B of the first figure is composed of all possible (or similar) syllables corresponding to the entered syllable. For example, a syllable grid between the start and end times of the entered syllable may contain Syllables with different tones or similar pronunciations. In the second preferred embodiment, the syllables have defined the start and end time, so the syllable grid is more or less simplified, and the resulting syllable grid is analyzed using the stacked decoder as before. In the third preferred embodiment of the present case, the language signal is visually represented as described in the second preferred embodiment, but the text is input in Chinese characters corresponding to each syllable. As an image to analyze, this image is decomposed into a plurality of segments, each stroke or each segment represents a composition to be analyzed by a text recognition program, this program provides many options for this text, and is stored in order of possibility Every text will be stored in the text box after being recognized and selected. In addition to turning syllable recognition into text recognition, the text grid Has the same attributes as the syllable grid, and then uses the thesaurus to analyze the contextual semantics of the text grid, as in the first preferred embodiment. Strokes can be entered via input devices, such as a mouse or optical pen / pad. • For image recognition, You can scan the image and save it in a traditional paper. 9 The paper is restored to the standard applicable to the Chinese National Standard (CNS) Λ4 specification (210 × :: 97 male i) (Please read the note on the back before filling in this page • * 1

419643五、發明説明（9 ) 匱中。本案適用於語言辨識*可用於以說或寫的方式呈現的語言，尤其特別適用於中文系統。本案得任熟悉此技藝之人士任施匠思而爲諸般修飾，然皆不脫如附申請專利範圍所欲保護者。 ---------1------*------味 (#先閱讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消f合作社印製本纸乐尺度適用中國國家標準（CNS ) Λ4規格（210X297公t )419643 V. Description of Invention (9) This case is applicable to language recognition *, which can be used for languages spoken or written, especially for Chinese systems. In this case, anyone who is familiar with this technique can use any of the techniques to modify it, but none of them can be protected as attached to the scope of patent application. --------- 1 ------ * ------ 味 (#Please read the notes on the back before filling this page) The staff of the Central Standards Bureau of the Ministry of Economic Affairs has printed this paper. Music scale is applicable to Chinese National Standard (CNS) Λ4 specification (210X297mmt)

Claims

8 88 8 ABCD 419643 VI. Scope of patent application 1_ A continuous language recognition method, which includes the following steps: (Please read the notes on the back before filling in this page) Divide one language signal to be identified into multiple components Part: Analyze the component with multiple syllable patterns to identify possible syllables in the signal; store the identified syllables in a syllable grid; and analyze the contextual semantics of the continuous syllable combination. 2 · The method described in item 1 of the scope of patent application, wherein the language signal is a voice signal, and the component includes a plurality of sample frames of the voice signal. 3 · As described in item 1 of the scope of patent application Method, wherein the language signal is a visual signal. 4 _ The method described in item 3 of the patent application, wherein the visual signal is represented by pinyin. 5 _ The method as described in item 1 of the scope of patent application, wherein the connected syllable combination is analyzed for its contextual semantics with a stacked decoder. Printed by the Consumer Cooperatives of the Central Standards Bureau of the Ministry of Economic Affairs 6. The method described in item 5 of the scope of patent application, wherein the stacked decoder advances the syllable grid, by finding all reasonable syllable sequences * to build a stacked string hypothesis Inference, which uses a syllable pronunciation lexicon to derive possible character choices, and then uses the possible character choices and a language style to expand the inference to derive the rank of each inference. 7 _ The method as described in the first item of the application, and the content of each frame is represented by a vector. 8 The method as described in the first item of the scope of patent application, wherein each syllable pattern is a hidden Markov Style (Hidden Markov Model, HMM) 11 This paper size is applicable to Chinese National Standard Half (CNS > 8 4 wash grids (210X297 mmΐ 419643 A8 B8 C8 D8) 6. Application for Patent Scope Printed by the Employees' Cooperatives of the Central Standards Bureau of the Ministry of Economic Affairs 9. The method described in item 4 of the scope of the patent application, wherein the content of the map is analyzed using a Viterbi algorithm. 10. The method described in item 1 of the scope of the patent application, which is stored in the syllable grid The identified characteristics of the syllable include the start time, end time, syllable, and possibility of occurrence. "1 1 The method as described in item 5 of the patent application, wherein the stacked decoder advances the syllable grid back" 1 2 · The method as described in item 1 of the scope of patent application, wherein the syllable pattern includes Chinese glycosyl patterns. 1 3 ·-a continuous text recognition method, including walker ... Identify a text input signal into a plurality of components; analyze the component together with a plurality of text styles to identify possible text in the signal: store the recognized text in a text cell; and analyze the continuous Contextual semantics after text combination. 1 4 · The method as described in item 13 of the scope of patent application, wherein the contextual textual combination is analyzed by a stack decoder after its contextual semantics. 1 5 · Such as the first scope of patent application The method described in item 4, wherein the heap® decoder advances the text grid, and by finding all reasonable text sequences, establishes the inference of the stacked string hypothesis, which uses a text lexicon to derive possible character choices and reuses The possible character selection and a language style expand the inference to derive the ranking of each inference. 1 6 _The method described in item 1 of the application scope, which is stored in the 12 paper standards that apply the Chinese national standard { CN'S) Α4 specification (210 × 2? 7mm) --I ------- β ------ 1Τ ------- ^ (Please read the precautions on the back before filling in this page ) 419643 g D8 Please identify the characteristics of the text in the text range of the patent scope including the start time, end time, text, and the possibility of occurrence ^ 1 7 · The method described in item 14 of the patent application scope, wherein the stacked decoder The text grid moves forward. 1 8 · The method described in item 13 of the scope of application ', where the text style includes Chinese text style * 1 9 .—a device for performing the method described in item 1 of the application scope 20 _The device described in item 19 of the scope of patent application, which is used for Chinese identification. 21 1. The method described in item 1 of the scope of patent application 'is used for Chinese identification. --------- ^ ------ 'π ------ ^ (Please read the precautions on the back before filling out this page) Staff Consumption Cooperation by the Central Bureau of Standards, Ministry of Economic Affairs 13 The size of this paper is applicable to China National Standard (CNS) Α4 specification (210 × 297 mm >