TW200411627A

TW200411627A - Robottic vision-audition system

Info

Publication number: TW200411627A
Application number: TW092103187A
Authority: TW
Inventors: Kazuhiro Nakadai; Hiroshi Okuno; Hiroaki Kitano
Original assignee: Japan Science & Tech Corp
Priority date: 2002-12-17
Filing date: 2003-02-17
Publication date: 2004-07-01
Also published as: JP2004198656A; JP3632099B2; TWI222622B

Abstract

A robottic vision-audition system includes an audition module (20), a face module (30), a stereo module (37), a motor control module (40), and an association module (50) for controlling each module. The audition module (20) includes an active direction passing type filter (23a) having a pass range which varies, according to correct information of sound source direction from the association module (50), such that the pass range is minimum at a front direction and is enlarged with the enlargement of the angles in right and left directions, to perform a sound source separation by collecting subbands having an Inter-ears Phase Difference (IPD) or an Inter-ears Intensity Difference (IID) and re-constructing the wave forms of the rounds from source. The audition module (20) further performs a recognition of sounds separated from the sounds from the source, by recognising sound signals separated by a plurality of audeo models (27d),consolidating the results of sound recognitions of each audeo module by selector (27e), and determining the most reliable sound recognition result from these sound recognition results.

Description

200411627 _、發明說明【發明所屬之技術領域】本發明係有關於一種適用於機器人，尤其是人型或動物型機器人之視聽覺系統。【先前技術】近年來’上述人型或動物型機器人不僅是 AI(Anificial Intelligence，人工智慧)研究目的之對象，而且在未來利用性上被視為所謂的「人類夥伴」。而機器人為了契人大員進行智能上的社交活動（s〇cial interacti〇n)，視聽覺等知覺對機器人而言是必要的。而且，明顯可知，機Is人為了實現與人類之間的社交活動，知覺之中的視聽覺，尤其是聽覺乃係一重要功能。因此，關於視覺、聽覺，也就是所謂的主動知覺日漸備受矚目。在此，所謂主動知覺係指可以發揮跟隨應能感受到責機器人視覺或機器人聽覺等知覺之知覺裝置之目標的^ 用，例如，藉由驅動機構，使支撐上述知覺頭隨著目標，以控制姿勢者。機器人之主動視覺中，= 知覺裝置之攝影機（cam叫藉由驅動機構所進行之欠^ 制，朝向目標保持其絲方向，而且更進-步自動二 ^^.,(f〇CUsing)^,^(z〇〇m (zoom out)等。薪屮士斗】、頒τι 曰此方式，即使目標移動，亦可機進行攝影。在以往隹引用攝景乂在已進仃了各種上述主動視相對於此，機哭人夕士仏站 < 研究。右士 ^ 钺杰人之主動聽覺中，至少為知螯壯 >克風错由由驅動機構進行之衣置戈 “工制朝向目標保持其 J M4 ] 9 6 200411627 指向性，而且利用央 J用麥克風將來自目標的聲音時，以主動聽覺之缺十 >、耳θ進仃集音。此見之缺點來說，在驅動機構作麥克風會收到驅動樯播…m 構作用期間’由於 , 動祛構之動作聲音’因此，對於的聲音會混入4*丄、@才示出乂大的雜訊，導致無法辨認目標發出的舞為了 m上返主動聽覺之缺點，採用視覺資訊來進行音泝 ^猎由麥考聲音之方法。原…方式，以正確辨認目標發出的立不=於上述主動聽覺中，根據以麥克風收集到的聲咬立认八施·、》 (Β)從各音源所發出的每個奪曰的分離，逛有f Ρ、加L二力七 A h。 I虿（c)辨認來自各音源的聲音。其於（A)音源定位及（b)音泝分 ‘ 士 — ； 9，原刀離已進仃了有關主動聽覺在真貫時間、真實環境下立— y 下之3源疋位、追蹤、分離之各種研究。例如’如國際公開第01/953 14 ?虎公報之揭示，利用從 HRTF(Head Related Transfer Functi〇ns，頭部傳達函幻求得之兩耳間位相差（Interaural phase崎⑽w㈣）及雨耳間強度差（interaural Intensity Differenee nD)來進行音源定位之研究已為眾所週知。而且，在上述文獻中，一般已知例如使用所謂方向通過型濾'波器，亦即Erection-Pass Filter ，係藉由進行選擇具有與特定方向之 ipD 相同之IPD之次頻帶（subband)，以對來自各音源之聲音進行分離的方法。相對於此，關於對於由音源分離而分離之來自各音源的聲音進行辨認，例如對於— ) 或遺漏數據（missing data)等之雜訊進行強健性（R〇bust)語 3.144J2 200411627 音辨認之方法，已進行如下述文獻所示之各種研究。 J ·貝克等著’以清晰語音模型（c 1 ean speech m〇de 1)為主之強健性“歐洲演講2001-第7次歐洲會議論文集，，，2001 年’弟 1 卷 ’ pp、213-216(J. Baker，M. Cooke，and P. Green， Robust as based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise· 7th European conference on Speech Communication 鲁 Technology”，Volume 1，p.213-216)。 Ρ·雷内貝等著，強健性語音辨認（R〇bust Speech Recognition)“歐洲演講200卜第7次歐洲會議論文集”， 2001 年’弟 12 卷 ’ pp.ll〇7-lll〇(philippe Renevey，Rolf200411627 _, description of the invention [Technical field to which the invention belongs] The present invention relates to an audio-visual system suitable for robots, especially human-type or animal-type robots. [Prior technology] In recent years, the above-mentioned human or animal robots are not only the object of AI (Anificial Intelligence) research purposes, but they are also regarded as so-called "human partners" in the future. In order for robots to perform intelligent social activities (soccial interactions), perceptions such as hearing and hearing are necessary for robots. Moreover, it is clear that in order to realize social activities with humans, the visual and auditory consciousness, especially the auditory consciousness, is an important function. Therefore, more and more attention has been paid to vision and hearing, so-called active perception. Here, the so-called active perception refers to the purpose of following the goal of a sensory device that can sense the perception of the robot such as robot vision or robot hearing, for example, by driving the mechanism to support the aforementioned sensory head to follow the target to control Poser. In the robot's active vision, = the camera of the perceptual device (cam is called the lack of ^ by the drive mechanism, keeps its silk direction toward the target, and further-step by step automatically ^^., (F〇CUsing) ^, ^ (z〇〇m (zoom out), etc.) This way, even if the target moves, you can still take pictures. In the past, quoting photography scenes has introduced various of the above active visions. On the other hand, the machine is crying at the Xi Shifang Station < research. You Shi ^ Yu Jieren's active hearing, at least knowing the strong and strong gt; Keep its J M4] 9 6 200411627 directivity, and use the microphone to collect the sound from the target with active hearing, and the ear θ to collect sound. For the disadvantages of this, the driving mechanism When the microphone is used, it will receive the driving broadcast ... m During the action of the structure, 'due to the dynamic sound of the movement', therefore, the corresponding sound will be mixed with 4 * 丄, @ before showing the large noise, resulting in unrecognizable targets. In order to overcome the shortcomings of active hearing, m Here is how to trace back the sound of hunting the sounds of McCaw. The original ... way to correctly identify the target is not == in the above-mentioned active hearing, according to the sound collected by the microphone, you can use the sound collected by the microphone. From each sound source, each sounding separation is f PF, plus L two power seven Ah. I 虿 (c) identify the sound from each sound source. It is (A) sound source localization and (b) sound Tracking points '; — 9, Yuan Dali has entered into various researches on active source hearing in true time and real environment — 3 sources of position, tracking, and separation under y. For example,' such as International Publication No. 01 / 953 14 Revealed by the Tiger Bulletin, the interaural phase difference (interaural phase ruggedness) and interaural Intensity Differenee nD obtained from HRTF (Head Related Transfer Functións) were used. ) To perform sound source localization research is well known. Moreover, in the above-mentioned literature, it is generally known to use, for example, a so-called directional pass-type filter, ie, an Erection-Pass Filter, by selecting an ipD with a specific direction. Same IP The D subband is a method for separating the sounds from the various sound sources. In contrast, regarding the identification of the sounds from the sound sources separated by the sound source, for example, for-) or missing data ) And other noise methods for robust identification (Robust) 3.144J2 200411627 sound recognition methods, various studies have been carried out as shown in the following documents. J. Baker et al. 'Robustness based on clear speech model (c 1 ean speech m〇de 1)', "European Speeches 2001-The 7th European Conference Proceedings," 2001, "Brother 1 Volume" pp, 213 -216 (J. Baker, M. Cooke, and P. Green, Robust as based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise · 7th European conference on Speech Communication Lu Technology ", Volume 1, p. 213-216). P. Renebe et al. Robust Speech Recognition, "European Lectures 200, 7th European Conference Proceedings", 2001 "Brother 12" pp.ll〇7-llll〇 (philippe Renevey, Rolf

Vetter, and Jens Kraus. Robust speech recognition using missing featuie theory and vector quantization. u7thVetter, and Jens Kraus. Robust speech recognition using missing featuie theory and vector quantization. U7th

Euiopean Confeience on Speech Communication Technology，，， Volume 12, pp.ll〇7-1110)。參然而’在上述二文獻所揭示之研究之中，在S /N比值較小的情形下，即無法有效進行語音辨認。而且，在真只日才間、真貫環境下的語音辨認相關研究亦尚未進行。【發明内容】本發明係有鑑於以上之問題點所開發者，目的在於提供一種可對於來自各音源已分離之聲音進行辨認的機器人視聽覺系統。為達成上述a的，本發明之機器人視聽覺系統之第工構成係具備有：將各說話者所發出之字彙及其方向組合而 314412 8 200411627 成之複數個音響模型（audeo model);使用上述音響模型， -對已進行音源分離之音響信號執行語音辨認過程 w rec〇gniti〇n Pr〇cesO之語音辨認引擎；以及，藉由該語音辨涊過程，將依音響模型別所得之複數個語音辨認過程結果加以統合，以對任一個語音辨認過程結果進行選擇之選擇器（selector);其特徵為可分別辨認各說話者同時發出之子杲。珂述選擇器係以多數決（maj〇rity decisi〇n)方式對前述語音辨認過程結果進行選擇而構成，亦可具備有將則述選擇器所選擇之語音辨認過程結果輪出至外部之對話部。 μ μ w辱u疋仃首源足位、音分離之音響信號來使用複數個複數個音響模型的方式，別進行語音辨認過程。接著，藉由選擇器，將由各音塑型所得之語音辨認過程結果加以統合，以判斷出可靠性尚之語音辨認結果。而且’為達成上述目的’本發明之機器人視聽覺系構成係'具備有：具有對外部聲音進行集音之至少 2:;;·’:艮據該麥克風發出的音響信號，藉由依根據曰回度（pnch)抽出、諧波構造之隹八音源分離及定n _ n (㈣咖g)所進行苴聛與重从位巩話者的方向，且抽其I、覺事件之聽覺模組；具有攝影嬙 ^ 枯：人的前方進行攝影躡衫機，根據由該攝影機所攝部辨認及定位，識別各說話者之:二’由各說話者的 ⑺叫之臉部模組；具有以水平出其臉部事件（fa 向使機器人旋動之驅 3144)2 9 200411627 電動機’根據該驅動電動機之旋轉位置抽出電動機控制模組；由前述聽覺事。畢件（aud山0n event)、臉部爭件及電動機事件，根據聽覺事件 η 什又θ /原疋位及臉部事件齡部定位之方向資訊來決定各 a Ό ϋ兄冶者的方向，使用卡曼 ^ s(Kalman filter)，以時間方向將前述事件連接’以生聽覺資料流及臉部資料流，然後，與上述資料流相關而產生結合資料流（association stream)之結合模组 (ass〇Clation module);以及根據該等資料流進行注音栌制，以及’根據伴隨注意控制所產生的行動計晝(p二結果進行電動機驅動控制的注意控制模組n 覺模組係根據具有依來自前述結合模組之正確音源二：；訊，而在正面方向呈最且隨左右角度變大即隨之變：的通過頻帶範圍（pass range)之主動方向通過型濾波哭，將具有預定寬幅範圍内的兩耳間位相差(IpD)或：耳二度差（IID)之次頻帶加以收集，且重新建構音源之波形，以進行音源分離，同時，使用複數個音響- 曰俱i進仃已音源分離之音響信號之語音辨認，藉由選擇器對依各音響槎划所得之語音辨認結果加以統合，藉由在該等語立辨7钍之中，判斷出可靠性最高之語音辨認結果所構成。辛W Μ果根據上述第2構成，聽覺模組係由麥克風所收、來自外部的聲音，利用諧波構造進行間距抽、:的 ®，以藉此得到每一音源之方向，以辨識各個說話者，、工^ <田出其辭與件。而且，臉部模組係從由攝影機攝得之晝二心見由進·^丁圖案辨認所得之各說話者之臉部辨認及定位， ° 木抽出各個話 314412 10 200411627 說者的臉部事件。再者，電動機控制模組係根據使機器人以水平方向進订旋動的驅動電動機的旋轉位置，檢測機器人的方向，以藉此抽出電動機事件。其中，前述所謂事件（event)係表示於各時點中有進行檢測的聋·音或臉部之十杳报斗、θ _ 1 ^ ^ 或疋表不驅動電動機在進行旋轉之狀態；所謂資料洁_ 、Tt * (st! earn)係表不一面進行錯誤訂正處理，一面例如由卡曼濾波器（Kdman filter)等進行時間性接續的方式而接續之事件。在此，結合模組係根據藉由上述方式分別抽出之聽覺事件、臉部事件以及電動機事件，來產生各說話者的聽覺育料流及臉部資料流，然後，與上述資料流相關連而產生結合貧料流，而以注意控制模組.（attenti〇n咖㈣爪。如e) 根據上述資料流進行注意控制（attentiQn e。⑽。1)，以藉此達打電動機控制模組之驅動電動機控制計晝。纟中，^ 結合資料流係包含右_營：备柯 …有見貝枓流及臉部資料流之概念。而且H主意㈣加㈣係指機器人對作為對象達打聽覺式及/或視覺式「注視」，所謂注意控带動機控制模組改變其方向， Τ 屯 I义俄杰人庄視前述說接著，注意控制模組係根據該計晝而對電_ 組之驅動電動機進行控制，且將機““吴象之說話者。藉此方式，機器人正對著作::向象==對 =:可利用麥克風將該說話者的聲音在敏感:::的向正確進行集音及定位’同時影機對該說話者之晝像進行良好的攝影。猎由攝 314412 2⑻411627 因此，藉由上述聽覺模組、臉部模組模組、結合模組以及注意护制心"八你-輪工制〜ί工制杈組之合作，可使撼聽覺及視覺互補各自具有之轉咏Μ 使枝裔人的八炎未生，而^兩所謂強健性 ftness)’即使是多位說話者，亦可分別對各說与者，覺。而且’即使例如遺漏聽覺事件或臉部事件之任一者日可’只要根據臉部事十及覺事件，結合模組即可對作為對象之說話者產生知謦 . Jt 生夫見，因此，可即時（real time)進灯電動機控制模組之控制。 ^ 再者，前述聽覺模組根據如上所述已進行音源定位. ::!刀碓之音響f虎’藉由使用複數個音響模型的方式，來分別進行語音辨認。接！— 胃 ^ 错由選擇器對依各音響模型所付之語音辨認結果加以έ人立m切 .、、先& ’以判斷出可靠性最高之語曰爹f w結果。藉此方式，* 白知之-音辨認相較之下，藉由使用複數個音響模型的方式， γ、# — Λ τ在真貫時間·真實環境下進仃正確的語音辨認，同 _ , 寸’糟由選擇器將各音響模型之^曰辨認結果加以統合， μ斗田 u判斷出可靠性最高之語音辨一果，而可進行更為正確之語音辨認。而且，為達成上述目的，斤本發明之機器人視聽覺系統之弟3構成係具備有：具有對有對外部聲音進行集音之至少一對麥克風，根據該麥克風發 .^曰]曰¥ ^號，藉由依根據聲音南度抽出、諧波構造所成八 ^ v .. 木合（grouping)所進行之音源为_及定位，來決定至少一 ^ ^ ^ 位祝話者的方向，且抽出其 |-〇'見事件之聽覺模組；具有。 ^ ^ ^杰人的前方進行攝影之攝影機，根據由該攝影機所攝影科〜之晝像，由各說話者的臉部 314412 12 200411627 二.’ 4別各§兄話者’且抽出其臉部事件之臉部模據由立體攝影機攝得之晝像抽出之視差，將縱向較長之物體進行抽出定位，且抽办雕# 4 ㈤立肢事件（stereo event)之 module): 件幾，根據該驅動電動機之旋轉位置抽出電動機 =之電動機控制模組；由前述聽覺事件、臉部事件、立 ==動機事件，根據聽覺事件之音源定位及臉部事狀版部^位之方向資訊來以各說話者的方向，使用卡哭濾波器（kalman filter)，α時間方向將前述事件連接，，& _:、疋產生^、覺貝料流、臉部資料流及立體視覺資料二’然後，與上述資料流相關連而產生結合資料流之結合二及，根據上述資料流進行注意控制，以及，根據制所產生的行動計畫結果進行電動機驅動控制 J思控制杈組；其中’前述聽覺模組係根據具有依來自月，』述結合模組之正確音源方向資訊，而在正面方向呈最小’且隨左右角度變大即隨之變大的通過頻帶範圍(pass lange)之主動方向通過型濾波器，將具有預定寬幅範圍内的兩耳間位相差（IPD)或兩耳間強度差（IID)之次頻帶加以' :集，且重新建構音源之波形，以進行音源分離，同時，於進行語音辨認時，使用複數個音響模型進行已音源分離之音響信號之語音辨認，藉由選擇器對依各音響模型所得之語音辨認結果加以統合，藉由在該等語音辨認处果之中，判斷出可靠性最高之語音辨認結果所構成。根據上述第3構成，聽覺模組係由麥克風所收集到的 314412 ]3 200411627 來自外部目標的聲音以藉此得到每—I 波構造進行聲音高度抽出，母音源之方向，以、、表宁女乂且抽出I_與t 、疋Q個說話者的方向，晝像中， P杈、、且心、從由攝影機攝得之行圖案辨認所得之各說話者之^辨^7 % A 位，辨別夂％ t 4· 可炙fc部辨涊及定罕別各况话者，且抽出各個話說者立體握έΒ # h 事件。再者’ 、、、、係根據由立體攝影機攝得將縱向#具十此A 科于 < 里像中抽出之視差， ο車又長之物體進行抽出定位，且柚電動機柝制H〆扣立月豆事件。再者，動電動機的旋轉位置^向進灯旋動的驅事件。戍。。人的方向，且抽出電動機其中，前述所謂事件（eveilt)係表干认々士檢測的聲音、^部十〃^表不於各時點中有進行動=機在進行旋轉之狀態’·所謂資料、’曰η…丁正處理，一面例如由卡、哭 filter)等進杆日本鬥从从士又‘波口口 (Kalman 、曰Η生接績的方式而接續之事件。在此，結合模組係根據藉由上述事件、盼卹亩从 < 乃式刀別抽出之聽覺件之音源定位及臉部事件之臉部定位=件，利用聽覺事說話者的方向，以葬！！向資訊來決定各 * 稭此產生各說話者的聽覺資料流、臉邱 -貝料流及立體資料流，然 w 处人資料、、ώ甘a 、貝科/瓜相關連而產生、，：口貝^。其中，所謂結合資料流係包含有聽覺資料月“剛流及立體視覺資料流之概念。此時，結合二 :據聽覺事件之音源定位及臉部事件之臉部定位，二::; 據聽覺及視覺之方向資訊，來決定各說話者的方向= 314412 14 200411627 、之之各5兄$者的方向而產生結合資n 〔制㈣’注意控制模組係根據上述資料流進行H 、動::及，根據伴隨注意控制所產生的果 :動機驅動控制。接著’注意控制模組係根據c订向朝向作為目標之說話者。# 將“人的方目標之說話者，聽覺模έ 工’态人正對著作為見拉組可利用麥克風將Euiopean Confeience on Speech Communication Technology ,,, Volume 12, pp. 1107-1110). Ref. However, in the studies disclosed in the above two documents, when the S / N ratio is small, speech recognition cannot be performed effectively. In addition, research on speech recognition in a real-life, real-world environment has not yet been conducted. [Summary of the Invention] The present invention has been developed in view of the above problems, and an object thereof is to provide a robotic audiovisual system capable of recognizing separated sounds from various sound sources. In order to achieve the above a, the first component of the robotic audiovisual system of the present invention includes: a combination of the vocabulary sent by each speaker and its direction, and a plurality of audeo models (314412 8 200411627); using the above Acoustic model,-Perform a speech recognition process w rec〇gniti〇n Pr〇cesO on the sound signal that has been separated from the sound source; and, through the speech recognition process, a plurality of speeches obtained by the sound model A selector that integrates the results of the recognition process to select any one of the results of the voice recognition process; it is characterized by the ability to identify the sons and daughters of each speaker simultaneously. The cosmetic selector is constructed by selecting the results of the aforementioned speech recognition process by a majority decision method, and may also include a dialogue in which the results of the speech recognition process selected by the comment selector are rotated to the outside. unit. The method of using multiple sound models is to use the sound signals of the first source's foot position and sound separation, and do not perform the voice recognition process. Then, through the selector, the results of the speech recognition process obtained from the various voice shapes are integrated to determine the reliable speech recognition results. In addition, 'to achieve the above-mentioned objective, the robot audiovisual system composition system of the present invention' is provided with: at least 2: which collects external sounds; ;; ': According to the acoustic signal emitted by the microphone, Degree (pnch) extraction, harmonic structure separation of the eighth sound source, and determination of n_n (㈣ coffee g), and the direction of the Gong speaker, and the auditory module of its I and auditory events; With a camera 嫱 ^: a camera shirt machine in front of a person, according to the camera's recognition and positioning, identify each speaker: two 'face module by howling of each speaker; with horizontal The facial event (fa direction to drive the robot 3144) 2 9 200411627 The motor 'draws out the motor control module according to the rotational position of the driving motor; from the aforementioned auditory matter. The complete piece (aud event 0n event), face contention, and motor event, determine the direction of each a Ό ϋ brother according to the direction information of the auditory event η / the original position and the age location of the facial event, Using Kalman filter (Kalman filter), the aforementioned events are connected in a time direction to generate an auditory data stream and a facial data stream, and then, related to the above data stream to generate a combination module (association stream) ( ass〇Clation module); and Zhuyin control based on these data streams, and 'attention control module for motor drive control based on the actions generated by the accompanying attention control (p. 2 results, the attention control module n consciousness module is based on the The correct sound source 2 from the aforementioned combination module :; the signal, which is the most positive in the front direction and changes as the left and right angles increase: the active direction pass filter of the pass range of the pass band will have a predetermined width The ear-to-ear phase difference (IpD) or second ear difference (IID) within the amplitude range is collected, and the waveform of the sound source is reconstructed to separate the sound sources. At the same time, a plurality of Loud-The voice recognition of the sound signals separated by the sound source is performed by the selector. The selector recognizes the result of the voice recognition according to each sound plan, and judges among the 7 distinguished sounds of these words. Consisting of the most reliable speech recognition result. According to the above-mentioned second configuration, the hearing module is the sound received by the microphone and from the outside, and the harmonic extraction is used to extract the spacing, such as ®. The direction of each sound source to identify each speaker, ^ ^ ^ field and speech. Moreover, the face module is identified by the pattern of the day and two hearts captured by the camera. The face recognition and positioning of each speaker, ° Mu extracts each of the words 314412 10 200411627 speaker's face event. Furthermore, the motor control module is based on the rotation position of the drive motor that causes the robot to rotate in the horizontal direction. The direction of the robot is detected in order to extract the motor event. Among them, the above-mentioned event refers to the detection of deafness, sound, or the face of the face at each time point, θ _ 1 ^ ^ or 疋Do not drive the motor in a state of rotation; the so-called data clean _, Tt * (st! Earn) are not only the error correction process, but the Kdman filter (Kdman filter) and other time-continuous methods to continue Here, the combination module is based on the auditory events, facial events, and motor events that are extracted by the above methods to generate the auditory training material stream and facial data stream of each speaker, and then, Correlation results in the combination of lean feed streams, and the attention control module. (Attention claw claws. Such as e) Attention control according to the above data flow (attentiQn e. Alas. 1) In order to control the driving motor of the motor control module, the day is calculated. In the middle, the ^ combined data stream system includes the right _ camp: Bei Ke… see the concepts of Bei Yi stream and facial data stream. In addition, the idea of H means that the robot will “listen” to the target to hear and / or visually. The so-called attention-control motive control module changes its direction. Note that the control module controls the drive motor of the electric group according to the day and time, and will "" Wu Xiangzhi speakers. In this way, the robot is facing the works :: Xiangxiang == pair =: the speaker's voice can be sensitive by using the microphone ::: the correct collection and localization of the direction of the speaker; and the camera's day image of the speaker Make good photography. Hunting photo 314412 2⑻411627 Therefore, with the cooperation of the above-mentioned hearing module, face module module, combination module, and attention protection " Eight You-Rotary System ~ Ligong Zhizhu Group, you can make your ears shake. And visual complementation have their own chanting M so that the eight people of the ancestors are not born, and the two so-called robustness ft ') can be felt by each speaker, even by multiple speakers. And 'Even if, for example, any one of the auditory event or the facial event is missed,' as long as the facial events and sensory events are combined, the module can generate knowledge of the target speaker. Real-time control of the motor control module for entering the lamp. ^ Furthermore, the aforementioned auditory module has already performed sound source localization according to the above. ::! Knife of the sound f tiger ’uses a plurality of sound models to perform speech recognition respectively. Pick up! — Stomach ^ Wrong. The selector recognizes the speech recognition results paid for each acoustic model. The first, the first & In this way, * Bai Zhizhi-sound recognition is compared, by using a plurality of acoustic models, γ, # — Λ τ in the true time and real environment for correct speech recognition, the same as _, inch 'By combining the recognition result of each acoustic model by the selector, μ Douda u judges the result of the most reliable speech recognition, and can perform more accurate speech recognition. In addition, in order to achieve the above object, the 3rd component of the robotic audiovisual system of the present invention is provided with: at least a pair of microphones that collect sounds from external sounds, and send out according to the microphones. By determining the direction of at least one ^ ^ person who is speaking, and by extracting the sound source based on the south of the sound and the harmonic structure, ^ v .. grouping, the sound source is _ and positioning, and the | -〇'See the auditory module of the event; ^ ^ ^ The camera in front of Jie Ren is taking pictures according to the daytime image taken by the camera. The face of each speaker is 314412 12 200411627 II. '4 different § brother talkers' and extract his face The facial model of the event is based on the parallax extracted from the day image captured by the stereo camera, and extracts and locates the longer longitudinal object, and draws the carving # 4 module of the stereo event): The rotation position of the driving motor is extracted from the motor control module of the motor =; from the aforementioned auditory event, facial event, standing == motivation event, according to the sound source localization of the auditory event and the direction information of the facial event board position For the direction of each speaker, a kalman filter is used to connect the aforementioned events in the α time direction, & _ :, 疋 generates ^, sensational material stream, facial data stream, and stereo vision data. ， Combined with the above-mentioned data stream to produce a combination of combined data stream and the attention control based on the above-mentioned data stream, and the motor drive control based on the action plan results generated by the system. Group; where the aforementioned auditory module is based on the correct sound source direction information of the combined module, which is the smallest in the front direction, and the passing frequency range (pass that increases with the left and right angles) Lange) 's active direction pass filter adds the sub-bands with inter-ear phase difference (IPD) or inter-ear intensity difference (IID) within a predetermined wide range, and reconstructs the waveform of the sound source. In order to perform sound source separation, at the same time, when performing speech recognition, a plurality of sound models are used for speech recognition of the sound signals separated from the sound sources, and the speech recognition results obtained according to each sound model are integrated by a selector. Among the results of speech recognition, the result of the speech recognition with the highest reliability is determined. According to the third configuration above, the auditory module is collected by the microphone 314412] 3 200411627 The sound from the external target is used to obtain each -I wave structure to extract the sound height, the direction of the vowel source, and the female And extract the directions of I_ and t and tQ speakers. In the day image, the P ^, 且, and heart, ^ identify ^ 7% A position of each speaker obtained from the line pattern captured by the camera, Distinguish 夂% t 4. The fc part can identify and identify the individual speakers, and extract each speaker's three-dimensional handhold # h event. Furthermore, ',,,,' are based on the parallax extracted by the stereo camera from the longitudinal image with the parallax extracted from the image, the long object is pulled out and positioned, and the grapefruit motor makes the H button. Li Yuedou incident. Furthermore, the rotation position of the moving motor is a driving event of the forward rotation of the forward lamp. garrison. . The direction of the person, and the motor is drawn out. The so-called event is the sound detected by the judge, and it indicates that there is movement at each point of time = the machine is rotating. , "Yue η ... Ding Zheng dealt with, while, for example, card and cry filter) and other events such as the Japanese fighting from Cong Shi and 'Balkoukou (Kalman, Yue Shengsheng) succession in succession. Here, the combined model The group is based on the sound source location of the auditory parts extracted from the <Nai-style knife> and the facial location of the facial events based on the above events, hopes, and burials using the direction of the speaker of the auditory event to the funeral! Let's decide each of them to produce the auditory data stream, face Qiu-bei material stream and three-dimensional data stream of each speaker. However, the personal data, related to gangan, beke / melon are related, and: ^. Among them, the so-called combined data stream includes the concept of "aural stream and stereoscopic data stream." At this time, the combination of two: sound source localization according to auditory events and face localization of facial events, two ::; Decide on each speech based on the direction of hearing and vision The direction of the person = 314412 14 200411627, the direction of each of the five brothers to generate the combined capital n [control system 'attention control module is based on the above data flow H, move ::, and according to the accompanying attention control generated Result: Motivation-driven control. Then, 'Note that the control module is oriented toward the speaker as the target according to c. # Put the "people's square target speaker, auditory handwork" state person is facing the work as a group of fans. Use the microphone to

在敏感度較高的正面方向正建、隹—隹立“兄5舌者的聲音邱炉4 害進仃集音及定位，同時日/V #組可藉由攝影_該說話叙晝 2’版因此，由於上述聽〜” 退仃良好的攝影。電動機控制模組、結合模组且:杈组、立體模組以及據聽覺資料汽的立D::、”且以及注意控制模組之合作，根向:二:…位及臉部資料流之說話者定位等方視覺互補夂自呈右,ΐΓ 錯此使機器人的聽覺及多位二暧昧性，而提高所謂強健性，即使是才了刀別對各說話切實產生知覺。而且，即使例如遺漏聽覺資料資料流之任一者砗，可拍说 W貝科〜及立體会任寸可根據殘留之資料流，由注咅、、且進行追蹤作為目標之說 …、彳工制輪向，、隹—恭 ϋ ，因此可正確掌握目標的方 0 進订笔動機控制模組之控制。在此’聽覺模組參照來自、社人一同時考慮臉部模組之臉部資;:：：：之結合資料流，而料流來進行音源定位，藉二:=、:之立體視❹ 而前述聽覺模㈣根據具有 L確之音源疋位。源方向資訊，按照聽覺特性，結合模組之正確音在正面方向呈最小，且隨左 314412 15 角度變大即隨之變大的通過頻帶範圍（pass range)之主、，向通過型滤波器，將具有預定寬幅範圍内的兩耳 =(IPD)或兩耳間強度差⑽）之次頻帶加以收集，且重、十、構音源之波形，以進行音源分離，因此ϋ按照上虐L見4寸性调整通過頻_範圍亦即敏感度之方 <，可、:由方、方向所造成的敏感度的不同，而進行更為正確之立源分離。冉去，·^、+、 y 曰 _ 1处I覺模組根據如上所述已由聽覺槎進行音源定彳立·立、、/5八μ a 、、、'且鄉γ 曰"、刀離之曰響信號，藉由使用複數個音批^ 仃^ θ辨涊。接者，猎由選摆哭十依各音響模型所得之古丑立刃处口〇可古^ °口曰衿w、、，口果加以統合，以判斷出應之之语音辨認結果，且將該語音辨認結果與相對較之相關連幸別出。藉此方式，與習知之語音辨認相真由使用複數個音響㈣^ 音二：：下進灯正確的語音辨認，同時，藉由選擇器將各之語音辨 :、果加以統合，以判斷出可靠性最高心〜果，而可進行更為正確之語音辨認。其中，於第2構成及楚1德Λ、> + , 組進行注立辨、… 構成之中’ ▲無法由聽覺模辨一河述注意控制模組以該音響信號 :::立朝向前述麥克風及前述攝影機，再次從前述麥：風對聲音進行隼音，夕兄且根據由聽覺模組所進行之立、、届— 位·分離之音塑作-，乐ώ < 曰源疋辨切。與…古再度由聽覺模組對該聲音進行語音夕银旦^ 見杈之麥克風及臉部模組之攝衫機正對著該說話者 _ 有错此可確貫進行語音辨認。 “聽覺模组在進行語音辨認時，參照由臉部模組所 J6 3344J2 200411627 得之臉部事件為宜。而且，，。；判斷之語音辨認結果輪出至=:具備有將前述聽覺模組所 -動方向通過型濾波器對活°卩。再者，前述主控制為宜。過頻帶範圍以可按每一頻率進行前述聽覺模組在進行夕处人一丁 &音辨認時，μ a ▲ 之w合貢料流的方式，精由麥照結合模組即，辦普, 思知部模組之臉部眘牡、古 . 基n组在關於由臉部模、W貝枓流。亦面，係根據由聽覺模組進行A'、'仃疋位之臉部事件方音響信號來進行語音辨切1 .分離之音源（說話者）之認。當前述主動方丈可進行更為正確之語音辨頻率進行控制時， 2、/态之通過頻帶範圍按每一 J徒同收集到的聲立此亦可進一步提高語音辨認。曰i離精確度’藉【實施方式】以下根據圖面所示之實施第1圖及第2圖係表示備…：：：明本發明。統之—實施形態之僅有上半身 “人視聽覺系成例。於第由、1用人型機器人全體構弟1圖中，人型機器人10#槿士 4 度，扑gree of f ^ 、知構成為4 D〇F(自由匕e of freedom)之機器人，包座1 1上以~r β 栝·基座1 1 ;在基部…以及㈣轉的方式支撑的月同體左右方向二’在胴體部12上以可繞三轴方向㈣軸’ 白的水平軸及前後方向的水支撐的頭1 十輛）進行搖動的方式 7碩邛13。上述基座U可固定配以進行動竹置亦可設置腳部. 丁勡作。而且，亦可將基座丨i載等上面。如μ ^ 戰置於可移動的台車 “^圖中箭號Α所示’胴體部⑴系對基座u 314412 17 繞著垂亩k、直軸以可旋轉的方式支撐，藉進行旋轉靶翻沾η #囬由未圖不之驅動機構〜.¾動的同時，圖示係表示再加以覆蓋。，、防曰性之外包裝頌°卩1 3係對胴體部12透過連如第1 mu 、連…構件Ha加以支撐，圖中前號B所示，對該連結構件瓦妓# ^ 方向之Aλ 3 a可、，’九者前德千軸進行搖動，而且如第2圖可繞著卢+ 甲刖说C所示， /者左右方向之水平軸進行搖動的如弟1圖中箭號D所示，上述门體部12可妞荖前德古Α々卜连π構件13a设對胴撐，Μ A々订搖動的方式加以+ 曰由。個未圖示之驅動機構， ^ , 主丁谷刖唬 A、B、C、ή 向進仃旋轉驅動。在此，頭部12 由入雕θ 备、如弟3圖所示，葬由王體具防音性之外包裝错側作A自主加以後盍，且具備有：在前 .,ΒΪ ^ 見之祝見衣置的攝影機.1 5，以及右兩側作為負責機、哭在 lfin,、貝枝σ。人聽覺之聽覺裝置之一對麥克風 !6(16&、l6b)。苴中，，夕兄紙侧，八夕克風16亚非限定在頭部13之兩、，亦可將其設置於頭部、、、之，、他位置或胴體部12等。上述外包I 1 4係由如胺曱酸明立从—人版T S夂乙®日（urethane)樹脂等具及曰性之合成樹脂所構成，、成且係以碩部13内部幾乎完全始、閉，以進行藤邮 ·|。頭4 13内部遮音的方式所構成。其中，與頭σΡ 1 3之外勿_ 1 / 1 旋14相同地，胴體部12之外包裝亦係由具吸音性之合成樹脂所構成。上述攝影機1 5传兔闲立為周知之構成，係由市面上銷售之具有例如所謂水平搖揣 , 卞摇攝（Pan)、上下搖攝（Tilt)、變焦拍攝 (Zoom)寺 3 DOFf 白山 ώ、由度）之攝影機所構成。而且，攝影機 3144J2 ]8 200411627 l5係設計成可同步傳送立體晝像。一上述麥克風16係分別於頭部13之左右兩側面，以朝向前方具有指向性的方式安裝。左右之各個麥克風心、 1 6 b係分別如第1圖刀笼9阊味一 ϋ及弟2圖所不，安裝於配置在頭部13 之外包裂Μ之兩側的段部14a、⑷的内側。而且，各個 = 16:、16b係透過設置在段部W，之貫通孔， 1:=:曰進行集音’同時，為了不會收到外包" 内撕曰，因而利用適當的手段予段部14a、14b夕孙π 7丨〆 7 /、頭邱^方。從段部14a、14b的内側朝向貝牙的方式，而形成於各段部14a、14b上。藉 ϋ式二麥克風16a、16b係構成所謂雙耳(―)麥 u亦可；成為：近麥克風16a，之安裝位置的外包裝亦了形成為人類外耳形狀。在此，麥克風 -對配置於外包裝14内側之亦可匕括麥克風收集到的内部聲音，可以、、肖，根據由該内部生的雜訊。 /矛、在機益人10内部發人視==包括上述攝影機15及麥克…機器糸統1 7係由和風枝态人視聽覺 ^ . 見枳組20、臉部模組30、立俨握鈿y 电動機控制模組40及处人h · · 肢杈組〕7、在此，处及—（咖⑽叫模、组50所構成。而批、〇口枳組50係作為依照客戶端Q丨彳f、 ::行處理之伺服器所構成’該伺服器其他模組，亦即^ 相對之各戶端係為以及電動機": 臉部模組3〇、立體模㈣工4組40,上錢服器及客戶端係互 334432 39 200411627 同步方式進行叙从甘士仃動作。其中，上述伺服器及各客戶端係分由個人計曾她 ^城所構成，再者，上述之各個人計算機係在例如TCWIP協宏rt · · 1 力力足(tiansmission control protocol/internet protocol ，值认 1 呼知控制協定/網際網路協定）之通信環境下，相互構成為1 、八N(local area network，區域網路）。此時，、、行資料H較大的事件（event)或資料流（stream)之通In the positive direction with high sensitivity, Zheng-Li stands “the voice of the brother 5 tongues Qiu Lu 4 smashes into the sound collection and positioning, and at the same time, the Japanese / V # group can use photography As a result, due to the above listening ~ "backed up good photography. Motor control module, combination module: and branch group, three-dimensional module, and stand-by D ::, ”according to auditory data, and attention to the cooperation of control module, the root direction is: two: ... position and facial data flow The speaker positioning and other visual complementation are self-righteous, and ΐΓ mistakenly makes the robot's hearing and multiple ambiguities, and improves the so-called robustness, even if only the knife does not actually produce perception of each speech. Moreover, even if for example omission Any one of the audio data stream can be said to be W Beco ~ and the three-dimensional meeting can be based on the residual data stream, and it can be followed by the note, and the tracking is ... Ϋ—Congratulations, so you can correctly grasp the control of the target party's motive control module. Here, the 'audience module' refers to the facial information from the face module at the same time as the social person; :::: The combination of data flow and material flow for sound source localization uses the stereo vision of two == and:, and the aforementioned auditory mode is based on the sound source position with L. The source direction information is based on the auditory characteristics and combines the correctness of the module. The sound is minimal in the front direction, As the angle of the left 314412 15 becomes larger, the passband range of the main and forward pass filters will have both ears within a predetermined wide range = (IPD) or the intensity difference between the two ears. ⑽) The sub-bands are collected, and the waveforms of the heavy, ten, and articulated sources are separated for sound source separation. Therefore, ϋ adjust the pass frequency _ range, which is the sensitivity method, according to the upper limit L, and: Due to the difference in sensitivity caused by the side and direction, a more accurate separation of the source is performed. Ran, · ^, +, y, _ 1 place I sense module has been determined by the auditory sound source as described above彳立 ,, // 5 八 μa ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, by, by using the plurality of sound batches ^ 仃 ^ The ancient and ugly edgings obtained by Crying in accordance with each acoustic model can be integrated. The mouth and mouth are combined to determine the corresponding speech recognition result, and the speech recognition result is compared with Fortunately, it is related. In this way, the sound recognition with the known voice is really made by using a plurality of sounds. The correct speech recognition of the down lights, and at the same time, the various speech recognition and results are combined by the selector to determine the most reliable heart ~ fruit, and more accurate speech recognition can be performed. Among them, in the second The composition and Chu 1 de Λ, > +, the group performs attention to distinguishing,… during the composition '▲ cannot be identified by the auditory model, a note control module uses the acoustic signal ::: standing towards the aforementioned microphone and the aforementioned camera, From the aforementioned wheat again: the wind sounds the sound of the sound, Xi Xiu and the sound of the sound, separation, and separation of the sound module made by the auditory module-, Le Jing < The sound was again voiced by the auditory module. Yindan ^ See the microphone and face camera of the facial module are facing the speaker _ If there is a mistake, it can be accurately identified. "When the auditory module performs speech recognition, it is advisable to refer to the facial events obtained by the facial module J6 3344J2 200411627. Moreover, ... The judged speech recognition result is turned to =: It is equipped with the aforementioned auditory module All moving directions pass through the filter to the active angle. Furthermore, the aforementioned main control is suitable. The over-frequency range is such that the aforementioned auditory module can be performed at each frequency. When performing human identification & sound recognition, μ a ▲ The way of combining the tributary stream, the combination of the fine and the Mai Zhao module, that is, the general, the Sizhi Department of the face Shenmu, ancient. The group n is about Also, it is based on the acoustic signals of the facial events of A 'and' 仃疋 position by the auditory module to perform speech recognition 1. The recognition of the separated sound source (speaker). When the aforementioned active abbot can perform more When the correct speech recognition frequency is controlled, the passband range of the 2 / state can be further improved according to the sounds collected by each J. This can also improve the speech recognition. "I. departure accuracy" by [Implementation] The following according to the figure The first and second figures shown in the implementation are shown ...: : The present invention's system - only form of embodiment the upper body "human Cheng Li audiovisual system. In the first figure of the first figure of the humanoid robot, the humanoid robot 10 # hibiscus 4 degrees, flutters gree of f ^, and is known as a 4 DOF (freedom of freedom) robot, with a seat. On the base 1 ~ 1, the base 1 1; the left and right directions of the moon, which are supported at the base ... and the rotation direction, and on the carcass portion 12 on the carcass portion 12 in a three-axis direction, and the horizontal axis is white. Water-backed head of the first 10 cars) way of shaking 7 Shuo 13. The above-mentioned base U can be fixedly configured for moving bamboos or can be provided with feet. Moreover, the base can also be mounted on the base. As shown in Figure ^, it is placed on a movable trolley "^" shown in the arrow "A" in the figure, "the body part is attached to the base u 314412 17 in a rotatable manner around the vertical axis k, and the rotary target is turned over沾 η ## Driven by a driving mechanism not shown in the figure. At the same time, the icon is shown to cover it again. The outer packaging is anti-sung, and the 3rd part is connected to the body part 12 through the first mu. , And ... The component Ha is supported, as shown by the front number B in the figure, and the connecting component Wach # ^ in the direction of Aλ 3 a may be, 'nine of them are shaken in front of the thousand axis, and as shown in FIG. 2 Lu + Jiayi said that as shown by C, the horizontal axis of the left / right direction is shaken as shown by the arrow D in the figure 1. The above door body 12 can be set to the front part of the ancient German 々连连 member 13a.胴，, Μ 々々 + + means for the shaking method. An unillustrated driving mechanism, ^, the main Dinggu bluffs A, B, C, and price to drive forward. Here, the head 12 is driven by Entering the carving θ, as shown in Figure 3, the burial is made by the king ’s body with sound-proof packaging on the wrong side as A, and it is self-contained, and it has: in front., ΒΪ ^ 见祝祝见衣Camera, 1.5, and the right side as the responsible machine, crying at lfin, and pee σ. One of the hearing devices of human hearing is the microphone! 6 (16 & 16b). Langzhong,, Xi Xiong paper side, The star festival of Asia and Africa 16 is limited to the two of the head 13, and it can also be set on the head, the, the other position, the carcass part 12, etc. The above-mentioned outsourcing I 1 4 is made by the amino acid Licong—Human version of TS 夂 B® urethane resin and other synthetic resins, which are almost completely opened and closed inside the Shuo Bu 13 for Fujie ··. Head 4 13 It is composed of internal sound-shielding methods. Among them, the outer package of the carcass body 12 is also made of sound-absorbing synthetic resin in the same way as the head σP 1 3 Do not _ 1/1 turn 14. The above camera 15 passes rabbits Idle is a well-known structure, which is a commercially available camera with, for example, the so-called horizontal pan, pan, Tilt, and Zoom 3 DOFf Hakusanjiro, Yudo). The camera 3144J2] 8 200411627 l5 series is designed to transmit stereo day images simultaneously. The microphone 16 is installed on the left and right sides of the head 13 in a directional manner toward the front. The left and right microphone cores, 16b, are respectively shown in Figure 1 and Figure 2. That is, it is mounted on the inside of the segment portions 14a and ⑷ arranged on both sides of the envelope M outside the head portion 13. Further, each = 16 :, 16b is through a through hole provided in the segment portion W, 1: =: At the same time, in order to not receive outsourcing "internal tearing," appropriate means are used to give the Duanbu 14a, 14b Xisun π 7 丨〆7 /, Tou Qiu Fang. The segments 14a and 14b are formed on the segments 14a and 14b so as to face the teeth. The two microphones 16a and 16b can also be used to form a so-called binaural (-) microphone u; it becomes: the outer packaging of the installation position near the microphone 16a is also formed into the shape of a human outer ear. Here, the microphone-pair of the internal sounds collected by the microphone disposed on the inside of the outer package 14 may be based on the noise generated by the internal sound. / Spear, send people's vision inside Jiyiren 10 == Including the above camera 15 and Mike ... The machine system 1 7 is visually and audibly heard by Hefeng branch humans. ^ See group 20, face module 30, stand-up grip钿 y Motor control module 40 and person h · · Limb group] 7. Here, the treatment is made up of-(cuckoo model, group 50), and the batch, 0 mouth group 50 is used as the client Q 丨彳 f, :: The server is composed of the processing of the server's other modules, that is, ^ relative to each client system and the motor ": Face module 30, three-dimensional model 40. The server and the client are mutually 334432 39 200411627 The Ganshang action is performed in a synchronized manner. Among them, the above server and each client are made up of a personal account, and each of the aforementioned people The computer system is, for example, a communication environment of TCWIP Xiehong rt · · 1 (tiansmission control protocol / internet protocol, value 1 call control protocol / Internet protocol), and the mutual configuration is 1, 8 N (local area network (local area network). At this time, the event of larger data H or Data stream

、 τ'套用可進行十億位元（gigabit)之資料交換的高速 '罔^網路於機器人視聽覺系統17為宜，此外，為了進 Β^- $ή Γ^Ί -it. ^ ^ v寺之控制用通信，最好係套用中速網際網路於視I免系統1 7為宜。如上所述，藉由於各個人計算機之間高速傳送大量資料，可提高機器人全體之即時 (eal time)性以及延展性（scaiabiHty)。 ^而且，各模組20、30、37、40、50分別以階層式分散構成’具體而言’從下位依序由I置層（devke)、過程層（process)、特徵層（characteristics)、事件層（event)所構成0 上述聽覺模組20係包括：作為裝置層之麥克風Μ ; 作為過程層之峰值（peak)抽出部21、音源定位部Μ、音源分離部23以及主動方向通過型濾波器23&;作為特徵層（資料）之聲音高度（pitch)24、音源水平方向25;作為事件層之聽覺事件產生部26;以及作為過程層之語音辨認部27及對話部28。 5圖所示般產生作用。亦係例如對於以48kHz、1 6 在此，聽覺模組20係如第即，於第5圖中，聽覺模組2〇 314412 20 200411627 位元所取樣（sampling)之來自麥克風ι6的音塑 ;符號 X1 表示，則藉由 FFT(FastFouierTrai：sf：r；；，’ = •傅利葉轉換）進行頻率分析，如以符號χ2表示，則饮對每一左右頻道（channel)產生頻譜（spectrum)。接著，聽風板組20藉由峰值抽出部2 1對每一左右頻道抽出一連串的半值’以左右頻道將相同或類似的峰值兩兩成對。峰值抽出係指使用只通過滿足：（a )電力超過閾值，且（石）為區域峰值（local peak)，（ τ )為了截斷（Cut)低頻雜訊及電力較小之高頻帶，而為例如90Hz至3kHz之間之頻率等二攸件（α至7’）之資料的頻帶濾波器而進行。該閾值係定義為計測周圍的背景雜訊，且加上敏感度參數例如l〇d]B之值。接著，聽覺模組20利用各峰值具有諧波構造之特性而進行音源分離。具體而言，音源分離部2 3係按頻率自 ]大之丨員序將具有邊波構造之區域峰值（local peak)予以抽出，且將該抽出之峰值之集合視為一個聲音。藉此方式’每一音源之音響信號會從混合音分別分離出來。於進行曰源分離時，聽覺模組2 0之音源定位部2 2係如符號X 3 所不’在音源分離部23對已分離之每一音源之音響信號， ^左右頻道選擇相同頻率之音響信號，以計算IPD(相互也相差）及IID(相互強度差）。其中，該計算係以例如每5 度進行一次。然後，音源定位部22會將計算結果輸出至主動方向通過型濾波器23a。相對於此，主動方向通過型濾波器23a係在結合模組 5〇根據計算出的結合資料流59之方向Θ，如符號X4所 2] 314412 200411627 示，在產生IPD之理論值ipd( = A 0，（0 ))的同時，計算 IID之理論值IID( = /\ p’（(9))。其中，方向0係根據臉部定位（臉部事件39)、立體視覺（立體視覺事件39a)及音源定位（聽覺事件29)，而依結合模組50中之即時追縱（符號 X3’）所計算出的結果。在此，理論值IPD及理論值甘叫异你刊用以下 °兒明之藍.覺核線幾jJSJEpipolar Geometry)而進行，具體而言，將機器人10之正面設定為0度，在士9〇度之範圍内計算理論值IPD及理論值IID。在此，上述遮在不使用HRTF且為了獲取音源之方向資訊之情況下乃為义須的。於立體視覺研究中，核線幾何乃係最為一般常見 ::位法之一’聽覺核線幾何即係對於視覺中核線幾何之 ::之應。接著’聽覺核線幾何係利用幾何學的關係而焱侍方向資訊，因此，可以不需要HRTF。八於上述聽覺核線幾何中，假設音源為無限遠，分別以哭、f、v為1PD、音源方向、頻率、音速，當將機 °。人碩部視為球形之情形下之半徑設、表示〇 ' 時以下式（1) d) 另一方面，根據藉由對頻言晋（spectrum),利用 △ 0，及IIDZ\ p，進行計算 FFT(快速傅利葉轉換）所得到之下式（2)、（3)對於各次頻帶之ιρβ 3M4I2 22 200411627 Δ arctan 概]、HSPii V=201〇g】0 -arctan 啦ΓMSPrl, Τ 'applies high-speed' 罔 ^ that can perform gigabit data exchange. It is advisable to use the Internet in the robotic audiovisual system 17. In addition, in order to enter Β ^-$ ή Γ ^ Ί -it. ^ ^ V For the temple's control communication, it is best to apply a medium-speed Internet to treat the I-free system 17 as appropriate. As described above, by transmitting a large amount of data at a high speed between individual computers, it is possible to improve the eal time and scaiabiHty of the entire robot. ^ In addition, each module 20, 30, 37, 40, and 50 are hierarchically dispersed to form a 'specifically' from the lower level to the dev layer, the process layer, the feature layer, and the feature layer. The event layer (event) is composed of the above-mentioned auditory module 20 including: a microphone M as a device layer; a peak extraction section 21 as a process layer, a sound source localization section M, a sound source separation section 23, and an active direction pass type filter The device 23 is a pitch 24 as a feature layer (data), a sound source horizontal 25, an auditory event generation section 26 as an event layer, and a speech recognition section 27 and a dialog section 28 as a process layer. Figure 5 works as shown. Also for example, at 48kHz, 16 here, the auditory module 20 is the first, in Figure 5, the auditory module 20314412 20 200411627 bit sampling (sampling) from the microphone ι6 sound plastic; The symbol X1 indicates that the frequency analysis is performed by FFT (FastFouierTrai: sf: r ;;, '= • Fourier transform). If it is expressed by the symbol χ2, the spectrum is generated for each left and right channel. Next, the listening board group 20 extracts a series of half-values' for each of the left and right channels by the peak extracting section 21 to pair the same or similar peaks in pairs for the left and right channels. Peak extraction refers to the use of only high-frequency bands that satisfy (a) the power exceeds the threshold, and (stone) is the local peak, and (τ) is used to cut off (low) low-frequency noise and low power. The frequency band of 90Hz to 3kHz and other two bands (α to 7 ') of the data band filter. The threshold is defined as a value for measuring surrounding background noise and adding a sensitivity parameter such as 10 d] B. Then, the auditory module 20 uses the characteristic that each peak has a harmonic structure to separate the sound sources. Specifically, the sound source separation section 23 extracts the local peaks having a side wave structure according to the order of the frequencies, and regards the set of the extracted peaks as one sound. In this way, the sound signal of each sound source is separated from the mixed sound. When the source is separated, the sound source localization unit 22 of the hearing module 20 is the same as the symbol X 3. The sound source separation unit 23 pairs the sound signals of each sound source that has been separated. ^ The left and right channels select the sound of the same frequency. Signal to calculate IPD (difference from each other) and IID (difference from each other). The calculation is performed, for example, every 5 degrees. Then, the sound source localization section 22 outputs the calculation result to the active direction pass type filter 23a. In contrast, the active-direction pass-through filter 23a is in the combination module 50. According to the calculated direction Θ of the combined data stream 59, as indicated by the symbol X4 2] 314412 200411627, the theoretical value of IPD ipd (= A 0, (0)), and calculate the theoretical value of IID IID (= / \ p '((9)). Among them, direction 0 is based on face positioning (face event 39), stereo vision (stereoscopic event 39a) ) And the sound source localization (auditory event 29), and the result calculated by combining the real-time tracking (symbol X3 ') in the module 50. Here, the theoretical value IPD and the theoretical value are called differently. Mingzhilan. JJJEpipolar Geometry), specifically, set the front face of the robot 10 to 0 degrees, and calculate the theoretical value IPD and theoretical value IID within the range of ± 90 degrees. Here, the above cover is necessary without using HRTF and in order to obtain the direction information of the sound source. In the study of stereo vision, epipolar geometry is one of the most common :: bit methods. Hearing epipolar geometry is the response to epipolar geometry in vision. Next, the audible epipolar geometry system uses geometrical relationships to serve direction information. Therefore, HRTF may not be needed. In the above geometry of the auditory epipolar line, it is assumed that the sound source is infinite, with cry, f, v as 1PD, sound source direction, frequency, and sound speed, respectively. In the case where the human part is regarded as a sphere, the following formula (1) is used when the radius is 0. d) On the other hand, the calculation is performed by using the spectrum of △ 0 and IIDZ \ p. FFT (Fast Fourier Transform) gives the following formulas (2), (3) ιρβ 3M4I2 22 200411627 Δ arctan profile for each frequency band], HSPii V = 201〇g] 0 -arctan ΓMSPrl

wJ%L ⑶ 在此，Spl、Spr係為分別在某時刻從左右麥克風丨以' 16b得到之頻譜。一再者，主動方向通過型濾波器23a係按照以符號χ7 所不之通過頻帶函數，從前述資料流方向Θ S，選擇與Θ s 相對應之主動方向通過型濾波器23a之通過頻帶“Θ s)。在此’通過頻帶函數係如第5圖之X7所示’由在機器人之正面方向（Θ =0度）敏感度為最大，在側方則敏感度會情形可知，在㈣纟時可採最小值，在側方則成大、：函數。這是為了對於在正面方向定位敏感度為最 ^隧者左右角度變大則敏感度即會降低之聽覺特性予以 ::者1中，在正面方向其定位敏感度為最大時，將成 ·'·、哺礼類之眼睛構造可見之中央小窩，故稱之為聽覺中央 2。關於該聽覺中心1 ’若為人類，則正面定位敏感度度’在左右90度附近則約為士 8度。 δ主動方向通過型濾波器23a係使用已選擇之通過頻帶二，以抽出位於範圍從Θ L至0 H之音響信號。其中，定義為 eH=0s+(5(0s)。主動方向通過型滤波器2 3 a係如符梦χ 5所-:頭部傳達函數(HRTF)中利用資料流方心，以推：出 ^及ΘΗ中之IPD及„D之理論值㈣卜及 314412 23 200411627 ID( △ ；0 H( 0 s))，亦即應抽出之音曰/原的方向。接著，主勡方向通過型濾波器23a係對音了日/原方向0按照根據聽覺核wJ% L ⑶ Here, Spl and Spr are the frequency spectrums obtained from the left and right microphones at '16b' at a certain time. Repeatedly, the active-direction pass-through filter 23a selects the passband of the active-direction pass-through filter 23a corresponding to Θ s from the aforementioned data stream direction Θ S according to the pass-band function not represented by the symbol χ7. Here, the 'pass band function system is shown as X7 in Fig. 5', since the sensitivity is maximum in the front direction of the robot (Θ = 0 degrees), the sensitivity can be seen on the side, and it can be known at The minimum value is taken as a large function on the side: This is for the hearing characteristic that the sensitivity will decrease when the left and right angles increase when the sensitivity of positioning in the front direction is the highest. When the positional sensitivity in the frontal direction is the largest, the central small fossa with the visible eye structure of the eyebrows is called the auditory center 2. Regarding the auditory center 1 ', if it is a human, the frontal position is sensitive Degree 'is about 8 degrees around 90 degrees left and right. Δ active direction pass filter 23a uses the selected pass band two to extract the acoustic signal in the range from Θ L to 0 H. Among them, it is defined as eH = 0s + (5 (0s). Active The pass-through filter 2 3 a is as shown in Fumeng χ 5-: Head Transfer Function (HRTF) using the center of the data flow to derive: IPD in ^ and Θ 及 and the theoretical value of D D ㈣ and 314412 23 200411627 ID (△; 0 H (0 s)), that is, the direction of the sound that should be extracted / original. Then, the main 勡 direction pass type filter 23a pairs the sound with the day / original direction 0 according to the auditory nucleus

、、泉成何對每一次頻帶計算出的IP, Quan Chenghe calculated the IP for each frequency band

η Μ、、Λ 必 Ε(0 ))及 IID( = A P ε(Θ ))以及根據hrtf所得 Λ n , Μ —△ 0 Η(θ ))及 IID(二 A h( β )) ’如符號X6所示，收隼右箭所、五〜木在則迷之通過頻帶5(θ) =之從角度〜至“之角度範圍内，所抽出之酬= △h)及Π，ΛΡε)為滿足以下條件之次頻帶。準之=，rfth係採用ipd*iid作為渡波的判斷基 =閣值’表示1PD t定位係為有效之頻率的上限。立中，頻率fth係與機器人1 〇之麥支 ’ 於士一夕克風之間的距離相依存，方；本貫施形態中，例如約為〗500HZ。亦即， /<Λ ^Ρη (βι )<Αρ^ {Qh)η Μ,, Λ Bi E (0)) and IID (= AP ε (Θ)) and Λ n, Μ — △ 0 Η (θ)) and IID (two A h (β)) obtained from hrtf as symbol As shown in X6, the passing band of the right arrow, Wu and Mu Zai, and the pass band 5 (θ) = within the range from the angle to the angle ", the remuneration extracted = △ h) and Π, Λε. The sub-band of the following conditions: quasi =, rfth uses ipd * iid as the judgment base of the crossing wave = cabinet value 'means the upper limit of the effective frequency of the 1PD t positioning system. In the middle, the frequency fth is the wheat branch of the robot 10 'The distance between Yu Shi's gram wind depends on each other, Fang; in the original form, for example, about 〖500HZ. That is, / < Λ ^ Ρη (βι) < Αρ ^ {Qh)

沾此係意味著，小於預定頻率Q之頻率，在HRTF之IPD :通㈣帶⑽)之範圍内存在IPD(背)時，以及大方；預疋頻率fth之頻率，在hr 之範圍内存在日士 p 的通過頻帶δ(0) EK △ p ，即進行收集次頻帶。一般在低頻帶中IPD之影響較 & i丄 θ孕乂大，在咼頻帶中IID之爭塑較大，為其閾值之頻率f 〜曰 1之料U與麥克風之間的距離相依存。者，主動方向通過型濾波器23a係藉由從上收集到的次頻帶將音塑缺工可將θ音彳5唬再合成且構築成波形的方如付號Χ8所示產生通過次相工玍逋迟人頻▼(Pws-subband)方向，如符 314412 24 200411627 说X 9所不對名^ ._ 士册、杯 ^ 7甘一人頻，進行濾波（filtering)，如符號X10 所不藉由反頻率轉換IFFT (inverse Fast Fourious nsfoim ’反畜利葉轉換），如符號γη所示抽出來自位於該範圍内之夂立夕八 σ曰源之刀_音（音響信號）。如第5圖所不，上述語音辨認部η係由自聲控制部 2Ja及自動辨認部27b所構成。自聲控制告"7&係在聽覺杈、、!· 20攸已進行音源定位·音源分離之各音響信號中，除去k後述之對話部28之揚聲器（speaks)發出之聲音，而/、項出來自外部之音響信號者。如第6圖所示，自動辨 ⑽邓27b係由語音辨認引擎2八及音響模型27d及選擇器 276所構成，以該語音辨認引擎27c來說，例如可以利用 ^京都大學所開發出來的稱為「Jurian」之語音辨認引擎，藉此可對各說話者所發出之字彙進行辨認。方、第6圖中’自動辨認部26b係對例如男性2位（說居者A、C)及女性1位（說話者B)等三位說話者進行辨認的方式所構成。因此，在自動辨認部2 “中，對各說話者之每一方向分別備有音響模型27d。在第6圖之情形中，音響模型27d係關於三位各說話者a、B、c，分別將各 ”兒活者發出之語音及其方向加以組合而成，且具有複數種’本情形係具備有9種音響模型27d。 °°曰辨邊引擎27c並聯執行9個語音辨認過程，此時使用上述9個音響模型27d。具體而言，語音辨認引擎27c 係對於分別相互以並聯方式輸入之音響信號，使用上述9 個音響模型27d，以執行語音辨認過程。接著，將上述語 314412 25 200411627 曰辨δ忍結果輸出至選摄突王^擇裔27e。上述選擇哭27 音響模型27d之所有μ 祥。。27e將來自各 ,^ ^ + 有D° θ辨認過程結果加以統合，例如蕤由多數決來判斷可靠性猎其語音辨認結果。㈣過程結果’且輸出在此’利用呈辦沾者 ^ 〆、且勺貝馬致況月4寸定說話者之音塑才 2以之字彙辨認率。曰B核型 i曰羚％从班百先，在3m x 3m的房間裡，將3個人η命、械為人10 一公尺之位置，且距離機器 -以及士 60度的方向。其次，從揚聲器輸出由男性2 X女^ 1位分別發出如顏色、數字、食物之類的150個子杲勺耳曰以機态人10之麥克風16a、16b進行集音， ::為音響模型用之語音資料。#中，在對各字彙進行集，時’對各字彙進行只從-個揚聲器發出的聲音、從二： ^器同時發出的聲音、以及從三個揚聲ϋ同時發出的聲曰寺二種聲音型態（pattern)的錄音。然後，藉由前述主動方向通過型濾波器23a對已錄音之語音信號進行語音分 #，且抽出各語音資料，對每一說話者及方向進行整理，以製成音響模型之訓練集合（training set)。接著，在各音響模型27d中，使用三連音素（triph〇ne)，且對每一訓練集合使用HTK(Hidden Marcov Mode卜隱藏馬可夫模型）工具套件（tool kit)27f，以對各說話者之每一方向製作共計9種之語音辨認用語音資料。使用由上述方式所得之音響模型用語音資料，經由實驗來對特定說話者之音響模型27d之字彙辨認率進行調查’結果如第7圖所示。第7圖係以橫轴表示方向，縱軸 26 314412 200411627 :示字囊辨認率之圖示，符號P表示本人(說話者A)的聲玲，符號Q表示其他人（說話者B、c)的聲音。在說話 A之音響模型中，說話者A位於機器人1〇之正面時（第圖），在正面（0度）可形《80%以上的字彙辨認率，而且說話者A位於右方60度或左方6〇度時，則分別如第B 圖或第7(C)圖戶斤示可知，較之於說話者之不@，方向之不同所造Μ辨認率降低的情形較少，尤其是可達80%以上之字彙辨認率。在考慮該結果，而於進行語音辨認時，利用音源方向為已知乙點，選擇器 27e Α ύ、隹一 p人甲 /e為了進行統合，而使用下式（5) 提供之成本函數V(pJ。咖)屯4，杨^+?似).似)_似))咖） ^d)=\l if K^d)^Kcs(PeM ,、 1。if ^s(piCi)^Kes^Peidj (5) 在此’將V(p，d)，Res(p，幻分別定義為使用說話者p 及方向d之音響模型時之字彙辨認率以及輸人語音之辨認、”。果，而將de设為即時追蹤（real time tracking)之音源方向，復將pe設為作為評價對象之說話者。上述Pv(pe，de)係在臉部辨認模組產生之機率，在無法辨認臉部時，一般設為1〇。接著，選擇器27e將具有最大成本函數v(Pe)之說話者Pe及辨認結果Res(p, d)予以輪出。此時，由於選擇器27e藉由參考來自臉部模組3〇進行臉部辨認所得之臉部事件39的方式’而得以指定說話者，因此可以提高語音辨認之強健性（r〇bustness)。 3)44)2 27 200411627 其中，成本函數v(Pe)之第二大的值時，藉由語音辨值為小於1.0或者接近候補的方式，來判斷為無法^1敗或是無法匯集於一個出至後述之對28 丁浯音辨認，且將其内容輸 < < ^了诂。卩28。上述對爷 . 及語音合成部2扑及揚聲哭28° 28係由對話控制部2Sa …係藉由後述之結合心6(^構成。上述對話控制部音辨認部27之語音龍結果 T制，且根據來自語This means that when the frequency is less than the predetermined frequency Q, when there is IPD (back) in the range of HRTF's IPD (common band), and generous; the frequency of the prefetched frequency fth exists in the range of hr. The pass band δ (0) EK Δ p of the taxi p, that is, the collection sub-band is performed. Generally, the impact of IPD in the low frequency band is greater than & i 丄 θ. The IID contention in the frequency band is greater. The threshold frequency f ~~ 1 is dependent on the distance between U and the microphone. Or, the active direction pass filter 23a uses the secondary frequency band collected from the above to eliminate the sound plasticity. The θ sound θ5 can be re-synthesized and constructed into a waveform. The secondary phase work is shown as the sign No. X8.玍逋 Late person frequency ▼ (Pws-subband) direction, such as the symbol 314412 24 200411627 said X 9 is not the name ^ ._ Shi book, cup ^ 7 person frequency, filtering, as the symbol X10 does not use Inverse frequency conversion IFFT (inverse Fast Fourious nsfoim 'inverse fast Fourier nsfoim'), as shown by the symbol γη, extracts the knife_tone (acoustic signal) from the source of 夂立夕八 σσ in the range. As shown in Fig. 5, the voice recognition unit η is composed of a self-sound control unit 2Ja and an automatic recognition unit 27b. The voice control report "7" is included in the sound signals of the auditory frame, the sound source positioning and the sound source separation, and the sound emitted by the speakers of the dialogue unit 28 described below is excluded, and /, Those who come out with an external acoustic signal. As shown in FIG. 6, the automatic recognition Deng 27b is composed of a speech recognition engine 28, an acoustic model 27d, and a selector 276. For the speech recognition engine 27c, for example, a name developed by Kyoto University can be used. It is the speech recognition engine of "Jurian", which can recognize the vocabulary sent by each speaker. The "automatic recognition unit 26b" shown in Fig. 6 is composed of three speakers such as two males (residents A and C) and one female (speaker B). Therefore, in the automatic identification unit 2 ", an acoustic model 27d is provided for each direction of each speaker. In the case of Fig. 6, the acoustic model 27d is about three speakers a, B, and c respectively. A combination of the voices and directions from each child is provided, and it has a plurality of types. In this case, there are 9 types of acoustic models 27d. °° The edge recognition engine 27c executes 9 speech recognition processes in parallel. At this time, the above 9 acoustic models 27d are used. Specifically, the speech recognition engine 27c uses the above-mentioned nine acoustic models 27d for the sound signals input in parallel with each other to perform the speech recognition process. Then, the results of the above-mentioned words 314412 25 200411627, ie, δ tolerance, are output to the selected photography king ^ Selection 27e. All the above choices cry 27 sound model 27d all μ Xiang. . 27e combines the results of the recognition process with D ° θ from each, ^ ^ +, for example, 可靠性 judges the reliability by a majority decision and hunts down its speech recognition results. ㈣Process result ’and output Here’ ’s the use of the presenter ^ 〆, and the spoon of the horse to make a speech 4 inches a month to determine the voice of the speaker 2 only the zigzag recognition rate. The B-nucleus type and the antelope% are from Bai Baixian. In a 3m x 3m room, 3 people are placed at a distance of 10 meters from the machine, and the direction is 60 degrees from the machine. Secondly, from the speaker output, 150 males, 2 females, and 1 females, such as color, numbers, food, etc., respectively, use the microphones 16a and 16b of the mobile person 10 to collect sounds. Voice data. In #, in the collection of each vocabulary, when 'vocabulary' is used, each vocabulary sounds from only one speaker, two sounds from the same device, and three sounds from the same time. Recording of sound patterns. Then, the voice signal of the recorded voice signal is divided by the active direction pass filter 23a, and each voice data is extracted, and each speaker and direction are sorted to form a training set of the acoustic model. ). Next, in each acoustic model 27d, a triphone is used, and a HTK (Hidden Marcov Mode and Hidden Marcov Model) tool kit 27f is used for each training set to In each direction, a total of 9 types of speech recognition materials are produced. Using the voice data for the acoustic model obtained in the above manner, the vocabulary recognition rate of the acoustic model 27d of a specific speaker is checked through experiments'. The results are shown in FIG. 7. Figure 7 shows the direction on the horizontal axis and the vertical axis 26 314412 200411627: a diagram showing the recognition rate of the capsule, the symbol P represents the sound of the person (speaker A), and the symbol Q represents the other person (speaker B, c) the sound of. In the acoustic model of Talking A, when Speaker A is located on the front side of the robot 10 (pictured), "80% or more vocabulary recognition rate" can be formed on the front side (0 degrees), and Speaker A is located 60 degrees to the right or At 60 degrees on the left, as shown in Figure B or Figure 7 (C), respectively, it can be seen that compared with the speaker's @@, the difference in direction caused by the difference in direction is less, especially Recognition rate of vocabulary above 80%. In considering the result, when the speech recognition is performed, the direction of the sound source is a known point B, and the selectors 27e Α, 隹一 p 人甲 / e are used to integrate the cost function V provided by the following formula (5) (pJ. Coffee) Tun 4, Yang ^ +? Like). Like) _like)) Coffee) ^ d) = \ l if K ^ d) ^ Kcs (PeM ,, 1. if ^ s (piCi) ^ Kes ^ Peidj (5) Here, 'V (p, d), Res (p, phantom are defined as the vocabulary recognition rate when using the acoustic model of speaker p and direction d and the recognition of the input voice,'. Fruit, And set de as the sound source direction of real time tracking, and set pe as the speaker to be evaluated. The above Pv (pe, de) is the probability generated by the face recognition module, and the face cannot be recognized It is generally set to 10. Next, the selector 27e rotates the speaker Pe with the largest cost function v (Pe) and the recognition result Res (p, d). At this time, since the selector 27e refers to From the facial module 30, the method of facial event 39 obtained by face recognition 'can be used to specify the speaker, so the robustness of speech recognition can be improved. 3) 44) 2 27 20041 1627 Among them, when the second largest value of the cost function v (Pe) is determined by the speech recognition value being less than 1.0 or close to the candidate, it can be determined that it cannot be defeated or cannot be collected in a pair. Ding Yiyin recognized it, and typed its content < < ^ 诂诂. 卩 28. The above-mentioned pair of masters, and the speech synthesis unit 2 flutter and cry 28 ° 28 by the dialogue control unit 2Sa… by the later described Combining with heart 6 (^). The speech control result T system of the speech recognition unit 27 of the dialogue control unit is based on the language

Res(p, d) ^ rff) Μ 4 ^ & '、即矾話者Pe及辨認結果 VP? }而產生作為對象之砖每土至語音合成部28b。上述狂立^之語音資料，且輸出控制部叫之語音資料來驅;;揚㈣話資料相對應之語音。茲此方彳Z 8e而發出與語音音辨認部27之二辨話部28係根據來自語 ^ 士口曰辨^忍結果，例如說話者A說中「】作為吾歡的數字時，機器人」 ..^ 對者该說話者A之妝 4下，可對該說話者A發出「甘士八名^兄了 1』」之聲音。，五立1對話部28在當輸出無法從語音辨認部27辨贫 °°曰内各時，機器人10會在正對該說話者A之狀離下，對該說話者A質問「你『是2?是心，的回答再度進行語音辨認。此時，由於機器人10二：况话2 A，因此可更進一步提高語音辨認的精確度。者。藉此方 <，聽覺模組20根據來自麥克風工6的音響传況话者且抽出其聽覺事件，透過網際網路對处人、#"、、， v。。棋組 5〇進仃迗讯，而且，對各說話者進行語音辨認，由對話部根據語音而對說話者確認語音辨認結果。 314412 28 200411627 在此，實際上，由於音源向為時因此為了持續姑山4士 — ☆ 才間t的函數，性，不、,如源，必須考慮時間方向之連續 ^ 所述，係根據即時追蹤所得而得到音源方向。藉此方十士芝貝枓抓0 s，稱事件為資料… 即％追蹤表達了考慮所有冉爭件為貝抖流之時間性流動之表現，數個音源’或即使在音源或機器人本身移動广t!複公-個_貝料流的方式，可以連續：：向資訊。再者，次# & + 目粉疋音源的方冉者，貝4流亦使用於統合視可藉由參考臉部事件，且見之事件，因此方式，來提古立馮a 〜見以進行音源定位的木杈阿音源定位的精確度。、。上述臉部模組30係由作為裝置層之卜Res (p, d) ^ rff) M 4 ^ & ', that is, the speaker Pe and the recognition result VP?}, And the target bricks are generated to the speech synthesis unit 28b. The voice data of the above-mentioned mad stand ^ is output and driven by the voice data called by the control department; the voice corresponding to the Yangli dialect data. Hereby, Fang Z 8e is sent out and the speech recognition unit 27 bis. The speech unit 28 is based on the result of speech recognition. For example, the speaker A says "[] as the number of my favorite robot." .. ^ If the speaker A has 4 make-ups, the speaker A can make a sound of "Ganshi ^^ Brother 1". When the output of the Wuli 1 dialogue unit 28 cannot be identified from the speech recognition unit 27, the robot 10 will leave the speaker A and ask "Are you" yes "to the speaker A? 2? Yes, the answer is again speech recognition. At this time, because the robot 10: 2 and 2 A, the accuracy of speech recognition can be further improved. By this way, the auditory module 20 according to The microphone worker 6 listens to the speaker and extracts his auditory events, and uses the Internet to deal with people, # " ,,, .... The chess group 50 enters the message, and each speaker is identified by voice The dialogue department confirms the speech recognition result to the speaker based on the speech. 314412 28 200411627 Here, in fact, because the sound source is lasting, in order to continue Gushan 4 shi — ☆ function of Caima t, sex, not, such as The source must be considered in the continuous direction of time. As mentioned above, the source direction is obtained based on the real-time tracking. In this way, Fang Shishizhi captures 0 s and calls the event data ... That is,% tracking expresses the consideration of all Ran content as Temporal flow The performance, several sound sources' or even the sound source or the robot itself can be moved in a wide way! Fu Gong-a _ shell material stream can be continuous :: Information. Moreover, times # & + Fang Ran of the source In addition, Bei 4 stream is also used in the integrated vision can refer to the facial events, and see the events, so the way to mention Gu Li Feng a ~ See the sound source positioning accuracy of the wooden branch A.. The above facial module 30 is used as a device layer.

Ml'臉部識別部32、臉作為4寸徵層（資料）之臉部ID34、臉部事件層之臉部事件產生冑36所構成。藉此3方5式以^乍為組30係根據來自攝影機；’釦部換根據例如膚色抽出而對各說話者二臉部發現部” 別部32根據事先登記的臉部資 =測’在臉部識同的臉部時，即決定其臉部_且、隹進=檢索，當有相時，由臉部定位部33決定、仃硪別該臉部的同决疋（疋位）該臉部方向35。在此，臉部模組30係在臉部發現找到複數㈣料，㈣各 R據畫像信號定位及追縱。此時，由臉部發現部;二，方向及明亮度往往會發生變化，因此部大小、行臉部區域檢測，根據由膚色抽出及相二ί; 31將進關肩异所得之型樣 3J4432 29 200411627 匹配（pattern matching)之組合，在2〇〇m秒之内可正確檢測出複數個臉部。、恥"卩疋位部3 3係將二次元畫像平面之臉部位置變換為二次元空間，再以方位角0、高度0及距離！·之組合而 '寻到一 _人元空間之臉部位置。然後，臉部模組3 0係由臉邛事件產生部3 6對每個臉部從臉部ID(姓名）34及臉部方向35來產生臉部事件39 ,且透過網際網路而對結合模組 5 0發出訊息。、 '、 …叩由丨F句衣罝之攝影機1 5，、為過耘層之視差晝像產生部3 7 a、目標抽出部3 7 b，作為特徵層（資料）之目標方向37e ’以及作為事件層之立體 :件產生# 37d所構成。藉此方式’立體模組37係根據 :自攝㈣15之晝像信號，而利用視差晝像產生部W 由雙方之攝影们5之晝像信號來產生視差書像。接著，由目標抽出部37b對視差晝像進行區域分m士果導致，當發現縱向長形物體，則目栌則Η I抽出部37b即將其抽出 =為候補人物，且決定（定位）其目標方向37c。產生部爾根據目標方向37c產生立體事件I且透過網際網路而對結合模組5 〇發出訊幸上述電動機控制模組40係由及電位計（potenti⑽eter)42，作為過程衣/之電動機41 width modulation，脈寬調變）控制電路^、之刚咖1^ digita卜類比數位）轉換電路44及 (nalog t〇^ 為資料之特徵層之機器人方❾46$力機控制部45，作為 ’以及作為事件層之電 3]4412 30 200411627 動機事件產生部47 ^ -組40中，Φ 成。藉此方式，於電動;d r Y书動機控制部4 电動機控制模 -容後述）之指令，透過 / Κ據來自注意控制模組57(請行驅動控制。而且，由控制電路43對電動機4】進置。該檢树結果係透過二二檢測電動機4】的旋轉位制部45。接著，電動機㈣ 44傳送給電動機控接收到的信號而抽出機：豕由仙轉換電路44 係根據機器人方向：二向46。電動機事件產生部47 電動機事件48， 1由電動機方向資訊所構成之息。 …網際網路而對結合模組5。發出訊上述結合模組50係對於已於部模組30、立體模组37干、述之聽覺模組20、臉版棋組37、電動機控制模位於上位，而構成作為各模M 2G、3()、、37、4f)層式定之上位的資料流層。具之事件層使來自聽覺模"、臉部模組3:、二= 機控制模'组40之非 “…7以及電動 &臉部事件39、立_事#39 /步，亦即使聽覺事件產生辦覺資：+ &及電動機事件心同步而二臉部資料流54、立體視覺資料流55 之絕對座標轉換部52 ;盥夂資料冷生結合資料流59,或是將I: I:54、55相關連產飞疋將與上述貪料流53、54、55之關加以解除之關連# 56;注意控制模組η以及觀察器 (viewei，）58o 上述絕對座標轉換部52係將來自聽覺模組2〇之聽覺事件29、來自臉部模組3〇之臉部事件39、來自立體模组 200411627 37之立體事件39a、以及來自機事件48加以同步，同時機控制模組40之電動 39以及立體事件39a，藉由已同:見事件29、臉部事件用將其座標系轉換為絕對練/步之電動機事件48,利臉部資料流…立體視覺：而^生聽覺資料流53、座標轉換_ 52係藉由連㈣同」I此時，上述絕對臉部資料流以及立體視覺資料流^式^覺資料流、流53、臉部資料流54以及立體視覺_ =生聽覺資料而且，關連部56係根據聽覺資 54以及立體視覺資料流55,考廣上'二:4 3、臉部資料流的時門丨脂皮十t 心上迷貝料流53、54、55 的日守間順序來與資料流產 55 產生結合資料& μ 或疋將該關連解除，以 ϋ貝抖流59，同時，相反地，之聽覺資料a q ^ 構成結合資料流59 見貝枓* 53、臉部資料流54以之連結關俜错明 B1 立視覺貧料流55 即仏父弱，則將關係予以解除。為目標的說話者移猎此方式，即便成 ^ 力守’亦可預測該說話去Μ 4 成為其移動範圍之角* 私動，若為 53、54、55 :方:則可藉由產生上述資料流而追蹤。 ’而對6亥5兄活者之移動進行預測並進同日寸，左意控制模組57係為了查模組40之驅動電旦進仃笔動機控制以結合資料二制而進行注意控制者，此時，係 ' L 聽覺資料流53、臉部f _ 視覺資料& 55之順序样券夫去十卩貝枓W 54及立體控制模《且57户柄缺、&先'考，來進行注意控制。注音 …且57係根據聽覺資料流53、 - 立體視覺資料、、亡v 鉍邛貝枓流54以及 "L之狀態以及結合模組59之存在與否， 314412 32 200411627 末進為人】0之動作計 .4】，則透過網際網路㈣有而要動作驅動電動機指令之電動機事件ST動機控制模組4。傳送作為動作注意控制係根據連續且57之戶、f生及觸發性（iri 孩保持相同狀態，藉由）糟由連績性來妒，1广對最具興趣的對象進行、έ 如，延擇應注意之資料流，以進行追縱。藉^進仃追控制模組5 7進行& I 式，>主意動進行電動機㈣ :=枝4 1之控制計晝⑻_ing)，根機指令64a，Η、#、JEL y —座玍包動 40。夢此方气才際網路而傳送至電動機控制模組二電動機控制模組4。中，根據該電動機 :二3 ’由電動機控制部45進行PWM控制，使驅動電動機^行旋轉驅動，而使機器人1G朝向預定方向^ 硯察器58係指將藉此方式產生之各資料流Μ、”、 55、57於伺服器之晝面上顯示者，具體而言，雷達圖(⑽r chart)58a係顯示其瞬間之資料流的狀態，且可更詳細地顯不攝影機之視野角及音源方向，資料流圖⑽則係顯示結合資料流（粗線圖示）與聽覺資料流、臉部資料流及立體視覺資料流（細線圖示）。 ^ 本發明實施形態之人型機器人1〇之構成如上所述，而其動作則如下所述。首先’距離機器人10的前方im處，分別於左斜方（Θ 460度）、正面（Θ4度）、及右斜方（θι6〇度）的方向排列說話者，機器人1 0藉由對話部28對三位說話者提出問題，由各說話者同時回答問題。機器人1〇由麥克風16收Ml 'face recognition section 32, the face is composed of the face ID34 of the 4-inch sign layer (data), and the face event generation face 36 of the face event layer. In this way, the 3 formulas and 5 styles are taken from the camera according to the group 30; 'the buckle part is changed according to, for example, the skin color extraction, and the two face detection parts of each speaker are obtained.' The other part 32 is based on the previously registered face information = test. When the face is recognized by the face, its face is determined. And, 隹 Jin = Search. When there is a photo, it is determined by the face positioning unit 33. Face direction 35. Here, the face module 30 finds a plurality of materials on the face, and each R locates and tracks according to the image signal. At this time, the face is found by the face; two, the direction and brightness are often Changes will occur, so the size of the part and the detection of the face area will be based on the combination of skin color extraction and phase difference; 31 will be different from the shape of the shoulder 3J4432 29 200411627 pattern matching in 200m seconds Multiple faces can be correctly detected within. The shame " bit position 3 3 transforms the face position of the two-dimensional image plane into a two-dimensional space, and then uses azimuth 0, height 0, and distance! Combination to 'find a face position in one_renyuan space. Then, the face module 3 0 is composed of faces The event generation unit 36 generates a face event 39 for each face from the face ID (name) 34 and the face direction 35, and sends a message to the combination module 50 through the Internet., ', ... 叩From F 丨 Yiyi ’s camera 15, the parallax day image generation section 37a, which is the over-layer, the target extraction section 37b, the target direction 37e 'as the feature layer (data), and the stereo as the event layer : Piece generation # 37d. In this way, the “stereoscopic module 37” is based on: daylight image signal of self-shot ㈣15, and the parallax daylight image generation unit W is used to generate parallax book image from the daylight image signals of both photographers 5 Next, the target extraction unit 37b performs regional division of the parallax day image. When a longitudinally elongated object is found, the target is Η I extraction unit 37b is about to extract it = as a candidate, and determines (locates) it. The target direction 37c. The generating unit generates a three-dimensional event I according to the target direction 37c and sends a message to the combination module 5 through the Internet. Fortunately, the above-mentioned motor control module 40 is composed of a potentiometer (potenti 作为 eter) 42 as a process garment / Motor 41 width modulation, pulse Modulation) control circuit ^, ganga 1 ^ digita analogy digits) conversion circuit 44 and (nalog t〇 ^ for the robot's feature layer of data ❾ 46 $ force machine control section 45, as' and as the event layer of electricity 3] 4412 30 200411627 Motivation event generating unit 47 ^-Group 40, Φ Cheng. In this way, in the electric; dr Y book motive control unit 4 motor control module-described later) instructions, through / K data from the attention Control module 57 (Please drive control. The control circuit 43 supplies the motor 4]. This tree inspection result is transmitted through the rotation control unit 45 of the second detection motor 4]. Next, the motor ㈣ 44 transmits the signal received by the motor control to extract the machine: 豕 by the cent conversion circuit 44 is based on the robot direction: two-way 46. Motor event generation unit 47 Motor event 48, 1 is composed of motor direction information. … Internet and the combination module 5. The above-mentioned combination module 50 is for the module 30, the three-dimensional module 37, the auditory module 20, the face chess group 37, and the motor control module located at the upper position, and constitutes each module M 2G, 3 () ,, 37, 4f) layered data stream layer. The event layer makes the non- "7 from the auditory module", face module 3 :, two = machine control module 'group 40, and the electric & facial event 39, 立 _ 事 # 39 / step, even if Auditory event generating resources: + & and motor event heart synchronization and absolute face conversion section 52 of the two-face data stream 54 and the stereo vision data stream 55; the toilet data is combined with the data stream 59, or I: I: 54, 55 related production Feiyu will release the relationship with the above-mentioned material flow 53, 54, 55 # 56; Note the control module η and the viewer (viewei,) 58o The above-mentioned absolute coordinate conversion unit 52 Synchronize the auditory event 29 from the hearing module 20, the facial event 39 from the face module 30, the stereo event 39a from the stereo module 200411627 37, and the machine event 48 to synchronize, and the machine control module 40 The electric 39 and the stereo event 39a are the same as: See event 29, the facial event is used to convert its coordinate system to an absolute exercise / step motor event 48, which facilitates the facial data stream ... stereo vision: and generates auditory data Stream 53, coordinate conversion _ 52 is the same by flirting "At this time, the above absolute Data stream and stereo vision data stream ^ sensation data stream, stream 53, facial data stream 54 and stereo vision _ = raw auditory data, and the related department 56 is based on auditory resources 54 and stereo vision data stream 55.上 '二: 4 3. The time gate of the facial data flow 丨 Fat skin ten t The day-to-day sequence of the convulsive material flow 53,54,55 on the heart to generate the combined data with the data abortion 55 & or The association is released, and the trembling stream 59 is connected. At the same time, the auditory data aq ^ constitutes the combined data stream 59. See 枓 * 53, the facial data stream 54 is connected to the connection error B1, and the visual lean stream 55 is established. That is, if the father is weak, the relationship will be lifted. The target speaker moves this way, and even if it is ^ LiShou ', it can predict that the speech will go to M 4 to become the corner of its range of movement * Private movement, if it is 53, 54, 55: Fang: then you can generate the above Data stream. 'And to predict the movement of the brothers and sisters in the same day, the left-hand control module 57 is a controller that pays attention to the control of the driving force of the module 40 and combines the two systems of data. At the time, the sequence of 'L auditory data stream 53, face f_visual data & 55's order sample couple went to Shibabei W54 and three-dimensional control mode "and 57 household handle missing, & first' test Pay attention to control. Phonetic… And 57 is based on the auditory data stream 53,-stereo vision data, the state of the dysprosium bismuth stream 54 and the "L" and the presence or absence of the combination module 59, 314412 32 200411627 the last person] [Motion meter.4], the motor event ST motive control module 4 is required to drive the motor command through the Internet. Teleportation as action attention control is based on continuous and 57 households, births, and triggers (iri children remain in the same state, by), jealousy of continuous performance, and Guangxi carries out, for example, delays the most interested objects. Select the data stream that you should pay attention to for tracking. Born into the chase control module 5 7 to perform & type I, > the idea to move the motor ㈣: = the control plan of the branch 4 1 ⑻ ing), the root machine instruction 64a, Η, #, JEL y — seat bag Move 40. The dream is sent to the motor control module 2 and the motor control module 4 via the Internet. According to the motor: 2 '3' PWM control is performed by the motor control unit 45, the driving motor ^ is rotated and driven, and the robot 1G is directed to a predetermined direction ^ The inspector 58 refers to each data stream M to be generated in this way. "", 55, 57 are displayed on the daytime surface of the server. Specifically, chartr chart 58a shows the status of its instantaneous data stream, and it can display the camera's field of view and sound source in more detail. Orientation, the data flow diagram is a combination of data flow (thick line icon) and auditory data flow, facial data flow, and stereo vision data flow (thin line icon). ^ The humanoid robot of the embodiment of the present invention 10 The structure is as described above, and its operation is as follows. First, the distance from the front im of the robot 10 to the left oblique (Θ 460 degrees), the front (Θ 4 degrees), and the right oblique (θι60 degrees) are respectively. The speakers are arranged in a direction, and the robot 10 asks three speakers through the dialogue unit 28, and each speaker answers the questions at the same time. The robot 10 is received by the microphone 16.

JJ 314412 200411627 集該說話者的声&立 ^ ^ , 耳曰，而由聽覺模組2 0產生伴隨音源方之聽覺搴彳φ。π ° 干29，且透過網際網路傳送至結合模組5〇。获此方式，社人# ζ 一曰。拉組50係根據該聽覺事件29而產生聽覺資败。F模組30讀取藉由攝影機15而得之疖兮本之臉部金伤 η兄居者旦象，以產生臉部事件39，藉由臉部資料JJ 314412 200411627 gathers the speaker's voice & ^ ^, ear, and the auditory module 20 generates auditory 搴彳 φ accompanying the sound source. π ° dry 29, and transmitted to the combination module 50 through the Internet. To get this way, sheren # ζ one said. The pull group 50 generates auditory failure based on the auditory event 29. The F module 30 reads the golden face of the face obtained by the camera 15 and the resident's image, to generate a facial event 39, and uses the facial data

该說話者臉部；隹/一 4 、， ' 订仏索並進行臉部識別，同時，透過網路7 0將作兔μ、+ 示 # 作為上述結果之臉部ID24及晝像傳送至处人杈組 50。J： φ，木 # _ 、、° 口 ”中s该祝話者之臉部並未登記在臉部資料寸部杈組3 0會透過網際網路將其内容士合模組50。因此，钍八#。此、合杈組50係根據上述之聽覺事件29、知部事件3 9、立舻宣立脰事件39a，而產生結合資料流π。 A，在利此用二覺模組2〇係藉由主動方向通過型渡波器者？用由聽覺核線幾何所得之IPD來進行各音源_ 接著#^ 位及分離’以取出分離音（音響信號）。要者’聽覺板組20係葬由甘π & 引擎27 由” 一曰辨認部29使用語音辨認 C，辨涊各說話者X、Y、五立，輪出至對話部28。夢、曰將其結果八^ 方式，對話部28係在機器人1〇刀別正對說話者之狀態下， .^ ^出在語音辨認部27所冶分之語音辨認所得之前述 τ , 〇其中，語音辨認部27益氺正確進行語音辨認時，機哭 7無法蚀—人10在正對該說話者The speaker's face; 隹 / 一 4,, 'Order and perform face recognition, and at the same time, send the rabbit μ, + 示 # as the result of the face ID24 and the day image to the Internet through 70 People fork group 50. J ： φ ，木 # _ ，， ° 口 ”s The face of the speaker is not registered in the face data. The branch group 30 will integrate its content into the module 50 through the Internet. Therefore,钍八 #. Therefore, the combination group 50 is based on the above-mentioned auditory event 29, the Ministry of Knowledge event 39, and the stand-up announcement event 39a to generate a combined data stream π. A, use the two-sense module 2 here. 〇Which pass through the active direction through the type of wave waver? Use the IPD obtained from the geometry of the auditory epipolar line to perform each sound source_ Then # ^ position and separation 'to take out the separation sound (sound signal). The funeral engine π & engine 27 is identified by the "one" recognition unit 29 using voice to identify C, each speaker X, Y, and five standings, and turns to the dialogue unit 28. Dream, say the results in eight ways. The dialogue unit 28 is in the state where the robot 10 is facing the speaker,. ^ ^ Is the aforementioned τ obtained by the voice recognition by the voice recognition unit 27. When the speech recognition unit 27 is performing speech recognition correctly, the machine crying 7 cannot be eclipsed—person 10 is facing the speaker

下，再度重覆提出問題，根櫨复 ^ 考之狀L 根據其回答再度進行語音辨認。稭此方式，根據本發明者舻妯丄士戶、知形態之人型機器人10，根據由聽覺模組20進行音仃曰源疋位·音源分離之分離音（音 314412 34 200411627 :11 \由、4音辨認部27使用與各說話者及方向相對應曰響杈型進行語音辨認的方式，可對同時 " 說話者的語音進行語音辨認。地—之複數位價。來對語音辨認部27的動作進行評〜户，，” 8圖所示，在距離機器人10前處，分別於左斜方（θ=+60声）、下而Μ 引右斜方（0=-60声）沾士 Α 又）正面（θ=0度）、及在實驗中，分別^ 說話者Χ、Υ、Ζ。其中， ;其前面㈣說話相照片。在此'，=/者’同時’ 音響模型時相同的 b揚尸耳裔係使用與產生片中的說話者的聲音。毛出的聲音視為照丧:妾著’根據以下的對話版本（Scen 义汽驗。）進仃k音辨認 2 機器10對三位說話者χ 位說話去 Y 〇〇 γ、z同時回答問題三仞從〜，… z &出問題。 3 ·撫 …一 y⑼7¾。 ^ 〇σ 1 〇根據三位說話者X、Υ、ζ的、、日八二曰源定位.立、及八％、叱5語音，進； 4 4丨原分離’再者關於各分離立、仓/ .機器人1 0名护皮刀離q進行語音辨切 . ◦在依序正對各說話者Χ、γ、7 I心碴說話者的回答。 2的狀態下回： 5·當機器人〗〇判斷 ^ 与”、、法正確進彳干纟五立由立二垓說話者再度重覆 °。曰^忍時，即正受音辨認。 U合再度進行菩圖將由上述版本戶斤所传之貫驗結果之第—例表示於第‘ 3】44]2 35 200411627 丄·由機裔人1〇詢問「喜歡的數字是什麼？」。（參考第9(a) 圖） 2·由作為各說話者X、Υ、Ζ之揚聲器，㈤日夺從！至10的數字之中，流放出朗讀任意數字之聲音。例如如第9(b) 圖所示，說話者X說「2」，說話者γ說「丨」，然後說話者Z說「3」。 3·機器人10在聽覺模組2〇根據利用其麥克風16進行集音之音響信號，藉由主動方向通過型濾波器進行音 :原疋位·音源分離，且將分離音抽出。然後，根據與各，、者 Y z相對應之分離音，按各說話者別，由語音辨認部27使用9個音響模型同時執行語音辨認過程’且進行其語音辨認。 4·此時，語音辨認部27的選擇器27e假設正面為說話者 γ而對語音辨認進行評價（第9(e)圖），接著假設正面為 S兄話者X而對語音辨認進行評價(第9⑷圖），最後假設正面為說話者Z而對語音辨認進行評價（第9(e)圖）。 5 ·接著’選擇器27e將語音 — 〒0果進仃統合，如第9(f) 圖所示，關於機器人正面（Θ =〇戶、.^ η .^ 1 度），決定最適合之說話者名稱（Υ)及語音辨認結果（「i )， — ;立季別出至對話部2 8。錯此方式，如第9(g)圖所示，機器 Y的狀態下會回答「Y君是『i』/ 正對說話者 6·接著，關於左斜方（θ= + 60度）的方& n ^ . m 〕方向，進行與上述相同之處理，如第9(h)圖所示，機哭 y ^ °°人10在正對說話者 X的狀悲下會回答「χ君是『2 丹考，關於右斜方（<9 3144J2 36 200411627 = -60度）的方向亦進行相同之處理，如第9⑴ 機器人10在正對説話者z的狀態下會回答圖所示 B疋此時，機器人1〇可以對於各說話者 V X、乙的回欠進行完全正喊的語音辨認。因此，即使在同時發：: 下，於使用機器人1〇之麥克風16的機器人視聽覺系：二中’亦表示出音源定位.音源分離.語音辨認之有效性其中，如第9⑴圖所示，當機器人1〇並未正對各十、話者時，則如「γ君是『1』。X君是『2』。z君是『3口 ° 兄合計是『6』。」所示般，使其回答出各說話者=、回答的數字之合計亦可。第10圖係表示由上述版本所得之實驗結果之第二例0 1 ·與第9圖所示之第一例相同地，由機器人工〇詢問「直歡的數字是什麼？」（參考第10(a)圖），、由作為各說^ 者X、、Υ、Ζ之揚聲器，如第l〇(b)圖所示，流放出說話者X為「2」、說話者γ為「1」、然後說話者z為「3 的聲音。」 2·同樣地，機器人10在聽覺模組20,根據利用其麥克風進行集音之音響信號，藉由主動方向通過型濾波器 23a進行音源定位·音源分離，且將分離音抽出，根據兵各成話者X、γ、z相對應之分離音，按各說話者別，由…9辨認部27使用9個音響模型同時執行語音辨認王且進行其語音辨認。此時，語音辨認部2 7的選 37 314412 200411627 二27:係如第1〇(c)圖所示’關於正面的說話者Y可對b S辨認進行正確評價。 3·相對於此，關於位於+ 60度 ,..1Λ/ 之况5舌者Χ ’選擇器27e荇弟10⑷圖所示，無法決定是「2」或「4」。 ^ 4.=此，機器人10係如第叫6)圖所示，會正對著位於度之說話者乂問「是『2』？是『4』？」/者位方、+ 6〇 5 ·相對於此，如第！〇 m岡 — 灰哭推山「會從作為說話者X之揚 ::10 2」的回答。此時’由於說話者x係位於機的正面，所以聽覺模組20會關於說，舌者回答進行音源定位.音源分離，而由語音辨認部27 確進打語音辨認，且將說話者名稱X及語立正「2」輸出至對話部28。藉此方曰j °心結果圖所示，會對說話者X回答「χϋ〇係如第1〇⑷ 6·接著，關於說話纟ζ係、進行相^理，^ 1 0在正對著况活者ζ的狀態下回答「Ζ君是『3 話者精X此;式ζ :益人1〇藉由再度詢問，使得對於各說葬^… 全正確進行語音辨認。因此呈現出 ^ 1G正對著在側方的說話者再度詢問的方心 =由於：方之聽覺中央小窩之影響所造成之分度降低的浯音辨認曖昧性，可接古提高語音辨認精,確度。了㈣精確度’且可進=，如！10⑴圖所示，機器人10對各說話者正確進仃b音辨3忍俊，如「γ君是『] 疋x君是『2』。z君是 314412 38 X、 200411627 『3』。合钟1θ 『a Μ疋b』。」所示般，使其回答出各說話者 Y、Z回答的數字之合計亦可。乐11圖係表示由上述對話版本所得之實驗仕果三例。、。 10 o(t 1 ·本,：與第9圖所示之第一例相同地，由機器人問吾欲的數字是什麼？」（參考第10(a)圖），如第圖所不，由作為各說話者χ、γ、z之揚聲器流放出. 洁者X為「8」、說話者丫為”」、然後說話者/為「· 的聲音。 •同‘地’機态人：! 0在聽覺模組Μ，根據利用其麥克 16進行集音之音響信號，參考由即時追縱（參考χ3，）產生之資料流方向0及各說話者之臉部事件’藉由主: 方向通過型濾波器23a進行音源定位.音源分離，且: 出分離音，根據|久％士 — 豕一各δ兄活者X、γ、z相對應之分離音對每°兄5舌者5吾音辨認部27使用9個音響模型，同丨進行語音辨認過程，且進行其語音辨認。此時’語音] 認部 2 7的選摆哭 9 7 θ ΒΒ 擇°° 27e，關於正面的說話者Υ，由基; 月汉口[3事件判斷為說話者γ❸機率較高乙點，在對各· 響模型之語音辨認結果進行統合之際，如g i◦⑷圖）不，將考慮該情形。藉此方式，可進行更為正確之語_ 辨5忍。因此，如第闰乐11(d)圖所示，機器人1〇會對說話- X回答「X君是『7』」。 3·相對於&，關於位於+ 60度之說話者X,機器人10ί 變方向而呈正面對時，關於此時呈正面的說話者Χ，^ 314412 39 200411627 據臉部事件為說話者x的機率較高，因此，相同地，如第11(e)圖所示，選擇器27e會考慮該情形。因此，如第11(f)圖所示，機器人10會對說話者x回答Γγ君是『8』」。 4.接著’如第11(g)圖所示，選擇器27e會對說話者ζ進行相同的處王里，且對說話者z答覆其語音辨認結果。亦即，如第u(h)圖所示，機器人1〇在正對說話者z的狀態下回答「Z君是『9』」。藉此方式，機器、人10係與每_說話者正對面，一面參考其臉部事件’―面根據說話者之臉部辨認，對各說話者X、Y、Z之所有回答進行正確的語音辨認。如上所述，藉由臉部辨認，可以指定說話者是誰，因此顯示出可以進行精喊度更高的語音辨認。尤其是在特定環境下使用之前提下，根據臉部辨認可得約近1〇〇%的臉部辨認精確度時，可以使用臉部辨認資訊作為可靠性較高的資訊，藉2在語音辨認部27之語音辨認引擎27c所使用的音響^型2二的數目得以㈣’因此可進行更為高速、且精確度高的語音辨認。：又问口口第12圖係表示由上述對話版本所得之實驗結果之第四例。 i由機器人10詢問「喜歡的水果是什麼？」（參考第12(a) 圖），如第12(b)圖所示，由作為各說話者χ、γ、Z之揚聲器說出說話者X為「梨子」、說話者γ為「西瓜」、然後說話者Z為「哈密瓜」。」 314412 40 200411627 2·::人10在聽覺模組20,根據利用其麥克風16進行立曰之S k 5虎’藉由主動方向通過型遽波器23a進行曰源疋位.音源分離，且抽出分離音。然後，根據盥各說話者χ、γ、.ζ相對應之分離音，對每一說話者語音辨遇部27使用9個音響模型，同時進行語音辨認過程，且進行其語音辨認。 3. =時，語音辨認# 27的選擇器27e假設正面是說話者 I且進行語音辨認的評價（第叫）圖），接著，假設正面是說話者X，且進行語音辨認的評價（第12⑷圖），取後，假設正面是說話者Z，且進行語音辨認的評價（第 12(e)圖）。、 4. 然後’如第⑽圖所示，由選擇器27e對語音辨認結果進行統合，關於機器人正面（θ=0度）方向，決定最為適合之說話者名稱（Υ)及語音辨認結果「西瓜」，且輸出至對話部28。藉此方式，如第12(g)圖所示，機哭人1〇在正對說話者Y的狀態下，會答 5·接著，對各說話者X、Z進行相君疋西瓜』」進仃相问的處理，且對各説話 Z回答其語音辨認結果。亦即，如第l2(h)圖所不’機器人10在正對說話者χ的狀態下，會艾出「χ 君是『梨子』」’再者，如第12⑴圖所示，機器人10在正對“U的狀態下，會答出「之君是『哈密瓜』」。此時，機器人10可以對各說話者χ、γ、Μ所有回合進行正確的語音辨認。因此，語音辨認引擎We所登記的字囊並非限於數字，只要是事先登記的字彙，即可以進 314412 41 200411627 行語音辨認。在此，實驗中使用的語音辨認弓丨擎2k中，雖然登記了約150個字彙，但是字彙音節數越多語音辨認率會略為降低。 ϋ 9Next, ask the question again and again, and then repeat the test ^ L according to the answer to voice recognition again. In this way, according to the inventor's humanoid robot 10 and the known humanoid robot 10, according to the separation sound of the sound source position and sound source separation by the hearing module 20 (sound 314412 34 200411627: 11 \ 由The four-tone recognition unit 27 uses a ringing type corresponding to each speaker and direction to perform voice recognition, which can recognize voices of simultaneous " speakers' voices. The multi-digit price of the ground. To the voice recognition unit The action of 27 is evaluated by the user, "As shown in Figure 8, at the distance from the front of the robot 10, the left oblique (θ = +60 sounds), the down and M lead right oblique (0 = -60 sounds)士 Α)) Frontal (θ = 0 degrees), and in the experiment, respectively ^ Speakers X, Ζ, and Z. Among them,; in front of ㈣ talking photo. Here, ', = / 者' at the same time when the acoustic model The same b Yang corpus lineage uses and produces the voices of the speakers in the film. Mao's voice is regarded as a mourning: 妾妾 'according to the following dialogue version (Scen Yiqi Yan.) Enter the 音 sound recognition 2 machine 10 pairs of three speakers χ talk to Y 〇〇γ, z answer the question at the same time three 仞 from ~, z & Problems. 3 · Stroke ... a y⑼7¾. ^ 〇σ 1 〇 According to the three speakers X, Υ, ζ, 八, 二, 源, 源, 八, 5%, 叱 5 voice, enter; 4 4 丨 Original separation ', and about the separation of each stand, warehouse /. Robot 10 skin-knife away from q for speech recognition. ◦ The speakers X, γ, 7 I were facing each other in sequence. Answer: 2 in the state of returning: 5. When the robot 〖〇 Judging ^ and ", and the method is correct, and the five stand-ups are repeated by the speaker. When you say ^ tolerance, you are being recognized by the sound. The U-Heavy again carried out the Botu diagram. The first example of the results passed by the above version of Hu Jin is shown in ‘3] 44] 2 35 200411627. 由 Ask the person“ What is your favorite number? ” (Refer to Figure 9 (a)) 2. The speakers X, Υ, and Z are the speakers, and the next day will win! Among the numbers up to 10, the sound of reading any number is exiled. For example, as shown in Fig. 9 (b), speaker X says "2", speaker γ says "丨", and speaker Z says "3". 3. The robot 10 uses the microphone 16 to collect sound from the acoustic module 20, and uses the active-direction pass-through filter to perform sound: original position and sound source separation, and extracts the separated sound. Then, based on the separated sounds corresponding to each of the speakers Y z, the speech recognition unit 27 simultaneously executes the speech recognition process using nine acoustic models and performs speech recognition for each speaker. 4. At this time, the selector 27e of the speech recognition unit 27 evaluates the speech recognition assuming that the front side is the speaker γ (Fig. 9 (e)), and then evaluates the speech recognition assuming that the front side is the S-speaker X ( (Figure 9⑷), and finally assume that speaker Z is the front to evaluate speech recognition (Figure 9 (e)). 5 · Then 'selector 27e integrates voice-—0 results into 仃, as shown in Figure 9 (f), regarding the front of the robot (Θ = 〇户,. ^ Η. ^ 1 degree), decides the most suitable speech The name of the person (Υ) and the result of speech recognition ("i",-; Li Ji is unique to the dialogue department 2 8. In this way, as shown in Figure 9 (g), the machine Y will answer "Y-Kun Is "i" / facing the speaker 6 · Next, regarding the direction of the left oblique (θ = + 60 degrees) & n ^. M] direction, perform the same processing as above, as shown in Figure 9 (h) It shows that the machine cries y ^ °° In the face of the speaker X, the person 10 will answer "χ 君是『 2 丹考, regarding the direction of the right oblique (< 9 3144J2 36 200411627 = -60 degrees). The same process is performed, as in the ninth case, the robot 10 will answer the figure B in the state facing the speaker z. At this time, the robot 10 can completely recognize the arrears of each of the speakers VX and B by voice recognition. Therefore, even in the case of simultaneous sending ::, the robot's audiovisual system using the microphone 16 of the robot 10: two middle 'also indicates the sound source localization, sound source separation, and speech recognition. Among them, as shown in Fig. 9 (a), when the robot 10 is not directly facing each speaker, it is like "γjun is" 1 ". Xjun is" 2 ". Zjun is" 3 "口 ° The total number of brothers is "6". "As shown in the figure, it is possible to make the total number of answers =, each speaker =. Fig. 10 shows the second example of the experimental results obtained from the above version. 0 1 · Same as the first example shown in Fig. 9, the robot worker asks "what is the number of straight joy?" (Refer to Fig. 10) (a)). As the speakers X,, Υ, and Z are speakers, as shown in Fig. 10 (b), the speaker X is "2" and the speaker γ is "1". Then, the speaker z is "3 voices." 2. Similarly, the robot 10 performs sound source localization and sound source in the hearing module 20 based on the acoustic signals collected by its microphone through the active direction pass filter 23a. Separate and extract the separation sound, according to the separation sounds corresponding to each of the speakers X, γ, and z, according to each speaker, the 9 recognition unit 27 uses 9 acoustic models to simultaneously execute the voice recognition king and perform its Speech recognition. At this time, the selection of the speech recognition unit 27 37 314412 200411627 2:27: As shown in Fig. 10 (c) ', regarding the positive speaker Y, the b S recognition can be correctly evaluated. 3. Contrary to this, regarding the position of + 60 °, ..1Λ /, the 5 tongue person X 'selector 27e 荇, as shown in the figure 10, cannot be determined to be "2" or "4". ^ 4. = Then, as shown in figure 6), the robot 10 will directly ask the speaker who is located at the degree "Yes" 2 "? Is it" 4 "? / Person position, + 6〇5 · In contrast, as the first! 〇 mgang — Grey Cry Pushshan's answer "Will start from speaker X Zhiyang :: 10 2". At this time, 'speaker x is located on the front of the machine, so the hearing module 20 will answer the speaker's answer to locate the sound source. The sound source is separated, and the speech recognition unit 27 does the speech recognition, and the speaker name X The Japanese language "2" is output to the dialogue unit 28. In this way, as shown in the j ° result chart, the speaker X will be answered "χϋ〇系为第 10⑷ 6 · Next, regarding the speech 纟 ζ system, the reasoning is performed, ^ 1 0 is facing the situation In the state of the living zeta, the answer "Zjun is" 3 Talkers are fine X this; Formula z: Yiren 1〇 By re-inquiring, the speech recognition is correctly performed for each of the sacrifices ^ ... So it shows ^ 1G positive For the side speaker who asked again, the heart of the square = due to: the effect of Fang Zhi's auditory central fossa, the degree of ambiguous cymbal recognition is reduced, which can improve the accuracy of speech recognition and accuracy. Degree 'and can enter =, as shown in the figure 10, the robot 10 correctly enters the sound of each speaker and recognizes 3 ninjas, such as "γjun is"] 疋 xjun is "2". Z 君 is 314412 38 X, 200411627 "3". The bell 1θ "a Μ 疋 b". "As shown in the figure, it is possible to make the total number of the digits Y and Z answered by each speaker. The Le 11 picture shows three examples of experimental results obtained from the above dialogue version. . 10 o (t 1 · Ben: As in the first example shown in Fig. 9, what number is the robot asking me? "(Refer to Fig. 10 (a)). Release as speakers for each speaker χ, γ, and z. Cleaner X is "8", speaker is "", and then the speaker / is the voice of "." Same person as "ground": 0 In the auditory module M, according to the acoustic signal collected by using its microphone 16, reference is made to the data stream direction 0 generated by real-time tracking (refer to χ3,) and the face events of each speaker. The filter 23a performs sound source localization. The sound sources are separated, and the sounds are separated, according to the | | %% — the sounds corresponding to each of the δ brothers X, γ, and z are identified to 5 voices per 5 brothers. Department 27 uses 9 acoustic models to perform the speech recognition process and perform its speech recognition. At this time, 'Voice' recognizes the sound of the selection of the component 2 7 9 θ Β Β °° 27e. Regarding the front speaker Υ, Youji; Yuehankou [3 events judged that the speaker's probability of γ❸ is higher than that of the second point, and when the speech recognition results of various sound models are integrated, (Such as gi◦⑷) No, the situation will be considered. In this way, a more correct language can be performed_ discriminate 5 tolerance. Therefore, as shown in Figure 11 (d), the robot 10 will speak to -X replied "X Jun is" 7 "". 3. Relative to & about speaker X located at +60 degrees, when robot 10 changes direction and faces head-on, about speaker X who is facing front at this time, ^ 412 412 39 200411627 according to facial events for speaker x The probability is high, so similarly, as shown in Fig. 11 (e), the selector 27e will consider the situation. Therefore, as shown in FIG. 11 (f), the robot 10 responds to the speaker x by Γ [gamma] is "8" ". 4. Next, as shown in FIG. 11 (g), the selector 27e performs the same processing on the speaker ζ, and answers the speech recognition result of the speaker z. That is, as shown in Fig. U (h), the robot 10 answers "Z Jun is" 9 "" while facing the speaker z. In this way, the machine and the human 10 are directly opposite each speaker, while referring to their facial events'-identifying according to the speaker's face, and correctly speaking all the answers of each speaker X, Y, and Z. identify. As described above, by identifying the face, it is possible to specify who the speaker is, and thus it is shown that speech recognition with higher precision can be performed. Especially before using it in a specific environment, when face recognition accuracy of approximately 100% is recognized based on face recognition, you can use face recognition information as highly reliable information, and use 2 for voice recognition The number of the audio-visual type 22 used by the speech recognition engine 27c of the unit 27 can be increased, so that higher-speed and high-accuracy speech recognition can be performed. : Ask again. Figure 12 shows the fourth example of the experimental results obtained from the above dialogue version. i The robot 10 asks "What is the favorite fruit?" (refer to Fig. 12 (a)). As shown in Fig. 12 (b), the speaker X speaks the speaker X as the speaker χ, γ, and Z. Is "pear", speaker γ is "watermelon", and speaker Z is "melon". "314412 40 200411627 2 · :: The person 10 is in the auditory module 20, and uses the microphone 16 to perform the SK 5 tiger's source position by the active direction through the type wave filter 23a. The sound source is separated, and Extract the separation sound. Then, based on the separated sounds corresponding to each of the speakers χ, γ, .ζ, each speaker's speech recognition unit 27 uses 9 acoustic models, and simultaneously performs a speech recognition process, and performs its speech recognition. 3. When ==, the selector 27e of the speech recognition # 27 assumes that the front is speaker I and evaluates the speech recognition (calling diagram), and then assumes that the front is the speaker X and evaluates the speech recognition (section 12⑷) Image), after taking it, assume that speaker Z is the front, and evaluate the speech recognition (Figure 12 (e)). 4. Then, as shown in the second figure, the selector 27e integrates the speech recognition results. With regard to the direction of the robot front (θ = 0 degrees), the most suitable speaker name (Υ) and the speech recognition result "Watermelon "And output to the dialogue unit 28. In this way, as shown in Figure 12 (g), the machine crying person 10 will answer 5 in the state facing the speaker Y. Then, each of the speakers X and Z will be phased together. " The processing of the phase question, and answering the speech recognition result for each speech Z. That is, as shown in Fig. L2 (h), 'the robot 10 will face out to the speaker χ, and it will show "χ 君是「梨子」"'. Furthermore, as shown in Fig. 12, the robot 10 Facing "U, I will answer" The Prince is "Hami Melon" ". At this time, the robot 10 can perform correct speech recognition for all rounds of each of the speakers χ, γ, and M. Therefore, the word bag registered by the speech recognition engine We is not limited to numbers. As long as it is a registered word, it can be recognized by 314412 41 200411627. Here, although about 150 vocabularies are registered in the speech recognition bow 2k used in the experiment, the speech recognition rate decreases slightly as the number of syllable syllables increases. ϋ 9

於上述之實施形態中，機器人1〇之上半身係呈有4 賺(自由度)所構成，但並非侷限於此，亦可將本發明之視聽覺系統組入於以可進行任意動作之方式所構成之機哭人中。而且，於上述之實施形態中，係就將本發明之機：人視聽覺系統組入人型機器人10之情形加以說明，作I 非限於此，顯然亦可組入犬型等各種動物型機器人或=他形式之機器人中。 ^此外，在上述說明中，…圖所示，已就機器人視聽覺糸統1 7具備有立體模組3 7之構成例加以說明，不過，未具備有該構件亦可。此時，結合模组5〇係根據聽覺事臉部事件％及電動機模組48，而產生各說話者之聽覺資料& 53 Α臉部資㈣54，然後與 # ”臉部資料…關連產生結合模組59而=: 左意控制模組50中，則是根據上述資料流來進行注音控制而構成。〜再者，於上述說明中，主動方向通過型濾波器係、、 °對通過頻帶範圍（pass range)進行控制，設定 =頒贡靶圍為一定，而與所處理的聲音頻率無關。在此，導出通過頻帶 6，使用 100Hz、200Hz、500Hz、1 000Hz、 2〇〇〇H2 Μ ς , 、固1 〇〇Ηζ的諧波構造音（]iarrnonics)的純音及 ~j固兮比、、士、，白〆對1個音源進行調查音源抽出率的實驗。其中， 42 314412 200411627 使音源從機态人正面之〇度至機又主械杰人的左方或方90产之範圍内，以每10度移動位置一次。一又第13圖至第15圖係表示夕夂从麥l #班仕攸0度到90度之範圍内之各位置上設置音源時之音源插 W ^千圖不，如該實驗結果所不，按妝頻率對通過頻帶範一相安々n 铯仃$工制，藉此可提高特疋頻率之聲音之抽出率， τ担一在立她分離精確度。進而，亦可提咼s。音辨認率。因此，於 .17 φ 方'上述况明之機器人視聽覺系、洗17中’主動方向通過型淚哭 ^ # 41 ^ 23a之通過頻帶範圍可依母一頻率進行控制而構成為宜。【產業上利用可能性】如以上所述，根據本發明，盥 π 益；处m 一知之浯音辨認相較之下，錯由使用複數個音響模 ^ if it T ^ ^ x ^ 扪方式，可在真實時間·真貝％ i兄下進仃正確的語音塑槿刑夕1立辨^ 而且，藉由選擇器將各音音杈型之,吾音辨認結語音辨認結果，藉此相’以判斷出可靠性最高之正確之語音辨認較於習知之語音辨認，可進行更為【圖式簡單說明】弟Ϊ圖係表示備右士 π Ψ ^ Ψ s月之機器人視聽覺裝置之第一 …恶之人型機器人之外觀之正視圖。 !2圖係為第1圖之人型機器人之側視圖。弟3圖‘為弟1圖之人如圖。支械裔人頭部構造之概略放大弟4圖係表示第】圓之系統之| ^铖扣人宁之機态人視聽覺包子構造例之方塊圖。 334412 43 200411627 第5圖係表矛楚覺模組的作用示意圖。兄L見糸統中之聽第6圖係表矛組的語音辨認部所圖之機器人視聽覺系統中之聽覺模圖。用之語音辨認引擎構造例之概略斜視左右二: = 第6圖之語音辨認引擎所得之正面及示出⑷為者發仏聲音之_㈣示，顯然後⑹為右斜方⑽度之說話者之情形。^者立！：圖係表示第4圖所示之機器人視聽覺系統中之語日辨為戶、驗之概略斜視圖。第圖⑷至⑴係依序表示帛4圖之機器人視聽覺系統之：音辨認實驗之第-例之結果之示意圖。第® (a)至⑴係依序表示帛4圖之機器人視聽覺系統之：音辨認實驗之第二例之結果之示意圖。第11圖（a)至（h)係依序表示第4圖之機器人視聽覺系統之2音辨認實驗之第三例之結果之示意圖。第12圖（a)至⑴係依序表示第4圖之機器人視聽覺系統之語音辨認實驗之第四例之結果之示意圖。第1 〇圖係表不對於與本發明實施形態相關之主動方向通過j濾波器之通過頻帶範圍已進行控制之情形下的抽出率圖示，表示⑷在0度，（b)在10度，⑷在2P度，⑷ 在j 〇度的方向具有音源之情形。第14圖係表示對於與本發明實施形態相關之主動方 44 314412 2004110^/ 向通過型濾波器之 <通過頻帶範圚出率圖示，表示（a)在匕進行控制之情形下的抽方向具有音源之情形。又（b)在50度，（c)在60度的苐15圖係表示向通過型濾波器之通出率圖示，表示（a)在方向具有音源之情形勢' 於與本發明實施形態相關之主動方 $頻▼範圍已進行控制之情形下的抽 70度’（b)在80度，（c)在90度的 10 人型機器人 11 12 胴體部 13 13a 連結構件 14 14a 、14b 段部 15 16、 16a、16b 麥克風 17 20 聽覺模組 21 22 音源定位部 23 23a 主動方向通過型濾波器 24 間距 25 26 聽覺事件產生部 2 7 27a 自聲控制部 27b 2 7c 語音辨認引擎 27d 27e 選擇器 27f 28 對話部 28a 28b 吾音合成部 28c 29 聽覺事件 30 基座頭部外包裝攝影機機器人視聽覺系統峰值抽出部音源分離部音源水平方向語音辨認部自動辨認部音響模型 HTK工具套件對話控制部揚聲器臉部模組 3]44]2 45 200411627 3 1 臉部發現部 33 臉部定位部 35 臉部方向 37 立體模組 37b 目標抽出部 37d 立體事件產生部 38a 立體事件 ® 40 電動機控制模組 42 電位計 44 AD轉換電路 46 機器人方向 48 電動機事件 52 絕對座標轉換部 54 臉部資料流 56 關連部 Φ 58 觀察器 58b 資料流圖 64a 電動機指令 32 臉部識別部 34 臉部ID 36 臉部事件產生部 37a 視差晝像產生部 37c 目標方向 38 臉部資料庫 39 臉部事件 41 電動機 43 PWM控制電路 45 電動機控制部 47 電動機事件產生部 50 結合模組 53 聽覺資料流 55 立體視覺資料流 57 注意控制模組 58a 雷達圖 59 結合資料流 70 網際網路 46 314412In the above embodiment, the upper body of the robot 10 is composed of 4 earning (degrees of freedom), but it is not limited to this, and the audiovisual system of the present invention may be incorporated in a manner capable of performing arbitrary actions. Posing machine crying. Moreover, in the above-mentioned embodiment, the case where the machine of the present invention is incorporated into the human-type robot 10 is described. I is not limited to this, and various animal-type robots such as a dog-type can obviously be incorporated. Or = other forms of robots. ^ In the above description, the example of the configuration in which the robotic audiovisual system 17 is equipped with the stereo module 37 is shown in the figure. However, the component may not be provided. At this time, the combination module 50 generates the auditory information of each speaker based on the auditory facial event% and the motor module 48, and then generates a combination with # ”facial information ... Module 59 and =: Zuoyi control module 50 is configured to perform Zhuyin control based on the above data stream. ~ Furthermore, in the above description, the active direction pass type filter system, °, and the pass band range (Pass range) control, setting = given target range is constant, regardless of the frequency of the sound being processed. Here, the passband 6 is derived, using 100Hz, 200Hz, 500Hz, 1000Hz, 2000H2 Μ ς The pure sound of the harmonic structural sound () iarrnonics of ， 1 〇〇Ηζ and ~ j 固 xi, shi, shi, and ba yi conducted experiments on the sound source extraction rate of one sound source. Among them, 42 314412 200411627 made the sound source From the position of 0 degrees on the front of the human body to the left or 90 degrees of the main weapon, move the position once every 10 degrees. Figures 13 to 15 show Xi Xi Cong Mai 1 # 班仕奕 0 ° to 90 ° each position When the sound source is set, the sound source is not inserted. As shown in the experimental results, according to the frequency of the makeup frequency band, the pass band frequency is phase-locked, n cesium, and the system is used to improve the extraction rate of sound with special frequencies. Τ The accuracy of the separation of Li Yili. In addition, you can also improve the recognition rate of the sound. Therefore, at .17 φ Fang 'the above-mentioned state of the robot's audiovisual system, washing 17' active direction through type tears cry # 41 ^ The pass band range of 23a can be controlled according to the mother-frequency. [Industrial Utilization Possibility] As described above, according to the present invention, the benefits are better; The wrong reason is to use multiple audio modes ^ if it T ^ ^ x ^ 扪, you can enter the correct voice in real time and real time%, and you can distinguish the sound of the hibiscus and the sound by the selector. In the voice-fork type, my voice recognizes and recognizes the voice recognition results, so as to determine the correct voice recognition with the highest reliability than the conventional voice recognition. [Schematic description] Bi Youshi π Ψ ^ Ψ s Moon's vision Put it first ... The front view of the appearance of the evil humanoid robot.! 2 is a side view of the humanoid robot shown in Figure 1. Brother 3 'is the figure of Brother 1's figure. Figure 4 is a block diagram showing the structure of the system of the circle] ^ 铖宁态态宁宁宁宁宁宁宁宁宁视 Human hearing and hearing buns structure example block diagram. 334412 43 200411627 Figure 5 shows the role of the spear module Figure 6: Brother L Seeing Listening in the System Figure 6 is a model of the auditory vision in the robot's audio-visual system shown by the voice recognition unit of the watch spear group. Outline of the construction example of the speech recognition engine. Squint left and right two: = Figure 6 shows the front of the speech recognition engine and the _ sign showing the sound of the person who made the sound, and then the speaker who is right oblique. Situation. ^ Lee! : The diagram is a schematic oblique view showing the language in the robot's audiovisual system shown in Figure 4 as a household and an inspection. Figures ⑷ to ⑴ show the results of the first-example of the sound recognition experiment of the robotic audiovisual system of Figure 4 in order: Figures (a) to (i) are diagrams showing the results of the second example of the sound recognition experiment of the robotic audiovisual system of Figure 4 in order: Figures 11 (a) to (h) are diagrams showing the results of the third example of the 2-tone recognition experiment of the robotic audiovisual system of Figure 4 in order. Figures 12 (a) to ⑴ show the results of the fourth example of the speech recognition experiment of the robotic audiovisual system of Figure 4 in order. FIG. 10 is a drawing showing the extraction rate in the case where the active frequency band passing through the j filter has been controlled in accordance with the embodiment of the present invention, and ⑷ is at 0 degrees, (b) is at 10 degrees, When 具有 is at 2P degrees and ⑷ has a sound source in the direction of j 〇 degrees. FIG. 14 is a diagram showing the pass band frequency output rate of the active side 44 314412 2004110 ^ / pass filter related to the embodiment of the present invention, showing (a) When the direction has a sound source. (B) The 5015 graph at 50 degrees and (c) at 60 degrees is a graph showing the output rate of the pass filter, and shows (a) the situation where there is a sound source in the direction. Relevant active side $ frequency ▼ 70 degrees when the range has been controlled '(b) at 80 degrees, (c) at 90 degrees 10 humanoid robot 11 12 carcass 13 13a link member 14 14a, 14b Part 15 16, 16a, 16b Microphone 17 20 Hearing module 21 22 Source localization part 23 23a Active direction pass filter 24 Pitch 25 26 Hearing event generation part 2 7 27a Acoustic control part 27b 2 7c Speech recognition engine 27d 27e Selection 27f 28 dialogue unit 28a 28b vowel synthesis unit 28c 29 auditory event 30 base head outer packaging camera robot audiovisual system peak extraction unit sound source separation unit sound source horizontal direction voice recognition unit automatic recognition unit acoustic model HTK tool suite dialogue control unit Speaker face module 3] 44] 2 45 200411627 3 1 Face detection section 33 Face positioning section 35 Face direction 37 Stereo module 37b Target extraction section 37d Stereo event Part generation part 38a Stereo event® 40 Motor control module 42 Potentiometer 44 AD conversion circuit 46 Robot direction 48 Motor event 52 Absolute coordinate conversion part 54 Face data stream 56 Related part Φ 58 Observer 58b Data flow diagram 64a Motor command 32 Face recognition section 34 Face ID 36 Face event generation section 37a Parallax day image generation section 37c Target direction 38 Face database 39 Face event 41 Motor 43 PWM control circuit 45 Motor control section 47 Motor event generation section 50 Combination mode Group 53 audio data stream 55 stereo vision data stream 57 attention control module 58a radar chart 59 combined data stream 70 internet 46 314412

Claims

200411627 Scope of patent application: i. A robotic audiovisual system, with Λ, ^ A ^ ^ ^, equipped with a combination of the spoken spoon of each speaker and its direction. Acoustic models; using these acoustic models, the speech recognition of the b-day recognition process for the stereo signals that have undergone standing, widening, and opening ^^, +, & A hand, and, for The aforementioned b-recognition process is integrated according to the aforementioned acoustic model and the results of a plurality of voice recognition processes, and the selector _ is accompanied by any one of the sound recognition process results. Among them, each of the speakers is spoken simultaneously Vocabulary for identification. For example, the robotic audio-visual system according to item 1 of the scope of the patent application, in which the Dadatin selector is constituted by a majority decision to select the result of the aforementioned speech recognition process. 3. If the robotic audiovisual system of item 1 or item 2 of the scope of patent application, it is provided with a dialogue unit that outputs the result of the voice recognition process selected by the selector to an external. 4. A robotic audiovisual system, comprising: at least one pair of microphones for collecting external sounds; according to the acoustic signals emitted by the microphones, the sound source is separated by a collection of spectral waves extracted according to the height of the sound And positioning to determine the direction of at least one speaker, and to extract the auditory module of its auditory event; a camera with a camera in front of the robot. According to the day image photographed by the camera, The speaker's face recognition and positioning, identify each speaker, and extract the face module of his face event; 47 314412, 'the robot is rotated in the horizontal direction. According to the driving force of the driving motor, Root control module; set the motor for extracting the motor event from the auditory event, facial event, and audio event location of the facial event and the face of the facial event, and determine the direction of each speaker according to the direction of the filter), , χ s * ^ „Use Kalman filter (Kalman… Ke k and facial data stream, and then connect with Shi to generate a combination of combined data streams Groups, " machine-related attention control based on these streams, and the action plan results from attention control for electric motors :: Attention control module; where:? Mechanism: The auditory module is based on From the aforementioned Π-direction information ', which is the smallest in the front direction, and which becomes larger with the left 2 ^ and A, the active direction pass-through wave passer of the pass range will have two in a predetermined wide range. The sub-bands of inter-ear phase difference (IPD) or inter-ear intensity difference (IID) are collected and the waveforms of sound sources $ ,, * 彡 ^ θ are reconstructed to separate the sound sources, and multiple sound models are used to perform The speech recognition of the audio signals separated by the sound source is combined by the selector to the speech recognition results obtained by the respective acoustic models. It is constituted by judging the most reliable speech recognition result among the speech recognition results. 5. A robotic audiovisual system, comprising: at least one pair of microphones for collecting external sounds, based on 314412 48 200411627音变 # issued by the wind, the harmonic wave broadcasts the number 1 θ 乜, and the sound source is separated and located according to the height of the sound— + / " " to determine the direction of the speaker. And extract its hearing and module; D.4. Jue has a camera that takes pictures of the front of the robot, and recognizes and locates each speaker's face based on the portrait taken by the camera. According to the parallax extracted from the day image captured by the stereo camera, 'feed and position the goods to a longer object, and extract the three-dimensional event post module; 1 There are two control modules for driving the electric motor that rotates the robot in the horizontal direction according to the rotation position of the driving motor; the motor for extracting the ㈣ event is composed of the aforementioned auditory event, facial event 'stereo event and motor event' according to the auditory event. Sound source positioning and face event orientation information to determine the direction of each speaker. Use a kalmanfiker to connect the aforementioned events in the time direction to the decision. Generate auditory data streams, facial data streams, and stereoscopic data streams, and then combine the above data streams to generate a combined module that combines the data streams; and perform attention control based on these data streams, and The resulting action plan results in an attention control module that performs motor drive control; wherein, the aforementioned auditory module is based on the correct sound source direction information according to 3) 4412 49 from the aforementioned combined module, and it becomes larger when the positive angle becomes larger. Larger: It is a minimum type and it passes through the filter in the left and right direction. It will have a main interphase difference (IPD) of 1 interfering range (ah range) or two ears. ^ Two ears within a predetermined wide range. Set, and reconstruct the sub-band of the sound source ^ (IID) and use multiple sound models to separate / source the sound sources. At the same time, the speech recognition is integrated by the sound signal recognition result of the selector sound source separation. Among the identification results of the 1st oldest temple with the highest reliability judged by 哕 *, * 〖Recognition result of ^ sound retrieved 6 · As for the 4th in the scope of patent application, where m, ' +, The robotic audiovisual system, note that when the “trigger group” is used for speech recognition, the above-mentioned pressing center panel system uses the sound signals of,,, and the camera, and then A ′. Yue Xiangyue talks about the voice of the microphone, and then the microphone collects the sound. The sound source positioning and separation by the auditory module are composed.卞海耳说进仃, my voice recognition 7. For example, the scope of the patent application No. 4 苴 Zhongfeidi 5 machine vision audiovisual system; Listening group is doing speech recognition, iBS is obtained from the face neighbor module Facial events. ", The robotic audiovisual system of item 5 of the scope of patent application by deduction β δ · 1 °, in which the group refers to the three-dimensional event served by the three-dimensional module when performing speech recognition. 9 ·: Application for June patent scope The robot visual and auditory system of item 5, wherein when the auditory module performs speech recognition, it refers to the facial events obtained by the facial module 3] 44] 2 50 200411627 and the stereo events obtained by the stereo module. 1 10. If the robot vision and hearing system of item 4 or item 5 of the scope of patent application, _ which includes a dialogue unit that outputs the speech recognition result judged by the aforementioned hearing module to the outside. 1 1. If the patent is applied for The robot audiovisual system of the range item 4 or item 5, wherein the passband range of the aforementioned active direction pass filter can be controlled for each frequency. 51 314412