TWI770867B

TWI770867B - voice recognition device

Info

Publication number: TWI770867B
Application number: TW110108542A
Authority: TW
Inventors: 王毓翔; 梁智能
Original assignee: 財團法人車輛研究測試中心
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2022-07-11
Also published as: TW202236260A

Abstract

本發明係揭露一種語音辨識裝置，其包含至少一個位置擷取裝置、一方向性收音裝置、一雜訊抑制器與一語音辨識處理器。位置擷取裝置依序耦接方向性收音裝置、雜訊抑制器與語音辨識處理器。位置擷取裝置取得音源之實體語音位置，並輸出語音位置至方向性收音裝置，使方向性收音裝置根據語音位置接收音源產生之語音訊號。雜訊抑制器根據語音位置對應之雜訊模型消除語音訊號之雜訊，以產生一語音辨識訊號。語音辨識處理器接收語音辨識訊號，並據此產生一操作訊號，進而提升語音辨識之精確性。The present invention discloses a speech recognition device, which comprises at least one position acquisition device, a directional sound pickup device, a noise suppressor and a speech recognition processor. The position acquisition device is sequentially coupled to the directional radio device, the noise suppressor and the speech recognition processor. The position acquisition device obtains the physical voice position of the sound source, and outputs the voice position to the directional sound pickup device, so that the directional sound pickup device receives the voice signal generated by the sound source according to the voice position. The noise suppressor eliminates the noise of the voice signal according to the noise model corresponding to the voice position, so as to generate a voice recognition signal. The speech recognition processor receives the speech recognition signal, and generates an operation signal accordingly, thereby improving the accuracy of the speech recognition.

Description

voice recognition device

本發明係關於一種辨識裝置，且特別關於一種語音辨識裝置。The present invention relates to a recognition device, and more particularly, to a speech recognition device.

隨著語音辨識功能越發成熟，各種多媒體裝置都會將語音辨識裝置作為輸入裝置，如手機的智慧助理、車輛的語音控制裝置與智慧家電等，以嶄新的方式為科技生活增加了新的色彩，使用者不須再以按鈕或接觸，直接便能與設備進行互動操作。As the voice recognition function becomes more and more mature, various multimedia devices will use the voice recognition device as an input device, such as smart assistants in mobile phones, voice control devices in vehicles, and smart home appliances, adding new colors to technological life in a new way. Users can interact with the device directly without the need for buttons or contacts.

目前搭載語音辨識系統大多是應用在個人化設備，設備可以透過指向性麥克風，或是限縮收音範圍與情境，達到較好的收音與辨識效果，然而在較複雜的環境或是麥克風收音距離較遠，如車內，就容易被噪音影響或是發生回授，且若是在公用設備，則也會有操作干擾的問題。舉例來說，當第一操作者需連續與公用設備互動時，若第二操作者有意或無意產生語音訊號爭奪操作權，將造成第一操作者之人機互動體驗不佳。此外，在較容易產生噪音之環境中，且無法限縮收音範圍或移動使用者位置時，將造成語音辨識率不佳及系統操作困難。目前車輛中常見的之語音辨識功能，部分採用安卓自動(Android Auto)系統，透過說出 OK Google ，或按住方向盤上之語音指令按鈕，便開始接收語音命令進行操作。在一般車用語音用途大多針對駕駛使用需求：如播打電話、進行導航、控制音樂播放或恆溫系統控制，這幾項需求，功能上來說皆為單向提出需求，且不會有長時間持續操作等問題。市售車之中控系統大多直接採用無指向性麥克風，故收音效果容易受到喇叭回授影響與噪音干擾，若改用市面的指向性麥克風，則除駕駛外，其他位置之乘客則較難進行操作。現今既有會議收音產品，為提供會議收音需求，採用360度全向性收音，且大多是高敏度的麥克風，目的是在會議室中準確收到所有參與會議者講話的聲音。這些設備著重於雜訊濾除功能以保持聲音清晰，大多在接收到聲音後，會採用動態降噪(Digital Noise Reduction，DNR)、聲音增益控制，或是其他相關方法增加人聲強度與收音能力。但由於要盡量接收所有會議參與者的聲音，較無指向性之需求，不會針對各別操作者方向去收音，也不會特別抑制其他人聲。At present, most of the voice recognition systems equipped with voice recognition are applied to personal devices. The device can use directional microphones or limit the sound collection range and situation to achieve better sound collection and recognition effects. If it is far away, such as in the car, it is easy to be affected by noise or feedback, and if it is in public equipment, it will also have the problem of operation interference. For example, when the first operator needs to continuously interact with the public equipment, if the second operator intentionally or unintentionally generates a voice signal to compete for the operation right, the first operator's human-machine interaction experience will be poor. In addition, in an environment where noise is more likely to be generated, and when it is not possible to limit the sound collection range or move the user's position, it will result in poor speech recognition rate and difficulty in system operation. Some of the voice recognition functions commonly found in vehicles currently use the Android Auto system. By saying OK Google, or pressing and holding the voice command button on the steering wheel, it starts to receive voice commands for operation. In general car voice applications are mostly aimed at driving needs: such as making calls, navigating, controlling music playback, or controlling the thermostat system. Functionally, these needs are all one-way demands and will not last for a long time. operation, etc. Most of the central control systems of commercial vehicles directly use non-directional microphones, so the sound reception effect is easily affected by the feedback of the speakers and noise interference. If the directional microphones on the market are used, it will be difficult for passengers in other positions except driving. operate. Today's existing conference radio products use 360-degree omnidirectional radio to meet the needs of conference radio, and most of them are high-sensitivity microphones. The purpose is to accurately receive the voice of all participants in the conference room. These devices focus on noise filtering to keep the sound clear. After receiving the sound, most of them will use dynamic noise reduction (Digital Noise Reduction, DNR), sound gain control, or other related methods to increase the intensity of the human voice and the ability to pick up the sound. However, due to the need to receive the voices of all conference participants as much as possible, there is no need for directivity, so it will not pick up the sound in the direction of each operator, and will not particularly suppress other voices.

因此，本發明係在針對上述的困擾，提出一種語音辨識裝置，以解決習知所產生的問題。Therefore, the present invention proposes a speech recognition device in order to solve the above-mentioned problems.

本發明提供一種語音辨識裝置，其係在語音訊號控制公用設備時，降低搶奪控制權之頻率及提升公用設備的操作性，並在複雜且密閉的環境中，改善收音品質、收音方向性與降噪功能，以提升語音辨識之精確性。The present invention provides a voice recognition device, which reduces the frequency of grabbing the control right and improves the operability of the public equipment when the voice signal controls the public equipment, and in a complex and airtight environment, improves the quality of sound, the directionality of sound and the reduction of noise. Noise function to improve the accuracy of speech recognition.

在本發明之一實施例中，提供一種語音辨識裝置，其包含至少一個位置擷取裝置、一方向性收音裝置、一雜訊抑制器與一語音辨識處理器。位置擷取裝置對應至少一個觸發條件。在一音源滿足觸發條件時，位置擷取裝置取得音源之實體語音位置，並輸出實體語音位置。方向性收音裝置耦接位置擷取裝置，方向性收音裝置用以接收實體語音位置，並根據實體語音位置接收音源產生之語音訊號。雜訊抑制器耦接位置擷取裝置與方向性收音裝置。雜訊抑制器存有複數個語音產生位置分別對應之雜訊模型，所有語音產生位置包含實體語音位置。雜訊抑制器用以接收語音訊號與實體語音位置，並根據實體語音位置對應之雜訊模型消除語音訊號之雜訊，以產生一語音辨識訊號。語音辨識處理器耦接雜訊抑制器，其中語音辨識處理器用以接收語音辨識訊號，並據此產生一操作訊號。In one embodiment of the present invention, a speech recognition device is provided, which includes at least one position acquisition device, a directional sound pickup device, a noise suppressor and a speech recognition processor. The position capture device corresponds to at least one trigger condition. When a sound source satisfies the trigger condition, the position acquisition device obtains the physical voice position of the sound source, and outputs the physical voice position. The directional sound pickup device is coupled to the position acquisition device, and the directional sound pickup device is used for receiving the physical voice position and receiving the voice signal generated by the sound source according to the physical voice position. The noise suppressor is coupled to the position acquisition device and the directional sound pickup device. The noise suppressor stores noise models corresponding to a plurality of speech generating positions respectively, and all the speech generating positions include physical speech positions. The noise suppressor is used for receiving the voice signal and the physical voice position, and eliminating the noise of the voice signal according to the noise model corresponding to the physical voice position, so as to generate a voice recognition signal. The speech recognition processor is coupled to the noise suppressor, wherein the speech recognition processor is used for receiving the speech recognition signal and generating an operation signal accordingly.

在本發明之一實施例中，語音辨識裝置更包含一座標轉換器，其係耦接位置擷取裝置、雜訊抑制器與方向性收音裝置。座標轉換器用以接收實體語音位置，並轉換實體語音位置之座標系為對應雜訊抑制器與方向性收音裝置之座標系後，傳送被轉換之實體語音位置至雜訊抑制器與方向性收音裝置。In an embodiment of the present invention, the speech recognition device further includes a coordinate converter, which is coupled to the position acquisition device, the noise suppressor and the directional sound pickup device. The coordinate converter is used to receive the physical voice position, and convert the coordinate system of the physical voice position to the coordinate system corresponding to the noise suppressor and the directional radio device, and then transmit the converted physical voice position to the noise suppressor and the directional radio device. .

在本發明之一實施例中，至少一個位置擷取裝置包含複數個位置擷取裝置，至少一個觸發條件包含複數個觸發條件，所有觸發條件分別對應所有位置擷取裝置。在音源依序滿足所有觸發條件時，由最早被滿足的觸發條件所對應之位置擷取裝置取得並輸出實體語音位置。In an embodiment of the present invention, the at least one position capture device includes a plurality of position capture devices, the at least one trigger condition includes a plurality of trigger conditions, and all the trigger conditions correspond to all the position capture devices respectively. When the audio source satisfies all trigger conditions in sequence, the position acquisition device corresponding to the earliest satisfied trigger condition obtains and outputs the physical voice position.

在本發明之一實施例中，位置擷取裝置為影像定位模組。在影像定位模組擷取具有一使用者之舉手姿勢之影像時，觸發條件被滿足，使用者作為音源，且使用者之實體位置作為實體語音位置。In an embodiment of the present invention, the position capturing device is an image positioning module. When the image positioning module captures an image with a gesture of raising a user's hand, the trigger condition is satisfied, the user is used as the audio source, and the physical position of the user is used as the physical voice position.

在本發明之一實施例中，位置擷取裝置為語音定位模組。在語音定位模組於不同位置接收音源產生之觸發語音時，觸發條件被滿足，且語音定位模組用以取得在不同位置的觸發語音之不同接收時間點，並據此取得實體語音位置。In an embodiment of the present invention, the position capturing device is a voice positioning module. When the voice positioning module receives the trigger voices generated by the audio source at different positions, the trigger condition is satisfied, and the voice positioning module is used to obtain different reception time points of the trigger voices at different positions, and obtain the physical voice position accordingly.

在本發明之一實施例中，位置擷取裝置包含一觸控顯示面板與一應用處理器。觸控顯示面板用以顯示應用程式之操作介面，其中操作介面具有對應實體語音位置之影像。應用處理器耦接觸控顯示面板、雜訊抑制器與方向性收音裝置，應用處理器安裝有應用程式。在觸控顯示面板對應影像之位置被按下時，觸發條件被滿足，應用處理器取得並輸出實體語音位置。In an embodiment of the present invention, the position capture device includes a touch display panel and an application processor. The touch display panel is used to display the operation interface of the application program, wherein the operation interface has an image corresponding to the physical voice position. The application processor is coupled to the touch control display panel, the noise suppressor and the directional radio device, and the application processor is installed with an application program. When the position of the touch display panel corresponding to the image is pressed, the trigger condition is satisfied, and the application processor obtains and outputs the physical voice position.

在本發明之一實施例中，方向性收音裝置包含一麥克風陣列與一音訊處理器。麥克風陣列用以接收不同位置的語音訊號。音訊處理器耦接麥克風陣列、位置擷取裝置與雜訊抑制器，音訊處理器存有所有語音產生位置分別對應之複數組偏移時段。音訊處理器用以接收實體語音位置，並根據實體語音位置與其對應之一組偏移時段移動在不同位置的語音訊號之波形至同一時間點，且在此同一時間點相加語音訊號，以產生被強化之語音訊號。音訊處理器用以傳輸被強化之語音訊號至雜訊抑制器。In an embodiment of the present invention, the directional sound pickup device includes a microphone array and an audio processor. The microphone array is used to receive voice signals from different locations. The audio processor is coupled to the microphone array, the position acquisition device and the noise suppressor, and the audio processor stores a complex set of offset periods corresponding to all speech generating positions respectively. The audio processor is used to receive the physical voice position, and move the waveforms of the voice signals at different positions to the same time point according to the physical voice position and a set of offset time periods corresponding to the physical voice position, and add the voice signals at the same time point to generate the received voice signal. Enhanced voice signal. The audio processor is used to transmit the enhanced voice signal to the noise suppressor.

在本發明之一實施例中，方向性收音裝置包含一方向性收音器與一自動旋轉平台。方向性收音器耦接雜訊抑制器，自動旋轉平台耦接位置擷取裝置，自動旋轉平台支撐方向性收音器。自動旋轉平台用以接收實體語音位置，並控制方向性收音器之收音方向朝向實體語音位置。方向性收音器用以接收語音訊號，並傳輸語音訊號至雜訊抑制器。In one embodiment of the present invention, the directional sound pickup device includes a directional sound pickup and an automatic rotating platform. The directional microphone is coupled to the noise suppressor, the automatic rotating platform is coupled to the position capturing device, and the automatic rotating platform supports the directional microphone. The automatic rotating platform is used to receive the physical voice position and control the direction of the directional microphone to be directed towards the physical voice position. The directional radio is used to receive the voice signal and transmit the voice signal to the noise suppressor.

在本發明之一實施例中，語音辨識處理器耦接位置擷取裝置與方向性收音裝置。在語音辨識處理器未接收語音辨識訊號長達一預設時段時，語音辨識處理器控制位置擷取裝置停止取得實體語音位置，並控制方向性收音裝置停止接收實體語音位置與產生語音訊號，且控制位置擷取裝置與方向性收音裝置操作在待機狀態。In an embodiment of the present invention, the speech recognition processor is coupled to the position acquisition device and the directional sound pickup device. When the voice recognition processor does not receive the voice recognition signal for a predetermined period of time, the voice recognition processor controls the position acquisition device to stop acquiring the physical voice position, and controls the directional radio device to stop receiving the physical voice position and generating the voice signal, and The control position acquisition device and the directional radio device operate in a standby state.

在本發明之一實施例中，提供一種語音辨識裝置，其包含複數個語音接收器、一音訊處理器、一雜訊抑制器與一語音辨識處理器。所有語音接收器用以於不同位置接收一音源產生之語音訊號。音訊處理器耦接所有語音接收器，音訊處理器存有複數個語音產生位置分別對應之複數組偏移時段。音訊處理器用以取得在不同位置的語音訊號之不同接收時間點，並據此取得音源的實體語音位置。所有語音產生位置包含實體語音位置。音訊處理器用以根據實體語音位置與其對應之一組偏移時段移動在不同位置的語音訊號之波形至同一時間點，且在此同一時間點相加語音訊號，以產生被強化之語音訊號。雜訊抑制器耦接音訊處理器。雜訊抑制器存有所有語音產生位置分別對應之雜訊模型。雜訊抑制器用以接收被強化之語音訊號與實體語音位置，並根據實體語音位置對應之雜訊模型消除被強化之語音訊號之雜訊，以產生一語音辨識訊號。語音辨識處理器耦接雜訊抑制器，其中語音辨識處理器用以接收語音辨識訊號，並據此產生一操作訊號。In one embodiment of the present invention, a speech recognition device is provided, which includes a plurality of speech receivers, an audio processor, a noise suppressor, and a speech recognition processor. All voice receivers are used to receive voice signals generated by an audio source at different locations. The audio processor is coupled to all the voice receivers, and the audio processor stores a complex set of offset periods corresponding to a plurality of voice generating positions respectively. The audio processor is used for obtaining different receiving time points of the voice signal at different positions, and obtaining the physical voice position of the sound source accordingly. All speech production locations contain physical speech locations. The audio processor is used for moving the waveforms of the voice signals at different positions to the same time point according to the physical voice position and a corresponding set of offset periods, and adding the voice signals at the same time point to generate the enhanced voice signal. The noise suppressor is coupled to the audio processor. The noise suppressor stores the noise models corresponding to all speech generating positions respectively. The noise suppressor is used for receiving the enhanced voice signal and the physical voice position, and eliminating the noise of the enhanced voice signal according to the noise model corresponding to the physical voice position, so as to generate a voice recognition signal. The speech recognition processor is coupled to the noise suppressor, wherein the speech recognition processor is used for receiving the speech recognition signal and generating an operation signal accordingly.

在本發明之一實施例中，語音辨識處理器耦接音訊處理器。在語音辨識處理器未接收語音辨識訊號長達一預設時段時，語音辨識處理器控制音訊處理器停止取得實體語音位置，並控制音訊處理器停止產生被強化之語音訊號，且控制音訊處理器操作在待機狀態。In an embodiment of the present invention, the speech recognition processor is coupled to the audio processor. When the voice recognition processor does not receive the voice recognition signal for a preset period of time, the voice recognition processor controls the audio processor to stop acquiring the physical voice position, and controls the audio processor to stop generating the enhanced voice signal, and controls the audio processor Operation is in standby mode.

基於上述，語音辨識裝置先取得音源之實體語音位置，並輸出語音位置至方向性收音裝置，使方向性收音裝置根據語音位置接收音源產生之語音訊號。如此一來，在語音訊號控制公用設備時，降低搶奪控制權之頻率及提升公用設備的操作性，並在複雜且密閉的環境中，改善收音品質、收音方向性與降噪功能，以提升語音辨識之精確性。Based on the above, the voice recognition device first obtains the physical voice position of the audio source, and outputs the voice position to the directional radio device, so that the directional radio device receives the voice signal generated by the audio source according to the voice position. In this way, when the voice signal controls the public equipment, the frequency of grabbing the control rights is reduced and the operability of the public equipment is improved, and in the complex and closed environment, the sound quality, the direction of the sound and the noise reduction function are improved to improve the voice. Accuracy of identification.

茲為使　貴審查委員對本發明的結構特徵及所達成的功效更有進一步的瞭解與認識，謹佐以較佳的實施例圖及配合詳細的說明，說明如後：Hereby, in order to make your examiners have a further understanding and understanding of the structural features of the present invention and the effects achieved, I would like to assist with the preferred embodiment drawings and coordinate detailed descriptions, and the descriptions are as follows:

本發明之實施例將藉由下文配合相關圖式進一步加以解說。盡可能的，於圖式與說明書中，相同標號係代表相同或相似構件。於圖式中，基於簡化與方便標示，形狀與厚度可能經過誇大表示。可以理解的是，未特別顯示於圖式中或描述於說明書中之元件，為所屬技術領域中具有通常技術者所知之形態。本領域之通常技術者可依據本發明之內容而進行多種之改變與修改。Embodiments of the present invention will be further explained with the help of the related drawings below. Wherever possible, in the drawings and the description, the same reference numbers refer to the same or similar components. In the drawings, shapes and thicknesses may be exaggerated for simplicity and convenience. It should be understood that the elements not particularly shown in the drawings or described in the specification have forms known to those of ordinary skill in the art. Those skilled in the art can make various changes and modifications based on the content of the present invention.

揭露特別以下述例子加以描述，這些例子僅係用以舉例說明而已，因為對於熟習此技藝者而言，在不脫離本揭示內容之精神和範圍內，當可作各種之更動與潤飾，因此本揭示內容之保護範圍當視後附之申請專利範圍所界定者為準。在通篇說明書與申請專利範圍中，除非內容清楚指定，否則「一」以及「該」的意義包含這一類敘述包括「一或至少一」該元件或成分。此外，如本揭露所用，除非從特定上下文明顯可見將複數個排除在外，否則單數冠詞亦包括複數個元件或成分的敘述。而且，應用在此描述中與下述之全部申請專利範圍中時，除非內容清楚指定，否則「在其中」的意思可包含「在其中」與「在其上」。在通篇說明書與申請專利範圍所使用之用詞(terms)，除有特別註明，通常具有每個用詞使用在此領域中、在此揭露之內容中與特殊內容中的平常意義。某些用以描述本揭露之用詞將於下或在此說明書的別處討論，以提供從業人員(practitioner)在有關本揭露之描述上額外的引導。在通篇說明書之任何地方之例子，包含在此所討論之任何用詞之例子的使用，僅係用以舉例說明，當然不限制本揭露或任何例示用詞之範圍與意義。同樣地，本揭露並不限於此說明書中所提出之各種實施例。The disclosure is specifically described with the following examples, which are only for illustration, because for those skilled in the art, various changes and modifications can be made without departing from the spirit and scope of the present disclosure. The scope of protection of the disclosed contents shall be determined by the scope of the appended patent application. Throughout the specification and claims, unless the content clearly dictates otherwise, the meanings of "a" and "the" include that such recitations include "one or at least one" of the element or ingredient. Furthermore, as used in this disclosure, a singular article also includes the recitation of a plurality of elements or components unless the exclusion of the plural is obvious from the specific context. Also, as used in this description and throughout the claims below, the meaning of "in" may include "in" and "on" unless the content clearly dictates otherwise. Terms used throughout the specification and the scope of the patent application, unless otherwise specified, generally have the ordinary meaning of each term used in the field, in the content disclosed herein and in the specific content. Certain terms used to describe the present disclosure are discussed below or elsewhere in this specification to provide practitioners with additional guidance in describing the present disclosure. Examples anywhere throughout the specification, including the use of examples of any terms discussed herein, are by way of illustration only, and of course do not limit the scope and meaning of the disclosure or any exemplified terms. Likewise, the present disclosure is not limited to the various embodiments set forth in this specification.

此外，若使用「電(性)耦接」或「電(性)連接」一詞在此係包含任何直接及間接的電氣連接手段。舉例而言，若文中描述一第一裝置電性耦接於一第二裝置，則代表該第一裝置可直接連接於該第二裝置，或透過其他裝置或連接手段間接地連接至該第二裝置。另外，若描述關於電訊號之傳輸、提供，熟習此技藝者應該可了解電訊號之傳遞過程中可能伴隨衰減或其他非理想性之變化，但電訊號傳輸或提供之來源與接收端若無特別敘明，實質上應視為同一訊號。舉例而言，若由電子電路之端點A傳輸(或提供)電訊號S給電子電路之端點B，其中可能經過一電晶體開關之源汲極兩端及/或可能之雜散電容而產生電壓降，但此設計之目的若非刻意使用傳輸(或提供)時產生之衰減或其他非理想性之變化而達到某些特定的技術效果，電訊號S在電子電路之端點A與端點B應可視為實質上為同一訊號。Furthermore, if the term "electrically (sexually) coupled" or "electrically (sexually) connected" is used herein, it includes any means of direct and indirect electrical connection. For example, if it is described in the text that a first device is electrically coupled to a second device, it means that the first device can be directly connected to the second device, or indirectly connected to the second device through other devices or connecting means device. In addition, if the transmission and provision of electrical signals are described, those skilled in the art should understand that the transmission of electrical signals may be accompanied by attenuation or other non-ideal changes. In fact, it should be regarded as the same signal. For example, if the electrical signal S is transmitted (or provided) from the terminal A of the electronic circuit to the terminal B of the electronic circuit, it may pass through the source and drain terminals of a transistor switch and/or possible stray capacitance. A voltage drop is generated, but the purpose of this design is not to deliberately use the attenuation or other non-ideal changes generated during transmission (or supply) to achieve some specific technical effects. The electrical signal S is at the terminal A and the terminal of the electronic circuit. B should be regarded as substantially the same signal.

於下文中關於“一個實施例”或“一實施例”之描述係指關於至少一實施例內所相關連之一特定元件、結構或特徵。因此，於下文中多處所出現之“一個實施例”或 “一實施例”之多個描述並非針對同一實施例。再者，於一或多個實施例中之特定構件、結構與特徵可依照一適當方式而結合。The following description of "one embodiment" or "an embodiment" refers to a particular element, structure or feature associated with at least one embodiment. Thus, the appearances of "one embodiment" or "an embodiment" in various places below are not directed to the same embodiment. Furthermore, the specific components, structures and features in one or more embodiments may be combined in a suitable manner.

除非特別說明，一些條件句或字詞，例如「可以(can)」、「可能(could)」、「也許(might)」，或「可(may)」，通常是試圖表達本案實施例具有，但是也可以解釋成可能不需要的特徵、元件，或步驟。在其他實施例中，這些特徵、元件，或步驟可能是不需要的。Unless otherwise specified, some conditional sentences or words, such as "can", "could", "might", or "may", are usually intended to express that the embodiments of this case have, However, it can also be interpreted as features, elements, or steps that may not be required. In other embodiments, these features, elements, or steps may not be required.

第1圖為本發明之第一實施例之語音辨識裝置之電路方塊圖。請參閱第1圖，以下介紹本發明之語音辨識裝置之第一實施例。語音辨識裝置1包含至少一個位置擷取裝置10、一方向性收音裝置11、一雜訊抑制器12與一語音辨識處理器13，其中這些元件皆為硬體。方向性收音裝置11耦接位置擷取裝置10，雜訊抑制器12耦接位置擷取裝置10與方向性收音裝置11，語音辨識處理器13耦接雜訊抑制器12。至少一個位置擷取裝置10對應至少一個觸發條件。為了清晰度與方便，位置擷取裝置10與觸發條件之數量皆以一為例。此外，位置擷取裝置10、方向性收音裝置11與雜訊抑制器12可使用相同座標系統。FIG. 1 is a circuit block diagram of a speech recognition apparatus according to a first embodiment of the present invention. Referring to FIG. 1, the following describes the first embodiment of the speech recognition device of the present invention. The speech recognition device 1 includes at least a position acquisition device 10 , a directional sound pickup device 11 , a noise suppressor 12 and a speech recognition processor 13 , wherein these components are all hardware. The directional sound pickup device 11 is coupled to the position capture device 10 , the noise suppressor 12 is coupled to the position capture device 10 and the directional sound pickup device 11 , and the speech recognition processor 13 is coupled to the noise suppressor 12 . At least one position capture device 10 corresponds to at least one trigger condition. For clarity and convenience, the number of the position capturing device 10 and the triggering condition is taken as an example. In addition, the position acquisition device 10 , the directional sound pickup device 11 and the noise suppressor 12 can use the same coordinate system.

以下介紹第一實施例之運作過程。在一音源2滿足觸發條件時，位置擷取裝置10取得音源2之實體語音位置P，並輸出實體語音位置P。方向性收音裝置11接收實體語音位置P，並根據實體語音位置P接收音源2產生之語音訊號V，其中語音訊號V包含對應操作權之操作語音。舉例來說，方向性收音裝置11可以波束成型(beamforming)模組實現，以強化對應實體語音位置P之方向的語音訊號V，並弱化其他方向的語音訊號V。由於雜訊抑制器12存有複數個語音產生位置分別對應之雜訊模型，其中所有語音產生位置包含實體語音位置P。因此，雜訊抑制器12接收語音訊號V與實體語音位置P，並根據實體語音位置P對應之雜訊模型消除語音訊號V之雜訊，以產生一語音辨識訊號R。其中雜訊抑制器12更可採用自適應性濾波演算法(adaptive filter algorithm)與有限脈衝響應(Finite impulse response, FIR)濾波器消除語音訊號V之雜訊，以提高雜訊抑制效率。語音辨識處理器13接收語音辨識訊號R，並據此產生一操作訊號O。操作訊號O可用以控制公用設備。位置擷取裝置10先取得音源2之實體語音位置P，並輸出語音位置P至方向性收音裝置11，使方向性收音裝置11根據實體語音位置P接收音源2產生之語音訊號V。如此一來，在語音訊號V控制公用設備時，可降低搶奪控制權之頻率及提升公用設備的操作性，並在複雜且密閉的環境中，改善收音品質、收音方向性與降噪功能，以提升語音辨識之精確性。The operation process of the first embodiment is described below. When a sound source 2 satisfies the trigger condition, the position capturing device 10 obtains the physical voice position P of the sound source 2 and outputs the physical voice position P. The directional sound pickup device 11 receives the physical voice position P, and receives the voice signal V generated by the audio source 2 according to the physical voice position P, wherein the voice signal V includes the operation voice corresponding to the operation right. For example, the directional radio device 11 can be implemented by a beamforming module, so as to strengthen the voice signal V in the direction corresponding to the physical voice position P, and weaken the voice signal V in other directions. Since the noise suppressor 12 has a plurality of noise models corresponding to the speech generating positions respectively, all the speech generating positions include the physical speech position P. As shown in FIG. Therefore, the noise suppressor 12 receives the voice signal V and the physical voice position P, and cancels the noise of the voice signal V according to the noise model corresponding to the physical voice position P to generate a voice recognition signal R. The noise suppressor 12 can further use an adaptive filter algorithm and a finite impulse response (Finite impulse response, FIR) filter to eliminate the noise of the voice signal V, so as to improve the noise suppression efficiency. The speech recognition processor 13 receives the speech recognition signal R, and generates an operation signal O accordingly. The operation signal O can be used to control the public equipment. The position acquisition device 10 first obtains the physical voice position P of the sound source 2 and outputs the voice position P to the directional sound pickup device 11 , so that the directional sound pickup device 11 receives the voice signal V generated by the sound source 2 according to the physical voice position P. In this way, when the voice signal V controls the public equipment, the frequency of grabbing control rights can be reduced and the operability of the public equipment can be improved. Improve the accuracy of speech recognition.

在本發明之某些實施例中，語音辨識處理器13可耦接位置擷取裝置10與方向性收音裝置11。在語音辨識處理器13未接收語音辨識訊號R長達一預設時段時，表示語音辨識裝置1之操作結束，以釋放出操作權。在語音辨識處理器13於預設時段中未接收語音辨識訊號R時，語音辨識處理器13控制位置擷取裝置10停止取得實體語音位置P，並控制方向性收音裝置11停止接收實體語音位置P與產生語音訊號V，且控制位置擷取裝置10與方向性收音裝置11操作在待機狀態，直到位置擷取裝置10擷取到新音源之新實體位置為止。In some embodiments of the present invention, the speech recognition processor 13 may be coupled to the position capturing device 10 and the directional sound pickup device 11 . When the speech recognition processor 13 does not receive the speech recognition signal R for a predetermined period of time, it means that the operation of the speech recognition device 1 is finished, so as to release the operation right. When the speech recognition processor 13 does not receive the speech recognition signal R within the preset time period, the speech recognition processor 13 controls the position capturing device 10 to stop acquiring the physical voice position P, and controls the directional sound pickup device 11 to stop receiving the physical voice position P Then, a voice signal V is generated, and the position capturing device 10 and the directional sound receiving device 11 are controlled to operate in a standby state until the position capturing device 10 captures a new physical position of a new audio source.

第2圖為本發明之第二實施例之語音辨識裝置之電路方塊圖。請參閱第2圖，以下介紹本發明之語音辨識裝置之第二實施例。第二實施例與第一實施例差別在於位置擷取裝置10及其觸發條件之數量。在第二實施例中，有複數個位置擷取裝置10與複數個觸發條件。為了避免音源2產生之語音訊號V被遮蔽而無法滿足單一觸發條件，故第二實施例使用不同的觸發條件，例如語音相關觸發條件、影像相關觸發條件與應用程式相關觸發條件。所有觸發條件分別對應所有位置擷取裝置10。本發明不考慮多個觸發條件同時被觸發的狀態。在音源2依序滿足所有觸發條件時，由最早被滿足的觸發條件所對應之位置擷取裝置10取得並輸出實體語音位置P。FIG. 2 is a circuit block diagram of a speech recognition device according to a second embodiment of the present invention. Please refer to FIG. 2 , the second embodiment of the speech recognition device of the present invention will be described below. The difference between the second embodiment and the first embodiment lies in the number of position capturing devices 10 and triggering conditions thereof. In the second embodiment, there are a plurality of position capturing apparatuses 10 and a plurality of trigger conditions. In order to prevent the voice signal V generated by the audio source 2 from being masked and unable to satisfy a single trigger condition, the second embodiment uses different trigger conditions, such as voice-related trigger conditions, image-related trigger conditions, and application-related trigger conditions. All trigger conditions correspond to all position capture devices 10 respectively. The present invention does not consider the state in which multiple trigger conditions are simultaneously triggered. When the audio source 2 satisfies all trigger conditions in sequence, the physical voice position P is obtained and output by the position acquisition device 10 corresponding to the earliest satisfied trigger condition.

第3圖為本發明之第三實施例之語音辨識裝置之電路方塊圖。請參閱第3圖，以下介紹本發明之語音辨識裝置之第三實施例。第三實施例與第一實施例差別在於第三實施例更包含一座標轉換器14。於第三實施例中，位置擷取裝置10與方向性收音裝置11可使用不同座標系統，方向性收音裝置11與雜訊抑制器12可使用相同座標系統。座標轉換器14耦接位置擷取裝置10、雜訊抑制器12與方向性收音裝置11。座標轉換器14接收實體語音位置P，並轉換實體語音位置P之座標系為對應雜訊抑制器12與方向性收音裝置11之座標系後，傳送被轉換之實體語音位置P’至雜訊抑制器12與方向性收音裝置11，其中所有語音產生位置亦包含被轉換之實體語音位置P’。因此，方向性收音裝置11接收被轉換之實體語音位置P’，並根據被轉換之實體語音位置P’接收音源2產生之語音訊號V。雜訊抑制器12則接收語音訊號V與被轉換之實體語音位置P’，並根據被轉換之實體語音位置P’對應之雜訊模型消除語音訊號V之雜訊，以產生一語音辨識訊號R。FIG. 3 is a circuit block diagram of a speech recognition apparatus according to a third embodiment of the present invention. Referring to FIG. 3, the following describes the third embodiment of the speech recognition device of the present invention. The difference between the third embodiment and the first embodiment is that the third embodiment further includes a coordinate converter 14 . In the third embodiment, the position acquisition device 10 and the directional sound pickup device 11 can use different coordinate systems, and the directional sound pickup device 11 and the noise suppressor 12 can use the same coordinate system. The coordinate converter 14 is coupled to the position acquisition device 10 , the noise suppressor 12 and the directional sound pickup device 11 . The coordinate converter 14 receives the physical voice position P, and converts the coordinate system of the physical voice position P to the coordinate system corresponding to the noise suppressor 12 and the directional radio device 11, and then transmits the converted physical voice position P' to the noise suppressor The device 12 and the directional sound pickup device 11, in which all speech generating positions also include the converted physical speech position P'. Therefore, the directional sound pickup device 11 receives the converted physical voice position P', and receives the voice signal V generated by the audio source 2 according to the converted physical voice position P'. The noise suppressor 12 receives the voice signal V and the converted physical voice position P', and eliminates the noise of the voice signal V according to the noise model corresponding to the converted physical voice position P', so as to generate a voice recognition signal R .

在本發明之一實施例中，位置擷取裝置10可為影像定位模組，觸發條件為影像相關觸發條件。在影像定位模組擷取具有一使用者之特定姿勢，例如舉手姿勢之影像時，觸發條件被滿足，此使用者作為音源2，且此使用者之實體位置作為實體語音位置P。舉例來說，影像定位模組可以把擷取到的影像區分為複數個區塊，並對每一區塊標上號碼，如此便可知道具有舉手姿勢的區塊之號碼，並將此作為實體語音位置P。或者，若影像定位模組具有雙鏡頭，則影像定位模組可以採用雙鏡頭對上述使用者進行定位，以取得使用者之三維座標，並將此作為實體語音位置P。In one embodiment of the present invention, the position capturing device 10 may be an image positioning module, and the trigger condition is an image-related trigger condition. When the image positioning module captures an image with a specific gesture of a user, such as a raised hand gesture, the trigger condition is satisfied, the user is the audio source 2 , and the physical position of the user is the physical voice position P. For example, the image positioning module can divide the captured image into a plurality of blocks, and mark each block with a number, so that the number of the block with the raised hand gesture can be known and used as Entity speech position P. Alternatively, if the image positioning module has dual lenses, the image positioning module can use the dual lenses to position the user, so as to obtain the three-dimensional coordinates of the user, and use this as the physical voice position P.

在本發明之另一實施例中，位置擷取裝置10可為語音定位模組，觸發條件為語音相關觸發條件。在語音定位模組於不同位置接收音源2產生之觸發語音時，觸發條件被滿足。其中觸發語音可與語音訊號相同或不同。語音定位模組取得在不同位置的觸發語音之不同接收時間點。因為不同接收時間點分別表示音源2相距語音定位模組之不同位置的距離，故語音定位模組可根據不同接收時間點取得實體語音位置P。舉例來說，語音定位模組可包含互相耦接之立體式麥克風陣列與語音處理器，立體式麥克風陣列包含複數個麥克風，因為所有麥克風位於不同位置，所以所有麥克風會在不同時間點接收到音源2產生的觸發語音，語音處理器可根據不同時間點之時間間隔與所有麥克風之位置計算出音源2之三維座標，並將此作為實體語音位置P。In another embodiment of the present invention, the position capturing device 10 may be a voice positioning module, and the triggering condition is a voice-related triggering condition. When the voice location module receives the trigger voice generated by the audio source 2 at different positions, the trigger condition is satisfied. The trigger voice can be the same as or different from the voice signal. The voice positioning module obtains different reception time points of the trigger voice at different positions. Since the different receiving time points respectively represent the distances between the audio source 2 and the different positions of the voice positioning module, the voice positioning module can obtain the physical voice position P according to different receiving time points. For example, the speech positioning module may include a stereo microphone array and a speech processor coupled to each other. The stereo microphone array includes a plurality of microphones. Because all the microphones are located at different positions, all the microphones will receive the sound source at different time points. For the trigger voice generated by 2, the voice processor can calculate the three-dimensional coordinates of the sound source 2 according to the time interval of different time points and the positions of all microphones, and use this as the physical voice position P.

第4圖為本發明之一實施例之位置擷取裝置10、方向性收音裝置11與雜訊抑制器12之電路方塊圖。請參閱第4圖，位置擷取裝置10可包含一觸控顯示面板100與一應用處理器101，其中應用處理器101耦接觸控顯示面板100、雜訊抑制器12與方向性收音裝置11。觸控顯示面板100顯示應用程式之操作介面，其中此操作介面具有對應實體語音位置P之影像。應用處理器101安裝有應用程式，故觸發條件為應用程式相關觸發條件。在觸控顯示面板100對應上述影像之位置被按下時，觸發條件被滿足，應用處理器101取得並輸出實體語音位置P。此外，第4圖中所示的電路可應用於第1圖或本發明中的其它實施例，但是不限於此。當第4圖中所示的電路應用在第3圖之實施例中時，應用處理器101耦接座標轉換器14。FIG. 4 is a circuit block diagram of the position acquisition device 10 , the directional sound pickup device 11 and the noise suppressor 12 according to an embodiment of the present invention. Referring to FIG. 4 , the position capture device 10 may include a touch display panel 100 and an application processor 101 , wherein the application processor 101 is coupled to the touch display panel 100 , the noise suppressor 12 and the directional sound pickup device 11 . The touch display panel 100 displays an operation interface of the application program, wherein the operation interface has an image corresponding to the physical voice position P. The application processor 101 has an application program installed, so the trigger condition is an application program-related trigger condition. When the position of the touch display panel 100 corresponding to the above-mentioned image is pressed, the trigger condition is satisfied, and the application processor 101 obtains and outputs the physical voice position P. Furthermore, the circuit shown in Fig. 4 may be applied to Fig. 1 or other embodiments of the present invention, but is not limited thereto. When the circuit shown in FIG. 4 is applied in the embodiment of FIG. 3 , the application processor 101 is coupled to the coordinate converter 14 .

第5圖為本發明之另一實施例之位置擷取裝置10、方向性收音裝置11與雜訊抑制器12之電路方塊圖。請參閱第5圖，方向性收音裝置11可包含一麥克風陣列110與一音訊處理器111。麥克風陣列110接收不同位置的語音訊號V。音訊處理器111耦接麥克風陣列110、位置擷取裝置10與雜訊抑制器12，音訊處理器111存有所有語音產生位置分別對應之複數組偏移時段。音訊處理器111接收實體語音位置P，並根據實體語音位置P與其對應之一組偏移時段移動在不同位置的語音訊號V之波形至同一時間點，且在此同一時間點相加語音訊號V，以產生被強化之語音訊號V’。音訊處理器111傳輸被強化之語音訊號V’至雜訊抑制器12，使雜訊抑制器12根據實體語音位置P對應之雜訊模型消除被強化之語音訊號V’之雜訊，以產生一語音辨識訊號R。此外，第5圖中所示的電路可應用於第1圖或本發明中的其它實施例，但是不限於此。當第5圖中所示的電路應用在第3圖之實施例中時，音訊處理器111與雜訊抑制器12耦接座標轉換器14，並以被轉換之實體語音位置P’代替實體語音位置P。當第5圖中所示的電路應用在第4圖之實施例中時，音訊處理器111耦接應用處理器101。FIG. 5 is a circuit block diagram of the position acquisition device 10 , the directional sound pickup device 11 and the noise suppressor 12 according to another embodiment of the present invention. Please refer to FIG. 5 , the directional sound pickup device 11 may include a microphone array 110 and an audio processor 111 . The microphone array 110 receives the voice signals V at different positions. The audio processor 111 is coupled to the microphone array 110 , the position acquisition device 10 and the noise suppressor 12 , and the audio processor 111 stores a complex array of offset periods corresponding to all speech generating positions respectively. The audio processor 111 receives the physical voice position P, moves the waveforms of the voice signals V at different positions to the same time point according to the physical voice position P and a set of offset periods corresponding to the physical voice position P, and adds the voice signal V at the same time point , to generate an enhanced voice signal V'. The audio processor 111 transmits the enhanced voice signal V' to the noise suppressor 12, so that the noise suppressor 12 eliminates the noise of the enhanced voice signal V' according to the noise model corresponding to the physical voice position P, so as to generate a Voice recognition signal R. Furthermore, the circuit shown in Fig. 5 may be applied to Fig. 1 or other embodiments of the present invention, but is not limited thereto. When the circuit shown in FIG. 5 is applied in the embodiment of FIG. 3, the audio processor 111 and the noise suppressor 12 are coupled to the coordinate converter 14, and the converted physical speech position P' replaces the physical speech position p. When the circuit shown in FIG. 5 is applied in the embodiment of FIG. 4 , the audio processor 111 is coupled to the application processor 101 .

第6圖為本發明之一實施例之音源2與方向性收音裝置11之電路示意圖。請參閱第6圖，麥克風陣列110可包含麥克風m1、m2與m3，音訊處理器111可包含時間偏移器1111、1111’與1111”、一平均計算器1112與一參數調整器1113，其中時間偏移器1111、1111’與1111”分別耦接麥克風m1、m2與m3，參數調整器1113耦接時間偏移器1111、1111’與1111”，時間偏移器1111、1111’與1111”耦接平均計算器1112。參數調整器1113存有所有語音產生位置分別對應之複數組偏移時段。因為麥克風m1、m2與m3相距音源2之距離皆不同，所以麥克風m1、m2與m3會在不同時間點接收到語音訊號V。舉例來說，麥克風m2與m3所接收到的語音訊號V之時間點之間隔為t1，麥克風m1與m3所接收到的語音訊號V之時間點之間隔為t2。假設被轉換之實體語音位置P’或實體語音位置P對應麥克風m3，即代表麥克風m3距離音源2最近。參數調整器1113分別調整時間偏移器1111、1111’與1111”之偏移時段分別為d1、d2與d3，使d1=t2，d2=t1，d3=0。因此，麥克風m1、m2與m3所接收到之語音訊號V之波形都被偏移到對應麥克風m3接收到語音訊號V之時間點。接著，平均計算器1112從時間偏移器1111、1111’與1111”接收所有語音訊號V，並將其相加且平均，以產生被強化之語音訊號V’。此外，第6圖中所示的電路可應用於第1圖或本發明中的其它實施例，但是不限於此。當第6圖中所示的電路應用在第1圖之實施例中時，參數調整器1113耦接位置擷取裝置10。當第6圖中所示的電路應用在第3圖之實施例中時，參數調整器1113耦接座標轉換器14。當第6圖中所示的電路應用在第4圖之實施例中時，參數調整器1113耦接應用處理器101。FIG. 6 is a schematic circuit diagram of a sound source 2 and a directional sound pickup device 11 according to an embodiment of the present invention. Referring to FIG. 6, the microphone array 110 may include microphones m1, m2 and m3, the audio processor 111 may include time shifters 1111, 1111' and 1111", an average calculator 1112 and a parameter adjuster 1113, wherein the time The shifters 1111, 1111' and 1111" are respectively coupled to the microphones m1, m2 and m3, the parameter adjuster 1113 is coupled to the time shifters 1111, 1111' and 1111", and the time shifters 1111, 1111' and 1111" are coupled Connect to Average Calculator 1112. The parameter adjuster 1113 stores the complex group offset periods corresponding to all speech generating positions respectively. Because the distances between the microphones m1, m2, and m3 from the sound source 2 are all different, the microphones m1, m2, and m3 receive the voice signal V at different time points. For example, the time interval between the voice signals V received by the microphones m2 and m3 is t1, and the time interval between the voice signals V received by the microphones m1 and m3 is t2. It is assumed that the converted physical voice position P' or the physical voice position P corresponds to the microphone m3, which means that the microphone m3 is the closest to the sound source 2. The parameter adjuster 1113 adjusts the offset periods of the time offsetrs 1111, 1111' and 1111" to be d1, d2 and d3, respectively, so that d1=t2, d2=t1, d3=0. Therefore, the microphones m1, m2 and m3 The waveform of the received voice signal V is shifted to the corresponding time point when the microphone m3 receives the voice signal V. Next, the average calculator 1112 receives all the voice signals V from the time shifters 1111, 1111' and 1111", They are summed and averaged to generate the enhanced speech signal V'. Furthermore, the circuit shown in Fig. 6 may be applied to Fig. 1 or other embodiments of the present invention, but is not limited thereto. When the circuit shown in FIG. 6 is applied in the embodiment of FIG. 1 , the parameter adjuster 1113 is coupled to the position capturing device 10 . When the circuit shown in FIG. 6 is applied in the embodiment of FIG. 3 , the parameter adjuster 1113 is coupled to the coordinate converter 14 . When the circuit shown in FIG. 6 is applied in the embodiment of FIG. 4 , the parameter adjuster 1113 is coupled to the application processor 101 .

第7圖為本發明之再一實施例之位置擷取裝置10、方向性收音裝置11與雜訊抑制器12之電路方塊圖。請參閱第7圖，方向性收音裝置11亦可包含一方向性收音器112與一自動旋轉平台113。方向性收音器112耦接雜訊抑制器12，自動旋轉平台113耦接位置擷取裝置10，自動旋轉平台113支撐方向性收音器112。自動旋轉平台113接收實體語音位置P，並控制方向性收音器112之收音方向朝向實體語音位置P，且方向性收音器112接收語音訊號V，並傳輸語音訊號V至雜訊抑制器12。此外，第7圖中所示的電路可應用於第1圖或本發明中的其它實施例，但是不限於此。當第7圖中所示的電路應用在第3圖之實施例中時，自動旋轉平台113與雜訊抑制器12耦接座標轉換器14，並以被轉換之實體語音位置P’代替實體語音位置P。當第7圖中所示的電路應用在第4圖之實施例中時，自動旋轉平台113耦接應用處理器101。FIG. 7 is a circuit block diagram of a position acquisition device 10 , a directional sound pickup device 11 and a noise suppressor 12 according to still another embodiment of the present invention. Please refer to FIG. 7 , the directional sound pickup device 11 may also include a directional sound pickup device 112 and an automatic rotating platform 113 . The directional microphone 112 is coupled to the noise suppressor 12 , the automatic rotating platform 113 is coupled to the position capturing device 10 , and the automatic rotating platform 113 supports the directional microphone 112 . The automatic rotating platform 113 receives the physical voice position P, and controls the direction of the directional microphone 112 to be directed toward the physical voice position P, and the directional microphone 112 receives the voice signal V and transmits the voice signal V to the noise suppressor 12 . In addition, the circuit shown in FIG. 7 may be applied to other embodiments in FIG. 1 or the present invention, but is not limited thereto. When the circuit shown in FIG. 7 is applied in the embodiment of FIG. 3, the automatic rotating platform 113 and the noise suppressor 12 are coupled to the coordinate converter 14, and the converted physical speech position P' is used to replace the physical speech position p. When the circuit shown in FIG. 7 is applied in the embodiment of FIG. 4 , the automatic rotation platform 113 is coupled to the application processor 101 .

第8圖為本發明之第四實施例之語音辨識裝置之電路方塊圖。請參閱第8圖，以下介紹本發明之語音辨識裝置之第四實施例。語音辨識裝置3包含複數個語音接收器30、一音訊處理器31、一雜訊抑制器32與一語音辨識處理器33，其中這些元件皆為硬體。音訊處理器31耦接所有語音接收器30，音訊處理器31存有複數個語音產生位置分別對應之複數組偏移時段。雜訊抑制器32耦接音訊處理器31，其中雜訊抑制器32存有所有語音產生位置分別對應之雜訊模型。語音辨識處理器33耦接雜訊抑制器32。此外，音訊處理器31與雜訊抑制器32使用相同座標系統。FIG. 8 is a circuit block diagram of a speech recognition apparatus according to a fourth embodiment of the present invention. Referring to FIG. 8 , the fourth embodiment of the speech recognition apparatus of the present invention is described below. The voice recognition device 3 includes a plurality of voice receivers 30 , an audio processor 31 , a noise suppressor 32 and a voice recognition processor 33 , all of which are hardware. The audio processor 31 is coupled to all the voice receivers 30, and the audio processor 31 stores a plurality of offset periods corresponding to the voice generating positions respectively. The noise suppressor 32 is coupled to the audio processor 31 , wherein the noise suppressor 32 stores the noise models corresponding to all speech generating positions respectively. The speech recognition processor 33 is coupled to the noise suppressor 32 . In addition, the audio processor 31 and the noise suppressor 32 use the same coordinate system.

以下介紹第四實施例之運作過程。首先，所有語音接收器30於不同位置接收一音源4產生之語音訊號V，其中語音訊號V包含對應操作權之操作語音。因為所有語音接收器30相距音源4之距離皆不同，所以所有語音接收器30會於不同時間點接收語音訊號V。音訊處理器31取得在不同位置的語音訊號V之不同接收時間點，並據此取得音源4的實體語音位置P，所有語音產生位置包含實體語音位置P。音訊處理器31根據實體語音位置P與其對應之一組偏移時段移動不同位置的語音訊號V之波形至同一時間點，且在此同一時間點相加語音訊號V，以產生被強化之語音訊號V’。雜訊抑制器32接收被強化之語音訊號V’與實體語音位置P，並根據實體語音位置P對應之雜訊模型消除被強化之語音訊號V’之雜訊，以產生一語音辨識訊號R。其中雜訊抑制器32更可採用自適應性濾波演算法(adaptive filter algorithm)與有限脈衝響應(Finite impulse response, FIR)濾波器消除語音訊號V之雜訊，以提高雜訊抑制效率。語音辨識處理器33接收語音辨識訊號R，並據此產生一操作訊號O。操作訊號O可用以控制公用設備。音訊處理器31先取得音源4之實體語音位置P，使音訊處理器31根據實體語音位置P產生被強化之語音訊號V’。如此一來，在語音訊號V控制公用設備時，可降低搶奪控制權之頻率及提升公用設備的操作性，並在複雜且密閉的環境中，改善收音品質、收音方向性與降噪功能，以提升語音辨識之精確性。The operation of the fourth embodiment is described below. First, all the voice receivers 30 receive a voice signal V generated by an audio source 4 at different positions, wherein the voice signal V includes the operation voice corresponding to the operation right. Because the distances between all the voice receivers 30 and the sound source 4 are different, all the voice receivers 30 will receive the voice signal V at different time points. The audio processor 31 obtains different reception time points of the voice signal V at different positions, and obtains the physical voice position P of the audio source 4 accordingly, and all the voice generating positions include the physical voice position P. The audio processor 31 moves the waveforms of the voice signals V at different positions to the same time point according to the physical voice position P and its corresponding set of offset periods, and adds the voice signals V at the same time point to generate an enhanced voice signal V'. The noise suppressor 32 receives the enhanced voice signal V' and the physical voice position P, and eliminates the noise of the enhanced voice signal V' according to the noise model corresponding to the physical voice position P to generate a voice recognition signal R. The noise suppressor 32 can further use an adaptive filter algorithm and a Finite Impulse Response (FIR) filter to eliminate the noise of the voice signal V, so as to improve the noise suppression efficiency. The speech recognition processor 33 receives the speech recognition signal R, and generates an operation signal O accordingly. The operation signal O can be used to control the public equipment. The audio processor 31 first obtains the physical voice position P of the audio source 4, so that the audio processor 31 generates an enhanced voice signal V' according to the physical voice position P. In this way, when the voice signal V controls the public equipment, the frequency of grabbing control rights can be reduced and the operability of the public equipment can be improved. Improve the accuracy of speech recognition.

在本發明之某些實施例中，語音辨識處理器33可耦接音訊處理器31。在語音辨識處理器33未接收語音辨識訊號R長達一預設時段時，表示語音辨識裝置3之操作結束，以釋放出操作權。在語音辨識處理器33未接收語音辨識訊號R長達預設時段時，語音辨識處理器33控制音訊處理器31停止取得實體語音位置P，並控制音訊處理器31停止產生被強化之語音訊號V’，且控制音訊處理器31操作在待機狀態，直到語音接收器30接收到新的語音訊號為止。In some embodiments of the present invention, the speech recognition processor 33 may be coupled to the audio processor 31 . When the speech recognition processor 33 does not receive the speech recognition signal R for a predetermined period of time, it means that the operation of the speech recognition device 3 is ended, so as to release the operation right. When the voice recognition processor 33 does not receive the voice recognition signal R for a preset period of time, the voice recognition processor 33 controls the audio processor 31 to stop acquiring the physical voice position P, and controls the audio processor 31 to stop generating the enhanced voice signal V ', and control the audio processor 31 to operate in a standby state until the voice receiver 30 receives a new voice signal.

第9圖為本發明之一實施例之音源4、語音接收器30與音訊處理器31之電路示意圖。請參閱第8圖與第9圖，語音接收器30可以麥克風M1、M2與M3實現，音訊處理器31可包含時間偏移器311、311’與311”、一平均計算器312與一參數調整器313，其中時間偏移器311、311’與311”分別耦接麥克風M1、M2與M3，參數調整器313耦接時間偏移器311、311’與311”與麥克風M1、M2與M3，時間偏移器311、311’與311”耦接平均計算器312。參數調整器313存有所有語音產生位置分別對應之複數組偏移時段。因為麥克風M1、M2與M3相距音源4之距離皆不同，所以麥克風M1、M2與M3會在不同時間點接收到語音訊號V。舉例來說，麥克風M2與M3所接收到的語音訊號V之時間點之間隔為t1，麥克風M1與M3所接收到的語音訊號V之時間點之間隔為t2。假設參數調整器313發現實體語音位置P對應麥克風M3，即代表麥克風M3距離音源4最近。參數調整器313分別調整時間偏移器311、311’與311”之偏移時段分別為d1、d2與d3，使d1=t2，d2=t1，d3=0。因此，麥克風M1、M2與M3所接收到之語音訊號V之波形都被偏移到對應麥克風M3接收到語音訊號V之時間點。接著，平均計算器312從時間偏移器311、311’與311”接收所有語音訊號V，並將其相加且平均，以產生被強化之語音訊號V’。此外，第9圖中所示的電路可應用於第8圖或本發明中的其它實施例，但是不限於此。FIG. 9 is a schematic circuit diagram of an audio source 4 , a voice receiver 30 and an audio processor 31 according to an embodiment of the present invention. Please refer to FIG. 8 and FIG. 9, the voice receiver 30 may be implemented by microphones M1, M2 and M3, and the audio processor 31 may include time shifters 311, 311' and 311", an average calculator 312 and a parameter adjustment 313, wherein the time shifters 311, 311' and 311" are respectively coupled to the microphones M1, M2 and M3, and the parameter adjuster 313 is coupled to the time shifters 311, 311' and 311" and the microphones M1, M2 and M3, The time shifters 311 , 311 ′ and 311 ″ are coupled to the average calculator 312 . The parameter adjuster 313 stores the complex group offset periods corresponding to all speech generating positions respectively. Because the distances between the microphones M1, M2 and M3 from the sound source 4 are all different, the microphones M1, M2 and M3 will receive the voice signal V at different time points. For example, the time interval between the voice signals V received by the microphones M2 and M3 is t1, and the time interval between the voice signals V received by the microphones M1 and M3 is t2. It is assumed that the parameter adjuster 313 finds that the physical voice position P corresponds to the microphone M3 , which means that the microphone M3 is the closest to the sound source 4 . The parameter adjuster 313 adjusts the offset periods of the time offsetrs 311, 311' and 311" to be d1, d2 and d3, respectively, so that d1=t2, d2=t1, and d3=0. Therefore, the microphones M1, M2 and M3 The waveform of the received voice signal V is shifted to the corresponding time point when the microphone M3 receives the voice signal V. Next, the average calculator 312 receives all the voice signals V from the time shifters 311, 311' and 311", They are summed and averaged to generate the enhanced speech signal V'. In addition, the circuit shown in FIG. 9 may be applied to FIG. 8 or other embodiments of the present invention, but is not limited thereto.

根據上述實施例，語音辨識裝置先取得音源之實體語音位置，並輸出語音位置至方向性收音裝置，使方向性收音裝置根據語音位置接收音源產生之語音訊號。如此一來，在語音訊號控制公用設備時，降低搶奪控制權之頻率及提升公用設備的操作性，並在複雜且密閉的環境中，改善收音品質、收音方向性與降噪功能，以提升語音辨識之精確性。According to the above-mentioned embodiment, the voice recognition device first obtains the physical voice position of the audio source, and outputs the voice position to the directional radio device, so that the directional radio device receives the voice signal generated by the audio source according to the voice position. In this way, when the voice signal controls the public equipment, the frequency of grabbing the control rights is reduced and the operability of the public equipment is improved, and in the complex and closed environment, the sound quality, the direction of the sound and the noise reduction function are improved to improve the voice. Accuracy of identification.

以上所述者，僅為本發明一較佳實施例而已，並非用來限定本發明實施之範圍，故舉凡依本發明申請專利範圍所述之形狀、構造、特徵及精神所為之均等變化與修飾，均應包括於本發明之申請專利範圍內。The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Therefore, all changes and modifications made in accordance with the shape, structure, feature and spirit described in the scope of the patent application of the present invention are equivalent. , shall be included in the scope of the patent application of the present invention.

1:語音辨識裝置 10:位置擷取裝置 100:觸控顯示面板 101:應用處理器 11:方向性收音裝置 110:麥克風陣列 111:音訊處理器 1111、1111’、1111”:時間偏移器 1112:平均計算器 1113:參數調整器 112:方向性收音器 113:自動旋轉平台 12:雜訊抑制器 13:語音辨識處理器 14:座標轉換器 2:音源 3:語音辨識裝置 30:語音接收器 31:音訊處理器 311、311’、311”:時間偏移器 312:平均計算器 313:參數調整器 32:雜訊抑制器 33:語音辨識處理器 4:音源 P:實體語音位置 V:語音訊號 R:語音辨識訊號 O:操作訊號 P’:被轉換之實體語音位置 V’:被強化之語音訊號 m1、m2、m3:麥克風 M1、M2、M3:麥克風 1: Voice recognition device 10: Location capture device 100: Touch display panel 101: Application Processors 11: Directional radio device 110: Microphone Array 111: Audio Processor 1111, 1111', 1111": Time Offset 1112: Average Calculator 1113: Parameter adjuster 112: Directional Radio 113: Automatic rotating platform 12: Noise suppressor 13: Speech recognition processor 14: Coordinate converter 2: Audio source 3: Voice recognition device 30: Voice Receiver 31: Audio Processor 311, 311', 311": Time Offset 312: Average Calculator 313: Parameter adjuster 32: Noise suppressor 33: Speech recognition processor 4: Audio source P: Physical voice position V: voice signal R: voice recognition signal O: Operation signal P': the transformed physical voice position V’: Enhanced voice signal m1, m2, m3: microphone M1, M2, M3: Microphone

第1圖為本發明之第一實施例之語音辨識裝置之電路方塊圖。第2圖為本發明之第二實施例之語音辨識裝置之電路方塊圖。第3圖為本發明之第三實施例之語音辨識裝置之電路方塊圖。第4圖為本發明之一實施例之位置擷取裝置、方向性收音裝置與雜訊抑制器之電路方塊圖。第5圖為本發明之另一實施例之位置擷取裝置、方向性收音裝置與雜訊抑制器之電路方塊圖。第6圖為本發明之一實施例之音源與方向性收音裝置之電路示意圖。第7圖為本發明之再一實施例之位置擷取裝置、方向性收音裝置與雜訊抑制器之電路方塊圖。第8圖為本發明之第四實施例之語音辨識裝置之電路方塊圖。第9圖為本發明之一實施例之音源、語音接收器與音訊處理器之電路示意圖。 FIG. 1 is a circuit block diagram of a speech recognition apparatus according to a first embodiment of the present invention. FIG. 2 is a circuit block diagram of a speech recognition device according to a second embodiment of the present invention. FIG. 3 is a circuit block diagram of a speech recognition apparatus according to a third embodiment of the present invention. FIG. 4 is a circuit block diagram of a position acquisition device, a directional radio device and a noise suppressor according to an embodiment of the present invention. FIG. 5 is a circuit block diagram of a position acquisition device, a directional radio device and a noise suppressor according to another embodiment of the present invention. FIG. 6 is a schematic circuit diagram of a sound source and a directional sound pickup device according to an embodiment of the present invention. FIG. 7 is a circuit block diagram of a position acquisition device, a directional radio device and a noise suppressor according to still another embodiment of the present invention. FIG. 8 is a circuit block diagram of a speech recognition apparatus according to a fourth embodiment of the present invention. FIG. 9 is a schematic circuit diagram of an audio source, a voice receiver and an audio processor according to an embodiment of the present invention.

1:語音辨識裝置 1: Voice recognition device

10:位置擷取裝置 10: Location capture device

11:方向性收音裝置 11: Directional radio device

12:雜訊抑制器 12: Noise suppressor

13:語音辨識處理器 13: Speech recognition processor

2:音源 2: Audio source

P:實體語音位置 P: Physical voice position

V:語音訊號 V: voice signal

R:語音辨識訊號 R: voice recognition signal

O:操作訊號 O: Operation signal

Claims

A voice recognition device, comprising: at least one position capture device corresponding to at least one trigger condition, when a sound source satisfies the at least one trigger condition, the at least one position capture device obtains the physical voice position of the sound source, and outputs the a physical voice position, wherein the at least one trigger condition is a voice-related trigger condition, an image-related trigger condition or an application-related trigger condition; a directional sound pickup device coupled to the at least one position capture device, the directional sound pickup device is used for Receive the physical voice position, and receive the voice signal generated by the audio source according to the physical voice position; a noise suppressor, coupled to the at least one position acquisition device and the directional sound pickup device, wherein the noise suppressor stores A noise model corresponding to a plurality of voice generating positions respectively, the voice generating positions include the physical voice position, the noise suppressor is used for receiving the voice signal and the physical voice position, and according to the noise corresponding to the physical voice position The model eliminates the noise of the voice signal to generate a voice recognition signal; and a voice recognition processor is coupled to the noise suppressor, wherein the voice recognition processor is used to receive the voice recognition signal and generate an operation accordingly signal; wherein the speech recognition processor is coupled to the at least one position acquisition device and the directional sound pickup device, and when the speech recognition processor does not receive the speech recognition signal for a preset period of time, the speech recognition processor controls The at least one position capturing device stops acquiring the physical voice position, and controls the directional sound pickup device to stop receiving the physical voice position and generating the voice signal, and controls the at least one position capturing device and the directional sound pickup device to operate in Standby state.

The speech recognition device as claimed in claim 1, further comprising a coordinate converter, which is coupled to the at least one position acquisition device, the noise suppressor and the directional radio device, wherein the coordinate converter is used to receive the The physical voice position, and the coordinate system of the physical voice position is converted into the coordinate system corresponding to the noise suppressor and the directional radio device, and then the converted physical voice position is sent to the noise suppressor and the directional radio. device.

The speech recognition device as claimed in claim 1, wherein the at least one position capture device includes a plurality of position capture devices, the at least one trigger condition includes a plurality of trigger conditions, and the trigger conditions respectively correspond to the position capture devices , when the audio source sequentially satisfies the triggering conditions, the position acquisition device corresponding to the triggering condition that is satisfied earliest obtains and outputs the physical voice position.

The speech recognition device of claim 1, wherein the at least one position capturing device is an image positioning module, and when the image positioning module captures an image with a user's raised hand gesture, the at least one trigger condition is satisfied, the user is the audio source, and the physical location of the user is the physical voice location.

The voice recognition device according to claim 1, wherein the at least one position capturing device is a voice positioning module, and the at least one triggering condition is satisfied when the voice positioning module receives trigger voices generated by the audio source at different positions , and the voice positioning module is used to obtain different reception time points of the trigger voice at the different positions, and obtain the physical voice position accordingly.

The speech recognition device of claim 1, wherein the at least one location captures The retrieval device includes: a touch display panel for displaying an operation interface of an application program, wherein the operation interface has an image corresponding to the physical voice position; and an application processor, coupled to the touch display panel, the noise suppression a device and the directional radio device, the application processor is installed with the application program, wherein when the touch display panel is pressed at the position corresponding to the image, the at least one trigger condition is satisfied, and the application processor obtains and outputs The entity's voice location.

The voice recognition device of claim 1, wherein the directional sound pickup device comprises: a microphone array for receiving the voice signals at different positions; and an audio processor coupled to the microphone array, the at least one position capture Taking the device and the noise suppressor, the audio processor stores a complex set of offset periods corresponding to the voice generating positions respectively, wherein the audio processor is used for receiving the physical voice position and corresponding to the physical voice position according to the physical voice position. The set of offset periods shifts the waveforms of the voice signals at the different positions to the same time point, and adds the voice signals at the same time point to generate the enhanced voice signal, and the audio processor is used to transmit the voice signal. The voice signal is enhanced to the noise suppressor.

The speech recognition device of claim 1, wherein the directional sound pickup device comprises: a directional sound receiver coupled to the noise suppressor; and an automatic rotating platform coupled to the at least one position acquisition device, the The automatic rotating platform supports the directional microphone, wherein the automatic rotating platform is used to receive the voice position of the entity and control the sound-receiving direction of the directional microphone toward the entity voice position, and the directional microphone is used for receiving the voice signal and transmitting the voice signal to the noise suppressor.

A voice recognition device, comprising: a plurality of voice receivers for receiving voice signals generated by an audio source at different positions; an audio processor, coupled to the voice receivers, the audio processor stores a plurality of voice generation positions A complex set of offset periods corresponding respectively, the audio processor is used to obtain different reception time points of the voice signal at the different positions, and obtain the physical voice positions of the audio source accordingly, and the voice production positions include the physical voice positions , wherein the audio processor is used to move the waveforms of the voice signals at the different positions to the same time point according to the physical voice position and the corresponding set of offset periods, and add the voice signals at the same time point to generate the enhanced voice signal; a noise suppressor coupled to the audio processor, wherein the noise suppressor stores the noise models corresponding to the voice generating positions respectively, and the noise suppressor is used for receiving the enhanced voice signal the voice signal and the physical voice position, and eliminate the noise of the enhanced voice signal according to the noise model corresponding to the physical voice position to generate a voice recognition signal; and a voice recognition processor, coupled to The noise suppressor, wherein the voice recognition processor is used for receiving the voice recognition signal, and generates an operation signal accordingly; wherein the voice recognition processor is coupled to the audio processor, and the voice recognition processor does not receive the voice When the recognition signal lasts for a predetermined period of time, the voice recognition processor controls the audio processor to stop acquiring the physical voice position, controls the audio processor to stop generating the enhanced voice signal, and controls the audio processor to stop generating the enhanced voice signal. The audio processor is operating in a standby state.