JP2003323191A

JP2003323191A - Access system to internet homepage adaptive to voice

Info

Publication number: JP2003323191A
Application number: JP2002130741A
Authority: JP
Inventors: Kiyoyuki Suzuki; 清幸鈴木
Original assignee: Advanced Media Inc
Current assignee: Advanced Media Inc
Priority date: 2002-05-02
Filing date: 2002-05-02
Publication date: 2003-11-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a client device for recognizing a voice inputted from a client device having a voice input function and realizing homepage reading with a voice and an access system to an Internet homepage adaptive to voice. <P>SOLUTION: This access system to an Internet homepage adaptive to voice has, in Internet homepage reading for accessing a homepage from the client device having a voice input function by voice, a sound analyzing means for receiving a voice input, performing a sound analysis such as voice encoding, extracting a characteristic parameter and transmitting the extracted characteristic parameter to a homepage providing site (1), a voice recognition means in the provider site for the homepage adaptive to voice for receiving the characteristic parameter, using a sound model, etc., to perform voice recognition, converting the recognized voice into a symbol string and transmitting the symbol string to the client device (2), and a receiving means with which the client device receives the homepage and displays the homepage on a screen or the like (3). <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明が属する技術分野】本発明は、インターネットに
接続でき、かつ音声入力機能を持つクライアント機器よ
り自然言語で入力された音声を認識して、ホームページ
閲覧を音声で実現するためのクライアント機器による音
声対応インターネットホームページのアクセスシステム
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention recognizes voice input in a natural language from a client device which can be connected to the Internet and has a voice input function, and realizes homepage browsing by voice. Corresponding Internet home page access system.

【０００２】[0002]

【従来の技術】音声処理に関する、主なコンピュータを
用いた最新の技術は音声符号化技術、音声合成技術、音
声認識技術、話者認識技術に大別される。音声符号化技
術は、音声をスペクトル分析し、音声波のもつ冗長性を
取り除いて圧縮化する技術である。音声合成技術は、人
間の音声を合成する技術で、テキストの読み上げなどに
利用されている。音声認識技術は、音声を言語として認
識する技術であるが、発展途上の技術でもあり、はじめ
の二つほど普及していない。話者認識技術は、音声から
誰かを特定する技術で、セキュリティシステムなどに応
用されているが、その性格上、広くは普及していない。2. Description of the Related Art The latest computer-related technologies relating to speech processing are roughly classified into speech coding technology, speech synthesis technology, speech recognition technology, and speaker recognition technology. The voice coding technique is a technique of spectrally analyzing voice to remove redundancy of a voice wave and compressing the voice. The voice synthesis technology is a technology for synthesizing human voice and is used for reading out texts. The speech recognition technology is a technology for recognizing speech as a language, but it is also a developing technology and is not as popular as the first two. The speaker recognition technology is a technology for identifying someone from a voice and is applied to security systems and the like, but it is not widely used due to its nature.

【０００３】音声処理で広く使用されている技術にスペ
クトル分析がある。スペクトル分析は、音響分析の標準
的な周波数分析で、その中でも特に広く用いられている
のものがパワースペクトル分析である。パワースペクト
ル分析では、まず入力された音声信号をデジタル処理で
標本化し、その標本化されたデータをＤＦＴ（離散的フ
ーリエ変換）やＦＦＴ（高速フーリエ変換）によって計
算して周波数の成分を求める。その求められた周波数成
分を分析して音韻論的処理を施すことにより、音声認識
技術等に応用できるデータが得られる。A technique widely used in voice processing is spectral analysis. The spectrum analysis is a standard frequency analysis of acoustic analysis, and the power spectrum analysis is particularly widely used. In the power spectrum analysis, first, the input voice signal is sampled by digital processing, and the sampled data is calculated by DFT (Discrete Fourier Transform) or FFT (Fast Fourier Transform) to obtain a frequency component. By analyzing the obtained frequency components and applying phonological processing, data applicable to speech recognition technology and the like can be obtained.

【０００４】大部分の音声認識おいては、サンプリング
した音声スペクトルを分析した後、ベクトル量子化によ
って１００種類程度のグループに分類し、ラベル列とし
て処理できるようにする。その後に隠れマルコフモデル
（ＨＭＭ）とよばれる統計モデルを用い尤度計算をしな
がらデータを分析し、結果を出力する。ＨＭＭは不特定
話者による連続音声認識技術の中核を担うモデルである
音韻モデルと単語モデルのいずれにも利用できる。In most speech recognition, after the sampled speech spectrum is analyzed, it is classified into about 100 types of groups by vector quantization so that it can be processed as a label string. After that, a statistical model called Hidden Markov Model (HMM) is used to analyze the data while performing likelihood calculation, and the result is output. The HMM can be used for both the phonological model and the word model, which are the models at the core of the continuous speech recognition technology by unspecified speakers.

【０００５】ＨＭＭの働きは、入力音声パターンを観測
して最もよくマッチする記号列または音韻列を見つけ出
すことである。言語モデルでは、大量のテキストデータ
をＨＭＭ等によって統計的に分析して得られた記号列の
出現確率が用いられる。The function of the HMM is to observe the input speech pattern to find the best matching symbol or phoneme sequence. In the language model, the appearance probability of a symbol string obtained by statistically analyzing a large amount of text data by HMM or the like is used.

【０００６】たとえば、音響モデルと言語モデルを用い
た音声認識システムが存在している。このシステムでは
６万語の日本語辞書を持ち、その中に含まれる単語は、
確率計算に必要な統計データとしての音響モデルと言語
モデルを内包している。マイクから入力された音声はデ
ジタルデータ化され、６万語の辞書と照らし合わせて候
補を絞り、高速大分類、言語解析、詳細音声突き合わせ
のステップを踏み、最終的に音響モデルと言語モデルで
解析した解析結果で最も確率の高いものを認識結果とし
て選択することによって、音声認識を行っている。For example, there is a voice recognition system using an acoustic model and a language model. This system has a Japanese dictionary of 60,000 words, and the words contained in it are
It includes acoustic model and language model as statistical data necessary for probability calculation. The voice input from the microphone is converted to digital data, the candidates are narrowed down by comparing it with a dictionary of 60,000 words, and the steps of high-speed large classification, language analysis and detailed voice matching are taken, and finally analyzed with an acoustic model and a language model. The speech recognition is performed by selecting the highest-probability analysis result as the recognition result.

【０００７】上記のような音声認識を大きく二つの機能
に分割するなら、音響分析と音声認識とからなる。音響
分析では音声の符号化、ノイズ処理、補正等を行う。音
声認識では、音響分析された符号化音声データに対して
音響処理や言語処理によって音声認識を行い、最も確率
の高い単語あるいは文字列抽出する。この音響処理、言
語処理には音響モデル、言語モデルが使用される。また
各モデル利用される音響パターンや単語あるいは文字列
などが登録されているものが辞書であり、辞書を充実す
ることによって認識率の向上につなげることができる。If the above speech recognition is roughly divided into two functions, it consists of acoustic analysis and speech recognition. In acoustic analysis, audio coding, noise processing, correction, etc. are performed. In speech recognition, speech recognition is performed on acoustically analyzed encoded speech data by acoustic processing or linguistic processing, and a word or character string with the highest probability is extracted. An acoustic model and a language model are used for this acoustic processing and language processing. A dictionary is a dictionary in which acoustic patterns, words, character strings, etc. used in each model are registered, and it is possible to improve the recognition rate by expanding the dictionary.

【０００８】図１は音声認識エンジン1の基本モデルを
示したもので、図の認識デコーダーは上記の音声認識処
理に対応する。マイクなどで入力した音声a（アナログ
音声）は音響分析110によって符号化され、符号化音声b
（デジタル音声）に変換され、音声認識デコーダー120
に渡される。音声認識デコーダー120はシステム音響モ
デル121、システム辞書122、システム言語モデル123を
用いて符号化音声bを分析し、いくつかの候補言葉を抽
出する。その中で最も確率の高い言葉が音声認識結果ｃ
として抽出される。FIG. 1 shows a basic model of the speech recognition engine 1, and the recognition decoder in the figure corresponds to the above speech recognition processing. Voice a (analog voice) input through a microphone is encoded by acoustic analysis 110, and encoded voice b
(Digital voice) converted to voice recognition decoder 120
Passed to. The speech recognition decoder 120 analyzes the coded speech b using the system acoustic model 121, the system dictionary 122, and the system language model 123, and extracts some candidate words. The word with the highest probability is the speech recognition result c
Is extracted as.

【０００９】単語レベル（例えば、「右」「左」などの
単語）の命令で音声操作を行う装置はいくつか開発され
ているが、自然言語レベルでは、大規模な音声認識シス
テムが必要となり、簡単な装置には利用されていない。
その理由として、符号化音声のデータ量の多いことや、
音声認識デコーダーのシステムが大きくなる点などが挙
げられる。音響分析処理と音声認識デコーダーを分離し
て別個に処理することができないことも、音声によるデ
ータ入力を用いた応用分野を狭めている。Although some devices have been developed to perform voice operations by word level (for example, words such as "right" and "left"), at the natural language level, a large-scale voice recognition system is required. Not used for simple devices.
The reason is that the amount of encoded voice data is large,
The point is that the system of the voice recognition decoder becomes large. The inability to separate the acoustic analysis processing and the speech recognition decoder into separate processing also narrows the field of application using voice data input.

【００１０】小メモリでＣＰＵ速度の遅いクライアント
機器での、音声入力によるインターネットホームページ
アクセスは現在行われていない。例えばi-mode（登録商
標）では、インターネットにおける操作は音声で行うの
ではなく、手動操作の入力デバイスで行っている。音声
入力機能を有しながら音声対応が実現していないのが現
状である。Internet home page access by voice input is not currently performed in a client device having a small memory and a low CPU speed. For example, in i-mode (registered trademark), operations on the Internet are not performed by voice, but are performed by a manually operated input device. At present, it is not possible to realize voice support while having a voice input function.

【００１１】電話機による音声認識でホームページを閲
覧する特許としては、『ホームページアクセス装置及び
ホームページアクセス方法』（特開2001-109687）があ
る。これは、ファクシミリ端末や一般電話端末より音声
入力したホームページアドレスとホームページの閲覧項
目のキーワードとを基にインターネットのアクセスサー
バーからホームページを取得するというものである。す
なわち音声認識させるベース基地（アクセスサーバー）
をユーザーとホームページ提供サイトの間に設けること
によって、音声入力を可能にしている。この方法を用い
れば、クライアント機器からのホームページアクセスも
可能となる。As a patent for browsing a homepage by voice recognition by a telephone, there is "homepage access device and homepage access method" (Japanese Patent Laid-Open No. 2001-109687). This is to acquire a home page from an Internet access server based on a home page address voice-input from a facsimile terminal or a general telephone terminal and a keyword of a browse item of the home page. In other words, the base base (access server) for voice recognition
Is provided between the user and the website providing site to enable voice input. By using this method, it is possible to access the home page from the client device.

【００１２】[0012]

【発明が解決しようとする課題】本発明が解決しようと
する課題は、現実の問題に照らし合わせて、学習機能を
用いなくても音声入力に対して高認識率を実現し、なお
かつ軽いシステムで音声認識による音声対応インターネ
ットホームページのアクセスを可能にするクライアント
機器とその方法を提供することにある。The problem to be solved by the present invention is to realize a high recognition rate for voice input without using a learning function in light of the actual problem, and to provide a light system. It is an object to provide a client device and a method for enabling access to a voice-enabled Internet home page by voice recognition.

【００１３】[0013]

【課題を解決するための手段】上記課題を解決するた
め、請求項１に記載された発明は、音声入力機能を有す
るクライアント機器から音声でホームページをアクセス
するインターネットホームページ閲覧において、(1)人
間の音声入力を受け、音声の符号化、ノイズ処理、補正
等の音響分析を行い、音声データより特徴パラメータを
抽出し、該特徴パラメータをホームページ提供サイトに
送信する音響分析手段、(2)前記特徴パラメータを受
け、音響モデル、辞書、言語モデルを用いて音声認識を
行い、特定の音声を記号列に変換して、ホームページに
必要な操作を行い、選択した情報をクライアント機器に
送信する、音声対応ホームページ提供サイトにおける音
声認識手段、(3)前記ホームページをクライアント機器
が受信して画面表示または音声出力などを行う受信手
段、を有することを特徴とする音声対応インターネット
ホームページのアクセスシステムである。Means for Solving the Problems In order to solve the above-mentioned problems, the invention described in claim 1 is: (1) Human beings who access a home page by voice from a client device having a voice input function. Acoustic analysis means for receiving voice input, performing voice analysis, noise processing, acoustic analysis such as correction, extracting characteristic parameters from voice data, and transmitting the characteristic parameters to a website providing site, (2) the characteristic parameters In response to this, voice recognition is performed using an acoustic model, a dictionary, and a language model, specific voices are converted into symbol strings, necessary operations are performed on the home page, and the selected information is transmitted to the client device. Speech recognition means at the providing site, (3) The client device receives the home page and displays it on the screen or sounds. Receiving means for performing such output is a voice-enabled Internet homepage access system characterized by having.

【００１４】請求項２に記載された発明は、前記音声認
識が特徴パラメータを分析して記号列を生成するための
ルールグラマをホームページ単位に作成して、ホームペ
ージ単位の音声認識を行う手段を含むことを特徴とする
請求項１に記載の音声対応インターネットホームページ
のアクセスシステムである。According to the second aspect of the invention, the speech recognition includes means for creating a rule grammar for analyzing the characteristic parameters to generate a symbol string for each home page and performing voice recognition for each home page. The access system for a voice-enabled Internet home page according to claim 1, wherein:

【００１５】請求項３に記載された発明は、音声でホー
ムページをアクセスするインターネットホームページ閲
覧に用いられる音声入力機能を有するクライアント機器
において、(1)人間の音声入力を受け、音声の符号化、
ノイズ処理、補正等の音響分析を行い、音声データより
特徴パラメータを抽出し、該特徴パラメータをホームペ
ージ提供サイトに送信する音響分析手段を備えたことを
特徴とする音声入力機能を有するクライアント機器であ
る。According to a third aspect of the present invention, in a client device having a voice input function used for browsing an internet home page for accessing a home page by voice, (1) receiving a human voice input, encoding the voice,
A client device having a voice input function, characterized by comprising acoustic analysis means for performing acoustic analysis such as noise processing and correction, extracting characteristic parameters from voice data, and transmitting the characteristic parameters to a website providing site. .

【００１６】本発明のシステムでは、音響分析部をクラ
イアント機器に組み込む。一方、音声対応インターネッ
トホームページを提供するサイトでは、音響分析部によ
って得られたデジタル信号（符号化音声データ）を基に
音声認識を行い、ホームページの操作に必要な記号列を
抜き出し、ホームページ操作を行う。音声データの送信
速度の高速化を図るために、クライアント機器で生成す
るデジタル信号は、認識率を低下させない、音声データ
の特徴を表す特徴量を抽出して特徴パラメータを生成し
て使用する。一方、音声対応インターネットホームペー
ジを提供するサイトでは、特徴パラメータに対応した音
声認識を行うルールグラマをホームページ単位に開発し
て高速化と軽量化を図る。In the system of the present invention, the acoustic analysis unit is incorporated in the client device. On the other hand, at sites that provide voice-enabled Internet homepages, voice recognition is performed based on the digital signals (encoded voice data) obtained by the acoustic analysis unit, the symbol strings required for homepage operations are extracted, and homepage operations are performed. . In order to speed up the transmission rate of voice data, a digital signal generated by a client device is used by generating a feature parameter by extracting a feature amount representing a feature of voice data that does not reduce the recognition rate. On the other hand, for sites that provide voice-enabled Internet homepages, we will develop a rule grammar that performs voice recognition corresponding to feature parameters for each homepage to speed up and reduce weight.

【００１７】本発明のシステムの特徴は、音響分析部と
音声認識部（音声認識デコーダー）を分離し、前者をク
ライアント機器に含め、後者をサイトに含める点であ
る。さらに本発明では、音声入力クライアント機器（例
えばi-mode（登録商標）携帯電話）から音声そのものを
送信するのではなく、特徴パラメータを作成して特徴パ
ラメータを送信する。特徴パラメータは音声認識率を低
下させない程度に音声データの特徴量を抜き出し、その
組をパラメータ化したものである。具体的には特徴パラ
メータによって、通常の音声データ32KB/sec（＝16KH
z、16bit）の3.75%（1.2KB/sec）まで通信量を低減する
ことが可能となる。このための音響分析機能はハードウ
ェア化してクライアント機器に組み込んでおく。A feature of the system of the present invention is that the acoustic analysis section and the speech recognition section (speech recognition decoder) are separated, and the former is included in the client device and the latter is included in the site. Further, in the present invention, the voice itself is not transmitted from the voice input client device (for example, i-mode (registered trademark) mobile phone), but the feature parameter is created and the feature parameter is transmitted. The feature parameter is obtained by extracting the feature amount of the voice data to the extent that the voice recognition rate is not lowered and parameterizing the set. Specifically, depending on the characteristic parameters, normal voice data 32KB / sec (= 16KH
It is possible to reduce the communication volume to 3.75% (1.2KB / sec) of z, 16bit). The acoustic analysis function for this purpose is converted into hardware and installed in the client device.

【００１８】一方、ホームページを提供するサイトで
は、特徴パラメータに対して音声認識処理を音声認識デ
コーダーが行う。音声認識デコーダーは、通常の音響モ
デルと言語モデルとしてコマンド＆コントロールタイプ
のルールグラマを用いて音声認識を行う。ルールグラマ
はホームページ単位に開発され、ホームページ単位に音
響モデル、言語モデル、辞書が作られる。このため、サ
イトの音声認識システムの軽量化、高速化が可能とな
り、なおかつ高認識率が実現されている。On the other hand, at a site providing a home page, a voice recognition decoder performs a voice recognition process on characteristic parameters. The speech recognition decoder performs speech recognition using a command and control type rule grammar as a normal acoustic model and a language model. The rule grammar is developed for each home page, and an acoustic model, a language model, and a dictionary are created for each home page. For this reason, it is possible to reduce the weight and speed of the site voice recognition system, and also achieve a high recognition rate.

【００１９】サイトは、音声認識デコーダーが特徴パラ
メータに対して音声認識した結果に基づいて、通常のホ
ームページ処理操作が実行され、その実行結果に対応し
た画面（情報）をユーザーのクライアント機器に送信す
る。音声入力とそれに関連した処理以外は、通常のホー
ムページサクセスがそのまま利用される。The site executes a normal homepage processing operation based on the result of voice recognition of the feature parameter by the voice recognition decoder, and transmits a screen (information) corresponding to the execution result to the user's client device. . Except for voice input and related processing, normal homepage success is used as it is.

【００２０】[0020]

【発明の実施の形態】本発明の実施の形態を図を用いて
説明する。なお、以下ではユーザーの利用するクライア
ント機器をi-mode（登録商標）携帯電話として話を進め
るが、音声入力装置を有し、インターネット接続可能な
クライアント機器ならば、どのようなものでも基本的に
共通した説明が成り立つ。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described with reference to the drawings. In the following, the client device used by the user will be referred to as an i-mode (registered trademark) mobile phone, but basically any client device having a voice input device and capable of connecting to the Internet is basically used. A common explanation holds.

【００２１】図２は、本発明の音声対応インターネット
ホームページのアクセス方法を説明するためのシステム
構成図である。i-mode（登録商標）携帯電話3のボタン3
20操作によって直接ホームページ提供サイト4に接続し
て、ホームページを閲覧することもできるが、本発明で
は音声入力を基本とする。まずホームページ提供サイト
4に接続するには、通常のボタン320操作で行う。接続し
たサイト4が本発明の規格に対応した音声認識システム
を有しているときには、その旨がサイト4より携帯電話3
のディスプレイ330に表示されるか、スピーカから音声
が出力される。それ以降は、音声によって情報の取得が
行える。その手順を説明する。FIG. 2 is a system block diagram for explaining a method of accessing a voice-enabled Internet home page according to the present invention. Button 3 on i-mode (registered trademark) mobile phone 3
Although the home page can be browsed by directly connecting to the home page providing site 4 by 20 operations, the present invention is based on voice input. First, a website providing site
To connect to 4, use standard 320 button operations. When the connected site 4 has a voice recognition system that complies with the standards of the present invention, this is indicated by the mobile phone 3 from the site 4.
Is displayed on the display 330 or sound is output from the speaker. After that, the information can be acquired by voice. The procedure will be described.

【００２２】１）サイト4に接続すると、通常は、メイ
ンメニューがディスプレイ330に表示され音声でメニュ
ーを選択する。この場合、番号がついている場合には、
「１」、「２」などの番号だけを発音して入力してもよ
いが、「１番を選択」、「２番の情報が欲しい」などの
自然言語による音声入力を行ってもよい。２）音声入力があると、ハードウェア化された音響分析
装置310が音声を分析をして特徴パラメータpを抽出し、
特徴パラメータpをサイト4に送信する。３）特徴パラメータpを受けたサイト4では、ＨＰシステ
ム400が音声入力かキー入力によるものかを判別し、音
声入力の場合には、特徴パラメータpを音声認識エンジ
ン410に渡す。４）特徴パラメータpを受けた音声認識エンジン410は、
現在携帯電話3に表示されている画面420に対応したルー
ルグラマ430を用いて音声認識を行い、音声認識結果c
（記号列）を生成して、ＨＰシステム400に渡す。ＨＰ
システム400は音声認識結果cから画面420を選択して、
画面i（サイト4の画面420）を携帯電話3に送信する。５）送信されてきた画面iはディスプレイ330に表示され
る。1) When connected to the site 4, the main menu is normally displayed on the display 330 and the menu is selected by voice. In this case, if there is a number,
Although only numbers such as “1” and “2” may be pronounced and input, natural language voice input such as “select number 1” and “I want information for number 2” may be performed. 2) When there is a voice input, the hardware-based acoustic analyzer 310 analyzes the voice and extracts the characteristic parameter p,
Send feature parameter p to site 4. 3) The site 4, which has received the characteristic parameter p, determines whether the HP system 400 is a voice input or a key input, and in the case of voice input, passes the characteristic parameter p to the voice recognition engine 410. 4) The voice recognition engine 410 receiving the characteristic parameter p
Voice recognition is performed using the rule grammar 430 corresponding to the screen 420 currently displayed on the mobile phone 3, and the voice recognition result c
(Symbol string) is generated and passed to the HP system 400. HP
The system 400 selects the screen 420 from the voice recognition result c,
Send screen i (screen 420 of site 4) to mobile phone 3. 5) The transmitted screen i is displayed on the display 330.

【００２３】以上が、本発明を用いたときの一連の音声
入力手順である。次に音声認識について詳細に見ていこ
う。図３は、本発明の音声認識過程を示した図である。
図の音響分析装置310は携帯電話に含まれ、音声認識エ
ンジン410はサイトに含まれる。とくに本発明において
サイト側で重要な働きをするのが音声認識デコーダー41
1である。なぜなら、音響分析はすでに携帯電話の方で
済んでいるからである。The above is a series of voice input procedures when the present invention is used. Next, let's take a closer look at voice recognition. FIG. 3 is a diagram showing a voice recognition process of the present invention.
The illustrated acoustic analyzer 310 is included in the mobile phone and the voice recognition engine 410 is included in the site. Especially in the present invention, the voice recognition decoder 41 plays an important role on the site side.
Is 1. Because the acoustic analysis is already done on the mobile phone.

【００２４】携帯電話で音声aを入力すると、音響分析
装置310がアナログ音声aを分析し、符号化、ノイズ処
理、補正等を行う。その後、符号化音声から音声認識率
を劣化させない範囲で特徴部分のみを抜き出し、特徴パ
ラメータpを作成する。この特徴パラメータpは通常の音
声データに比べて最大3.75%まで縮小することができ
る。When the voice a is input by the mobile phone, the acoustic analyzer 310 analyzes the analog voice a and performs encoding, noise processing, correction and the like. After that, only the characteristic portion is extracted from the encoded voice within a range that does not deteriorate the voice recognition rate, and the characteristic parameter p is created. This feature parameter p can be reduced to a maximum of 3.75% compared to normal voice data.

【００２５】携帯電話の音響分析装置310で作成された
特徴パラメータp（図の○1、○2、○3…等は特徴量すな
わち個々の特徴ベクトルを示す）は、サイトに送信さ
れ、音声認識が行われる。特徴パラメータpを受信した
音声認識デコーダー411は辞書432を参照しながら、音響
的確率計算、言語的確率計算により音声認識を行う。音
響的確率計算にはコマンド＆コントロール用音響モデル
（ルールグラマ用音響モデル）431、言語的確率計算に
はルールグラマ用言語モデル433を用いる。とくに特徴
パラメータpに対応させるために、本発明ではルールグ
ラマ用音響モデル431は欠かせない。The characteristic parameters p (indicated by ○ 1, ○ 2, ○ 3, etc. in the figure represent characteristic amounts, that is, individual characteristic vectors) created by the acoustic analysis device 310 of the mobile phone are transmitted to the site for voice recognition. Is done. The voice recognition decoder 411 that has received the characteristic parameter p performs voice recognition by acoustic probability calculation and linguistic probability calculation while referring to the dictionary 432. An acoustic model for command and control (acoustic model for rule grammar) 431 is used for acoustic probability calculation, and a language model 433 for rule grammar is used for linguistic probability calculation. Particularly, in order to correspond to the characteristic parameter p, the acoustic model 431 for rule grammar is indispensable in the present invention.

【００２６】音声認識デコーダー411による音声認識結
果cは記号列（図のW1、W2、W3…等は個々の単語を示
す）である。The speech recognition result c by the speech recognition decoder 411 is a symbol string (W1, W2, W3 ... In the figure indicate individual words).

【００２７】Ａ／Ｄ変換では、Ａ／Ｄ変換器でサンプリ
ングした波形の瞬時値を数値化する。この数値化をＡ／
Ｄ変換という。また、Ａ／Ｄ変換を量子化ともいう。Ａ
／Ｄ変換による出力がパルスコード（ＰＣ：ＰＣＭ方式
で振幅を量子化したデジタルコード）である。特徴パラ
メータの算出では、ＰＣから特徴的な音響のみを算出
し、補正して特徴パラメータを作成する。したがって、
この段階ではノイズなどはすべて除去され、ホームペー
ジ閲覧操作に必要な音響デジタルデータのみが特徴パラ
メータとして音声認識デコーダーに送信される。In the A / D conversion, the instantaneous value of the waveform sampled by the A / D converter is digitized. This digitization is A /
This is called D conversion. A / D conversion is also called quantization. A
The output of the / D conversion is a pulse code (PC: digital code whose amplitude is quantized by the PCM method). In the calculation of the characteristic parameter, only the characteristic sound is calculated from the PC and corrected to create the characteristic parameter. Therefore,
At this stage, all noises are removed, and only the acoustic digital data necessary for the homepage browsing operation is transmitted to the voice recognition decoder as a characteristic parameter.

【００２８】特徴パラメータには音声データから、その
特徴をよく表す特徴量の組を用いる。ふつうスペクトル
分析で得られるパワースペクトルおよび△（デルタ）、
△△（デルタデルタ）などが用いられ、ベクトル量子化
された多次元のベクトル量である。通常、10ms毎にＭＦ
ＣＣ（１２）、△ＭＦＣＣ（８）、△△ＭＦＣＣ
（４）、パワースペクトル、△パワースペクトル、△△
パワースペクトルの計２７次程度の特徴ベクトルが算出
されるが、本発明では最終的に100種類程度のグループ
（ラベル）にし、通常32KB/secの音声データを3.75%の
データ量（1.2KB/sec）に縮小して送信する。具体的に
は特徴パラメータは、１秒間に16000個（16KHz）のサン
プリングデータから10ms毎に20msの窓間隔で特徴ベクト
ルを抽出したものである。As the characteristic parameter, a set of characteristic quantities that well represent the characteristic is used from the voice data. Power spectrum and Δ (delta) usually obtained by spectrum analysis,
ΔΔ (delta delta) is used and is a vector-quantized multidimensional vector quantity. Normally MF every 10ms
CC (12), △ MFCC (8), △△ MFCC
(4), power spectrum, △ power spectrum, △ △
A total of about 27th order feature vectors of the power spectrum are calculated, but in the present invention, finally, about 100 types of groups (labels) are set, and 32 KB / sec of audio data is usually 3.75% data amount (1.2 KB / sec). ) Reduce and send. Specifically, the feature parameter is a feature vector extracted from 16000 (16 KHz) sampling data per second at a window interval of 20 ms every 10 ms.

【００２９】一方、音声認識デコーダーでは音響分析と
言語分析を行う。このときに使用するモデルはコマンド
＆コントロール用音響モデル（ルールグラマ用音響モデ
ル）とルールグラマ用言語モデルで、辞書とともに補助
記憶装置に登録しておき、必要に応じてパソコンにロー
ドして実行する。ルールグラマ用音響モデルは通常の音
響分析モデルに比べてシンプルに作成されているため
に、若干認識精度が落ちるが、メモリ消費量が少なくて
すむ。さらに、画面に即した音響モデルであるために、
認識精度の低下を防ぐことができる。On the other hand, the speech recognition decoder performs acoustic analysis and language analysis. The models used at this time are the acoustic model for command and control (acoustic model for rule grammar) and the language model for rule grammar, which are registered in the auxiliary storage together with the dictionary, and loaded and executed on the personal computer as necessary. . Since the acoustic model for rule grammar is made simpler than the ordinary acoustic analysis model, the recognition accuracy is slightly lowered, but the memory consumption is small. Furthermore, because it is an acoustic model that matches the screen,
It is possible to prevent deterioration of recognition accuracy.

【００３０】ルールグラマ用言語モデルも同様に通常の
言語モデルよりもシンプル化されている。しかも画面に
対応したルールグラマ用言語モデルにすることによっ
て、よりいっそうの軽量化と認識精度維持を実現してい
る。辞書は発音辞書と言語辞書とに分けられるが、本発
明では特徴パラメータを基にルールグラマで音声認識分
析を行うので、辞書もルールグラマ用発音辞書（ルール
グラマ用音響モデルで使用）とルールグラマ用言語辞書
（ルールグラマ用言語モデルで使用）が作られている。
本発明に登録されているルールグラマ用発音辞書はばら
つきの多い子供や老齢者を除いた成人男女を基に作成し
てある。The language model for the rule grammar is also simpler than the normal language model. Moreover, by using a language model for the rule grammar that corresponds to the screen, we have achieved further weight reduction and recognition accuracy maintenance. The dictionary is divided into a pronunciation dictionary and a language dictionary. In the present invention, since the speech recognition analysis is performed by the rule grammar based on the characteristic parameter, the dictionary is also used as the pronunciation dictionary for the rule grammar (used in the acoustic model for the rule grammar) and the rule grammar. A language dictionary (used in the language model for rule grammar) has been created.
The pronunciation dictionary for rule grammar registered in the present invention is created based on adult men and women excluding children and old people who have many variations.

【００３１】[0031]

【発明の効果】本発明の特徴は、音響分析部と音声認識
デコーダー部を分離し、音響分析部を音声入力機能付き
クライアント機器に組み込み、しかも音響分析の結果、
音声データを特徴ベクトルのみで表現した特徴パラメー
タとして音声認識部に送信し、音声認識デコーダーでは
ルールグラマ（音響モデル、辞書、言語モデル）で音声
認識を行うことにある。これによって、得られる本発明
の効果は以下の通りである。The feature of the present invention is that the acoustic analysis section and the speech recognition decoder section are separated, the acoustic analysis section is incorporated in the client device with a speech input function, and the acoustic analysis result is
The voice data is transmitted to the voice recognition unit as a feature parameter expressed only by a feature vector, and the voice recognition decoder performs voice recognition using a rule grammar (acoustic model, dictionary, language model). The effects of the present invention obtained thereby are as follows.

【００３２】これまで音響分析と音声認識は切っても切
れないほど密接な関係にあり、しかも音響分析の結果と
して得られる符号化音声データ（デジタル音声データ）
は大容量であるために、分離が難しく、クライアント機
器内で音響分析を行うことができなかった。これに対し
て本発明では、音声認識率を劣化せず、なおかつ音声デ
ータの認識に必要な特徴的な音声データを抽出して作成
した特徴パラメータを用いるために、クライアント機器
に音響分析機能を内蔵させて音響分析をクライアント機
器で行い、特徴パラメータをサイトに送信することによ
って短時間に音声データ（特徴パラメータ）を送信する
ことができるようになった。しかも送信する音声データ
はデジタルであるために、送信中のデータの劣化を防ぐ
ことができる。Up to now, acoustic analysis and voice recognition have an inextricably close relationship with each other, and moreover, encoded voice data (digital voice data) obtained as a result of the acoustic analysis.
Due to its large capacity, separation was difficult and acoustic analysis could not be performed in the client device. On the other hand, according to the present invention, since the voice recognition rate is not deteriorated and the characteristic parameter created by extracting the characteristic voice data necessary for recognition of the voice data is used, the client device has the acoustic analysis function. By performing acoustic analysis on the client device and transmitting the characteristic parameter to the site, the voice data (characteristic parameter) can be transmitted in a short time. Moreover, since the audio data to be transmitted is digital, it is possible to prevent the deterioration of the data during transmission.

【００３３】特徴パラメータに対応してルールグラマを
画面（情報）ごとに設けることによって、データ量の小
さな特徴パラメータでも高認識率を維持したまま音声認
識が可能となった。また、アプリケーションに即したル
ールグラマを作成することによって、メモリ使用消費量
を少なくすることが可能となっただけでなく、音声認識
処理時間の短縮化が可能となり、なおかつ高音声認識率
が実現されている。とくに本発明は学習機能を用いなく
ても高音声認識率が維持されるために、ユーザーにとっ
て面倒な初期作業が不要であり、しかも不特定多数のサ
イトに対応できるようになっている。By providing a rule grammar for each screen (information) corresponding to the characteristic parameter, it becomes possible to perform voice recognition while maintaining a high recognition rate even with a characteristic parameter having a small amount of data. In addition, by creating a rule grammar suitable for the application, not only the memory consumption can be reduced, but also the speech recognition processing time can be shortened and a high speech recognition rate is realized. ing. In particular, since the present invention maintains a high voice recognition rate without using the learning function, it does not require a troublesome initial work for the user and can deal with an unspecified number of sites.

[Brief description of drawings]

【図１】従来例における音声認識エンジンシステムのシ
ステム構成図である。FIG. 1 is a system configuration diagram of a speech recognition engine system in a conventional example.

【図２】本発明におけるクライアント機器から音声対応
インターネットホームページを入手する手順を説明する
ためのシステム構成図である。FIG. 2 is a system configuration diagram for explaining a procedure for obtaining a voice-enabled Internet home page from a client device according to the present invention.

【図３】本発明における音声認識手順を説明するための
システム構成図である。FIG. 3 is a system configuration diagram for explaining a voice recognition procedure in the present invention.

[Explanation of symbols]

１音声認識エンジン１１０音響分析１２０音声認識デコーダー１２１システム音響モデル１２２システム辞書１２３システム言語モデル３ i-mode（登録商標）携帯電話（インターネッ
ト対応クライアント機器）３１０音響分析装置（ハードウェア化された音響分
析装置）３２０ボタンまたはデータ入力操作キー３３０ディスプレイ４ホームページ提供サイト（サイト）４００ＨＰシステム４１０音声認識エンジン４１１音声認識デコーダー４２０画面（情報、コンテンツ）４３０ルールグラマ４３１コマンド＆コントロール用音響モデル（ルー
ルグラマ用音響モデル）４３２辞書４３３ルールグラマ用言語モデルａ音声（アナログ音声）ｂ符号化音声（デジタル音声）ｐ特徴パラメータ（特徴量の組） ○１、○２、○３……等は特徴量ｃ音声認識結果Ｗ１、Ｗ２、Ｗ３……等は認識結果求められた個々の単
語ｉ画面（情報）1 voice recognition engine 110 acoustic analysis 120 voice recognition decoder 121 system acoustic model 122 system dictionary 123 system language model 3 i-mode (registered trademark) mobile phone (Internet compatible client device) 310 acoustic analyzer (hardware acoustic analysis Device) 320 button or data input operation key 330 Display 4 Homepage providing site (site) 400 HP system 410 Speech recognition engine 411 Speech recognition decoder 420 Screen (information, contents) 430 Rule grammar 431 Acoustic model for command & control (for rule grammar) Acoustic model) 432 Dictionary 433 Language model for rule grammar a Voice (analog voice) b Coded voice (digital voice) p Feature parameters (feature set) ○ 1, ○ 2, ○ 3 ... Etc. characteristic quantity c speech recognition result W1, W2, W3 ...... like recognition result the obtained individual words i screen (information)

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/28 Ｇ１０Ｌ 9/00 Ｎ 19/00 9/18 Ｆ 21/02 3/00 ５２１Ｗ５３７Ｊ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 15/28 G10L 9/00 N 19/00 9/18 F 21/02 3/00 521W 537J

Claims

[Claims]

1. When browsing an Internet home page for accessing a home page by voice from a client device having a voice input function, (1) receiving a human voice input,
Perform audio analysis such as voice coding, noise processing, and correction,
Acoustic analysis means for extracting a characteristic parameter from voice data and transmitting the characteristic parameter to a website providing site, (2) receiving the characteristic parameter, performing voice recognition using an acoustic model, a dictionary, and a language model to obtain a specific voice. Is converted to a symbol string, perform the necessary operations on the homepage,
It is characterized by having a voice recognizing means in a voice-compatible homepage providing site for transmitting the selected information to the client device, and (3) a receiving means for receiving the homepage by the client device and performing screen display or voice output. Access system for voice-enabled Internet websites.

2. The speech recognition includes a unit for creating a rule grammar for analyzing a characteristic parameter to generate a symbol string for each home page and performing voice recognition for each home page. Access system for voice-enabled Internet homepages described in.

3. A client device having a voice input function used for browsing an internet homepage for accessing a homepage by voice, (1) receiving human voice input, and performing acoustic analysis such as voice encoding, noise processing, correction, etc. A client device having a voice input function, comprising: an acoustic analysis means for performing a characteristic parameter extraction from voice data and transmitting the characteristic parameter to a website providing site.