JPH0764593A

JPH0764593A - Information processor and control method

Info

Publication number: JPH0764593A
Application number: JP5215664A
Authority: JP
Inventors: Hidenori Nagasaki; 秀紀長崎
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1993-08-31
Filing date: 1993-08-31
Publication date: 1995-03-10

Abstract

PURPOSE:To provide an information processor and control method in which a key at the time of extracting only a necessary part can be obtained by reflecting a pronouncing state peculiar to a person who inputs a voice on the display of speech information, and only the necessary part can be automatically extracted. CONSTITUTION:This device is equipped with a microphone 19 which inputs the speech information, display part 16 which displays the speech information, and memory 22 for preservation which stores the speech information. A CPU 11 detects the pronouncing state of the interval, up/down, and strength or the like of the pronunciation of each syllable of the speech information inputted through the microphone 19, and the speech information is displayed on the display part 16 in a display configuration in which the pronouncing state can be recognized at the time of displaying the voice information. Also, the CPU 11 detects a key word, topic switching word, or related word from among the voice information, extracts the speech information according to the detected result, and stores it in the memory 22 for preservation.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声情報を入力可能な
情報処理装置に係り、特に入力された音声情報を表示
し、必要な部分のみ抽出する情報処理装置及び制御方法
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information processing apparatus capable of inputting voice information, and more particularly to an information processing apparatus and a control method for displaying input voice information and extracting only a necessary portion.

【０００２】[0002]

【従来の技術】従来から音声処理を行う情報処理装置で
は、マイクを通じて入力された音声情報をディスプレイ
装置に表示して、ユーザに見せるようにしたものがあ
る。これは、入力された音声情報を音声認識し、対応す
る文字コードにテキスト変換することで実現される。2. Description of the Related Art Conventionally, there is an information processing apparatus for performing voice processing, in which voice information input through a microphone is displayed on a display device so as to be shown to a user. This is realized by voice-recognizing the input voice information and performing text conversion into a corresponding character code.

【０００３】ところで、音声情報を入力する場合には、
常に一定の調子で入力することはない。すなわち、例え
ば「きょうはてんきがとてもよいですね」といった文章
を音声入力することを考えると、「きょうは」と「てん
き」との間に多少の間をおいて入力したり、「とても」
の部分で強く発音したり、「ですね」の語尾で上げて入
力するなどがある。これは、音声入力者によって様々で
ある。By the way, when inputting voice information,
You don't always enter in a constant tone. That is, for example, considering that a sentence such as "Kyouwa Tenki is very good" is input by voice, you can enter it with some gap between "Kyoha" and "Tenki" or "Very".
You can pronounce it strongly in the part of, or raise it with the ending "is" and enter it. This varies depending on the voice input person.

【０００４】従来、この種の情報処理装置では、テキス
ト変換された音声情報を単に羅列して表示するだけであ
り、このような１音節毎の発音の間隔、上がり／下がり
といったイントネーション、音の強弱であるアクセント
など、発音状態を表示に反映させることはできなかっ
た。Conventionally, in this type of information processing apparatus, the text-converted voice information is simply listed and displayed, and the pronunciation interval for each syllable, the intonation such as rising / falling, and the strength of the sound. It was not possible to reflect the pronunciation condition such as the accent in the display.

【０００５】また、音声情報を入力する場合には、始め
から終りまで必要な情報だけを入力することは少なく、
そこには雑談的な情報も含まれている。したがって、後
に必要な部分のみを表示画面から抽出することが必要と
なるが、従来の機種では、この抽出をユーザ自ら行って
いた。When inputting voice information, it is rare to input only necessary information from the beginning to the end.
It also contains informative information. Therefore, it is necessary to extract only the necessary portion from the display screen later, but in the conventional model, the user himself / herself did this extraction.

【０００６】[0006]

【発明が解決しようとする課題】上記したように、従
来、入力された音声情報を単に羅列して表示するだけで
あったため、音声入力者特有の発音状態を表示から判断
できず、後に必要な部分のみを表示画面から抽出する際
の手掛かりが掴みにくい等の問題があった。As described above, conventionally, since the input voice information is simply displayed in a list, the pronunciation state peculiar to the voice input person cannot be judged from the display, which is necessary later. There was a problem that it was difficult to grasp the clue when extracting only the part from the display screen.

【０００７】また、従来、必要な部分を抽出する場合、
常にユーザ自ら行わなければならなかったため、非常に
面倒であり、特に情報が多い場合（話しが長い場合）に
その抽出に困難を要する等の問題があった。Further, conventionally, when extracting a necessary part,
Since the user always had to do it himself, it was very troublesome, and there was a problem that it was difficult to extract it especially when there was a lot of information (when the talk was long).

【０００８】本発明は上記のような点に鑑みなされたも
ので、音声情報の表示に音声入力者特有の発音状態を反
映させて、必要な部分のみを抽出する際の手掛かりとす
ることができ、また、必要な部分のみを自動的に抽出可
能な情報処理装置及び制御方法を提供することを目的と
する。The present invention has been made in view of the above points, and can be used as a clue when extracting only a necessary portion by reflecting the pronunciation state peculiar to the voice input person in the display of voice information. It is another object of the present invention to provide an information processing device and a control method capable of automatically extracting only a necessary part.

【０００９】[0009]

【課題を解決するための手段】本発明は、音声情報を入
力するための入力手段、上記音声情報を表示するための
表示手段、上記音声情報を格納するための記憶手段を備
え、上記入力手段によって入力された上記音声情報を音
声認識し、文字コード情報にテキスト変換すると共に１
音節毎にその発音の間隔、上がり／下がり、強弱等の発
音状態を検出し、このテキスト変換された上記音声情報
をその発音状態が分かるような表示形態で上記表示手段
に表示するようにしたものである。The present invention comprises an input means for inputting voice information, a display means for displaying the voice information, and a storage means for storing the voice information. The voice information input by the above is recognized by voice, and text conversion is performed into character code information.
Detecting pronunciation intervals such as pronunciation intervals, rising / falling, and strength for each syllable, and displaying the text-converted voice information on the display means in a display form in which the pronunciation status can be understood. Is.

【００１０】さらに、この表示された上記音声情報に対
し、指定された範囲内の上記音声情報を抽出し、この抽
出された上記音声情報を上記記憶手段に格納するように
したものである。Further, the voice information within the designated range is extracted from the displayed voice information, and the extracted voice information is stored in the storage means.

【００１１】また、本発明は、音声情報を入力するため
の入力手段、上記音声情報を格納するための記憶手段を
備え、上記入力手段によって入力された上記音声情報を
音声認識し、文字コード情報にテキスト変換し、このテ
キスト変換された上記音声情報の中から予め設定された
キーワード、話題転換用の単語または上記キーワードと
関連する単語を検出し、上記話題転換用の単語または上
記関連単語が検出された場合にその検出位置を保持して
おき、上記キーワードが検出されたときまたは上記関連
単語が規定値以上検出されたときに上記検出位置から上
記音声情報を抽出し、この抽出された上記音声情報を上
記記憶手段に格納するようにしたものである。The present invention further comprises input means for inputting voice information and storage means for storing the voice information, wherein the voice information input by the input means is voice-recognized and character code information is obtained. Text conversion is performed, and a preset keyword, word for topic conversion or a word related to the keyword is detected from the voice information converted into text, and the word for topic conversion or the related word is detected. When the keyword is detected or when the related word is detected more than a specified value, the voice information is extracted from the detected position, and the extracted voice is stored. The information is stored in the storage means.

【００１２】[0012]

【作用】上記の構成によれば、音声情報を入力すると、
音節毎にその発音の間隔、上がり／下がり、強弱等の発
音状態が検出される。そして、音声情報の表示に際し、
この発音状態が分かるような表示形態で表現される。し
たがって、表示画面上で抽出範囲を指示する場合におい
て、その発音状態を手掛かりとして、必要な部分のみ簡
単に抽出することができる。With the above arrangement, when voice information is input,
For each syllable, a pronunciation state such as a pronunciation interval, rising / falling, and strength is detected. And when displaying the audio information,
The display form is such that the pronunciation state can be understood. Therefore, when instructing the extraction range on the display screen, it is possible to easily extract only the necessary portion by using the sound generation state as a clue.

【００１３】また、上記の構成によれば、音声情報の中
からキーワード、話題転換単語または関連単語が検出さ
れる。そして、話題転換単語または関連単語が検出され
た場合に、その検出位置が音声情報の抽出位置として保
持され、キーワードが検出されたときまたは関連単語が
規定値以上検出されたときに、その位置から音声情報が
抽出される。したがって、ユーザによる抽出操作を必要
とせずに、必要な部分のみの音声情報を抽出することが
可能となる。Further, according to the above configuration, the keyword, the topic conversion word or the related word is detected from the voice information. Then, when the topic conversion word or the related word is detected, the detected position is held as the extraction position of the voice information, and when the keyword is detected or when the related word is detected more than the specified value, from that position Audio information is extracted. Therefore, it is possible to extract the voice information of only the necessary portion without requiring the extraction operation by the user.

【００１４】[0014]

【実施例】以下、図面を参照して本発明の一実施例を説
明する。図１は音声入力可能な情報処理装置の要部の構
成を示すブロック図である。図１において、ＣＰＵ１１
は、本装置全体の制御を行うもので、ここではＲＯＭ１
２およびＲＡＭ１３をアクセスして、入力指示に従うプ
ログラムの起動で、音声入力制御処理等を実行する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a main part of an information processing apparatus capable of voice input. In FIG. 1, the CPU 11
Is for controlling the entire device.
2 and the RAM 13 are accessed, and a voice input control process or the like is executed by activating a program according to the input instruction.

【００１５】ＲＯＭ１２には、ＣＰＵ１１の起動に必要
なプログラムを格納するためのプログラム領域１２ａ
や、表示に用いるフォントデータを格納するためのフォ
ント領域１２ｂの他、ここでは音声認識に必要な音声辞
書を格納するための辞書領域１２ｃ、関連する単語同志
を結び付けたニューラルネットワーク情報を格納するた
めのネットワーク領域１２ｄが設けられている。上記音
声辞書は、音声波形とひらがな文字を対応させた辞書で
あり、拗音や促音などのパターンも持ち合わせている。The ROM 12 has a program area 12a for storing a program required for starting the CPU 11.
A font area 12b for storing font data used for display, a dictionary area 12c for storing a voice dictionary required for voice recognition, and neural network information for connecting related words. The network area 12d is provided. The above-mentioned voice dictionary is a dictionary in which voice waveforms and hiragana characters are associated with each other, and also has patterns such as whispers and consonants.

【００１６】ＲＡＭ１３には、音声情報を表示するため
の表示用テーブルを格納するためのテーブル領域１３
ａ、音声情報を一時的に格納するための音声バッファ１
３ｂ、キーワードを格納しておくためのキーワード領域
１３ｃ、抽出位置を格納しておくための抽出位置記憶領
域１３ｄ等が設けられている。上記表示用テーブルに
は、図２に示すように、音声情報の認識結果である文字
コードと、発音の間隔を示すタイマカウト、発音のイン
トネージョンを示す上がり／下がり情報、発音のアクセ
ントを示す強弱情報がそれぞれ１音節毎に格納される。The RAM 13 has a table area 13 for storing a display table for displaying voice information.
a, a voice buffer 1 for temporarily storing voice information
3b, a keyword area 13c for storing keywords, an extraction position storage area 13d for storing extraction positions, and the like are provided. In the display table, as shown in FIG. 2, a character code which is a recognition result of voice information, a timer cout indicating a pronunciation interval, rising / falling information indicating pronunciation intonation, and a strength indicating a pronunciation accent. Information is stored for each syllable.

【００１７】入力部１４は、例えばキーボードからな
り、文字コードやコマンド等の情報を入力するためのも
のである。入力制御部１５は、入力部１４によって入力
された情報の入力制御を行う。表示部１６は、例えば液
晶表示装置（ＬＣＤ）からなり、音声情報を含めた各種
の情報を表示する。表示制御部１７は、ＶＲＡＭ（ビデ
オＲＡＭ）１８を用いて表示部１６に対する表示制御を
行う。ＶＲＡＭ１８には、表示部１６に表示する表示デ
ータが格納される。The input unit 14 is composed of, for example, a keyboard and is used to input information such as character codes and commands. The input control unit 15 controls the input of the information input by the input unit 14. The display unit 16 is composed of, for example, a liquid crystal display device (LCD), and displays various kinds of information including audio information. The display control unit 17 uses a VRAM (video RAM) 18 to perform display control on the display unit 16. Display data to be displayed on the display unit 16 is stored in the VRAM 18.

【００１８】マイク１９は、音声情報を入力するための
ものである。音声制御部２０は、マイク１９を通じて入
力された音声情報の入力制御を行う。タイマ２１は、Ｃ
ＰＵ１１の指示に従って時間をカウントする。保存用メ
モリ２２には、保存用の音声情報が格納される。The microphone 19 is for inputting voice information. The voice control unit 20 controls the input of voice information input through the microphone 19. The timer 21 is C
The time is counted according to the instruction of PU11. The storage memory 22 stores sound information for storage.

【００１９】次に、同実施例の動作を説明する。ここで
は、（ａ）音声情報の入力動作、（ｂ）音声情報の表示
動作、（ｃ）ユーザ指示による音声情報の抽出動作、
（ｄ）システムによる音声情報の自動抽出動作に分けて
説明する。Next, the operation of the embodiment will be described. Here, (a) voice information input operation, (b) voice information display operation, (c) voice information extraction operation by user instruction,
(D) The operation for automatically extracting the voice information by the system will be described separately.

【００２０】（ａ）音声情報の入力動作本発明では、マイク１９を通じて音声情報が入力された
際、音声情報を音声認識し、その結果を表示部１６に表
示するが、この場合、単に表示するのではなく、１音節
毎の発音の間隔、上がり／下がりといったイントネーシ
ョン、音の強弱であるアクセントなどの発音状態を表示
に反映させることを特徴とする。(A) Input operation of voice information In the present invention, when voice information is input through the microphone 19, the voice information is voice-recognized and the result is displayed on the display unit 16. In this case, it is simply displayed. Instead of this, it is characterized in that the pronunciation intervals of each syllable, intonation such as rising and falling, and pronunciation states such as accents that are the strength of the sound are reflected in the display.

【００２１】図３のフローチャートに示すように、入力
部１４を通じて音声情報の入力が指示されると（ステッ
プＡ１）、ＣＰＵ１１はタイマ２１を起動して（ステッ
プＡ２）、音声情報の入力を待つ（ステップＡ３のＮ
ｏ）。As shown in the flow chart of FIG. 3, when input of voice information is instructed through the input unit 14 (step A1), the CPU 11 activates the timer 21 (step A2) and waits for input of voice information (step A2). N in step A3
o).

【００２２】音声情報は、マイク１９を通じて入力さ
れ、音声制御部２０で音声処理された後、ＣＰＵ１１に
与えられる。ＣＰＵ１１は、音声制御部２０によって音
声処理された音声情報を入力すると（ステップＡ３のＹ
ｅｓ）、その中の１音節を切り出すと共にタイマ２１か
ら現在時間Ｔn を検出して（ステップＡ４）、ＲＡＭ１
３のテーブル領域１３ａに設けられた表示用テーブルに
カウント値として初期値「０」をセットする（ステップ
Ａ５）。The voice information is input through the microphone 19, processed by the voice controller 20 and then given to the CPU 11. When the CPU 11 inputs the voice information that has been voice-processed by the voice control unit 20 (Y in step A3).
es), one syllable is cut out, and the current time Tn is detected from the timer 21 (step A4).
An initial value "0" is set as a count value in the display table provided in the table area 13a of No. 3 (step A5).

【００２３】また、ＣＰＵ１１はＲＯＭ１２の辞書領域
１２ｃに設けられた音声辞書を用いて、その１音節に対
する音声認識を行い（ステップＡ６）、その認識結果で
ある文字コード（音声情報に対応するひらがな文字）を
表示用テーブルにセットする（ステップＡ７）。さら
に、ＣＰＵ１１はその１音節の上がり／下がり情報およ
び強弱情報を検出し、各レベルを初期値「１」として表
示用テーブルにセットする（ステップＡ８およびＡ
９）。Further, the CPU 11 uses the voice dictionary provided in the dictionary area 12c of the ROM 12 to perform voice recognition for the one syllable (step A6), and the character code (Hiragana character corresponding to the voice information) as the recognition result. ) Is set on the display table (step A7). Further, the CPU 11 detects the rising / falling information and the strength information of the one syllable, and sets each level as the initial value "1" in the display table (steps A8 and A).
9).

【００２４】しかして、次の音声情報が入力されると
（ステップＡ１０のＹｅｓ）、上記同様、ＣＰＵ１１は
次の１節を切り出すと共にタイマ２１から現在時間Ｔn+
1 を検出し（ステップＡ１１）、表示用テーブルにカウ
ント値としてＴn+1 −Ｔn （発音の間隔）をセットする
（ステップＡ１２）。また、ＣＰＵ１１はその１音節に
対する音声認識を行い（ステップＡ１３）、その認識結
果である文字コード（音声情報に対応するひらがな文
字）を表示用テーブルにセットする（ステップＡ１
４）。さらに、ＣＰＵ１１はその１音節の上がり／下が
り情報および強弱情報を検出し、初期レベルを「１」と
したときの各レベルの比較値を表示用テーブルにセット
する（ステップＡ１５およびＡ１６）。When the next voice information is input (Yes in step A10), the CPU 11 cuts out the next section and the current time Tn + from the timer 21 in the same manner as above.
1 is detected (step A11), and Tn + 1-Tn (sound generation interval) is set as a count value in the display table (step A12). Further, the CPU 11 performs voice recognition for the one syllable (step A13), and sets the character code (Hiragana character corresponding to voice information) as the recognition result in the display table (step A1).
4). Further, the CPU 11 detects the rising / falling information and the strength information of the one syllable, and sets the comparison value of each level when the initial level is "1" in the display table (steps A15 and A16).

【００２５】このようにして、音声情報の入力が終了す
るまで（ステップＡ１０のＮｏ）、上記同様の処理が繰
り返される。これにより、表示用テーブルには、図２に
示すように、１音声毎に認識結果（ひらがな文字）、発
音の間隔、上がり／下がり情報および強弱情報がセット
される。In this way, the same processing as described above is repeated until the input of voice information is completed (No in step A10). As a result, in the display table, as shown in FIG. 2, the recognition result (Hiragana characters), the pronunciation interval, the up / down information, and the strength information are set for each voice.

【００２６】（ｂ）音声情報の表示動作音声情報の入力が終了すると、ＣＰＵ１１は図４のフロ
ーチャートに示す表示処理を実行する。この表示処理
は、上記のようにして作成した表示用テーブルに従って
行われる。(B) Display operation of voice information When the input of voice information is completed, the CPU 11 executes the display process shown in the flowchart of FIG. This display processing is performed according to the display table created as described above.

【００２７】すなわち、ＣＰＵ１１は、まず、表示用テ
ーブルにセットされた１音節目の文字コードを読み出
し、対応する文字パターンを表示部１６に表示する（ス
テップＢ１）。ここで、次の文字が存在すれば（ステッ
プＢ２のＹｅｓ）、ＣＰＵ１１はその文字を表示する
が、その際に、表示用テーブルにセットされた当該文字
コードに対応するカウント値を読み出し、そのカウント
値に応じたスペースを表示する（ステップＢ３）。そし
て、ＣＰＵ１１は当該文字コードに付加された上がり／
下がり情報および強弱情報に応じた表示形態で次の文字
パターンを表示部１６に表示する（ステップＢ４）。That is, the CPU 11 first reads out the character code of the first syllable set in the display table and displays the corresponding character pattern on the display unit 16 (step B1). Here, if the next character exists (Yes in step B2), the CPU 11 displays the character. At that time, the count value corresponding to the character code set in the display table is read and the count value is read. A space corresponding to the value is displayed (step B3). Then, the CPU 11 adds the up / down added to the character code.
The next character pattern is displayed on the display unit 16 in a display form corresponding to the falling information and the strength information (step B4).

【００２８】図５に表示例を示す。例えば「きょうはて
んきがとてもよいですね」，「さてきょうのほんだいで
ある…であります」といったような文章が音声入力され
た場合において、１音節毎に音声認識された結果が表示
部１６の表示画面に表示される。この場合、１音節毎に
発音の間隔がわかるように、その間隔に応じたスペース
が各文字間に表示される。FIG. 5 shows a display example. For example, when a sentence such as “Today is very good” or “I am a sacred person ...” is input by voice, the result of voice recognition for each syllable is displayed on the display unit 16. Displayed on the display screen. In this case, a space corresponding to the interval is displayed between the characters so that the interval of pronunciation can be known for each syllable.

【００２９】また、図中「ね」の表示位置が上にシフト
しているように、初めの「き」の音を基準として、その
音よりもイントネーションが高く発音された文字または
低く発音された文字は、そのときのレベルに応じて上下
にシフトして表示される。Further, as the display position of "ne" in the figure is shifted upward, the intonation is pronounced higher or lower than the sound of the initial "ki" as a reference. The characters are displayed by shifting up and down according to the level at that time.

【００３０】また、図中「とても」が拡大表示されてい
るように、初めの「き」の音を基準として、その音より
もアクセントが強く発音された文字または弱く発音され
た文字は、そのときのレベルに応じた大きさで拡大表示
または縮小表示される。As shown in the enlarged view of "very" in the figure, a character whose accent is stronger or weaker than the sound of the first "ki" is used as a reference. The image is enlarged or reduced in size according to the current level.

【００３１】（ｃ）ユーザ指示による音声情報の抽出動作次に、表示画面に表示された音声情報の中から必要な部
分のみ抽出する場合の動作を説明する。(C) Extraction Operation of Voice Information According to User's Instruction Next, an operation of extracting only a necessary portion from the voice information displayed on the display screen will be described.

【００３２】図６のフローチャートに示すように、ユー
ザは抽出すべき音声情報の始点および終点をカーソル等
により指示する（ステップＣ１およびＣ２）。ＣＰＵ１
１は、この始点および終点を抽出範囲とし、その抽出範
囲内の音声情報を抽出する（ステップＣ３）。具体的に
は、ＲＡＭ１３のテーブル領域１３ａに格納された表示
用テーブルを参照し、同テーブルから抽出範囲に存在す
る文字コードを読み出す。そして、ＣＰＵ１１はこのよ
うにして抽出した音声情報を保存用の音声情報として保
存用メモリ２２に格納する（ステップＣ４）。As shown in the flowchart of FIG. 6, the user designates the start point and the end point of the voice information to be extracted with a cursor or the like (steps C1 and C2). CPU1
1 sets the start point and the end point as an extraction range, and extracts voice information within the extraction range (step C3). Specifically, the display table stored in the table area 13a of the RAM 13 is referred to, and the character code existing in the extraction range is read from the table. Then, the CPU 11 stores the voice information thus extracted as the voice information for storage in the storage memory 22 (step C4).

【００３３】この様子を図５に示す。図５に示すような
表示において、「にほんごしょり」から「であります」
までの音声情報が必要であった場合には、その文脈の先
頭である「に」にカーソル３１を設定した後、文脈の終
りである「す」にカーソル３２を設定する。これによ
り、「にほんごしょり…であります」の部分のみが抽出
され、保存用メモリ２２に保存されることになる。This state is shown in FIG. In the display as shown in Fig. 5, "from Japanese" to "is"
When the voice information up to is required, the cursor 31 is set to "ni" which is the beginning of the context, and then the cursor 32 is set to "su" which is the end of the context. As a result, only the portion "is Japanese ..." is extracted and stored in the storage memory 22.

【００３４】このように、表示画面上で抽出範囲を指示
することで、必要な部分のみの音声情報を保存すること
ができる。この場合、上記（ａ）で説明したように音声
情報の表示に音声入力者特有の発音状態が反映されてい
るため、その発音状態を手掛かりとして、必要な部分の
み簡単に抽出することができる。In this way, by designating the extraction range on the display screen, it is possible to save the voice information of only the necessary portion. In this case, since the pronunciation state peculiar to the voice input person is reflected in the display of the voice information as described in (a) above, only the necessary portion can be easily extracted with the pronunciation state as a clue.

【００３５】（ｄ）システムによる音声情報の自動抽出動作上記（ｃ）ではユーザ自身が音声情報を抽出したが、こ
こではシステムが自動的に抽出することを特徴とする。
この自動抽出は、キーワード、話題転換用の単語、関連
単語の３つの情報に基づいて行う。キーワードは、予め
ユーザによって指示されており、ＲＡＭ１３のキーワー
ド領域１３ｃに格納されている。話題転換用の単語と
は、例えば「さて」、「ところで」といった単語であ
る。関連単語とは、キーワードに関連する単語であり、
例えば「日本語処理」がキーワードであったとすると、
「自然言語処理」、「ワードプロセッサ」等がそれに相
当する。これらの関連単語は、ＲＯＭ１２のネットワー
ク領域１２ｄに格納されたニューラルネットワーク情報
から得られる。(D) Automatic Extraction Operation of Voice Information by System In the above (c), the user himself / herself extracted the voice information. However, the feature is that the system automatically extracts the voice information.
This automatic extraction is performed based on three pieces of information of a keyword, a word for topic conversion, and a related word. The keyword has been previously specified by the user and is stored in the keyword area 13c of the RAM 13. The word for topic change is, for example, a word such as “Well” or “By the way”. A related word is a word related to a keyword,
For example, if "Japanese processing" is the keyword,
"Natural language processing", "word processor", etc. correspond to that. These related words are obtained from the neural network information stored in the network area 12d of the ROM 12.

【００３６】なお、ここでは、音声波形と単語を対応さ
せた音声辞書（形態素に分解できる精度を有する辞書）
を用いて、入力された音声情報を単語に置き換えて表示
することを前提とし、その中からキーワード、話題転換
用の単語、関連単語といった各単語を検索するものとす
る。単語に置き換えられた音声情報は、ＲＡＭ１３の音
声バッファ１３ｂに一時的に格納されている。Here, here, a voice dictionary in which a voice waveform and a word are associated with each other (a dictionary having an accuracy that can be decomposed into morphemes)
It is assumed that the input voice information is replaced with a word and displayed by using, and each word such as a keyword, a word for topic conversion, and a related word is searched from the word. The voice information replaced with words is temporarily stored in the voice buffer 13b of the RAM 13.

【００３７】図７および図８のフローチャートに示すよ
うに、ＣＰＵ１１は、まず、初期設定として、単語検索
用カウンタＡおよび関連単語出現カウンタＢをそれぞれ
「０」にクリアしておくと共に、話題転換用フラグＦを
オフ状態にしてリセットしておく（ステップＤ１）。単
語検索用カウンタＡは、検索対象から単語をいくつ検索
したかを計数しておくためのものである。関連単語出現
カウンタＢは、キーワードに関連する単語がいくつあっ
たかを計数しておくためのものである。話題転換用フラ
グＦは、「さて」、「ところで」といった話題転換用の
単語が検索された場合にオンされる。これらのカウンタ
Ａ，ＢおよびフラグＦは、例えばＲＡＭ１３に設けられ
る。As shown in the flow charts of FIGS. 7 and 8, the CPU 11 first clears the word search counter A and the related word appearance counter B to "0" as initial settings, and switches the topic. The flag F is turned off and reset (step D1). The word search counter A is for counting the number of words searched from the search target. The related word appearance counter B is for counting the number of words related to the keyword. The topic conversion flag F is turned on when a topic conversion word such as “Well” or “By the way” is retrieved. The counters A and B and the flag F are provided in the RAM 13, for example.

【００３８】ＣＰＵ１１は、音声バッファ１３ｂに格納
された音声情報の中から文頭の単語を１語切り出して
（ステップＤ２）、単語検索用カウンタＡを更新する
（ステップＤ３）。そして、ＣＰＵ１１は、その切り出
した単語がキーワードと同じ単語であるか、話題転換用
の単語であるか、キーワードと関連する単語であるかを
調べる（ステップＤ４〜Ｄ６）。The CPU 11 cuts out one word at the beginning of the sentence from the voice information stored in the voice buffer 13b (step D2) and updates the word search counter A (step D3). Then, the CPU 11 checks whether the extracted word is the same word as the keyword, a word for topic conversion, or a word related to the keyword (steps D4 to D6).

【００３９】ステップＤ４でキーワードと同じ単語が検
出された場合、ＣＰＵ１１は話題転換用フラグＦがオン
状態にあるか否かを調べる（ステップＤ９）。これは、
キーワードが検出されても、それが本題の中で使われた
のか、あるいは本題外でたまたま使われたのかがわから
ないため、キーワードのみで音声情報を抽出することを
避け、話題転換があって初めて取り出すためである。When the same word as the keyword is detected in step D4, the CPU 11 checks whether or not the topic conversion flag F is on (step D9). this is,
Even if a keyword is detected, we do not know whether it was used in the main subject or happened to be used outside the main subject, so avoid extracting voice information using only the keyword, and extract it only after a topic change. This is because.

【００４０】話題転換用フラグＦがオフであった場合
（ステップＤ９のＮｏ）、ＣＰＵ１１は関連単語出現カ
ウンタＢの値が規定値に達しているか否かを調べる（ス
テップＤ１０）。これは、話題転換せずに本題に入った
場合に備えたものである。When the topic conversion flag F is off (No in step D9), the CPU 11 checks whether or not the value of the related word appearance counter B has reached the specified value (step D10). This is prepared in case the main subject is entered without changing the topic.

【００４１】関連単語出現カウンタＢの値が規定値に達
していなかった場合には（ステップＤ１０のＮｏ）、Ｃ
ＰＵ１１はステップＤ３からの処理に戻り、単語検索用
カウンタＡを更新して次の単語の検索を行う。関連単語
出現カウンタＢの値が規定値に達していた場合（ステッ
プＤ１０のＹｅｓ）、またはステップＤ９で話題転換用
フラグＦがオン状態にあれば、ＣＰＵ１１は後述するス
テップＤ１５またはステップＤ１８でキープした抽出位
置に基づいて音声情報を抽出し（ステップＤ１１）、保
存用メモリ２２に格納する（ステップＤ１２）。When the value of the related word appearance counter B has not reached the specified value (No in step D10), C
The PU 11 returns to the processing from step D3, updates the word search counter A, and searches for the next word. If the value of the related word appearance counter B has reached the specified value (Yes in step D10), or if the topic conversion flag F is on in step D9, the CPU 11 keeps it in step D15 or step D18 described later. The voice information is extracted based on the extraction position (step D11) and stored in the storage memory 22 (step D12).

【００４２】ステップＤ５で話題転換用の単語が検出さ
れた場合、話題転換用フラグＦがオフ状態にあるとき
（ステップＤ１３のＮｏ）、ＣＰＵ１１は同フラグＦを
オンして（ステップＤ１４）、そのときの検出位置を抽
出位置として抽出位置記憶領域１３ｄにキープした後
（ステップＤ１５）、ステップＤ３からの処理に戻り、
単語検索用カウンタＡを更新して次の単語の検索を行
う。When the topic conversion word is detected in step D5 and the topic conversion flag F is in the off state (No in step D13), the CPU 11 turns on the flag F (step D14), After the detected position at that time is kept as the extraction position in the extraction position storage area 13d (step D15), the process returns to step D3,
The word search counter A is updated to search for the next word.

【００４３】ステップＤ６でキーワードと関連する単語
が検出された場合、ＣＰＵ１１は関連単語出現カウンタ
Ｂを更新する（ステップＤ１６）。このとき、関連単語
出現カウンタＢの値が「１」の場合、つまり初めて関連
単語が検出されたのであれば（ステップＤ１７のＹｅ
ｓ）、ＣＰＵ１１はそのときの検出位置を抽出位置とし
て抽出位置記憶領域１３ｄにキープした後（ステップＤ
１８）、ステップＤ３からの処理に戻り、単語検索用カ
ウンタＡを更新して次の単語の検索を行う。一方、関連
単語出現カウンタＢの値が規定値に達した場合には（ス
テップＤ１９のＹｅｓ）、ＣＰＵ１１は抽出位置記憶領
域１３ｄにキープされている抽出位置から音声情報を抽
出し（ステップＤ２０）、保存用メモリ２２に格納する
（ステップＤ２１）。When the word related to the keyword is detected in step D6, the CPU 11 updates the related word appearance counter B (step D16). At this time, if the value of the related word appearance counter B is "1", that is, if the related word is detected for the first time (Yes in step D17).
s), the CPU 11 keeps the detection position at that time as the extraction position in the extraction position storage area 13d (step D
18) Then, returning to the processing from step D3, the word search counter A is updated to search for the next word. On the other hand, when the value of the related word appearance counter B has reached the specified value (Yes in step D19), the CPU 11 extracts voice information from the extraction position kept in the extraction position storage area 13d (step D20). The data is stored in the storage memory 22 (step D21).

【００４４】このようにして、文頭から単語が１語ずつ
取り出され、それがキーワード、話題転換単語または関
連単語か否かが調べられる。そして、話題転換単語また
は関連単語が検出された場合に、その検出位置が音声情
報の抽出位置として保持され、キーワードが検出された
ときまたは関連単語が規定値以上検出されたときに、そ
の位置から音声情報が抽出される。In this way, words are taken out one by one from the beginning of the sentence, and it is checked whether or not it is a keyword, topic conversion word, or related word. Then, when the topic conversion word or the related word is detected, the detected position is held as the extraction position of the voice information, and when the keyword is detected or when the related word is detected more than the specified value, from that position Audio information is extracted.

【００４５】なお、所定数の単語を検索しても、抽出す
べき位置を検索できない場合に備えて、単語検索用カウ
ンタＡが規定値に達した際に（ステップＤ７）、単語検
索用カウンタＡ、関連単語出現カウンタＢおよび話題転
換用フラグＦを全てクリアしてから（ステップＤ８）、
再び検索を開始するものとする。When the word search counter A reaches a prescribed value (step D7), in preparation for the case where the position to be extracted cannot be searched even if a predetermined number of words are searched, the word search counter A is used. , After clearing all the related word appearance counter B and the topic conversion flag F (step D8),
The search shall be started again.

【００４６】図９に表示例を示す。図９（ａ）は「とこ
ろで」といった話題転換単語がある文章において、「日
本語処理」をキーワードとして音声情報を抽出する場合
の例を示している。この場合には、「ところで」の位置
が抽出位置としてキープされ、キーワードである「日本
語処理」が検出された際に、「ところで」の位置を基準
に音声情報の抽出が行われる。FIG. 9 shows a display example. FIG. 9A shows an example of a case where voice information is extracted using "Japanese processing" as a keyword in a sentence having a topic conversion word such as "By the way". In this case, the position of “By the way” is kept as the extraction position, and when the keyword “Japanese processing” is detected, the voice information is extracted based on the position of the “By the way”.

【００４７】また、図９（ｂ）は「ところで」といった
話題転換単語も「日本語処理」といったキーワードもな
い文章において、関連単語から音声情報を抽出する場合
の例を示している。この場合には、例えば図１０に示す
ようなニューラルネットワークからキーワードの「日本
語処理」と関連する単語として、「イメージ」，「自然
言語処理」，「情報」が検索され、その数が規定値に達
したとき、最初に検出された関連単語の位置（初めの
「イメージ」）を基準に音声情報が抽出される。Further, FIG. 9B shows an example in which voice information is extracted from the related words in a sentence in which there is no topic conversion word such as "By the way" and no keyword such as "Japanese processing". In this case, for example, “image”, “natural language processing”, and “information” are searched as words related to the keyword “Japanese processing” from a neural network as shown in FIG. , The audio information is extracted based on the position of the related word detected first (the first “image”).

【００４８】なお、上記実施例では、文脈の先頭を検出
する場合についてのみ説明したが、文脈の終りを検出す
る場合も上記同様の手法にて行うことができる。この場
合には、例えば「以上のように」，「結局」，「結びと
して」といった文脈の終りを示す単語を検出するか、ニ
ューラルネットワークを用いてキーワードとは関連の遠
い単語が多数出現する箇所を検出するなどの方法があ
る。In the above embodiment, only the case of detecting the beginning of the context has been described, but the same method as described above can be used for detecting the end of the context. In this case, for example, a word indicating the end of the context such as "as above", "after all", or "as a concluding" is detected, or a neural network is used to find a large number of words far from the keyword. There are methods such as detecting.

【００４９】[0049]

【発明の効果】以上のように本発明によれば、入力され
た音声情報の音節毎にその発音の間隔、上がり／下が
り、強弱等の発音状態を検出し、音声情報の表示に際
し、この発音状態が分かるような表示形態で表示するよ
うにしたため、表示画面上で抽出範囲を指示する場合
に、その発音状態を手掛かりとして、必要な部分のみ簡
単に抽出することができる。As described above, according to the present invention, pronunciation states such as intervals, rising / falling, and strength of the pronunciation of the input voice information are detected for each syllable, and when the voice information is displayed, this pronunciation is detected. Since the display mode is such that the state can be understood, when the extraction range is designated on the display screen, the pronunciation state can be used as a clue to easily extract only the necessary portion.

【００５０】また、本発明によれば、音声情報の中から
キーワード、話題転換単語または関連単語を検出し、そ
の検出結果に応じて音声情報を抽出するようにしたた
め、ユーザによる抽出操作を必要とせずに、必要な部分
のみ抽出することができる。これにより、音声情報を入
力可能な情報処理装置において、入力された音声情報の
中から必要な部分のみ自動的に抽出して、ユーザの負担
を軽減することができる。Further, according to the present invention, the keyword, the topic conversion word or the related word is detected from the voice information and the voice information is extracted according to the detection result. Therefore, the extraction operation by the user is required. Instead, it is possible to extract only the necessary part. With this, in the information processing apparatus capable of inputting voice information, only a necessary portion is automatically extracted from the input voice information, and the burden on the user can be reduced.

[Brief description of drawings]

【図１】本発明の一実施例に係る情報処理装置の構成を
示すブロック図。FIG. 1 is a block diagram showing the configuration of an information processing apparatus according to an embodiment of the present invention.

【図２】同実施例の表示用テーブルの構成を示す図。FIG. 2 is a diagram showing the configuration of a display table of the embodiment.

【図３】同実施例の音声情報の入力動作を説明するため
のフローチャート。FIG. 3 is a flowchart for explaining a voice information input operation of the embodiment.

【図４】同実施例の音声情報の表示動作を説明するため
のフローチャート。FIG. 4 is a flowchart for explaining a voice information display operation of the embodiment.

【図５】同実施例の音声情報の表示例を示す図。FIG. 5 is a diagram showing a display example of voice information according to the embodiment.

【図６】同実施例のユーザ指示による音声情報の抽出動
作を説明するためのフローチャート。FIG. 6 is a flowchart for explaining a voice information extracting operation according to a user's instruction in the embodiment.

【図７】同実施例のシステムによる音声情報の自動抽出
動作を説明するためのフローチャート。FIG. 7 is a flowchart for explaining an automatic extraction operation of voice information by the system of the embodiment.

【図８】同実施例のシステムによる音声情報の自動抽出
動作を説明するためのフローチャート。FIG. 8 is a flowchart for explaining an automatic extraction operation of voice information by the system of the embodiment.

【図９】同実施例の音声情報の表示例を示す図。FIG. 9 is a diagram showing a display example of audio information according to the embodiment.

【図１０】同実施例のニューラルニットワークの構成を
示す図。FIG. 10 is a diagram showing a configuration of a neural knit work according to the same embodiment.

[Explanation of symbols]

１１…ＣＰＵ、１２…ＲＯＭ、１２ｃ…辞書領域、１２
ｄ…ネットワーク領域、１３…ＲＡＭ、１３ａ…テーブ
ル領域、１３ｂ…音声バッファ、１３ｃ…キーワード領
域、１３ｄ…抽出位置記憶領域、１４…入力部、１５…
入力制御部、１６…表示部、１７…表示制御部、１８…
ＶＲＡＭ、１９…マイク、２０…音声制御部、２１…タ
イマ、２２…保存用メモリ。11 ... CPU, 12 ... ROM, 12c ... Dictionary area, 12
d ... network area, 13 ... RAM, 13a ... table area, 13b ... voice buffer, 13c ... keyword area, 13d ... extraction position storage area, 14 ... input section, 15 ...
Input control unit, 16 ... Display unit, 17 ... Display control unit, 18 ...
VRAM, 19 ... Microphone, 20 ... Voice control unit, 21 ... Timer, 22 ... Storage memory.

フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ０６Ｆ 3/16 ３２０Ｄ 7165−5ＢＨ 7165−5Ｂ 17/22 Continuation of the front page (51) Int.Cl. ⁶ Identification number Office reference number FI Technical display location G06F 3/16 320 D 7165-5B H 7165-5B 17/22

Claims

[Claims]

1. Input means for inputting voice information, voice recognition of the voice information input by the input means, text conversion to character code information, and detection of the pronunciation interval for each syllable. An information processing apparatus comprising: a voice processing means; and a display control means for displaying the voice information converted into text by the voice processing means in a display form in which the pronunciation interval can be understood.

2. Input means for inputting voice information, voice recognition of the voice information input by the input means, text conversion into character code information, and rise / fall of the pronunciation for each syllable. An information processing apparatus comprising: a voice processing means for detecting; and a display control means for displaying the voice information converted into text by the voice processing means in a display form in which rise / fall of the pronunciation can be recognized. .

3. Input means for inputting voice information, voice recognition of the voice information input by the input means, text conversion to character code information, and detection of the strength of pronunciation for each syllable. An information processing apparatus comprising: a voice processing means; and a display control means for displaying the voice information converted into text by the voice processing means in a display form in which the strength of pronunciation is understood.

4. An instructing means for instructing a range to be extracted from the audio information displayed by the display control means, and an extracting means for extracting the audio information within the extraction range in accordance with an instruction by the instructing means. The information processing apparatus according to claim 1, further comprising a storage unit that stores the voice information extracted by the extraction unit.

5. Input means for inputting voice information, voice recognition of the voice information input by the input means, text conversion into character code information, and interval of pronunciation of each syllable, rise / fall. Voice processing means for detecting a pronunciation state such as falling or strength, display control means for displaying the voice information converted into text by the voice processing means in a display form so as to understand the pronunciation state, and the display control means. Instructing means for instructing a range to be extracted with respect to the displayed voice information, extracting means for extracting the voice information within the extraction range in accordance with an instruction by the instructing means, and the voice extracted by the extracting means. An information processing apparatus comprising: a storage unit that stores information.

6. An input unit for inputting voice information, a display unit for displaying the voice information, and a storage unit for storing the voice information, wherein the voice information input by the input unit is stored. The voice recognition is performed, the text is converted into character code information, and the pronunciation states such as the pronunciation interval, rise / fall, and strength are detected for each syllable, and the pronunciation state of the text-converted voice information can be understood. While displaying on the display means in a display form, the voice information within a designated range is extracted from the displayed voice information, and the extracted voice information is stored in the storage means. Control method characterized by

7. An input means for inputting voice information, a voice processing means for voice-recognizing the voice information input by the input means, and text-converting it to character code information, and text conversion by the voice processing means. Detecting means for detecting a preset keyword from the above-mentioned voice information, extracting means for extracting the voice information based on the keyword detected by the detecting means, and the extracting means for extracting the voice information. An information processing apparatus comprising: a storage unit that stores voice information.

8. An input means for inputting voice information, a voice processing means for voice-recognizing the voice information input by the input means, and text-converting it to character code information, and text conversion by the voice processing means. Detecting means for detecting a word for topic conversion from the above-mentioned voice information, extracting means for extracting the voice information based on the word for topic conversion detected by the detecting means, and this extracting means An information processing apparatus comprising: a storage unit that stores the extracted voice information.

9. Input means for inputting voice information, voice processing means for voice-recognizing the voice information input by the input means, and text-converting it to character code information, and text-converting by the voice processing means. Detecting means for detecting a word associated with a preset keyword from among the above-mentioned voice information, extracting means for extracting the voice information based on the related word detected by the detecting means, and this extracting means An information processing apparatus, comprising: a storage unit that stores the voice information extracted by.

10. Input means for inputting voice information, voice processing means for voice-recognizing the voice information input by the input means, and text-converting it to character code information, and text conversion by the voice processing means. Detecting means for detecting a preset keyword, word for topic conversion or a word related to the keyword from the above-mentioned voice information, and the word for topic conversion or the related word is detected by this detecting means. When the keyword is detected, or when the related word is detected more than a specified value, the extraction means for extracting the voice information from the detection position, and the extraction means An information processing apparatus comprising: a storage unit that stores the extracted voice information.

11. Input means for inputting voice information,
A storage unit for storing the voice information is provided, the voice information input by the input unit is voice-recognized, the text is converted into character code information, and preset from the text-converted voice information. A keyword, a word for topic conversion or a word related to the keyword is detected, and when the word for topic conversion or the related word is detected, the detection position is held and the keyword is detected. At this time or when the related word is detected at a specified value or more, the voice information is extracted from the detection position, and the extracted voice information is stored in the storage means.