JP2010086074A

JP2010086074A - Speech processing apparatus, speech processing method, and speech processing program

Info

Publication number: JP2010086074A
Application number: JP2008251772A
Authority: JP
Inventors: Takashi Sumiyoshi; 貴志住吉; Masato Togami; 真人戸上; Yasunari Obuchi; 康成大淵
Original assignee: Hitachi Omron Terminal Solutions Corp
Current assignee: Hitachi Omron Terminal Solutions Corp
Priority date: 2008-09-29
Filing date: 2008-09-29
Publication date: 2010-04-15

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently create document data with speech data added thereto. <P>SOLUTION: A speech document correspondence unit 22 of a speech processing apparatus 1 calculates the likelihood of each word in a document included in document data 20 from a speech frame sequence 13 to be created in order to detect a first time period of the speech frame sequence 13 during which words in the document are spoken and a second time period of the speech frame sequence 13 during which contents other than the words in the document are spoken, extracts, as a symbol, the speech frame sequence 13 of the second time period, and decides, as an additional position of the extracted symbol, an adjacent position of a position of a word in the document spoken during the first time period which is temporally close with respect to the second time period during which the extracted symbol is spoken. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声処理装置、音声処理方法、および、音声処理プログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a voice processing program.

入力された音声データを文書データに添付する文書処理装置が、提案されている（特許文献１）。この文書処理装置では、ユーザが文書データ内の添付位置を指定すると、その添付位置へ音声データを添付する。なお、音声データの添付形式としては、音声が添付されたことを表すアイコンを添付する形式、および、音声データを音声認識によりテキスト化された文字列として添付する形式が、挙げられている。
特開平１１−２４２６６９号公報 A document processing apparatus that attaches input audio data to document data has been proposed (Patent Document 1). In this document processing apparatus, when a user designates an attachment position in document data, audio data is attached to the attachment position. Note that examples of the audio data attachment format include a format in which an icon indicating that audio is attached and a format in which the audio data is attached as a character string converted into a text by voice recognition.
Japanese Patent Laid-Open No. 11-242669

特許文献１などの従来の技術の文書処理装置では、音声データの添付位置をユーザにより逐一手入力する必要があるため、音声データ付きの文書データを作成するコストが、大きい。 In the conventional document processing apparatus such as Patent Document 1, it is necessary to manually input the attachment position of the audio data one by one, so that the cost of creating the document data with the audio data is high.

さらに、音声データの録音は、会議などで長期間継続して行われることも多い。そのときには、音声データのファイルが添付位置ごとのファイルになっておらず、１つの会議全体のファイルとなる。よって、１つの会議全体のファイルから、複数の添付位置ごとのファイルを切り出す作業が必要となり、文書データを作成するコストが増える。 Furthermore, recording of audio data is often performed continuously for a long period of time at a conference or the like. At that time, the audio data file is not a file for each attachment position, but is a file for the entire meeting. Therefore, it is necessary to cut out a file for each of a plurality of attachment positions from the file of one entire meeting, and the cost for creating document data increases.

そこで、本発明は、前記した問題を解決し、音声データ付きの文書データを効率的に作成することを、主な目的とする。 Accordingly, the main object of the present invention is to solve the above-described problems and efficiently create document data with audio data.

前記課題を解決するために、本発明は、サンプリングされた波形データをシンボルとして文書データに付加する音声処理装置であって、
前記音声処理装置が、音声フレーム列検出部と、音声文書対応部と、シンボル付加部と、を有し、
前記音声フレーム列検出部が、
前記波形データを記憶部から読み取り、前記波形データの音量を元に音声区間を検出することで音声フレーム列を作成し、
前記音声文書対応部が、
作成される前記音声フレーム列から前記文書データに含まれる文書内単語ごとの尤度を計算することで、文書内単語が発言される前記音声フレーム列の第１時間帯と、文書内単語以外の内容が発言される前記音声フレーム列の第２時間帯とを検出し、
前記第２時間帯の前記音声フレーム列をシンボルとして抽出し、抽出されるシンボルが発言される前記第２時間帯に対して、時間的に近い前記第１時間帯に発言される文書内単語の位置の近傍位置を、抽出されるシンボルの付加位置として決定し、
前記シンボル付加部が、
記憶部から読み取った文書内単語の位置と、前記音声文書対応部が決定したシンボルの付加位置と、をもとに、前記シンボルを前記文書データに配置することで、抽出されるシンボルを前記文書データに付加することを特徴とする。
その他の手段は、後記する。 In order to solve the above problems, the present invention is an audio processing apparatus for adding sampled waveform data to document data as a symbol,
The speech processing apparatus includes a speech frame sequence detection unit, a speech document correspondence unit, and a symbol addition unit,
The voice frame sequence detection unit,
Read the waveform data from the storage unit, create a voice frame sequence by detecting a voice section based on the volume of the waveform data,
The voice document correspondence unit
By calculating the likelihood for each word in the document included in the document data from the generated voice frame sequence, the first time zone of the voice frame sequence in which the word in the document is spoken, and a word other than the word in the document Detecting a second time zone of the audio frame sequence in which the content is spoken;
The speech frame sequence in the second time zone is extracted as a symbol, and the word in the document uttered in the first time zone that is closer in time to the second time zone in which the extracted symbol is uttered. The position near the position is determined as an additional position of the extracted symbol,
The symbol adding unit is
Based on the position of the word in the document read from the storage unit and the addition position of the symbol determined by the voice document corresponding unit, the symbol is arranged in the document data, so that the extracted symbol is the document. It is added to data.
Other means will be described later.

本発明によれば、音声データ付きの文書データを効率的に作成することができる。 According to the present invention, document data with audio data can be created efficiently.

以下、本発明の第１，第２実施形態について、説明する。第１，第２実施形態に共通する点は、説明者が文書に沿って説明するときに、文書には記載されていない発言内容をマイクで録音し、その録音内容を補充データとして文書に付加する点である。補充データを付加することで、例えば、以下の用途に活用できる。 Hereinafter, first and second embodiments of the present invention will be described. The common points of the first and second embodiments are that when the explainer explains along the document, the content of the remarks not described in the document is recorded with the microphone, and the recorded content is added to the document as supplementary data. It is a point to do. By adding supplementary data, it can be used for the following applications, for example.

例えば、説明者は、業務において、商品の説明や契約の重要事項などを消費者へ説明する。説明者は、多くの顧客に同じ説明を繰り返すことも多いため、その説明内容を補充データとして文書に付加しておくことで、補足説明が必要な箇所を容易に把握でき、説明業務を効率化することができる。 For example, the explainer explains to the consumer the explanation of the product, important matters of the contract, etc. in the business. Explainers often repeat the same explanations for many customers, so by adding the contents of the explanation to the document as supplementary data, it is possible to easily grasp the places where supplementary explanations are necessary and improve the efficiency of explanation work. can do.

また、説明者は、企業内、教育現場、サポートセンタなどで、法的同意を求める類の説明を行う。説明中に行われた会話内容や、説明した相手の同意の有無を証拠として残しておくため、音声やその他のメディア（映像、メモ書きなど）を記録しておくことが有効である。 In addition, the presenter gives explanations of the type for which legal consent is required in the company, at the educational site, or at the support center. It is effective to record audio and other media (videos, memos, etc.) in order to keep the contents of the conversation conducted during the explanation and the consent of the explained partner as evidence.

以下、本発明の第１実施形態を説明する。第１実施形態では、説明者用、聴講者用のマイクがそれぞれ用意され、個々のマイクから発言者の音源を特定する。よって、第１実施形態は、例えば、コールセンタなどの電話での説明に応用可能である。
具体的には、コールセンターでエージェントが不適切な単語を発言した場合に、それをリアルタイムに検出して監督者へ通知するシステムや、リアルタイムに商品名等を検出し、その表品の関連資料を画面に表示するシステムなどに応用できる。 Hereinafter, a first embodiment of the present invention will be described. In the first embodiment, a speaker microphone and a speaker microphone are prepared, and the sound source of the speaker is specified from each microphone. Therefore, the first embodiment can be applied to a description on a telephone such as a call center.
Specifically, when an agent speaks an inappropriate word at the call center, it detects it in real time and notifies the supervisor. It can be applied to systems that display on the screen.

図１は、音声処理装置を示す構成図である。音声処理装置１は、ＣＰＵ（Central Processing Unit）８５、メモリ８６、および、インタフェースを有する一般的なコンピュータとして構成される。
ＣＰＵ８５は、演算処理と、音声処理装置１の各構成要素の制御処理と、を実行する。
音声処理装置１のインタフェースは、説明者用マイク８０、聴講者用マイク８１、ディスプレイ９１、マウス９２、および、スピーカ９３を、音声処理装置１と接続する。 FIG. 1 is a configuration diagram illustrating a voice processing device. The sound processing apparatus 1 is configured as a general computer having a CPU (Central Processing Unit) 85, a memory 86, and an interface.
The CPU 85 executes arithmetic processing and control processing of each component of the sound processing device 1.
The interface of the sound processing device 1 connects the speaker microphone 80, the listener microphone 81, the display 91, the mouse 92, and the speaker 93 to the sound processing device 1.

なお、音声処理装置１は、一般的なコンピュータと同様、記憶装置に格納されたソフトウェアプログラムに基づき演算装置が処理をするという仕組みを想定したが、本発明はそのような構成に依存せず、たとえばソフトウェアプログラムと同等の機能をハードウェアで実現するようなアーキテクチャに対しても適用できる。 The speech processing apparatus 1 is assumed to have a mechanism in which an arithmetic device performs processing based on a software program stored in a storage device, as in a general computer, but the present invention does not depend on such a configuration, For example, the present invention can be applied to an architecture that implements functions equivalent to software programs in hardware.

説明者用マイク８０は、説明者の音声を録音（サンプリング）し、その結果を波形データ１０へと出力する。
聴講者用マイク８１は、聴講者（被説明者）の音声を録音し、その結果を波形データ１０へと出力する。
ディスプレイ９１は、文書データ２０や、シンボル付文書データ４３などのユーザが閲覧するデータを表示する表示装置である。
マウス９２は、ディスプレイ９１に表示されたマウスポインタの操作を受け付ける、入力装置である。
スピーカ９３は、波形データ１０を音声信号として出力するデバイスである。 The explainer microphone 80 records (samples) the explainer's voice and outputs the result to the waveform data 10.
The listener's microphone 81 records the voice of the listener (explained person) and outputs the result to the waveform data 10.
The display 91 is a display device that displays data viewed by the user, such as the document data 20 and the document data 43 with symbols.
The mouse 92 is an input device that accepts an operation of a mouse pointer displayed on the display 91.
The speaker 93 is a device that outputs the waveform data 10 as an audio signal.

メモリ８６は、ＣＰＵ８５が実行することで各処理部を実現するためのプログラムと、その各処理部の処理対象となる各データと、を記憶する。
音声処理装置１は、処理部として、音声フレーム列検出部１２、音声文書対応部２２、シンボル作成部４０、シンボル付加部４２、および、シンボル再生部４４を、有する。
音声処理装置１は、データとして、波形データ１０、波形フレーム１１、音声フレーム列１３、文書データ２０、文書内単語音韻モデル２１、文書内単語尤度マップ２２ａ、音声文書対応データ２２ｂ、一般単語辞書３１、一般音韻モデル３２、シンボルデータ４１、および、シンボル付文書データ４３を、有する。
以下、音声処理装置１の各構成要素の詳細を説明する。 The memory 86 stores a program for realizing each processing unit when executed by the CPU 85 and each data to be processed by each processing unit.
The audio processing device 1 includes an audio frame sequence detection unit 12, an audio document correspondence unit 22, a symbol creation unit 40, a symbol addition unit 42, and a symbol reproduction unit 44 as processing units.
The speech processing apparatus 1 includes, as data, waveform data 10, waveform frames 11, speech frame sequence 13, document data 20, in-document word phonological model 21, in-document word likelihood map 22a, speech document correspondence data 22b, general word dictionary 31, general phoneme model 32, symbol data 41, and document data with symbol 43.
Hereinafter, details of each component of the voice processing device 1 will be described.

文書内単語音韻モデル２１は、文書データ２０に出現する各文書内単語について、文書内単語を構成する音素ごとに対応する一般音韻モデル３２を直列接続することで、あらかじめ作成される。 The in-document word phoneme model 21 is created in advance by serially connecting general phoneme models 32 corresponding to each phoneme constituting the in-document word for each in-document word appearing in the document data 20.

一般単語辞書３１は、文書データ２０の文書内単語に限定されず、発言されうる一般的な単語を収録している。
一般音韻モデル３２は、一般単語辞書３１内の各単語について、ＨＭＭ（Hidden Markov Model）で多数の波形から学習して作成されたものであり、ＨＭＭの出力確率を求めるために利用される。ＨＭＭの出力確率は、一般音韻モデル３２と、音声フレーム列１３と、のＭＦＣＣ（Mel Frequency Cepstrum Coefficient）特徴ベクトルの距離を計算することで、単語音声認識で用いられている一般的な方法として、求めることができる。 The general word dictionary 31 is not limited to the words in the document of the document data 20, but includes general words that can be spoken.
The general phoneme model 32 is created by learning each word in the general word dictionary 31 from a large number of waveforms using an HMM (Hidden Markov Model), and is used to determine the output probability of the HMM. As a general method used in word speech recognition, the output probability of the HMM is calculated by calculating the distance between MFCC (Mel Frequency Cepstrum Coefficient) feature vectors of the general phoneme model 32 and the speech frame sequence 13. Can be sought.

図２は、音響信号データを示す説明図である。音響信号データとしての、波形データ１０、波形フレーム１１、音声フレーム列１３について、説明する。図２では、時系列の波形データ１０を示すため、横軸を時刻とし、縦軸を音量とするグラフが示される。 FIG. 2 is an explanatory diagram showing acoustic signal data. The waveform data 10, the waveform frame 11, and the voice frame sequence 13 as acoustic signal data will be described. In FIG. 2, in order to show the time-series waveform data 10, a graph is shown in which the horizontal axis represents time and the vertical axis represents volume.

波形データ１０は、音源ごとにマイク（説明者用マイク８０、聴講者用マイク８１）を介して入力される音響信号データをＡ／Ｄ変換して、デジタル信号として録音した時系列のデータである。この波形データ１０には、説明者などが発言している時間帯もあれば、沈黙などによる無発言の時間帯もある。波形データ１０のサンプリングの環境は、例えば、サンプリング周波数が１０[kHz]、量子化が８[bit]、エンコード方式がリニアＰＣＭ（pulse code modulation）である。 The waveform data 10 is time-series data recorded as a digital signal by A / D converting acoustic signal data input via a microphone (explainer microphone 80, listener microphone 81) for each sound source. . The waveform data 10 includes a time zone in which an explainer or the like is speaking, and a time zone in which there is no speech due to silence or the like. The sampling environment of the waveform data 10 is, for example, a sampling frequency of 10 [kHz], a quantization of 8 [bit], and an encoding method of linear PCM (pulse code modulation).

波形フレーム１１は、波形データ１０を、単位長（固定長）の時間帯で区切ったデータである。波形フレーム１１の単位長とは、音声解析で利用するのに適した任意の長さである。例えば、波形フレーム１１の単位長を５００サンプル（２０[ms]）とする。波形フレーム１１は、時間帯を区切っただけなので、波形データ１０と同様に、発言している時間帯もあれば、無発言の時間帯もある。 The waveform frame 11 is data obtained by dividing the waveform data 10 by a unit length (fixed length) time zone. The unit length of the waveform frame 11 is an arbitrary length suitable for use in speech analysis. For example, the unit length of the waveform frame 11 is set to 500 samples (20 [ms]). Since the waveform frame 11 only divides the time zone, like the waveform data 10, there is a time zone in which a speech is made and a time zone in which there is no speech.

音声フレーム列１３は、１つ以上の連続する波形フレーム１１の列をまとめたデータである。音声フレーム列１３は、波形フレーム１１を信号処理した結果、発言開始時の波形フレーム１１から発言終了時の波形フレーム１１までの、連続して発言している時間帯のデータとして構成される。
波形フレーム１１への信号処理とは、例えば、波形フレーム１１ごとにその音量と閾値とを比較する処理であり、図２では、開始閾値αを上回った波形フレーム１１から、終了閾値βを下回った波形フレーム１１までを、１つの音声フレーム列１３として抽出している。このように、２つの閾値（開始閾値α、終了閾値β）を併用する方式は、ヒステリシス（histerisis）と呼ばれており、１つの閾値だけで判定する方式に比べ、抽出される音声フレーム列１３が細切れになることを防ぐという効果がある。 The audio frame sequence 13 is data in which a sequence of one or more continuous waveform frames 11 is collected. As a result of signal processing of the waveform frame 11, the audio frame sequence 13 is configured as data in a time zone in which speech is continued from the waveform frame 11 at the start of speech to the waveform frame 11 at the end of speech.
The signal processing for the waveform frame 11 is, for example, a process for comparing the volume and the threshold value for each waveform frame 11, and in FIG. 2, the waveform frame 11 that has exceeded the start threshold value α has fallen below the end threshold value β. Up to the waveform frame 11 is extracted as one audio frame sequence 13. As described above, the method using the two threshold values (start threshold value α and end threshold value β) together is called hysteresis, and the extracted audio frame sequence 13 is compared with a method for determining using only one threshold value. This has the effect of preventing chopping.

図３（ａ）は、波形データ１０の発言内容をテキスト化して示す説明図である。横方向を時間軸とし、縦方向を時刻および各音源（説明者、聴講者）とする。縦方向の３行を１セットとして、時刻「１０００[ms]」ごとに３セット分折り返して示している。テキストが記載されている時間帯が、音声フレーム列１３の時間帯である。 FIG. 3A is an explanatory diagram showing the content of a statement in the waveform data 10 as text. The horizontal direction is the time axis, and the vertical direction is the time and each sound source (explainer, listener). Three rows in the vertical direction are set as one set, and three sets are folded back at every time “1000 [ms]”. The time zone in which the text is described is the time zone of the audio frame sequence 13.

図３（ｂ）は、図３（ａ）で示した波形データ１０に対して、音声文書対応データ２２ｂ（表３、詳細は後記）が示す文書内単語の認識結果を上書きした説明図である。例えば、音声文書対応データ２２ｂを参照すると、「＃１」の文書内単語「注意事項」は、音声時間「１００〜３００」に発言されているので、説明者欄の「注意事項」に「＃１」が割り当てられる。図３（ｂ）を参照すると、文書データ２０には記載されていない補足説明（時刻＝１０００の「第一に」、など）や、聴講者による同意（時刻＝２０００の「わかりました」、など）などの文書外音声が発言されている。 FIG. 3B is an explanatory diagram in which the recognition result of the word in the document indicated by the voice document correspondence data 22b (Table 3, details will be described later) is overwritten on the waveform data 10 shown in FIG. . For example, referring to the voice document correspondence data 22b, the word “notes” in the document “# 1” is spoken during the voice time “100 to 300”, and therefore “#” in the “notes” column of the explainer column. 1 "is assigned. Referring to FIG. 3 (b), supplementary explanations not described in the document data 20 ("first" at time = 1000, etc.), consent by the listener ("Okay" at time = 2000, Etc.).

図４（ａ）は、音声処理装置１がディスプレイ９１に出力する文書データ２０（表１、詳細は後記）の表示例を示す画面図である。
「注意事項」などの文書データ２０の文字の表示位置（レイアウト）は、文書データ２０に格納されている表示位置（左上座標から右下座標までの範囲）をもとに、決定される。なお、表示位置と記載したが、詳細には、文書データ２０内の文書内単語の位置を示すため、印刷処理においては印刷位置となり、ファイル保存処理においては格納位置となる。 FIG. 4A is a screen diagram illustrating a display example of the document data 20 (Table 1, details will be described later) output from the speech processing apparatus 1 to the display 91.
The display position (layout) of characters in the document data 20 such as “notes” is determined based on the display position (range from the upper left coordinate to the lower right coordinate) stored in the document data 20. Although described as the display position, in detail, since it indicates the position of the word in the document in the document data 20, it is the print position in the printing process and the storage position in the file saving process.

図４（ｂ）は、図４（ａ）に対してシンボルデータ４１（表５、詳細は後記）を付加した結果である、シンボル付文書データ４３の表示例を示す画面図である。各吹き出し７１１〜７１９の表示位置は、シンボルデータ４１の「シンボルの表示位置」をもとに決定される。
灰色の吹き出し７１３，７１６は、聴講者の発言内容を示す。これらの吹き出し７１３，７１６の表示文字列（はい、わかりました）は、シンボルデータ４１の「表示文字列」である。
白色の吹き出し７１１，７１２，７１４，７１５，７１７〜７１９は、説明者の発言内容を示す。これらの吹き出しの表示文字列は、画面内の空きスペースの関係で全てを表示しきれないので、短縮形である「・・」などで示される。また、吹き出しの大きさが異なっているのは、吹き出しが示すシンボルごとの時間長を反映したものであり、発言時間が長いほど、大きな吹き出しを表示するものとする。このように、吹き出しが示すシンボルごとの時間長を、シンボルの画面表示内容に反映させることにより、ユーザは、シンボルごとの時間長を直観的に把握することができる。
なお、時間長を表示内容に反映させる処理として、前記したシンボルの吹き出しの大きさ以外にも、例えば、時間長をもとにシンボルの色を変更してもよいし、シンボルの時間長（数字表記）をシンボル表示に文字列として含めてもよい。さらに、前記した時間長を表示内容に反映させる処理は、１種類だけでもよいし、複数種類を組み合わせてもよい。
ここで、図４（ｂ）では、各吹き出しの色の違いは、音源の違いを示すものであり、シンボルデータ４１の「音源」をもとに決定されたシンボルデータ４１の「表示色」が反映される。 FIG. 4B is a screen diagram showing a display example of the document data 43 with symbols, which is a result of adding the symbol data 41 (Table 5, details will be described later) to FIG. The display positions of the balloons 711 to 719 are determined based on the “symbol display position” of the symbol data 41.
Gray balloons 713 and 716 indicate the content of the listener's speech. The display character strings (Yes, I understand) of these balloons 713 and 716 are “display character strings” of the symbol data 41.
White balloons 711, 712, 714, 715, 717 to 719 indicate the content of the commentary of the presenter. Since the display character strings of these balloons cannot be displayed completely due to the empty space in the screen, they are indicated by a short form such as “..”. The size of the speech bubble is different from the time length of each symbol indicated by the speech bubble. The longer the speech time, the larger the speech bubble is displayed. In this manner, by reflecting the time length for each symbol indicated by the balloon in the screen display content of the symbol, the user can intuitively grasp the time length for each symbol.
In addition to the size of the above-described symbol balloon, for example, the symbol color may be changed based on the time length, or the symbol time length (number (Notation) may be included in the symbol display as a character string. Furthermore, the process for reflecting the above-described time length in the display content may be only one type, or a plurality of types may be combined.
Here, in FIG. 4B, the difference in color between the balloons indicates the difference in sound source, and the “display color” of the symbol data 41 determined based on the “sound source” of the symbol data 41 is the same. Reflected.

表１は、文書データ２０を示す。文書データ２０は、文書内単語ごとに、そのＩＤ「＃」と、その重要度と、表示位置の情報（左上座標、右下座標）と、が対応づけられている。
文書内単語とは、説明者の発言時に区切られる可能性がある場所で区切られた単位である。
重要度は、あらかじめ与えられてもよいし、重要単語辞書や品詞などから決定してもよい。
表示位置の情報は、文書データを表示する（図４参照）際のレイアウトとなる情報である。 Table 1 shows the document data 20. In the document data 20, for each word in the document, its ID “#”, its importance, and display position information (upper left coordinates, lower right coordinates) are associated.
The word in the document is a unit delimited at a place where there is a possibility of being delimited when the explainer speaks.
The importance may be given in advance, or may be determined from an important word dictionary, part of speech, or the like.
The display position information is information serving as a layout for displaying document data (see FIG. 4).

表２は、文書内単語尤度マップ２２ａを示す説明図である。表２は、２２５０個の波形フレーム１１が入力されたときに作成される文書内単語尤度マップ２２ａのうち、文書内単語「コードを」「引っ張らない」の部分を抜粋して、示している。文書内単語尤度マップ２２ａには、波形フレーム番号と、文書内単語ごとの、その波形フレーム番号までの「スコア」（最大尤度）と、その「スコア」を出力した波形フレーム番号である「開始フレーム番号」と、が記録される。 Table 2 is an explanatory view showing the in-document word likelihood map 22a. Table 2 shows an extracted portion of the word “code” and “do not pull” in the document from the word likelihood map 22a in the document created when 2250 waveform frames 11 are input. . The word likelihood map 22a in the document includes a waveform frame number, a “score” (maximum likelihood) up to the waveform frame number for each word in the document, and a waveform frame number that outputs the “score”. "Start frame number" is recorded.

波形フレーム番号「１〜１５」に対応する１５個の波形フレーム１１には、以下の２つの文書内単語の波形フレーム１１が含まれている。
・波形フレーム番号「２〜３」に対応する、文書内単語「コードを」の波形フレーム１１。文書内単語「コードを」のスコアの最高点「７」である波形フレーム番号「３」の開始フレーム番号が「２」であるため。
・波形フレーム番号「８〜１１」に対応する、文書内単語「引っ張らない」の波形フレーム１１。文書内単語「引っ張らない」のスコアの最高点「９」である波形フレーム番号「１１」の開始フレーム番号が「８」であるため。 The fifteen waveform frames 11 corresponding to the waveform frame numbers “1 to 15” include the following two waveform frames 11 of words in the document.
A waveform frame 11 of the word “code” in the document corresponding to the waveform frame number “2-3”. This is because the start frame number of the waveform frame number “3”, which is the highest score “7” of the word “code” in the document, is “2”.
A waveform frame 11 of the word “do not pull” in the document corresponding to the waveform frame number “8 to 11”. This is because the start frame number of the waveform frame number “11” that is the highest score “9” of the word “do not pull” in the document is “8”.

一方、前記した文書内単語が発言されていない、以下の波形フレーム１１は、それぞれシンボルデータ４１となる。
・波形フレーム番号「１」の波形フレーム１１
・波形フレーム番号「４〜７」の波形フレーム１１
・波形フレーム番号「１２〜１５」の波形フレーム１１ On the other hand, the following waveform frames 11 in which the above-mentioned words in the document are not spoken are respectively symbol data 41.
-Waveform frame 11 with waveform frame number "1"
-Waveform frame 11 of waveform frame number "4-7"
-Waveform frame 11 of waveform frame number "12-15"

表３は、説明者の音声文書対応データ２２ｂを示す。音声文書対応データ２２ｂは、波形データ１０と、文書データ２０とを、文書内単語の発言時間を元に対応づけたものである。音声文書対応データ２２ｂは、波形フレーム１１の列ごとに、そのＩＤ「＃」と、文書内単語のテキスト内容（文書内単語でないときには「文書外音声」）と、その発言された時間帯を示す音声時間（開始時刻〜終了時刻）と、文書内単語の画面上での表示座標と、を対応づけて構成される。 Table 3 shows the voice document correspondence data 22b of the presenter. The voice document correspondence data 22b is obtained by associating the waveform data 10 with the document data 20 based on the speech time of words in the document. The voice document correspondence data 22b indicates, for each column of the waveform frame 11, the ID “#”, the text content of the word in the document (“speech outside document” if it is not a word in the document), and the time zone in which the speech was made. The speech time (start time to end time) is associated with the display coordinates of the words in the document on the screen.

表４は、聴講者の音声文書対応データ２２ｂを示す。聴講者の発言内容は、全て文書外音声となるので、表３の「文書内単語」列は、「文書外音声」列へと置き換わる。表４のその他の列は、表３と同じ形式である。 Table 4 shows the audio document correspondence data 22b of the listener. Since all the contents of the listener's utterances are out-of-document audio, the “in-document word” column in Table 3 is replaced with the “out-of-document audio” column. The other columns in Table 4 have the same format as Table 3.

表５は、シンボルデータ４１を示す説明図である。シンボルデータ４１は、シンボルのＩＤを示す「シンボルの符号（図４（ｂ）の吹き出しの符号）」と、シンボルの情報（音源、表示色、表示文字列）と、シンボルの表示位置と、を対応づけて構成される。 Table 5 is an explanatory diagram showing the symbol data 41. The symbol data 41 includes a “symbol code (symbol code in FIG. 4B)” indicating a symbol ID, symbol information (sound source, display color, display character string), and symbol display position. Configured in association.

図５は、音声フレーム列検出部１２（図１）の動作を示すフローチャートである。図２で示したように、音声フレーム列検出部１２は、波形フレーム１１から音声フレーム列１３を抽出する。なお、音声フレーム列１３の抽出処理は、一般的に音声区間検出と呼ばれる処理である。図５で説明する方法は、その一例であり、任意の音声区間検出処理で置き換えることができる。 FIG. 5 is a flowchart showing the operation of the audio frame sequence detector 12 (FIG. 1). As shown in FIG. 2, the audio frame sequence detector 12 extracts the audio frame sequence 13 from the waveform frame 11. Note that the extraction process of the audio frame sequence 13 is a process generally called audio section detection. The method described with reference to FIG. 5 is an example thereof, and can be replaced with an arbitrary voice segment detection process.

また、図５の処理は、説明者用マイク８０からの波形データ１０、聴講者用マイク８１からの波形データ１０、それぞれについて、独立に実行される。
まず、説明者用マイク８０の波形データ１０から生成される音声フレーム列１３は、文書内単語の発言内容を含んでいるため、音声文書対応部２２が文書内単語尤度マップ２２ａを作成するための入力データとなる。
一方、聴講者用マイク８１からの波形データ１０から生成される音声フレーム列１３は、文書内単語の発言内容を含んでいないため、文書内単語尤度マップ２２ａの作成には使用されない。 5 is executed independently for each of the waveform data 10 from the explainer microphone 80 and the waveform data 10 from the listener microphone 81.
First, since the audio frame sequence 13 generated from the waveform data 10 of the explainer microphone 80 includes the utterance content of the word in the document, the audio document correspondence unit 22 creates the in-document word likelihood map 22a. Input data.
On the other hand, since the audio frame sequence 13 generated from the waveform data 10 from the listener's microphone 81 does not include the content of words in the document, it is not used to create the word likelihood map 22a in the document.

Ｓ１０１として、マイク（説明者用マイク８０、聴講者用マイク８１）からサンプリングされつづけている波形データ１０を時系列の順に単位長で区切り、波形フレーム１１として読み取る。以下、Ｓ１０１〜Ｓ１１０までのループは、読み取った波形フレーム１１を選択するループである。 In S101, the waveform data 10 continuously sampled from the microphones (explainer microphone 80, listener microphone 81) is divided into unit lengths in order of time series and read as a waveform frame 11. Hereinafter, the loop from S101 to S110 is a loop for selecting the read waveform frame 11.

Ｓ１０２として、Ｓ１０１で選択中の波形フレーム１１の音量を示すパワーｐを計算する。パワーｐは、波形フレーム１１内の全サンプルの振幅の２乗和平均として計算される。
Ｓ１０３として、パワーｐが開始閾値α（図２参照）より大きいなら（Ｓ１０３，Ｙｅｓ）、選択中の波形フレーム１１を音声フレーム列１３の開始フレームとするため、記録中フラグ「Ｆ」に１（記録中）を代入する（Ｓ１０４）。
Ｓ１０５として、記録中フラグ「Ｆ」が「１」であるときには（Ｓ１０５，Ｙｅｓ）、選択中の波形フレーム１１を音声バッファへ記録する（Ｓ１０６）。 As S102, the power p indicating the volume of the waveform frame 11 being selected in S101 is calculated. The power p is calculated as the square sum of the amplitudes of all the samples in the waveform frame 11.
In S103, if the power p is larger than the start threshold value α (see FIG. 2) (S103, Yes), the recording flag “F” is set to 1 (S103, Yes) to set the selected waveform frame 11 as the start frame of the audio frame sequence 13. (Recording) is substituted (S104).
As S105, when the recording flag “F” is “1” (S105, Yes), the selected waveform frame 11 is recorded in the audio buffer (S106).

Ｓ１０７として、パワーｐが終了閾値β（図２参照）より小さいなら（Ｓ１０７，Ｙｅｓ）、前回のＳ１０６で記録した波形フレーム１１を音声フレーム列１３の終了フレームとして、記録した音声フレーム列１３を、音声フレーム列検出部１２の出力データとして出力するとともに、次回の音声フレーム列１３を記録するため、音声バッファをクリアする（Ｓ１０８）。そして、記録中フラグ「Ｆ」に０（記録中ではない）を代入する（Ｓ１０９）。
Ｓ１１０として、ループを終了する。 In S107, if the power p is smaller than the end threshold β (see FIG. 2) (S107, Yes), the recorded audio frame sequence 13 is recorded with the waveform frame 11 recorded in the previous S106 as the end frame of the audio frame sequence 13. In addition to outputting as output data of the audio frame sequence detector 12, the audio buffer is cleared to record the next audio frame sequence 13 (S108). Then, 0 (not recording) is substituted for the recording flag “F” (S109).
In S110, the loop is terminated.

音声文書対応部２２は、音声フレーム列検出部１２が出力した音声フレーム列１３を、文書データ２０に対応づける。音声フレーム列１３の音源が説明者であるときには、音声文書対応部２２は、図６に示す動作を行うことで、表２に示す文書内単語尤度マップ２２ａをもとに、表３に示す音声文書対応データ２２ｂを作成する。 The voice document correspondence unit 22 associates the voice frame sequence 13 output from the voice frame sequence detection unit 12 with the document data 20. When the sound source of the audio frame sequence 13 is the explainer, the audio document correspondence unit 22 performs the operation shown in FIG. 6, and based on the in-document word likelihood map 22a shown in Table 2, the table shown in Table 3 follows. The voice document correspondence data 22b is created.

図６は、音源が説明者であるときの、音声文書対応部２２の動作を示すフローチャートである。このフローチャートは、音声フレーム列検出部１２が出力した音声フレーム列１３を、音声文書対応部２２が入力するたびに、実行される。つまり、図６の処理を１回実行することで、１つの音声フレーム列１３が処理される。 FIG. 6 is a flowchart showing the operation of the voice document correspondence unit 22 when the sound source is an explainer. This flowchart is executed every time the voice document correspondence unit 22 inputs the voice frame sequence 13 output from the voice frame sequence detection unit 12. That is, by executing the process of FIG. 6 once, one audio frame sequence 13 is processed.

まず、Ｓ２０１〜Ｓ２０３のループ処理は、音声文書対応部２２が、音声フレーム列１３を文書内単語音韻モデル２１へ入力したときの確率尤度として、文書内単語尤度マップ２２ａを生成する処理である。
Ｓ２０１として、文書データ２０の文書内単語を順に１つずつ選択するループを開始する。
Ｓ２０２として、Ｓ２０１で選択した文書内単語の単語尤度（スコア）を、音声フレーム列１３を構成する波形フレーム１１ごとに計算し、文書内単語尤度マップ２２ａに書き込む。
Ｓ２０３として、ループを終了する。例えば、１回のループで表２の「コードを」列が生成され、次の１回のループで表２の「引っ張らない」列が生成される。 First, the loop process of S201 to S203 is a process in which the speech document correspondence unit 22 generates the in-document word likelihood map 22a as the probability likelihood when the speech frame sequence 13 is input to the in-document word phoneme model 21. is there.
In S201, a loop for selecting words in the document data 20 one by one in order is started.
In S202, the word likelihood (score) of the word in the document selected in S201 is calculated for each waveform frame 11 constituting the speech frame sequence 13, and written to the word likelihood map 22a in the document.
In S203, the loop is terminated. For example, the “code” column in Table 2 is generated in one loop, and the “do not pull” column in Table 2 is generated in the next loop.

次に、Ｓ２１１〜Ｓ２１３の処理は、生成された文書内単語尤度マップ２２ａを参照して、音声文書対応データ２２ｂを生成する処理である。
Ｓ２１１として、文書内単語尤度マップ２２ａの文書内単語ｗごとに、波形フレーム番号ｆの最適パスを探索する。
具体的には、波形フレーム番号ｆにおける文書内単語ｗの単語スコアをｓｃｏｒｅ（ｗ、ｆ）、開始位置をｂｅｇｉｎ（ｗ、ｆ）とすると、最適パス（｛ｗｉ、ｆｉ｝）は、以下の（式１）で求められる。 Next, the processing of S211 to S213 is processing for generating voice document correspondence data 22b with reference to the generated intra-document word likelihood map 22a.
In S211, the optimum path of the waveform frame number f is searched for each word w in the document in the word likelihood map 22a in the document.
Specifically, when the word score of the word w in the document in the waveform frame number f is score (w, f) and the start position is begin (w, f), the optimal path ({wi, fi}) is It is calculated | required by (Formula 1).

なお、表２の例では、文書内単語「コードを」列の｛ｗ１＝５、ｆ１＝３｝、および、文書内単語「引っ張らない」列の｛ｗ２＝６、ｆ２＝１１｝が最適パスとなる。
また、表２では一部の単語（ｗ１＝５、ｗ２＝６）しか示していないが、これ以外の単語の尤度スコアは波形フレームが単語音韻モデルと一致せず尤度スコアが低くなるため、最適パスに選択されない。 In the example of Table 2, {w1 = 5, f1 = 3} in the word “code” column in the document and {w2 = 6, f2 = 11} in the word “do not pull” column in the document are optimal paths. It becomes.
Table 2 shows only some of the words (w1 = 5, w2 = 6), but the likelihood scores of the other words are low because the waveform frame does not match the word phoneme model and the likelihood score is low. The best path is not selected.

さらに、最適パスを求める式について、文書データ２０の「重要度」を考慮することとしてもよい。文書内単語ｗの「重要度」をＰ（ｗ）とすると、最適パス（｛ｗｉ、ｆｉ｝）は、以下の（式２）で求められる。 Furthermore, the “importance” of the document data 20 may be taken into consideration for the formula for obtaining the optimum path. When the “importance” of the word w in the document is P (w), the optimum path ({wi, fi}) is obtained by the following (Formula 2).

そして、１つの文書内単語ｗの最適パスが既に１回見つかったときには、その文書内単語ｗの最適パスの探索を省略することとしてもよい。この省略するケースは、プレゼンテーションなどで文書内単語が１回しか発言されないことを前提とするもので、パスの探索を省略することで、計算量を削減することができる。
一方、質疑応答などで文書内単語が複数回発言される場合には、前記した前提がなりたたないので、パスの探索を省略しないほうがよい。 When the optimum path for one word w in the document has already been found once, the search for the optimum path for the word w in the document may be omitted. This omitted case is based on the premise that a word in a document is spoken only once in a presentation or the like, and the amount of calculation can be reduced by omitting a path search.
On the other hand, when a word in a document is replied a plurality of times in a question-and-answer session or the like, it is better not to omit the path search because the above-mentioned assumption is not satisfied.

Ｓ２１２として、Ｓ２１１で求めた最適パスに従い、波形フレーム列を文書内単語と、文書外音声と、に分割する。文書内単語は、最適パスで選択された単語ｗｉの区間（ｂｅｇｉｎ（ｗｉ、ｆｉ）、ｆｉ）である。文書外音声は、文書内単語以外の連続する波形フレーム列の区間である。 In S212, the waveform frame sequence is divided into in-document words and out-of-document speech according to the optimum path obtained in S211. The word in the document is a section (begin (wi, fi), fi) of the word wi selected by the optimum path. The voice outside the document is a section of a continuous waveform frame sequence other than the words in the document.

Ｓ２１３として、Ｓ２１２で求めた文書内単語と、文書外音声とを、音声文書対応データ２２ｂ（表３）に書き込む。
音声文書対応データ２２ｂの「文書内単語」列には、Ｓ２１２で求めた文書内単語を示すテキスト（文書データ２０から取得）、または、文書外音声を示すフラグが記述される。
音声文書対応データ２２ｂの「音声時間」列には、Ｓ２１２で求めた文書内単語および文書外音声を構成する波形フレーム１１について、最初の波形フレーム１１の開始時刻から、最後の波形フレーム１１の終了時刻までの時間帯が記述される。
音声文書対応データ２２ｂの「表示座標」列には、文書データ２０に基づき、Ｓ２１２で求めた文書外音声をディスプレイ９１に表示する位置（吹き出しアイコンにおける吹き出しの先端部分の位置）が記述される。 In S213, the word in the document obtained in S212 and the voice outside the document are written in the voice document correspondence data 22b (Table 3).
In the “word in document” column of the voice document correspondence data 22b, text indicating the word in the document obtained in S212 (obtained from the document data 20) or a flag indicating voice outside the document is described.
In the “voice time” column of the voice document correspondence data 22b, the waveform frame 11 constituting the word in the document and the voice outside the document obtained in S212 is counted from the start time of the first waveform frame 11 to the end of the last waveform frame 11. The time zone up to the time is described.
In the “display coordinates” column of the voice document correspondence data 22b, the position (the position of the tip portion of the balloon in the balloon icon) where the voice outside the document obtained in S212 is displayed on the display 91 based on the document data 20 is described.

文書外音声の「表示座標」は、例えば、文書外音声と文書内単語との発言時間帯の時間的近さをもとに、以下の手法から求めることができる。
まず、文書外音声（＃１）が発言された後に、文書内単語（＃２の「注意事項」）が発言された場合、文書データ２０（表１）を参照して文書内単語（＃１の「注意事項」）の表示位置（左上座標、右下座標）を取得する。そして、文書内単語の左上座標の近傍（やや左側）の座標を、文書外音声の「表示座標」とする。
次に、文書内単語（＃２の「注意事項」）が発言された後に、文書外音声（＃３）が発言された場合、文書データ２０（表１）を参照して文書内単語（＃１の「注意事項」）の表示位置（左上座標、右下座標）を取得する。そして、文書内単語の右上座標の近傍（やや右側）の座標を、文書外音声の「表示座標」とする。 The “display coordinates” of the voice outside the document can be obtained from the following method based on, for example, the time close to the speech time zone between the voice outside the document and the word in the document.
First, when an out-of-document voice (# 1) is uttered and an in-document word ("Notes" in # 2) is uttered, the in-document word (# 1) is referred to the document data 20 (Table 1). "Precautions") display position (upper left coordinates, lower right coordinates). Then, the coordinates in the vicinity (slightly left) of the upper left coordinates of the words in the document are set as “display coordinates” of the voice outside the document.
Next, when an out-of-document voice (# 3) is uttered after an in-document word (# 2 “notes”) is uttered, the in-document word (#) is referred to the document data 20 (Table 1). 1 ("Notes") display position (upper left coordinates, lower right coordinates) is acquired. Then, the coordinates in the vicinity (slightly right) of the upper right coordinates of the words in the document are set as “display coordinates” of the voice outside the document.

以上、図６を参照して、音声フレーム列１３の音源が説明者であるときの処理を説明した。一方、音声フレーム列１３の音源が聴講者であるときには、音声文書対応部２２は、表３に示す音声文書対応データ２２ｂをもとに、表４に示す音声文書対応データ２２ｂを作成する。
具体的には、音声文書対応部２２（シンボルテキスト作成部）は、各文書外音声について、一般音韻モデル３２と一般単語辞書３１を用いて単語音声認識を行う。
単語音声認識は、具体的には、一般単語辞書３１内の単語Ｗ１、Ｗ２、…、ＷＪのそれぞれのスコアＳ（１）、Ｓ（２）、…、Ｓ（Ｊ）を、一般音韻モデル３２を用いて求める。スコアの最大値ｍａｘ（Ｓ（ｊ））＞γ（γは既定の閾値）であれば、認識結果であるＷｊ | ｊ＝ａｒｇｍａｘ（Ｓ（Ｊ））を表示文字列とする。そうでなければ認識失敗を示す「？」などの記号を表示文字列とする。 The processing when the sound source of the audio frame sequence 13 is the explainer has been described above with reference to FIG. On the other hand, when the sound source of the voice frame sequence 13 is a listener, the voice document correspondence unit 22 creates the voice document correspondence data 22b shown in Table 4 based on the voice document correspondence data 22b shown in Table 3.
Specifically, the speech document correspondence unit 22 (symbol text creation unit) performs word speech recognition for each out-of-document speech using the general phoneme model 32 and the general word dictionary 31.
Specifically, the word speech recognition is based on the scores S (1), S (2),..., S (J) of the words W1, W2,. Find using. If the maximum score value max (S (j))> γ (γ is a predetermined threshold value), the recognition result Wj | j = argmax (S (J)) is set as the display character string. Otherwise, a symbol such as “?” Indicating a recognition failure is used as a display character string.

そして、音声文書対応部２２は、説明者の音声文書対応データ２２ｂ（表３）を参照して、聴講者の音声文書対応データ２２ｂ（表４）の「表示座標」を決定する。
例えば、表４の文書外音声「はい」（音声時間＝700〜800）は、表３の文書内単語「注意事項」の音声時間「100〜300」の後に発言されている。よって、文書外音声「はい」の表示座標は、文書内単語「注意事項」の直後の、表３の文書外音声（音声時間＝300〜600）の表示位置「(140,100)」と同じ位置とする。 Then, the voice document correspondence unit 22 refers to the presenter's voice document correspondence data 22b (Table 3) and determines the “display coordinates” of the listener's voice document correspondence data 22b (Table 4).
For example, the out-of-document voice “Yes” (voice time = 700 to 800) in Table 4 is remarked after the voice time “100 to 300” of the word “Notes” in Table 3. Therefore, the display coordinate of the voice outside the document “Yes” is the same position as the display position “(140,100)” of the voice outside the document (voice time = 300 to 600) in Table 3 immediately after the word “Notes” in the document. To do.

図１に戻って、シンボル作成部４０は、音声文書対応データ２２ｂ（表３，表４）からシンボルデータ４１（表５）を作成する。 Returning to FIG. 1, the symbol creation unit 40 creates symbol data 41 (Table 5) from the voice document correspondence data 22b (Tables 3 and 4).

シンボル作成部４０は、説明者の音声文書対応データ２２ｂ（表３）からは、文書内単語を除いた文書外音声をシンボルとして抽出する。例えば、シンボル作成部４０は、表３の音声時間「０〜１００[ms]」の文書外音声（＃１）を、シンボルの符号「７１１」のシンボルとし、表３の「表示座標」を表５の「シンボルの表示位置」へ代入する。 The symbol creation unit 40 extracts, as a symbol, an out-of-document voice excluding in-document words from the presenter's voice document correspondence data 22b (Table 3). For example, the symbol creation unit 40 uses the out-of-document voice (# 1) with the voice time “0 to 100 [ms]” in Table 3 as the symbol with the symbol code “711”, and displays the “display coordinates” in Table 3. 5 is substituted for “symbol display position”.

シンボル作成部４０は、聴講者の音声文書対応データ２２ｂ（表４）からは、全ての文書外音声をシンボルとして抽出する。例えば、シンボル作成部４０は、表４の音声時間「７００〜８００[ms]」の文書外音声（＃１、はい）を、シンボルの符号「７１３」のシンボルとし、表４の「表示座標」を表５の「シンボルの表示位置」へ代入し、表４の「文書外音声」のテキストを表５のシンボルの「表示文字列」へ代入する。 The symbol creation unit 40 extracts all out-of-document sounds as symbols from the audio document correspondence data 22b (Table 4) of the listener. For example, the symbol creating unit 40 uses the out-of-document voice (# 1, yes) of the voice time “700 to 800 [ms]” in Table 4 as the symbol with the symbol code “713”, and the “display coordinates” in Table 4 Is substituted for “symbol display position” in Table 5, and the text “sound outside document” in Table 4 is substituted for “display character string” of the symbol in Table 5.

シンボルデータ４１の「表示文字列」について、基本的には、音声認識処理の結果であるテキスト化された音声内容をそのまま記載する。しかし、辞書外の単語を発言したときなど、音声認識の結果が得られなかったときには、文書外音声の時間長に応じて、文書外音声の長さが所定以下であれば「−（なし）」、所定以上であれば「…」という文字列を表示文字列としてもよい。
シンボルデータ４１の「表示色」について、「音源」が説明者なら「白」とし、「音源」が聴講者なら「灰」とする。 As for the “display character string” of the symbol data 41, basically, the voice content converted into text as a result of the voice recognition process is described as it is. However, when a speech recognition result is not obtained, such as when a word outside the dictionary is spoken, if the length of the out-of-document speech is equal to or less than a predetermined length according to the time length of the out-of-document speech, “-(none) “, Or a character string“...
The “display color” of the symbol data 41 is “white” if the “sound source” is the explainer, and “gray” if the “sound source” is the listener.

シンボル再生部４４は、指定されたシンボルデータ４１の音声信号をスピーカ９３から出力する。なお、シンボルデータ４１の指定方法は、例えば、利用者がマウス９２を用いて、ディスプレイ９１に表示されているシンボル付文書データ４３のシンボルデータ４１（図４（ｂ）の吹き出し７１１〜７１９）をクリックすることで、実行される。 The symbol reproducing unit 44 outputs the audio signal of the designated symbol data 41 from the speaker 93. The symbol data 41 can be specified by using, for example, the symbol data 41 of the document data with symbol 43 displayed on the display 91 using the mouse 92 (balloons 711 to 719 in FIG. 4B). Click to execute.

そして、シンボル再生部４４は、シンボルデータ４１が指定されると、シンボルデータ４１の作成元となった音声文書対応データ２２ｂを参照し、音声文書対応データ２２ｂの「音声時間」を取得し、その「音声時間」分の波形データ１０をメモリ８６から読み取って、スピーカ９３から出力する。これにより、利用者は指定した表示文字列に相当する音声を聞くことができる。 Then, when the symbol data 41 is designated, the symbol reproduction unit 44 refers to the voice document correspondence data 22b that is the creation source of the symbol data 41, acquires the “voice time” of the voice document correspondence data 22b, and The waveform data 10 for “voice time” is read from the memory 86 and output from the speaker 93. Thereby, the user can hear the sound corresponding to the designated display character string.

以上、第１実施形態について、説明した。これにより、ユーザは、表示されたシンボル付文書データ４３に付加されたシンボルデータ４１（吹き出しアイコン）を参照することにより、説明者が説明を補足したところや、聴講者から質問や返答が得られたところ、その発言内容などが即座に理解できる。
さらに、説明者は、画面上に文書データ２０を表示して説明する代わりに、シンボル付文書データ４３を表示して説明することにより、過去の説明内容を確認しながら、説明を効率よく進めることができる。 The first embodiment has been described above. Thereby, the user can obtain a question or a response from the audience by referring to the symbol data 41 (balloon icon) added to the displayed symbol-added document data 43, where the explanation is supplemented by the explanation. As a result, the contents of the remarks can be understood immediately.
Furthermore, instead of displaying and explaining the document data 20 on the screen, the explainer displays the symbol-added document data 43 and explains the contents so that the explanation can proceed efficiently while confirming the past explanation contents. Can do.

本発明の第２実施形態を説明する。第２実施形態では、録音する音声の音源を特定可能なマイクアレイを介して、音声を録音する。これにより、発言者ごとにマイクを用意する必要が無くなるため、導入が容易になる。第２実施形態は、例えば、会議室などの対面する環境で実施される。 A second embodiment of the present invention will be described. In the second embodiment, sound is recorded via a microphone array that can specify the sound source of sound to be recorded. This eliminates the need to prepare a microphone for each speaker, thus facilitating introduction. For example, the second embodiment is implemented in a facing environment such as a conference room.

図７は、第２実施形態の音声処理装置を示す構成図である。図１と比較すると、図１の各マイク（説明者用マイク８０、聴講者用マイク８１）が、図７の複数のマイク素子８２ａ〜８２ｃからなるマイクアレイ８２へと、置き換わっている。
マイクアレイ８２は、説明場面の近辺に設置される。これにより、説明者ごとにヘッドセットマイクなどの近接マイクを使う必要が無くなるので、マイクの設置を意識しないことにより、円滑に説明をすることができる。音源位置は、マイクアレイ８２から見た方位角の音源方向θで表現される。 FIG. 7 is a configuration diagram illustrating the speech processing apparatus according to the second embodiment. Compared with FIG. 1, each microphone (explainer microphone 80, listener microphone 81) in FIG. 1 is replaced with a microphone array 82 including a plurality of microphone elements 82a to 82c in FIG.
The microphone array 82 is installed in the vicinity of the explanation scene. This eliminates the need to use a proximity microphone such as a headset microphone for each explainer, so that the explanation can be made smoothly without being conscious of the installation of the microphone. The sound source position is represented by the sound source direction θ of the azimuth angle viewed from the microphone array 82.

そして、図７の音声処理装置１には、図１の構成要素に加え、音源同定部１４が追加されている。音源同定部１４は、音源からの波形データ１０が各マイク素子８２ａ〜８２ｃに到達する時間差を調べることで、音源位置を検出する。さらに、音源同定部１４は、音源位置に対応する空間フィルタを適用し、音源位置からの音声を強調して抽出し、波形データ１０へと出力する。 In addition to the components shown in FIG. 1, a sound source identification unit 14 is added to the voice processing apparatus 1 shown in FIG. The sound source identification unit 14 detects the sound source position by examining the time difference at which the waveform data 10 from the sound source reaches the microphone elements 82a to 82c. Further, the sound source identification unit 14 applies a spatial filter corresponding to the sound source position, emphasizes and extracts the sound from the sound source position, and outputs it to the waveform data 10.

図８は、第２実施形態の音声フレーム列検出部１２の動作を示すフローチャートである。図５のフローチャートと比較すると、以下の点が変更されている。
・音源方向θごとのループ（Ｓ１０１ａ〜Ｓ１１０ａ）が、追加されている。
・波形フレーム１１のパワーｐが、音源方向θごとのパワーｐ（θ）へと変更されている（Ｓ１０２ｂ、Ｓ１０３ｂ、Ｓ１０７ｂ）。音源方向θは、マイクアレイに対し方位角３６０度を５度間隔でとったもの（θ＝０、５、・・・、３５５）とする。ｐ（θ）は、マイク素子｛ｉ＝０、１、２、３｝への入力波形フレームのフーリエ変換ｘ＝｛ｘｉ｝の複素共役転置ベクトルｘ^＊と、音源方向θに対応するステアリングベクトルａ（θ）と、から、ｐ（θ）＝ａ（θ）・ｘ^＊で求められる。
・記録中フラグ「Ｆ」が、音源方向θごとの記録中フラグ「Ｆ（θ）」へと変更されている（Ｓ１０４ｂ、Ｓ１０５ｂ、Ｓ１０８ｂ）
・Ｓ１０９ｂにおいて、Ｓ１０９で出力されていた音声フレーム列１３に加え、Ｓ１０１ａで選択中の音源方向θも併せて、出力される。ここで、音源方向θを遅延和ビームフォーミングで強調したｘ（θ）＝ａ（θ）ｘを出力することで、マイクアレイ８２で録音される音源方向θ以外からの音の影響を低減できる。 FIG. 8 is a flowchart showing the operation of the audio frame sequence detection unit 12 of the second embodiment. Compared with the flowchart of FIG. 5, the following points are changed.
A loop (S101a to S110a) for each sound source direction θ is added.
The power p of the waveform frame 11 is changed to the power p (θ) for each sound source direction θ (S102b, S103b, S107b). The sound source direction θ is assumed to be an azimuth angle of 360 degrees with respect to the microphone array at intervals of 5 degrees (θ = 0, 5,..., 355). p (θ) is the complex conjugate transposed vector x ^* of the Fourier transform x = {xi} of the input waveform frame to the microphone element {i = 0, 1, 2, 3} and the steering vector a corresponding to the sound source direction θ. From (θ), p (θ) = a (θ) · x ^* is obtained.
The recording flag “F” is changed to the recording flag “F (θ)” for each sound source direction θ (S104b, S105b, S108b).
In S109b, in addition to the audio frame sequence 13 output in S109, the sound source direction θ selected in S101a is also output. Here, by outputting x (θ) = a (θ) x in which the sound source direction θ is emphasized by the delay sum beamforming, it is possible to reduce the influence of sound other than the sound source direction θ recorded by the microphone array 82.

図９は、第２実施形態の音声文書対応部２２の動作を示すフローチャートである。図６のフローチャートと比較すると、以下の点が変更されている。この変更により、マイクアレイ８２によって得られた入力音声に対しても、音源同定部１４が説明者の方向を検出し、説明内容の可視化が実現できる。
・文書内単語尤度マップ２２ａが、音源方向θごとに生成および参照される（Ｓ２０２ｂ、Ｓ２１１ｂ）。
・図６のフローチャートでは、説明者の音声フレーム列１３の入力を前提としていたが、図９のフローチャートでは、音源方向θに説明者が存在するか否かは、実行開始時には特定されない。そこで、音源同定部１４は、新たに追加された変数Ｓ（θ）の処理（Ｓ２１１ｃ、Ｓ２１１ｄ）を実行することで、音源方向θに説明者が存在するか否かを判断する。 FIG. 9 is a flowchart showing the operation of the voice document handling unit 22 of the second embodiment. Compared with the flowchart of FIG. 6, the following points are changed. With this change, the sound source identification unit 14 can detect the direction of the presenter even for the input sound obtained by the microphone array 82, and the contents of the explanation can be visualized.
A document word likelihood map 22a is generated and referenced for each sound source direction θ (S202b, S211b).
In the flowchart of FIG. 6, it is assumed that the audio frame sequence 13 of the explainer is input. However, in the flowchart of FIG. 9, whether or not an explainer exists in the sound source direction θ is not specified at the start of execution. Therefore, the sound source identification unit 14 determines whether there is an explainer in the sound source direction θ by executing processing (S211c, S211d) of the newly added variable S (θ).

説明者らしさを示す変数Ｓ（θ）は、音源方向θに説明者が存在するかどうかを、説明文が発言内容に含まれるか、あるいは過去に含んでいたかどうかで判別するために、導入される変数である。
音源同定部１４は、以下の（式３）により、最適パスの文書内音声区間におけるスコアの和ＳＰ（θ）を計算する。 The variable S (θ) indicating the character of the explainer is introduced in order to determine whether or not the explainer exists in the sound source direction θ based on whether the explanatory text is included in the content of the statement or whether it has been included in the past. Variable.
The sound source identification unit 14 calculates the sum SP (θ) of the scores in the in-document speech section of the optimum path by the following (Equation 3).

このＳＰ（θ）を用いて、Ｓ（θ）←γＳ（θ）＋（１−γ）ＳＰ（θ）として更新する（Ｓ２１１ｃ）。ここで、γは忘却係数をあらわし、例えばγ＝０．９などと設定する。
音源同定部１４は、Ｓ２１１ｃで更新したＳ（θ）が、あらかじめ設定した閾値νを上回るか否かを判定する。閾値νを上回る（Ｓ２１１ｄ、Ｙｅｓ）ときには、音源方向θに説明者が存在すると判断し、Ｓ２１２を実行する。一方、閾値νを上回らない（Ｓ２１１ｄ、Ｎｏ）ときには、音源方向θに説明者が存在しないと判断する。音源方向θに説明者が存在しないときには、音声文書対応部２２は、図９の開始時に入力された音声フレーム列１３全体を、１つの文書外音声とみなし（Ｓ２１２ａ）、処理をＳ２１３へと進める。 Using this SP (θ), S (θ) ← γS (θ) + (1-γ) SP (θ) is updated (S211c). Here, γ represents a forgetting factor, for example, γ = 0.9.
The sound source identification unit 14 determines whether or not S (θ) updated in S211c exceeds a preset threshold value ν. When the threshold value ν is exceeded (S211d, Yes), it is determined that there is an explainer in the sound source direction θ, and S212 is executed. On the other hand, when the threshold value ν is not exceeded (S211d, No), it is determined that there is no presenter in the sound source direction θ. When there is no presenter in the sound source direction θ, the speech document correspondence unit 22 regards the entire speech frame sequence 13 input at the start of FIG. 9 as one out-of-document speech (S212a), and advances the processing to S213. .

なお、あらかじめ説明者の音源方向θが与えられている場合は、音源同定部１４のＳ２１１ｃの処理を省略し、Ｓ２１１ｄの判定式を、「音源方向θに説明者が存在する」と置き換えてもよい。 If the sound source direction θ of the presenter is given in advance, the processing of S211c of the sound source identification unit 14 may be omitted and the determination formula of S211d may be replaced with “the presenter exists in the sound source direction θ”. Good.

図７に戻り、シンボル作成部４０の動作は、図１の第１実施形態と同様である。ここで、表示色を音源方向θごとに変えてもよい。これにより、説明者や聴講者がそれぞれ複数の場合であっても、どの発言が同一人物によりなされたかを画面上に示すことができる。 Returning to FIG. 7, the operation of the symbol creation unit 40 is the same as that of the first embodiment of FIG. Here, the display color may be changed for each sound source direction θ. Thereby, even if there are a plurality of explainers and listeners, it is possible to indicate on the screen which statement has been made by the same person.

以上説明した第１，第２実施形態によれば、説明者が文書に沿って説明するときに、文書には記載されていない発言内容を、補充データとして文書に付加することができる。
これにより、説明時の音声を記録し説明の証拠として残す場合、説明者は文書資料のどれを説明したか、あるいはそれに対して聴講者がどのように応答したか、という情報に説明中にアクセスすることができる。
よって、文書には記載されていない発言内容を効率よく参照することにより、説明漏れや同意の獲得漏れをなくすことができる。つまり、説明者が文書資料に基づき他の人物に説明を行うという場面において、わかりやすく、厳選された情報をシンボルとして説明者に提供することができる。 According to the first and second embodiments described above, when the explainer explains along the document, it is possible to add the remarks content not described in the document as supplementary data to the document.
In this way, when recording the audio at the time of explanation and leaving it as proof of explanation, the instructor has access to information during the explanation about which of the document material was explained or how the listener responded to it. can do.
Therefore, it is possible to eliminate omissions in explanation and omission of consent by efficiently referring to the contents of statements that are not described in the document. In other words, in a situation where the explainer explains to another person based on the document material, it is easy to understand and carefully selected information can be provided as a symbol to the explainer.

本発明の各実施形態は、会議等の構造化、可視化という用途において、有効である。
複数の発言者の発言内容を可視化する比較例として、発言者ごとの発言を検出し、時系列グラフとして表示するという方法が挙げられる。
文書資料を利用する比較例として、会議中に表示していたスライドを関連付けて表示するものや、音声認識で用いる言語モデルの適応に利用するものが、挙げられる。
本発明の各実施形態では、文書資料に基づき説明する場面を可視化するという用途において、前記した比較例よりも、より直観的に各発言者の発言内容を視覚的（画面上の吹き出しアイコン）、および、聴覚的（吹き出しアイコンのクリックによる音声再生）に理解しやすい利用法を提供することができる。 Each embodiment of the present invention is effective in applications such as structuring and visualization of meetings and the like.
As a comparative example for visualizing the content of a plurality of speakers, there is a method in which a speech for each speaker is detected and displayed as a time series graph.
As comparative examples using document materials, there are a case where a slide displayed during a meeting is displayed in association with another, and a case where the slide is used for adaptation of a language model used in speech recognition.
In each embodiment of the present invention, in the application of visualizing the scene to be explained based on the document material, the content of each speaker's speech is more visually (balloon icon on the screen) more intuitively than the comparative example described above. In addition, it is possible to provide a usage method that is easy to understand auditorially (sound reproduction by clicking a speech bubble icon).

本発明の第１実施形態に関する音声処理装置を示す構成図である。It is a block diagram which shows the audio | voice processing apparatus regarding 1st Embodiment of this invention. 本発明の第１実施形態に関する音響信号データを示す説明図である。It is explanatory drawing which shows the acoustic signal data regarding 1st Embodiment of this invention. 本発明の第１実施形態に関する波形データの発言内容を示す説明図である。It is explanatory drawing which shows the statement content of the waveform data regarding 1st Embodiment of this invention. 本発明の第１実施形態に関するディスプレイに出力する文書データの表示例を示す画面図である。It is a screen figure which shows the example of a display of the document data output on the display regarding 1st Embodiment of this invention. 本発明の第１実施形態に関する音声フレーム列検出部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice frame sequence detection part regarding 1st Embodiment of this invention. 本発明の第１実施形態に関する音声文書対応部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice document corresponding | compatible part regarding 1st Embodiment of this invention. 本発明の第２実施形態に関する音声処理装置を示す構成図である。It is a block diagram which shows the audio | voice processing apparatus regarding 2nd Embodiment of this invention. 本発明の第２実施形態に関する音声フレーム列検出部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice frame sequence detection part regarding 2nd Embodiment of this invention. 本発明の第２実施形態に関する音声文書対応部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice document corresponding | compatible part regarding 2nd Embodiment of this invention.

Explanation of symbols

１音声処理装置
１０波形データ
１１波形フレーム
１２音声フレーム列検出部
１３音声フレーム列
１４音源同定部
２０文書データ
２１文書内単語音韻モデル
２２音声文書対応部
２２ａ文書内単語尤度マップ
２２ｂ音声文書対応データ
３１一般単語辞書
３２一般音韻モデル
４０シンボル作成部
４１シンボルデータ
４２シンボル付加部
４３シンボル付文書データ
４４シンボル再生部
８０説明者用マイク
８１聴講者用マイク
８２マイクアレイ
８２ａ〜８２ｃマイク素子
８５ＣＰＵ
８６メモリ
９１ディスプレイ
９２マウス
９３スピーカ
DESCRIPTION OF SYMBOLS 1 Speech processing apparatus 10 Waveform data 11 Waveform frame 12 Speech frame sequence detection part 13 Speech frame sequence 14 Sound source identification part 20 Document data 21 In-document word phoneme model 22 Spoken document correspondence part 22a In-document word likelihood map 22b Spoken document correspondence data 31 General Word Dictionary 32 General Phoneme Model 40 Symbol Creation Unit 41 Symbol Data 42 Symbol Addition Unit 43 Document Data with Symbol 44 Symbol Reproduction Unit 80 Microphone for Explainer 81 Microphone for Audience 82 Microphone Array 82a-82c Microphone Element 85 CPU
86 Memory 91 Display 92 Mouse 93 Speaker

Claims

An audio processing apparatus for adding sampled waveform data to document data as symbols,
The audio processing device includes an audio frame sequence detection unit, an audio document correspondence unit, and a symbol addition unit,
The voice frame sequence detector
Read the waveform data from the storage unit, create a voice frame sequence by detecting the voice section of the waveform data,
The voice document support unit
By calculating the likelihood for each word in the document included in the document data from the generated voice frame sequence, the first time zone of the voice frame sequence in which the word in the document is spoken, and a word other than the word in the document Detecting a second time zone of the audio frame sequence in which the content is spoken;
The speech frame sequence in the second time zone is extracted as a symbol, and the word in the document uttered in the first time zone that is closer in time to the second time zone in which the extracted symbol is uttered. The position near the position is determined as an additional position of the extracted symbol,
The symbol adding unit includes:
Based on the position of the word in the document read from the storage unit and the addition position of the symbol determined by the voice document corresponding unit, the symbol is arranged in the document data, so that the extracted symbol is the document. A voice processing device characterized by being added to data.

When calculating the likelihood for each word in the document, the voice document correspondence unit reads the importance for each word in the document from the storage unit, and the word in the document is remarked based on the likelihood and the importance. The audio processing device according to claim 1, wherein the first time zone of the audio frame sequence and the second time zone of the audio frame sequence in which contents other than words in the document are spoken are detected. .

The audio processing apparatus further includes a symbol reproduction unit,
When the symbol displayed on the screen of the display device is selected through the input device, the symbol playback unit controls to play back the audio frame sequence of the selected symbol through a speaker. The speech processing apparatus according to claim 1 or 2.

The speech processing apparatus further includes a symbol text creation unit,
The symbol text creation unit reads a phonological model indicating a phoneme of a word in a word dictionary from a storage unit, and inputs the speech frame sequence of the symbol to the phonological model, thereby the word dictionary included in the speech frame sequence. The speech processing apparatus according to any one of claims 1 to 3, wherein a word is extracted, the utterance content of a symbol is converted into text, and the text is added to the document data.

The voice processing according to any one of claims 1 to 4, wherein the symbol adding unit reflects a time length of a symbol time zone for a symbol displayed on a screen of a display device. apparatus.

The symbol adding unit determines a size of a symbol displayed on a screen of a display device based on a time length of a symbol time zone as a process of reflecting the time length. 5. The voice processing device according to 5.

The symbol adding unit determines a display color of a symbol displayed on a screen of a display device based on a time length of a symbol time zone as a process of reflecting the time length. 5. The voice processing device according to 5.

The voice processing apparatus according to claim 5, wherein the symbol adding unit displays a number indicating a time length of a symbol time zone on a screen of a display device as the process of reflecting the time length.

The sound processing device, when sampling the waveform data, performs sampling for each sound source into a microphone for an explainer who reads out the document data, and a microphone for a speaker other than the explainer,
When the voice document corresponding unit extracts a symbol,
Extracting in-document words and symbols from the audio frame sequence sampled through the microphone for the presenter,
The speech processing apparatus according to any one of claims 1 to 8, wherein the speech frame sequence sampled through a microphone for a speaker other than the presenter is extracted as a symbol.

The audio processing device, when sampling the waveform data, performs sampling through a microphone array composed of a plurality of microphone elements for specifying a sound source direction,
When the voice document corresponding unit extracts a symbol,
Extracting words and symbols in the document from the audio frame sequence in the sound source direction of the explainer who reads the document data,
The speech processing apparatus according to any one of claims 1 to 8, wherein the speech frame sequence other than the sound source direction of the explainer is extracted as a symbol.

The voice processing device further includes a sound source identification unit,
The sound source identification unit calculates the likelihood of each word in the document included in the document data for each sound source direction of the audio frame sequence, and determines the sound source direction having the highest maximum likelihood value as the sound source of the explainer The voice processing apparatus according to claim 10, wherein the voice processing apparatus is identified as a direction.

The voice processing device according to any one of claims 9 to 11, wherein the voice processing device displays a symbol on a screen of a display device by changing a display format for each sound source of the symbol. .

An audio processing method by an audio processing apparatus for adding sampled waveform data to document data as symbols,
The audio processing device includes an audio frame sequence detection unit, an audio document correspondence unit, and a symbol addition unit,
The voice frame sequence detector
Read the waveform data from the storage unit, create a voice frame sequence by detecting the voice section of the waveform data,
The voice document support unit
By calculating the likelihood for each word in the document included in the document data from the generated voice frame sequence, the first time zone of the voice frame sequence in which the word in the document is spoken, and a word other than the word in the document Detecting a second time zone of the audio frame sequence in which the content is spoken;
The speech frame sequence in the second time zone is extracted as a symbol, and the word in the document uttered in the first time zone that is closer in time to the second time zone in which the extracted symbol is uttered. The position near the position is determined as an additional position of the extracted symbol,
The symbol adding unit includes:
Based on the position of the word in the document read from the storage unit and the addition position of the symbol determined by the voice document corresponding unit, the symbol is arranged in the document data, so that the extracted symbol is the document. An audio processing method characterized by being added to data.

An audio processing program for causing the audio processing apparatus, which is a computer, to execute the audio processing method according to claim 13.