JP2012181307A

JP2012181307A - Voice processing device, voice processing method and voice processing program

Info

Publication number: JP2012181307A
Application number: JP2011043572A
Authority: JP
Inventors: Manabu Kamiyama; 学上山; Hiroto Teranishi; 博人寺西; Akira Chiyo; 章千代; Hiroki Yoshimoto; 大樹吉本; Takahiro Otsuka; 隆宏大塚
Original assignee: NEC Software Hokkaido Ltd
Current assignee: NEC Solution Innovators Ltd
Priority date: 2011-03-01
Filing date: 2011-03-01
Publication date: 2012-09-20

Abstract

PROBLEM TO BE SOLVED: To increase efficiency in recognition processing of voice data.SOLUTION: A voice processing device includes storage means for storing input voice data, voice division means for dividing voice data stored in the storage means, voice recognition means for recognizing a plurality of partial voice data items generated by division by the voice division means using at least two voice recognition engines and converting them to character data, and integration means for integrating character data as recognition results by the voice recognition means to generate document data.

Description

本発明は、音声を認識する技術に関する。 The present invention relates to a technology for recognizing speech.

上記技術分野において、特許文献１に示されているように、入力した音声データを分割して認識する技術が知られている。 In the above technical field, as disclosed in Patent Document 1, a technique for dividing and recognizing input voice data is known.

特開2000-089786号公報JP 2000-089786 A

しかしながら、上記従来技術では、音声認識処理手段でシーケンシャルに分割された音声データを認識しており、処理効率が悪かった。 However, in the above prior art, the voice data is sequentially divided by the voice recognition processing means, and the processing efficiency is poor.

本発明の目的は、上述の課題を解決する技術を提供することにある。 The objective of this invention is providing the technique which solves the above-mentioned subject.

上記目的を達成するため、本発明に係る装置は、
入力した音声データを記憶する記憶手段と、
前記記憶手段に記憶された音声データを分割する音声分割手段と、
前記音声分割手段による分割によって生成された複数の部分音声データを少なくとも２つの音声認識エンジンを用いて認識し、文字データに変換する音声認識手段と、
前記音声認識手段による認識結果としての文字データを統合して文書データを生成する統合手段と、
を備えたことを特徴とする。 In order to achieve the above object, an apparatus according to the present invention provides:
Storage means for storing the input voice data;
Audio dividing means for dividing the audio data stored in the storage means;
Voice recognition means for recognizing a plurality of partial voice data generated by the division by the voice division means using at least two voice recognition engines and converting them into character data;
Integration means for integrating character data as a recognition result by the voice recognition means to generate document data;
It is provided with.

上記目的を達成するため、本発明に係る方法は、
記憶手段に記憶された音声データを分割する音声分割ステップと、
前記音声分割ステップによる分割によって生成された複数の部分音声データを少なくとも２つの音声認識エンジンを用いて認識し、文字データに変換する音声認識ステップと、
前記音声認識ステップによる認識結果としての文字データを統合して文書データを生成する統合ステップと、
を含むことを特徴とする。 In order to achieve the above object, the method according to the present invention comprises:
An audio dividing step for dividing the audio data stored in the storage means;
A voice recognition step of recognizing a plurality of partial voice data generated by the division in the voice division step using at least two voice recognition engines and converting the data into character data;
An integration step of generating document data by integrating character data as a recognition result in the voice recognition step;
It is characterized by including.

上記目的を達成するため、本発明に係るプログラムは、
記憶手段に記憶された音声データを分割する音声分割ステップと、
前記音声分割ステップによる分割によって生成された複数の部分音声データを少なくとも２つの音声認識エンジンを用いて認識し、文字データに変換する音声認識ステップと、
前記音声認識ステップによる認識結果としての文字データを統合して文書データを生成する統合ステップと、
をコンピュータに実行させることを特徴とする。 In order to achieve the above object, a program according to the present invention provides:
An audio dividing step for dividing the audio data stored in the storage means;
A voice recognition step of recognizing a plurality of partial voice data generated by the division in the voice division step using at least two voice recognition engines and converting the data into character data;
An integration step of generating document data by integrating character data as a recognition result in the voice recognition step;
Is executed by a computer.

本発明によれば、音声データの認識処理を効率化することができる。 ADVANTAGE OF THE INVENTION According to this invention, the recognition process of audio | voice data can be made efficient.

本発明の第１実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る音声認識部の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition part which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る音声処理装置に記憶される音声データの構成を示す図である。It is a figure which shows the structure of the audio | voice data memorize | stored in the audio | voice processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る音声処理装置における音声データの分割結果を示す図である。It is a figure which shows the division | segmentation result of the audio | voice data in the audio | voice processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る音声処理装置における音声データの分割方法を説明するための図である。It is a figure for demonstrating the division | segmentation method of the audio | voice data in the audio | voice processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る音声処理装置における音声データの分割方法を説明するための図である。It is a figure for demonstrating the division | segmentation method of the audio | voice data in the audio | voice processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る音声処理装置が表示するグラフィカルユーザインタフェースを示す図である。It is a figure which shows the graphical user interface which the audio processing apparatus which concerns on 2nd Embodiment of this invention displays. 本発明の第３実施形態に係る音声認識部の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition part which concerns on 3rd Embodiment of this invention.

以下に、図面を参照して、本発明の実施の形態について例示的に詳しく説明する。ただし、以下の実施の形態に記載されている構成要素はあくまで例示であり、本発明の技術範囲をそれらのみに限定する趣旨のものではない。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings. However, the components described in the following embodiments are merely examples, and are not intended to limit the technical scope of the present invention only to them.

（第１実施形態）
本発明の第１実施形態としての音声処理装置１００について、図１を用いて説明する。音声処理装置１００は、音声データを認識して文書データに変換する装置である。 (First embodiment)
A speech processing apparatus 100 as a first embodiment of the present invention will be described with reference to FIG. The voice processing device 100 is a device that recognizes voice data and converts it into document data.

図１に示すように、音声処理装置１００は、記憶部１０１と音声分割部１０２と音声認識部１０３と認識結果統合部１０５とを含む。記憶部１０１は、入力した音声データを記憶する。音声分割部１０２は、記憶部１０１から音声データを読出し、複数の部分音声データに分割する。また、音声認識部１０３は、音声分割部１０２による分割によって生成された複数の部分音声データを、少なくとも２つの音声認識エンジン１３１〜１３ｎを用いて認識し、文字データに変換する。更に、認識結果統合部１０５は、音声認識部１０３から出力された文字データを統合して、文書データを生成する。 As shown in FIG. 1, the speech processing apparatus 100 includes a storage unit 101, a speech division unit 102, a speech recognition unit 103, and a recognition result integration unit 105. The storage unit 101 stores input voice data. The audio dividing unit 102 reads audio data from the storage unit 101 and divides it into a plurality of partial audio data. Further, the voice recognition unit 103 recognizes the plurality of partial voice data generated by the division by the voice division unit 102 using at least two voice recognition engines 131 to 13n, and converts them into character data. Further, the recognition result integration unit 105 integrates the character data output from the voice recognition unit 103 to generate document data.

以上の構成によれば、複数の音声認識エンジンで並列に音声データを処理するので非常に効率的に音声認識を行なうことができる。 According to the above configuration, since speech data is processed in parallel by a plurality of speech recognition engines, speech recognition can be performed very efficiently.

（第２実施形態）
本発明の第２実施形態について、図２以降を用いて説明する。図２は、本実施形態に係る音声処理システム２００の構成を示すブロック図である。図２において、音声処理システム２００は、マイク２１０、スピーカ２２０、ディスプレイ２３０、操作部（マウスやキーボード）２４０などと接続されている。また、音声処理システム２００は、音声記憶部２０１と音声認識部２０２と文書処理部２０３と音声再生部２０７と操作部２４０とを備えている。音声処理システム２００は、マイク２１０から入力した音声を認識して、ディスプレイ２３０やスピーカ２２０に入力音声や認識結果を出力しつつ、認識結果としての文書中の誤りを修正したり編集したりするシステムである。音声処理システム２００の構成要素のうち、音声記憶部２０１は、マイク２１０から入力した音声データを記憶する。また、音声認識部２０２は、音声記憶部２０１に記憶された音声データを認識して文書データに変換する。更に文書処理部２０３は、音声認識部２０２が生成した文書データを用意されたＧＵＩフォームに挿入して、表示データを生成する。生成された表示データは、ディスプレイ２３０によって表示される。 (Second Embodiment)
A second embodiment of the present invention will be described with reference to FIG. FIG. 2 is a block diagram showing a configuration of the voice processing system 200 according to the present embodiment. In FIG. 2, the audio processing system 200 is connected to a microphone 210, a speaker 220, a display 230, an operation unit (mouse or keyboard) 240, and the like. The voice processing system 200 includes a voice storage unit 201, a voice recognition unit 202, a document processing unit 203, a voice playback unit 207, and an operation unit 240. The voice processing system 200 recognizes the voice input from the microphone 210 and outputs the input voice and the recognition result to the display 230 and the speaker 220, and corrects or edits an error in the document as the recognition result. It is. Of the components of the voice processing system 200, the voice storage unit 201 stores voice data input from the microphone 210. The voice recognition unit 202 recognizes the voice data stored in the voice storage unit 201 and converts it into document data. Further, the document processing unit 203 inserts the document data generated by the voice recognition unit 202 into the prepared GUI form, and generates display data. The generated display data is displayed on the display 230.

操作部２４０からは、ユーザの操作を受け付け、文書処理部２０３が生成した文書データをディスプレイ２３０に表示しつつ、文書データの編集、修正を行なう。 A user operation is accepted from the operation unit 240, and the document data generated by the document processing unit 203 is displayed on the display 230, and the document data is edited and corrected.

図３は、音声認識部２０２の詳細構成を示す図である。音声認識部２０２は、音声入力部３１０と音声分割部３２０と複数の認識エンジン３３１〜３３ｎと認識結果結合部３４０と文書出力部３５０とを備えている。 FIG. 3 is a diagram illustrating a detailed configuration of the voice recognition unit 202. The voice recognition unit 202 includes a voice input unit 310, a voice division unit 320, a plurality of recognition engines 331 to 33n, a recognition result combining unit 340, and a document output unit 350.

音声入力部３１０は、音声データ（音声ファイル）を音声記憶部２０１から読み出し、音声分割部３２０に渡す。音声分割部３２０は、受け取った音声データを部分音声データに分割する。部分音声データは、複数の認識エンジン３３１〜３３ｎに送られ、それぞれ音声認識処理が加えられて文字データに変換される。 The audio input unit 310 reads out audio data (audio file) from the audio storage unit 201 and passes it to the audio dividing unit 320. The audio dividing unit 320 divides the received audio data into partial audio data. The partial voice data is sent to a plurality of recognition engines 331 to 33n, and is subjected to voice recognition processing and converted into character data.

ここで、複数の音声認識エンジンで１つの部分音声データに対して認識処理を加えても良い。その場合、もっとも信頼性の高い音声認識結果を採用することができる。或いは、隣り合う２つの部分音声データをそれぞれの音声認識エンジンに入力して、音声認識エンジンは、その２つの部分音声データの組合せを認識しても良い。例えば、分割により一連の部分音声データ１〜３を生成したとすると、音声認識エンジン３３１で部分音声データ１と部分音声データ２の組合せを認識し、音声認識エンジン３３２で部分音声データ２と部分音声データ３の組合せを認識するという方法により、認識精度を向上させても良い。 Here, a recognition process may be added to one partial voice data by a plurality of voice recognition engines. In that case, the most reliable speech recognition result can be adopted. Alternatively, two adjacent partial speech data may be input to each speech recognition engine, and the speech recognition engine may recognize a combination of the two partial speech data. For example, if a series of partial voice data 1 to 3 is generated by division, the voice recognition engine 331 recognizes a combination of the partial voice data 1 and the partial voice data 2, and the voice recognition engine 332 recognizes the partial voice data 2 and the partial voice data. The recognition accuracy may be improved by a method of recognizing the combination of data 3.

認識結果結合部３４０は、認識結果を単数または複数用いて時系列に統合し、統合認識結果を作成する。同じ部分音声データに対する認識結果に重複がある場合(異なる辞書や認識方式により、複数の認識エンジンで１つの部分音声ファイルを認識した場合など)は認識結果を比較し、信頼度の高い方（特徴量のマッチングが多い方）を採用して統合認識結果を作成する。そして、統合認識結果を、文書出力部３５０に出力する。文書出力部３５０は、ユーザが認識結果としての文書を編集できるグラフィックインターフェースを生成して統合認識結果を出力する。 The recognition result combining unit 340 uses a single or a plurality of recognition results and integrates them in time series to create an integrated recognition result. If there is an overlap in the recognition results for the same partial speech data (such as when one partial speech file is recognized by multiple recognition engines using different dictionaries or recognition methods), the recognition results are compared and the one with higher reliability (features) Adopt the one with more amount matching) and create the integrated recognition result. Then, the integrated recognition result is output to the document output unit 350. The document output unit 350 generates a graphic interface that allows the user to edit the document as the recognition result and outputs the integrated recognition result.

図４は音声記憶部２０１に記憶された音声データを示す図である。ここでは、議事録の作成を支援するシステムを前提としているため、一つの会議について一つの音声データテーブル４００が作成されるとする。音声データテーブル４００には会議を識別するための記憶シーケンス番号４０１の他、場所、話者、内容などのコメント４０２が付与され、また、各音声ファイルが生成された日時を示すタイムスタンプと共にデジタル音声ファイル４０３が格納されている。 FIG. 4 is a diagram showing audio data stored in the audio storage unit 201. Here, since it is premised on a system that supports creation of minutes, it is assumed that one audio data table 400 is created for one meeting. In addition to the storage sequence number 401 for identifying the conference, the voice data table 400 is given a comment 402 such as a place, a speaker, and contents, and a digital voice together with a time stamp indicating the date and time when each voice file was generated. A file 403 is stored.

図５は、分割された音声データを管理するための部分音声データテーブル５００を示している。各部分音声データは、例えば、１０ｍｓや１ｓ等の長さであり、タイムスタンプと共にシーケンシャルな番号を付加されている。これにより、認識結果の統合を正確に行なうことが可能となる。各部分音声データには、そのデータを認識するための認識エンジンを特定する認識エンジン番号が割り当てられる。更に、部分音声データテーブル５００には、認識エンジン３３１〜３３ｎでの認識結果を格納するエリアを有しており、各音節の示す文字を特定する。 FIG. 5 shows a partial audio data table 500 for managing the divided audio data. Each partial audio data has a length of 10 ms or 1 s, for example, and a sequential number is added together with a time stamp. This makes it possible to accurately integrate recognition results. Each partial speech data is assigned a recognition engine number that identifies a recognition engine for recognizing the data. Furthermore, the partial speech data table 500 has an area for storing the recognition results of the recognition engines 331 to 33n, and specifies the characters indicated by each syllable.

図６は、音声データの分割方法について説明するための図である。音声データの分割方法として、分割前の音声データを無音部分も含めて等分割する方法６０１と、分割前の音声データを無音部分を除いた音声部分を等分割する方法６０２と、が考えられる。方法６０１には、分割処理スピードが速いというメリットがあり、方法６０２には、音声認識スピードが速いというメリットがある。分割の間隔について、ユーザからの指示を受け付けてもよい。方法６０２は、言い換えると、音声ファイルに存在する無音の区間を検索し、無音区間が開始又は終了するタイミングで音声を分割する方法とも言える。有音区間の長さに応じて有音区間毎に分割時間を変更してもよい。また、有音区間同士を連結した上で分割処理を行なってもよい。 FIG. 6 is a diagram for explaining a method of dividing audio data. As a method of dividing the audio data, there can be considered a method 601 for equally dividing the audio data before the division including the silent portion, and a method 602 for equally dividing the audio portion excluding the silent portion from the audio data before the division. The method 601 has an advantage that the division processing speed is fast, and the method 602 has an advantage that the voice recognition speed is fast. You may receive the instruction | indication from a user about the space | interval of a division | segmentation. In other words, the method 602 can be said to be a method of searching for a silent section existing in an audio file and dividing the voice at a timing when the silent section starts or ends. You may change a division | segmentation time for every sound section according to the length of a sound section. Further, the dividing process may be performed after connecting the sound sections.

また、予め、分割前音声データにおける無音部分の割合を算出し、その無音割合が所定値よりも小さい場合には方法６０１を採用し、無音割合が処理値よりも多い場合には方法６０２を採用するように切り替えても良い。 Further, the ratio of the silent part in the pre-division voice data is calculated in advance, and the method 601 is adopted when the silent ratio is smaller than a predetermined value, and the method 602 is adopted when the silent ratio is larger than the processing value. You may switch to do.

更に、図７に示すように、分割された音声の前または後ろ、もしくは両方に前後の時間の音声の一部区間を持つような分割方法を採用しても良い。このような重複方式によれば各分割音声が重複している部分を持ち、一つ一つの部分音声データを長めに設定できるので、分割タイミングが悪いことによる誤認識を回避できる。更に部分音声データを長めに設定すれば、前後の文脈をも考慮してより正確に音声認識を行なうことが可能となる。もちろん、無重複方式も採用できる、無重複方式では互いに重複した区間を持たない。同じ音声を複数回変換することがないため、ＣＰＵ消費時間やメモリ使用量などの資源を少なくすることができる。 Further, as shown in FIG. 7, a division method may be employed in which a partial section of the audio of the preceding and succeeding time is provided before or behind the divided audio, or both. According to such an overlap method, each divided sound has a portion where it is overlapped, and each partial sound data can be set longer, so that it is possible to avoid misrecognition due to poor division timing. Furthermore, if partial speech data is set longer, speech recognition can be performed more accurately in consideration of the context before and after. Of course, the non-overlapping method can also be adopted, and the non-overlapping method does not have overlapping sections. Since the same voice is not converted a plurality of times, resources such as CPU consumption time and memory usage can be reduced.

図８は、文書処理部２０３が生成してディスプレイ２３０に表示されるグラフィカルユーザインタフェース８００を示す図である。これは一例であり、本発明はこれに限定されるものではない。グラフィカルユーザインタフェース８００は、音声の波形表示欄８０１と、タイトル表示欄８０２と、認識結果としての文書データ表示欄８０３とを含む。更にグラフィカルユーザインタフェース８００は、それぞれの文書データに対応する音声の開始時間を示すタイムスタンプ（全音声の開始時点からの経過時間）表示欄８０４を含む。 FIG. 8 is a diagram showing a graphical user interface 800 generated by the document processing unit 203 and displayed on the display 230. This is an example, and the present invention is not limited to this. The graphical user interface 800 includes an audio waveform display field 801, a title display field 802, and a document data display field 803 as a recognition result. Furthermore, the graphical user interface 800 includes a time stamp (elapsed time from the start time of all sounds) display field 804 indicating the start time of the sound corresponding to each document data.

また、文書データ表示欄８０３の下方には、再生ボタン８０５や音量ボタン８０６の他に、リピートボタン８０７などが用意されており、それぞれ再生操作、音量変更操作、リピート操作などに用いられる。文書データ表示欄８０３に表示された各文字は、その文字の認識材料となった音声データの位置と紐付けられている。従って、文書データ表示欄８０３に表示された文書の何れかの位置にカーソルを移動して、その状態で再生ボタン８０５をクリックすれば、その位置に対応する音声データが再生される。つまり、文書データに表示されている各文章は、音声データ単位を表わすものではなく、あくまでも文書編集の便宜のために一行ずつ表を構成しているものである。つまり、各行ごとに音声データが分割されている訳ではなく、このグラフィカルユーザインタフェース８００で再生対象となる音声データは１つである。 In addition to the playback button 805 and volume button 806, a repeat button 807 and the like are prepared below the document data display field 803, and are used for playback operation, volume change operation, repeat operation, and the like, respectively. Each character displayed in the document data display field 803 is associated with the position of the voice data that is the material for recognizing the character. Therefore, if the cursor is moved to any position of the document displayed in the document data display field 803 and the playback button 805 is clicked in that state, the audio data corresponding to that position is played back. That is, each sentence displayed in the document data does not represent a voice data unit, but constitutes a table line by line for the convenience of document editing. That is, the audio data is not divided for each row, and there is only one audio data to be reproduced in this graphical user interface 800.

図８のように操作パネルが文書データ表示欄８０３の下方に配置されていることにより、ユーザは、文書データの確認、編集を行ないながら音声データの再生が行ない易いという利点がある。なお、文書データ表示欄８０３において、文書データは時系列に上方から下方に並んでいるが、下方から上方に向けて時系列に表示しても良い。その場合には、操作ボタン群８０５〜８０７を文書データ表示欄８０３の上方に配置することが望ましい。或いは、ユーザが文書データの表示順序を変更することができる場合に、その表示順序設定に応じて、操作ボタンの位置を変更してもよい。 Since the operation panel is arranged below the document data display field 803 as shown in FIG. 8, there is an advantage that the user can easily reproduce the audio data while checking and editing the document data. In the document data display field 803, the document data is arranged in time series from top to bottom, but may be displayed in time series from the bottom to the top. In that case, it is desirable to arrange the operation button groups 805 to 807 above the document data display field 803. Alternatively, when the user can change the display order of the document data, the position of the operation button may be changed according to the display order setting.

以上、本実施形態によれば、音声データを分割して複数の音声認識エンジンで並列処理を行なうため、非常に効率的に音声認識処理を行なうことができる。 As described above, according to this embodiment, since voice data is divided and parallel processing is performed by a plurality of voice recognition engines, voice recognition processing can be performed very efficiently.

（第３実施形態）
本発明の第３実施形態に係る音声処理システムについて図９を用いて説明する。図９は、本実施形態に係る音声処理システムに含まれる音声認識部９０２の内部構成を示す図である。音声認識部９０２以外の音声処理システムの構成は、上記第２実施形態と同様であるためここでは説明を省略する。 (Third embodiment)
A speech processing system according to the third embodiment of the present invention will be described with reference to FIG. FIG. 9 is a diagram illustrating an internal configuration of the speech recognition unit 902 included in the speech processing system according to the present embodiment. Since the configuration of the speech processing system other than the speech recognition unit 902 is the same as that of the second embodiment, description thereof is omitted here.

音声認識部９０２は、１つまたは複数の学習機能付音声認識エンジン９３１〜９３ｎと、認識エンジン学習制御部９６０とを有する点で、第２実施形態における音声認識部２０２と異なる。その他の構成は、第２実施形態と同様であるため、同じ構成については同じ符号を付してその詳しい説明を省略する。 The speech recognition unit 902 is different from the speech recognition unit 202 in the second embodiment in that it includes one or more speech recognition engines with learning functions 931 to 93n and a recognition engine learning control unit 960. Since other configurations are the same as those of the second embodiment, the same components are denoted by the same reference numerals, and detailed description thereof is omitted.

認識エンジン９３１〜９３ｎは入力された音声を文字列に変換する機能を有する。音声は音声分割部３２０から入力され、変換された文字列は認識結果結合部３４０に入力される。また、認識エンジン９３１〜９３ｎは、音声を変換するごとに音声の性別や癖などを学習し、自律的に認識率を向上する機能を有する。認識エンジン学習制御部９６０は、各認識エンジンの学習性能を制御する。システム開始当初から複数の音声認識エンジンを起動するのではなく、当初は１つの音声認識エンジンを起動し、一定量の音声認識処理後に、その学習した結果を複製して複数のエンジンを立ち上げてもよい。 The recognition engines 931 to 93n have a function of converting input speech into a character string. The voice is input from the voice dividing unit 320, and the converted character string is input to the recognition result combining unit 340. Each of the recognition engines 931 to 93n has a function of learning the gender and habit of the voice every time the voice is converted and autonomously improving the recognition rate. The recognition engine learning control unit 960 controls the learning performance of each recognition engine. Instead of starting multiple speech recognition engines from the beginning of the system, initially start one speech recognition engine, and after a certain amount of speech recognition processing, duplicate the learned results and start up multiple engines. Also good.

本実施形態によれば、音声認識エンジン９３１〜９３ｎは学習機能付であることから、一定量の音声認識処理をすることによって、性別や癖などを学習して認識率向上に役立てる。 According to the present embodiment, since the speech recognition engines 931 to 93n are equipped with a learning function, by performing a certain amount of speech recognition processing, they learn gender, wrinkles, and the like to help improve the recognition rate.

（他の実施形態）
以上、本発明の実施形態について詳述したが、それぞれの実施形態に含まれる別々の特徴を如何様に組み合わせたシステム又は装置も、本発明の範疇に含まれる。 (Other embodiments)
As mentioned above, although embodiment of this invention was explained in full detail, the system or apparatus which combined the separate characteristic contained in each embodiment how was included in the category of this invention.

また、本発明は、複数の機器から構成されるシステムに適用されても良いし、単体の装置に適用されても良い。さらに、本発明は、実施形態の機能を実現する情報処理プログラムが、システム或いは装置に直接或いは遠隔から供給される場合にも適用可能である。したがって、本発明の機能をコンピュータで実現するために、コンピュータにインストールされるプログラム、或いはそのプログラムを格納した媒体、そのプログラムをダウンロードさせるＷＷＷ(World Wide Web)サーバも、本発明の範疇に含まれる。 Further, the present invention may be applied to a system constituted by a plurality of devices, or may be applied to a single device. Furthermore, the present invention can also be applied to a case where an information processing program that realizes the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention on a computer, a program installed in the computer, a medium storing the program, and a WWW (World Wide Web) server for downloading the program are also included in the scope of the present invention. .

Claims

Storage means for storing the input voice data;
Audio dividing means for dividing the audio data stored in the storage means;
Voice recognition means for recognizing a plurality of partial voice data generated by the division by the voice division means using at least two voice recognition engines and converting them into character data;
Integration means for integrating character data as a recognition result by the voice recognition means to generate document data;
An audio processing apparatus comprising:

The voice recognition means
The speech processing apparatus according to claim 1, wherein recognition processing is performed on at least two speech recognition engines on one partial speech data generated by the speech partitioning unit.

The audio processing apparatus according to claim 1, wherein the audio dividing unit equally divides the audio data before division including a silent part.

The audio processing apparatus according to claim 1, wherein the audio dividing unit equally divides an audio portion obtained by removing a silent portion from audio data before division.

The voice processing apparatus according to claim 1, wherein the voice dividing unit receives an instruction from a user and changes a division interval according to the instruction.

6. The voice dividing unit searches for a silent section existing in the voice data, and divides the voice data at a timing when the silent section starts or ends. Voice processing device.

The speech processing apparatus according to claim 6, wherein the voice dividing unit changes a division interval for each voiced section according to a length of a voiced section existing in the voice data.

The voice dividing means calculates a ratio of a silent section in the voice data, and if the silent ratio is smaller than a predetermined value, the voice data before the division is divided equally including a silent part, and the silent ratio is processed. 8. The audio processing apparatus according to claim 1, wherein when the number is larger than the value, the audio part obtained by removing the silent part from the audio data before division is equally divided.

The audio processing according to any one of claims 1 to 8, wherein the audio dividing unit divides the audio data so that end portions of the partial audio data have overlapping audio data. apparatus.

An audio dividing step for dividing the audio data stored in the storage means;
A voice recognition step of recognizing a plurality of partial voice data generated by the division in the voice division step using at least two voice recognition engines and converting the data into character data;
An integration step of generating document data by integrating character data as a recognition result in the voice recognition step;
A speech processing method comprising:

An audio dividing step for dividing the audio data stored in the storage means;
A voice recognition step of recognizing a plurality of partial voice data generated by the division in the voice division step using at least two voice recognition engines and converting the data into character data;
An integration step of generating document data by integrating character data as a recognition result in the voice recognition step;
A sound processing program for causing a computer to execute.