JP2014134640A

JP2014134640A - Transcription device and program

Info

Publication number: JP2014134640A
Application number: JP2013001960A
Authority: JP
Inventors: Yuya Fujita; 悠哉藤田; Akio Kobayashi; 彰夫小林; Shoe Sato; 庄衛佐藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-01-09
Filing date: 2013-01-09
Publication date: 2014-07-24

Abstract

PROBLEM TO BE SOLVED: To provide a transcription device and a program capable of improving accuracy and performance of voice recognition.SOLUTION: A model storage unit stores an acoustic model and a language model. A voice recognition unit performs recognition processing based on feature quantity extracted from input voice data, while referring to the acoustic model and the language model. A correction unit enables correction of a recognition result text to be done based on an operation of a user, and creates a correction result text as a result of the correction. A pronunciation imparting unit imparts a phoneme string corresponding to the correction result text. An alignment unit associates the imparted phoneme string and the input voice data in time, and aligns them. An adaptation unit adapts the acoustic model and the language model based on the aligned phoneme string, the input voice data and the correction result text.

Description

本発明は、音声を基に文字を起こして文字起こし結果のテキストを出力する文字起こし装置およびそのプログラムに関する。 The present invention relates to a transcription apparatus that wakes up characters based on speech and outputs the resulting text and a program thereof.

会議参加者の発言や記者会見における発言など、音声を何らかの媒体に録音し、発言者らが発言した通りに文字化してテキストとして記録することが一般的に行なわれている。このように音声を文字化する作業を文字起こしあるいは書き起こしなどと呼ぶ。 It is a common practice to record speech on some medium, such as the speech of a conference participant or a speech at a press conference, and transcribe it as a speaker and record it as text. Such an operation for converting a voice into a character is called a transcription or a transcription.

文字起こしを人手で行なうには、つまり文字起こし作業者が録音された音声を聞きながらその発話内容をキーボード等から入力するという方法では、多大な労力を要する。そこで、音声認識技術を用いてその人手による手間を軽減する取り組みが行なわれている。 In order to perform transcription manually, that is, a method in which a transcription operator inputs the utterance content from a keyboard or the like while listening to the recorded voice, requires a great deal of labor. Therefore, efforts are being made to reduce the labor required by the human hands using voice recognition technology.

特許文献１や特許文献２には、音声認識処理によって得られた音声認識結果のテキストを、編集し、記憶装置に書き込む構成が開示されている。 Patent Documents 1 and 2 disclose a configuration in which text of a speech recognition result obtained by speech recognition processing is edited and written to a storage device.

特許第４０２００８３号公報Japanese Patent No. 4020083 特許第３８５９６１２号公報Japanese Patent No. 3859612

しかしながら、上に挙げた先行技術文献に開示された技術では、音声認識の精度が十分とは言えない。認識結果の中に誤りが含まれていると、人手でその部分を修正する必要があり、作業効率が上がらないという問題があった。 However, the techniques disclosed in the prior art documents listed above cannot be said to have sufficient speech recognition accuracy. If an error is included in the recognition result, there is a problem that it is necessary to manually correct the part, and the work efficiency does not increase.

本発明は、上記の課題認識に基づいて行なわれたものであり、音声認識の精度・性能を向上させることのできる文字起こし装置およびそのプログラムを提供するものである。また特に、本発明は、装置の運用中であっても音声認識の精度・性能を適応的に向上させることのできる文字起こし装置およびそのプログラムを提供するものである。 The present invention has been made based on the above problem recognition, and provides a transcription apparatus and a program thereof that can improve the accuracy and performance of speech recognition. In particular, the present invention provides a transcription apparatus and a program thereof that can adaptively improve the accuracy and performance of speech recognition even during operation of the apparatus.

［１］上記の課題を解決するため、本発明の一態様による文字起こし装置は、音響モデルおよび言語モデルを記憶するモデル記憶部と、前記モデル記憶部に記憶されている音響モデルおよび言語モデルを参照しながら、入力音声データから抽出された特徴量を基に認識処理を行ない、認識結果テキストを出力する音声認識部と、ユーザーによる操作に基づく前記認識結果テキストの修正を可能とし、修正の結果得られる修正結果テキストを作成する修正部と、前記修正結果テキストの発音に対応する音素列を付与する発音付与部と、前記発音付与部によって付与された前記音素列と前記入力音声データとを時刻において対応付けて整列させる整列部と、前記整列部によって整列された前記音素列と前記入力音声データとの対応関係と、前記入力音声データの特徴量と、に基づいて音響モデルを適応化または作成して前記モデル記憶部を更新するとともに、前記修正結果テキストに基づいて言語モデルを適応化または作成して前記モデル記憶部を更新する、適応部と、を具備することを特徴とする。 [1] In order to solve the above-described problem, a transcription apparatus according to an aspect of the present invention includes a model storage unit that stores an acoustic model and a language model, and an acoustic model and a language model that are stored in the model storage unit. While referring to the speech recognition unit that performs recognition processing based on the feature amount extracted from the input speech data and outputs the recognition result text, and enables the correction of the recognition result text based on the operation by the user, and results of the correction A correction unit for creating a correction result text to be obtained, a pronunciation giving unit for giving a phoneme string corresponding to the pronunciation of the correction result text, the phoneme sequence given by the pronunciation giving unit, and the input voice data An alignment unit that associates and aligns, a correspondence relationship between the phoneme sequence aligned by the alignment unit and the input voice data, and the input The model storage unit is updated by adapting or creating an acoustic model based on the feature amount of the speech data, and the model storage unit is updated by adapting or creating a language model based on the correction result text. And an adapting unit.

この構成によれば、認識結果テキストおよび認識結果テキストを修正して得られる修正結果テキストに基づいて、自動的に発音系列を付与する。また、付与された発音系列と入力音声データとの間での時間方向における整列がなされる。よって、音響モデルならびに言語モデルの両方に関して、認識結果テキストおよび修正結果テキストに基づくモデルの適応化を行える。よって、音声認識の精度を向上させることができる。 According to this configuration, the pronunciation sequence is automatically assigned based on the recognition result text and the correction result text obtained by correcting the recognition result text. In addition, alignment in the time direction is performed between the assigned pronunciation sequence and the input voice data. Therefore, it is possible to adapt the model based on the recognition result text and the correction result text for both the acoustic model and the language model. Therefore, the accuracy of voice recognition can be improved.

［２］また、本発明の一態様は、上記の文字起こし装置において、前記修正部は、ユーザーによる操作に基づき前記認識結果テキストの修正を行わない場合には、前記認識結果テキストをそのまま前記修正結果テキストとして出力する、ことを特徴とする。 [2] Further, according to one aspect of the present invention, in the above transcription apparatus, when the correction unit does not correct the recognition result text based on an operation by a user, the correction result text is corrected as it is. The result is output as text.

この構成により、認識結果テキストの全体または一部を修正しない場合にも、モデルを適応化することが可能となる。 With this configuration, it is possible to adapt the model even when all or part of the recognition result text is not corrected.

［３］また、本発明の一態様は、上記の文字起こし装置において、前記音声認識部による認識処理と、前記修正部が前記修正結果テキストを作成する処理と、前記発音付与部が前記音素列を付与する処理と、前記整列部が前記音素列と前記入力音声データとを整列させる処理と、前記適応部が前記モデル記憶部を更新する処理とを含む一連の処理を、複数回繰り返すように制御する制御部、をさらに具備する。 [3] Further, according to one aspect of the present invention, in the above transcription apparatus, a recognition process by the voice recognition unit, a process in which the correction unit creates the correction result text, and the pronunciation giving unit in the phoneme sequence A series of processes including a process for assigning the phoneme sequence and the input speech data, and a process for the adaptive unit to update the model storage unit are repeated a plurality of times. A control unit for controlling.

これにより、ある入力音声データを基とする文字起こし処理を、複数回にわたって繰り返し行なうことが可能となる。このような文字起こし装置を用いることにより、ユーザーは、漸次、文字起こし結果を作成するとともにモデルを適応させて、音声認識結果の精度を高めていくことができる。 As a result, the transcription process based on certain input voice data can be repeated a plurality of times. By using such a transcription device, the user can gradually create a transcription result and adapt the model to improve the accuracy of the speech recognition result.

［４］また、本発明の一態様は、コンピューターを、音響モデルおよび言語モデルを記憶するモデル記憶部、前記モデル記憶部に記憶されている音響モデルおよび言語モデルを参照しながら、入力音声データから抽出された特徴量を基に認識処理を行ない、認識結果テキストを出力する音声認識手段、ユーザーによる操作に基づく前記認識結果テキストの修正を可能とし、修正の結果得られる修正結果テキストを作成する修正手段、前記修正結果テキストの発音に対応する音素列を付与する発音付与手段、前記発音付与手段によって付与された前記音素列と前記入力音声データとを時刻において対応付けて整列させる整列手段、前記整列手段によって整列された前記音素列と前記入力音声データとの対応関係と、前記入力音声データの特徴量と、に基づいて音響モデルを適応化または作成して前記モデル記憶部を更新するとともに、前記修正結果テキストに基づいて言語モデルを適応化または作成して前記モデル記憶部を更新する、適応手段、として機能させるためのプログラムである。 [4] Further, according to one embodiment of the present invention, a computer stores a model storage unit that stores an acoustic model and a language model, and inputs audio data while referring to the acoustic model and the language model stored in the model storage unit. A speech recognition unit that performs recognition processing based on the extracted feature values and outputs a recognition result text, a correction that enables correction of the recognition result text based on a user operation, and creates a correction result text obtained as a result of the correction Means, a pronunciation giving means for giving a phoneme string corresponding to the pronunciation of the correction result text, an alignment means for aligning the phoneme string given by the pronunciation giving means and the input speech data in association with each other in time, the alignment A correspondence relationship between the phoneme string arranged by the means and the input voice data, and a feature amount of the input voice data Adapting or creating an acoustic model based on the above, updating the model storage unit, and adapting or creating a language model based on the correction result text to update the model storage unit, It is a program to make it function.

本発明によれば、音声認識結果および修正結果に基づいてモデルを更新するため、音声認識精度を向上させることができる。よって、文字起こしの作業効率を向上させることができる。また、本発明によれば、装置の運用中であっても音声認識の精度・性能を適応的に向上させることができる。 According to the present invention, since the model is updated based on the voice recognition result and the correction result, the voice recognition accuracy can be improved. Therefore, the work efficiency of transcription can be improved. Further, according to the present invention, the accuracy and performance of voice recognition can be adaptively improved even during operation of the apparatus.

本発明の実施形態による文字起こし装置の概略機能構成を示すブロック図である。It is a block diagram which shows schematic function structure of the transcription apparatus by embodiment of this invention. 同実施形態による入力音声記憶部が記憶する入力音声データの構成を示す概略図である。It is the schematic which shows the structure of the input audio | voice data which the input audio | voice storage part by the same embodiment memorize | stores. 同実施形態による特徴量抽出部が抽出した特徴量のデータ構成を示す概略図である。It is the schematic which shows the data structure of the feature-value extracted by the feature-value extraction part by the embodiment. 同実施形態による認識結果記憶部が記憶するデータの構成を示す概略図である。It is the schematic which shows the structure of the data which the recognition result memory | storage part by the same embodiment memorize | stores. 同実施形態による修正結果出力部が出力する修正結果データの構成を示す概略図である。It is the schematic which shows the structure of the correction result data which the correction result output part by the embodiment outputs. 同実施形態による文字起こし装置の全体的な処理手順を示すフローチャートである。It is a flowchart which shows the whole process sequence of the transcription apparatus by the same embodiment. 同実施形態による修正部が提供するユーザーインターフェースの画面例を示す概略図である。It is the schematic which shows the example of a screen of the user interface which the correction part by the embodiment provides.

以下、図面を参照しながら、本発明の実施形態について説明する。
図１は、本実施形態による文字起こし装置の概略機能構成を示すブロック図である。図示するように、文字起こし装置１は、音声取得部１１と、入力音声記憶部１２と、特徴量抽出部１４と、モデル記憶部１５と、音声認識部１６と、認識結果記憶部１８と、修正部２０と、修正結果出力部２２と、発音付与部２４と、整列部２６と、適応部２８と、制御部４０とを含んで構成される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a schematic functional configuration of the transcription apparatus according to the present embodiment. As shown in the figure, the transcription apparatus 1 includes a voice acquisition unit 11, an input voice storage unit 12, a feature amount extraction unit 14, a model storage unit 15, a voice recognition unit 16, a recognition result storage unit 18, The correction unit 20 includes a correction result output unit 22, a sound generation giving unit 24, an alignment unit 26, an adaptation unit 28, and a control unit 40.

音声取得部１１は、外部から音声データを取得し、入力音声記憶部１２に書き込む。
入力音声記憶部１２は、文字起こし処理の対象となる音声データを記憶する。
特徴量抽出部１４は、入力音声記憶部１２に記憶されている入力音声データを読み込み、その音響特徴量を抽出する。特徴量抽出部１４は、抽出した音響特徴量のデータを、入力音声データと関連付けて記憶させる。 The voice acquisition unit 11 acquires voice data from the outside and writes it in the input voice storage unit 12.
The input voice storage unit 12 stores voice data to be subjected to a transcription process.
The feature quantity extraction unit 14 reads the input voice data stored in the input voice storage unit 12 and extracts the acoustic feature quantity. The feature quantity extraction unit 14 stores the extracted acoustic feature quantity data in association with the input voice data.

モデル記憶部１５は、内部に、音響モデル記憶部１５１と言語モデル記憶部１５２とを含んで構成される。音響モデルは、音声から抽出される音響特徴量と音素記号との関係を表わすものである。言語モデルは、言語要素（文字や単語など）の出現に関する統計的な傾向を表わすものである。本実施形態において、音響モデル記憶部１５１は、ＨＭＭ（隠れマルコフモデル）音響モデルを記憶する。また、言語モデル記憶部１５２は、ｎ−ｇｒａｍ（エヌグラム）言語モデルを記憶する。
音声認識部１６は、特徴量抽出部１４によって抽出された音響特徴量を用いて、モデル記憶部１５に記憶されている音響モデルおよび言語モデルを参照しながら、入力音声データの認識処理を実行する。音声認識部１６は、認識結果であるテキストのデータを認識結果記憶部１８に書き込む。
認識結果記憶部１８は、音声認識部１６から出力された認識処理結果のテキストデータを記憶する。 The model storage unit 15 includes an acoustic model storage unit 151 and a language model storage unit 152 therein. The acoustic model represents a relationship between an acoustic feature amount extracted from speech and a phoneme symbol. The language model represents a statistical tendency regarding the appearance of language elements (such as characters and words). In the present embodiment, the acoustic model storage unit 151 stores an HMM (Hidden Markov Model) acoustic model. The language model storage unit 152 stores an n-gram language model.
The speech recognition unit 16 executes input speech data recognition processing while referring to the acoustic model and the language model stored in the model storage unit 15 using the acoustic feature amount extracted by the feature amount extraction unit 14. . The voice recognition unit 16 writes text data as a recognition result in the recognition result storage unit 18.
The recognition result storage unit 18 stores the text data of the recognition processing result output from the voice recognition unit 16.

修正部２０は、認識結果記憶部１８に記憶された認識結果のテキストデータを編集、修正する機能を有し、ユーザーによる操作に基づく前記認識結果テキストの修正を可能とする。そして、修正部２０は、修正の結果得られる修正結果テキストを作成する。具体的には、修正部２０はユーザーインターフェースを提供する。これにより、ユーザーは、認識結果のテキストデータのうち、誤りである箇所を修正する操作を行なうことができる。このとき修正部２０は、認識結果のテキストデータの位置と、入力音声データの位置とを、時刻によって関連付けてユーザーに対して表示する。また、修正部２０は、入力音声データに対応する波形を画面に表示する。また、修正部２０は、ユーザーが指定する任意の位置から入力音声データの頭だしを行なって、音声の再生を行なう機能を有する。 The correction unit 20 has a function of editing and correcting the text data of the recognition result stored in the recognition result storage unit 18, and enables the correction of the recognition result text based on a user operation. And the correction part 20 produces the correction result text obtained as a result of correction. Specifically, the correction unit 20 provides a user interface. As a result, the user can perform an operation of correcting the erroneous part of the recognition result text data. At this time, the correcting unit 20 displays the position of the text data of the recognition result and the position of the input voice data in association with each other according to time. Further, the correction unit 20 displays a waveform corresponding to the input voice data on the screen. In addition, the correction unit 20 has a function of starting the input voice data from an arbitrary position designated by the user and reproducing the voice.

修正結果出力部２２は、認識処理結果のテキストデータ、および修正された場合には修正結果を反映したテキストデータを出力する。修正結果出力部２２による出力先は、情報記憶媒体や外部の装置等である。また、修正結果出力部２２は、発音付与部２４に対しても、修正結果を反映した上記のテキストデータを出力する。修正結果出力部２２によって出力されるテキストデータは、即ち、文字起こし結果のデータ（以後、「文字起こしテキスト」と呼ぶ場合がある。）である。なお、認識結果に対する修正箇所がない場合には、修正結果出力部２２が出力するテキストデータは、認識処理結果のテキストデータと同一となる。 The correction result output unit 22 outputs the text data of the recognition processing result and the text data reflecting the correction result when corrected. The output destination by the correction result output unit 22 is an information storage medium, an external device, or the like. The correction result output unit 22 also outputs the above text data reflecting the correction result to the pronunciation providing unit 24. The text data output by the correction result output unit 22 is data of a transcription result (hereinafter sometimes referred to as “transcription text”). If there is no correction portion for the recognition result, the text data output by the correction result output unit 22 is the same as the text data of the recognition processing result.

発音付与部２４は、文字起こしテキストに発音系列を付与する。発音系列とは、典型的には、音素を表わす記号の列のデータである。
整列部２６は、発音付与部２４によって付与された発音系列の位置（時刻）と、入力音声データの位置（時刻）とを対応付ける（アラインメント、整列）。なお、このとき、特徴量抽出部１４から、入力音声データおよびそれに対応する音響特徴量データの列（図３に示す）が整列部２６に渡される。 The pronunciation giving unit 24 gives a pronunciation series to the transcription text. The pronunciation series is typically data of a string of symbols representing phonemes.
The aligning unit 26 associates the position (time) of the pronunciation sequence provided by the pronunciation providing unit 24 with the position (time) of the input voice data (alignment, alignment). At this time, the feature amount extraction unit 14 passes the input voice data and the corresponding acoustic feature amount data string (shown in FIG. 3) to the alignment unit 26.

適応部２８は、修正結果出力部２２から出力された文字起こしテキスト、または認識結果記憶部１８から読み出した認識結果テキストと、そのテキストに対応して整列部２６によって整列された音素列と、入力となった音声データおよびそれに対応する音響特徴量データの列（図３に示す）とを用いて、モデル記憶部１５に記憶されているモデル（音響モデルまたは言語モデルまたはそれらの両方）を適応化する。具体的には、適応部２８は、整列部２６によって整列された音素列と入力音声データとの対応関係（相対的な時刻により関連付けられる関係）と、入力音声データの特徴量とに基づいて音響モデルを適応化または作成してモデル記憶部１５を更新する。また、適応部２８は、修正結果テキストに基づいて言語モデルを適応化または作成してモデル記憶部１５を更新する。適応部２８が、モデル記憶部１５に既に登録されている既存のモデルを更新する場合と、新たなモデルを作成してモデル記憶部１５に登録する場合とがある。ここで適応化されたモデルは、次回以後の認識処理において参照され得る。
制御部４０は、文字起こし装置１に含まれる各部の動作を制御する。また、制御部４０は、ユーザーインターフェース（コンピューターの表示用画面への出力、キーボードからの文字入力、マウス等のポインティングデバイスによる指示入力など）を制御する。 The adaptation unit 28 includes the transcription text output from the correction result output unit 22 or the recognition result text read from the recognition result storage unit 18, the phoneme string aligned by the alignment unit 26 corresponding to the text, and the input The model (acoustic model and / or language model) stored in the model storage unit 15 is adapted using the voice data and the corresponding acoustic feature data string (shown in FIG. 3). To do. Specifically, the adaptation unit 28 performs acoustical sound based on the correspondence between the phoneme sequence aligned by the alignment unit 26 and the input speech data (the relationship associated by relative time) and the feature amount of the input speech data. The model storage unit 15 is updated by adapting or creating the model. Further, the adaptation unit 28 adapts or creates a language model based on the correction result text and updates the model storage unit 15. There are a case where the adaptation unit 28 updates an existing model already registered in the model storage unit 15 and a case where a new model is created and registered in the model storage unit 15. The model adapted here can be referred to in the recognition process after the next time.
The control unit 40 controls the operation of each unit included in the transcription apparatus 1. The control unit 40 also controls a user interface (output to a computer display screen, character input from a keyboard, instruction input using a pointing device such as a mouse).

なお、上記の各部は、例えば電子回路等を用いて実現される。また、データを記憶する各記憶部は、半導体メモリや磁気ディスク装置等を用いて実現される。 In addition, each said part is implement | achieved, for example using an electronic circuit etc. Each storage unit for storing data is realized using a semiconductor memory, a magnetic disk device, or the like.

図２は、入力音声記憶部１２が記憶する入力音声データの構成を示す概略図である。一例として、入力音声記憶部１２が記憶するデータはオブジェクト指向データベースに格納される。図示するように、このデータは、発話番号、相対時刻、音声波形データの項目を有する表のデータである。そして、この表の各行が発話に対応する。ここで、発話とは、人が声で発する言葉の一区切りである。このとき区切りとなるのは、始端と終端と途中の句読点、あるいはポーズ（pause）などである。発話番号は、各々の発話をユニークに識別可能とするシリアル番号である。相対時刻は、その発話の、入力音声データ全体における相対的な位置を示すものであり、「ＨＨ：ＭＭ：ＳＳ．ｈｈ」（時、分、秒、百分の一秒）の形式で表現される。音声波形データは、音声波形を表わすデータであり、具体的には所定のサンプリング周波数でサンプリングされた振幅値の列である。 FIG. 2 is a schematic diagram showing a configuration of input voice data stored in the input voice storage unit 12. As an example, data stored in the input voice storage unit 12 is stored in an object-oriented database. As shown in the figure, this data is table data having items of utterance number, relative time, and speech waveform data. Each row in this table corresponds to an utterance. Here, an utterance is a segment of a word spoken by a person. The breaks at this time are the beginning and end points, punctuation marks in the middle, or pause. The utterance number is a serial number that allows each utterance to be uniquely identified. The relative time indicates the relative position of the utterance in the entire input voice data, and is expressed in the format of “HH: MM: SS.hh” (hour, minute, second, one hundredth of a second). The The voice waveform data is data representing a voice waveform, and is specifically a string of amplitude values sampled at a predetermined sampling frequency.

図３は、特徴量抽出部１４による特徴量抽出の結果のデータ構成を示す概略図である。一例として、特徴量抽出結果データは、オブジェクト指向データベースに格納される。図示するように、このデータは、発話番号、相対時刻、音声波形データ、特徴量データの項目を有する表のデータである。この表の各行が発話に対応する。発話番号、相対時刻、音声波形データの項目については、図２において説明した通りである。特徴量データの項目に格納されるのは、音声波形データから抽出された音響特徴量のデータである。音響特徴量としては、例えば、周波数ケプストラム係数（ＭＦＣＣ、Mel-Frequency Cepstrum Coefficients）を用いる。特徴量データの項目は、具体的には、周波数ケプストラム係数の時系列のデータを格納する。なお、特徴量抽出のためにフーリエ変換を行う際の窓の長さを、例えば１０ミリ秒とする。 FIG. 3 is a schematic diagram illustrating a data structure of a result of feature amount extraction by the feature amount extraction unit 14. As an example, the feature amount extraction result data is stored in an object-oriented database. As shown in the figure, this data is data of a table having items of utterance number, relative time, speech waveform data, and feature amount data. Each row in this table corresponds to an utterance. The items of the utterance number, relative time, and voice waveform data are as described in FIG. What is stored in the feature quantity data item is acoustic feature quantity data extracted from the speech waveform data. For example, frequency cepstrum coefficients (MFCC, Mel-Frequency Cepstrum Coefficients) are used as the acoustic feature amount. Specifically, the feature amount data item stores time-series data of frequency cepstrum coefficients. Note that the length of the window when performing Fourier transform for feature quantity extraction is, for example, 10 milliseconds.

図４は、認識結果記憶部１８のデータ構成を示す概略図である。音声認識部１６がこのデータを書き込む。一例として、この認識結果のデータは、オブジェクト指向データベースに格納される。図示するように、このデータは、発話番号、相対時刻、音声波形データ、特徴量データ、認識結果（テキスト）の項目を有する表のデータである。この表の各行が発話に対応する。発話番号、相対時刻、音声波形データ、特徴量データの項目については、図２および図３において説明した通りである。認識結果の項目は、音声認識部１６がその発話を認識処理した結果のテキストを格納する。なお、認識結果のテキストに含まれる語または文字の単位で、相対時刻のデータが関連付けられている。これにより、認識結果のテキストに含まれる語または文字は、音声波形データ内の対応する時刻のデータや、特徴量データ内の対応する時刻のデータと、関連付けられる。 FIG. 4 is a schematic diagram illustrating a data configuration of the recognition result storage unit 18. The voice recognition unit 16 writes this data. As an example, the recognition result data is stored in an object-oriented database. As shown in the figure, this data is table data having items of utterance number, relative time, speech waveform data, feature amount data, and recognition result (text). Each row in this table corresponds to an utterance. The items of the utterance number, relative time, voice waveform data, and feature amount data are as described in FIGS. The item of the recognition result stores text as a result of the speech recognition unit 16 recognizing the utterance. Note that relative time data is associated with each word or character included in the text of the recognition result. Thereby, the word or character included in the text of the recognition result is associated with the corresponding time data in the speech waveform data or the corresponding time data in the feature data.

図５は、修正結果出力部２２によって出力される修正結果データの構成を示す概略図である。一例として、修正結果データは、オブジェクト指向データベースに格納される。図示するように、このデータは、発話番号、相対時刻、音声波形データ、特徴量データ、認識結果（テキスト）、修正結果（テキスト）の項目を有する表のデータである。この表の各行が発話に対応する。発話番号、相対時刻、音声波形データ、特徴量データ、認識結果（テキスト）の項目については、図２から図４までにおいて説明した通りである。修正結果の項目は、修正部２０における修正処理によって修正された結果のデータを格納する。既に説明したように、修正結果は、認識結果をユーザーが修正したことによって得られるデータである。ここでの修正とは、語単位または文字単位で認識結果の一部が削除されたり、認識結果の一部が別の語または文字で置換されたり、認識結果には含まれていなかった新たな語または文字が挿入されたり、といった操作である。認識結果のテキストと修正結果のテキストのペア自体が、いかなる修正の操作が為されたかを情報として有している。なお、いかなる修正操作が行なわれたかを表わす付加的なデータを合わせて保持するようにしても良い。 FIG. 5 is a schematic diagram illustrating a configuration of correction result data output by the correction result output unit 22. As an example, the correction result data is stored in an object-oriented database. As shown in the figure, this data is table data having items of utterance number, relative time, speech waveform data, feature data, recognition result (text), and correction result (text). Each row in this table corresponds to an utterance. The items of utterance number, relative time, speech waveform data, feature amount data, and recognition result (text) are as described in FIGS. The correction result item stores data of the result corrected by the correction process in the correction unit 20. As described above, the correction result is data obtained by correcting the recognition result by the user. This correction means that a part of the recognition result is deleted on a word or character basis, a part of the recognition result is replaced with another word or character, or a new one that was not included in the recognition result. An operation such as inserting a word or a character. The pair of the recognition result text and the correction result text itself has information on what correction operation has been performed. Additional data indicating what correction operation has been performed may be held together.

図６は、文字起こし装置１による処理の手順を示すフローチャートである。以下、このフローチャートに沿って、処理の流れを説明する。なお、音声取得部１１が取得した入力音声データは、既に入力音声記憶部１２に記憶されている。また、モデル記憶部１５には、複数の音響モデルおよび言語モデルが既に保存されている。なお、音響モデルは話者に依存するものであり、言語モデルは話題に依存するものであるため、それぞれ複数が準備されている。 FIG. 6 is a flowchart illustrating a processing procedure performed by the transcription apparatus 1. Hereinafter, the flow of processing will be described along this flowchart. The input voice data acquired by the voice acquisition unit 11 is already stored in the input voice storage unit 12. In the model storage unit 15, a plurality of acoustic models and language models are already stored. Since the acoustic model depends on the speaker and the language model depends on the topic, a plurality of acoustic models are prepared.

まずステップＳ２０１において、制御部４０は、ユーザーからの指示に基づいて、文字起こしを行なう対象の入力音声データを選択する。
次にステップＳ２０２において、制御部４０は、ユーザーからの指示に基づいて、文字起こしを行なう対象の話者に対応した音響モデルと、その話題に対応する言語モデルを選択する。なお、このとき、最適な音響モデルまたは最適な言語モデルがモデル記憶部１５中に存在しない場合には、該当の話者に最も近い音響モデル、または該当の話題に最も近い言語モデルを選択するようにする。あるいは、一般的な（汎用の）音響モデルまたは言語モデルを選択するようにしても良い。 First, in step S201, the control unit 40 selects input voice data to be transcribed based on an instruction from the user.
Next, in step S202, the control unit 40 selects an acoustic model corresponding to the speaker to be transcribed and a language model corresponding to the topic based on an instruction from the user. At this time, if the optimal acoustic model or the optimal language model does not exist in the model storage unit 15, the acoustic model closest to the corresponding speaker or the language model closest to the topic is selected. To. Alternatively, a general (general purpose) acoustic model or language model may be selected.

次にステップＳ２０３において、音声認識処理を実行する。具体的には、特徴量抽出部１４が選択された入力音声データの音響特徴量を抽出し、音声認識部１６は選択されたモデルを用いてその入力音声データの認識処理を行なう。そして、音声認識部１６は、認識結果のデータを認識結果記憶部１８に書き込む。既に述べたように、音声認識結果のテキストは、元の音声データや、抽出された特徴量のデータと、時刻において対応付けられる形で認識結果記憶部１８に書き込まれる。 Next, in step S203, voice recognition processing is executed. Specifically, the feature amount extraction unit 14 extracts the acoustic feature amount of the selected input speech data, and the speech recognition unit 16 performs recognition processing of the input speech data using the selected model. Then, the voice recognition unit 16 writes the recognition result data in the recognition result storage unit 18. As already described, the text of the speech recognition result is written in the recognition result storage unit 18 in association with the original speech data and the extracted feature data at the time.

次にステップＳ２０４において、制御部４０は、ユーザーからの操作に基づいて、認識結果を修正するか否かを判断する。判断の結果、修正の必要がないという指示がユーザーから為された場合（ステップＳ２０４：ＮＯ）には、ステップＳ２０５へ進む。修正をするという指示がユーザーから為された場合（ステップＳ２０４：ＹＥＳ）には、ステップＳ２０６へ進む。 Next, in step S204, the control unit 40 determines whether or not to correct the recognition result based on an operation from the user. As a result of the determination, if the user gives an instruction that no correction is necessary (step S204: NO), the process proceeds to step S205. If an instruction to correct is given from the user (step S204: YES), the process proceeds to step S206.

次にステップＳ２０５に進んだ場合には、同ステップにおいて、修正結果出力部２２は、認識結果のテキストをそのまま文字起こし結果として保存・出力する。この場合、修正部２０は、認識結果記憶部１８から読み出した認識結果のデータを、修正せずに、そのまま修正結果出力部２２に渡す。つまり、ユーザーによる操作に基づき認識結果テキストの修正を行わない場合には、修正部２０および修正結果出力部２２は、認識結果のデータをそのまま修正結果のデータとみなして、保存・出力する。
このような処理にすることにより、ユーザーは、認識結果の精度よりも、文字起こし結果を出力する早さのほうを優先することができる。また、そのように認識結果を修正しない場合にも認識結果を用いたモデルの適応化を行なうことができる。後で再度音声認識処理を行なう場合には、今回適応化されたモデルを用いることができるようになる。 When the process proceeds to step S205, the correction result output unit 22 saves and outputs the recognition result text as a transcribed result as it is. In this case, the correction unit 20 passes the recognition result data read from the recognition result storage unit 18 to the correction result output unit 22 without correction. That is, when the recognition result text is not corrected based on the operation by the user, the correction unit 20 and the correction result output unit 22 regard the recognition result data as it is as the correction result data, and save / output the data.
By adopting such processing, the user can give priority to the speed of outputting the transcription result rather than the accuracy of the recognition result. Further, even when the recognition result is not corrected as described above, the model can be adapted using the recognition result. When the speech recognition process is performed again later, the model adapted this time can be used.

次にステップＳ２０６に進んだ場合には、同ステップにおいて、修正部２０がテキストの修正を行なう。具体的には、修正部２０は、まず、認識結果のテキストを入力音声データと対応付けて画面に表示する。このとき、修正部２０は、入力音声データを音声波形の形でグラフィカルに画面に表示する。なお、画面表示の例については後述する。また、修正部２０は、ユーザーからのポインティングデバイス等による指示に基づき、入力音声データ内の指定された箇所（時刻）から頭だしして、音声を再生出力する。ユーザーは、これらの表示された情報および再生された音声などに基づいて、誤り箇所を特定し、キーボード等から文字を入力することによってテキストの修正を行なう。修正された結果のテキストは、図５で説明した形式で、文字起こし結果として修正結果出力部２２によって保存・出力される。具体的には、修正結果出力部２２は、修正結果のデータを外部の装置等に出力し、または文字起こし装置内の記憶媒体に書き込む。 Next, when the process proceeds to step S206, the correction unit 20 corrects the text in the same step. Specifically, the correction unit 20 first displays the recognition result text on the screen in association with the input voice data. At this time, the correction unit 20 graphically displays the input voice data in the form of a voice waveform on the screen. An example of the screen display will be described later. Further, the correcting unit 20 reproduces and outputs the sound from the designated position (time) in the input sound data based on an instruction from the user using a pointing device or the like. The user corrects the text by specifying an error location based on the displayed information and reproduced voice, and inputting characters from a keyboard or the like. The corrected text is stored and output as a transcription result by the correction result output unit 22 in the format described with reference to FIG. Specifically, the correction result output unit 22 outputs the data of the correction result to an external device or the like, or writes it to a storage medium in the transcription device.

なお、ステップＳ２０５に進んだ場合も、ステップＳ２０６に進んだ場合も、修正結果出力部２２は、修正結果のデータを文字起こしデータとして発音付与部２４に渡す。文字起こし結果データとしては、音声波形データと、特徴量データと、修正結果のテキストデータとが、互いに、時刻によって対応付け可能な形で記憶される。ステップＳ２０５またはステップＳ２０６の処理の完了後は、いずれの場合も、次のステップＳ２０７に進む。 Note that the correction result output unit 22 passes the correction result data to the pronunciation providing unit 24 as character transcription data, regardless of whether the process proceeds to step S205 or to step S206. As the transcription result data, speech waveform data, feature amount data, and corrected text data are stored in a form that can be associated with each other according to time. After completion of the processing in step S205 or step S206, in either case, the process proceeds to the next step S207.

次にステップＳ２０７において、発音付与部２４は、修正結果のテキスト（修正が行なわれなかったことにより認識結果のテキストがそのまま修正結果のテキストとみなされた場合を含む）に、発音系列（音素列）を付与する。そして、整列部２６は、入力音声データ中のどの時間区間においてその発音がなされたかを対応付ける整列処理（アラインメント処理）を行なう。 Next, in step S207, the pronunciation providing unit 24 adds the text of the correction result (including the case where the text of the recognition result is regarded as the text of the correction result as it is because correction has not been performed) to the phonetic sequence (phoneme string). ). Then, the aligning unit 26 performs an aligning process (alignment process) for associating in which time interval in the input voice data the pronunciation is made.

発音付与部２４が発音系列を付与するための処理の具体的な手順は次の通りである。即ち、発音付与部２４は、漢字仮名混じり文として与えられている修正結果テキストを、平仮名のみのテキストに変換する。漢字仮名混じり文のテキストを平仮名のみのテキストに変換するためには、例えば、「MeCab」などの形態素解析器を用いる。「MeCab」は、オープンソースの（ソースプログラムの形式で一般に提供されている）形態素解析エンジンのプログラムである［参考ＵＲＬ：http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html］。平仮名のみのテキストに変換された後には、発音付与部２４は、予め記憶してある平仮名と発音記号との対応テーブルを参照することにより、平仮名のテキストを発音記号（音素を表わす記号の列）に変換する。なお、このような既存技術を用いて発音の付与を行なった場合、その精度は９５％程度である。つまり、得られる音素列には５％程度のエラーを含むが、モデルを適応化するための元となる情報の精度としては９５％という数値は十分に良好な値である。 The specific procedure of the process for the pronunciation giving unit 24 to give the pronunciation series is as follows. That is, the pronunciation providing unit 24 converts the correction result text given as a kanji kana mixed sentence into a text of only hiragana. For example, a morphological analyzer such as “MeCab” is used to convert a kanji-kana mixed sentence text into a hiragana-only text. “MeCab” is an open source (generally provided in the form of source program) morphological analysis engine program [Reference URL: http://mecab.googlecode.com/svn/trunk/mecab/doc/index .html]. After conversion to the text of only hiragana, the pronunciation providing unit 24 refers to the correspondence table of hiragana and phonetic symbols stored in advance, thereby converting the hiragana text into phonetic symbols (a string of symbols representing phonemes). Convert to Note that when sound is imparted using such an existing technique, the accuracy is about 95%. That is, the obtained phoneme string includes an error of about 5%, but the numerical value of 95% is a sufficiently good value as the accuracy of the information used for adapting the model.

また、整列部２６が入力音声データと音素記号の列との整列を行なう処理の例は次の通りである。即ち、整列部２６は、発音付与部２４によって付与された音素列を出力する確率が最大となるような音響モデル（ＨＭＭ）の状態遷移系列を探索する。この処理は、ビタビアラインメントと呼ばれるもので、既存技術の一つである。このビタビアラインメントを実行することにより、付与された音素列に含まれる各音素に対応する時刻情報を得ることができる。つまり、整列部２６は、修正結果のテキストと、そのテキストに対応する音素列と、入力音声データと、音声データの特徴量とを、相互に、同一の時間軸上で対応付けたことに相当するデータを出力する。なお、上記の「時間区間」のサイズは、入力音声データから特徴量を抽出する際の時間窓の長さに等しい１０ミリ秒としている。言い換えれば、整列部２６による整列処理の最小単位が１０ミリ秒である。 An example of a process in which the aligning unit 26 aligns the input speech data and the phoneme symbol sequence is as follows. That is, the alignment unit 26 searches for a state transition sequence of the acoustic model (HMM) that maximizes the probability of outputting the phoneme sequence provided by the sound generation assigning unit 24. This processing is called viterbi alignment and is one of existing technologies. By executing this Viterbi alignment, time information corresponding to each phoneme included in the assigned phoneme string can be obtained. That is, the alignment unit 26 corresponds to associating the correction result text, the phoneme string corresponding to the text, the input voice data, and the feature amount of the voice data on the same time axis. Output data. The size of the “time interval” is set to 10 milliseconds, which is equal to the length of the time window when extracting the feature value from the input voice data. In other words, the minimum unit of alignment processing by the alignment unit 26 is 10 milliseconds.

次にステップＳ２０８において、適応部２８は、整列部２６によって出力された整列の結果と、文字起こしテキスト（修正結果テキスト）と入力音声データとを基に、モデル（音響モデルまたは言語モデル、あるいはそれらの両方）を適応化する。適応部２８は、具体的には、新たなモデルを作成してモデル記憶部１５に登録するか、元のモデル（直前の音声認識処理において選択された用いられたモデル）を更新するか、いずれかの処理を行なう。なお、適応部２８がモデルを適応化するための具体的な処理例は、次の通りである。 Next, in step S208, the adaptation unit 28 uses the model (acoustic model or language model, or those models) based on the alignment result output by the alignment unit 26, the transcription text (corrected result text), and the input speech data. Both). Specifically, the adaptation unit 28 creates a new model and registers it in the model storage unit 15 or updates the original model (the used model selected in the immediately preceding speech recognition process). Do some processing. A specific processing example for the adaptation unit 28 to adapt the model is as follows.

即ち、音響モデルに関しては、実際の話者に一致した音響モデルが選択されていた場合には、適応部２８は、ＭＡＰ（Maximum a posteriori、事後最大確率）法やＭＬＬＲ（Maximum Likelihood Linear Regression）法を用いてモデルを適応化する。ＭＡＰ法やＭＬＬＲ法自体は、いずれも、利用可能な既存技術である。音響モデルに関して、実際の話者と音響モデルが一致していない場合には、適応部２８は、Baum-Welchアルゴリズムを用いて、新しく音響モデルを学習する。 That is, with respect to the acoustic model, when an acoustic model that matches the actual speaker is selected, the adaptation unit 28 performs the MAP (Maximum a posteriori) method or the MLLR (Maximum Likelihood Linear Regression) method. Use to adapt the model. Both the MAP method and the MLLR method are existing technologies that can be used. When the acoustic model does not match the actual speaker, the adaptation unit 28 learns a new acoustic model using the Baum-Welch algorithm.

また、言語モデルに関しては、認識処理の対象となった実際の話題に一致した言語モデルが選択されていた場合には、適応部２８は、ＭＡＰ法やモデル混合法を用いてモデルを適応化する。モデル混合法は、新たに得られたテキスト内におけるn-gramの出現頻度を加味して、既存の言語モデルにおけるn-gramの出現頻度をカウントしなおし、新たに算出されたn-gramの出現確率で言語モデルを更新する方法である。実際の話題と言語モデルとが一致していない場合には、適応部２８は、獲得したテキストに基づいて新規に言語モデルを学習する。 As for the language model, when a language model that matches the actual topic to be recognized is selected, the adaptation unit 28 adapts the model using the MAP method or the model mixture method. . The model mixing method counts the occurrence frequency of n-grams in the existing language model, taking into account the occurrence frequency of n-grams in the newly obtained text, and the appearance of newly calculated n-grams This is a method of updating a language model with probability. If the actual topic and the language model do not match, the adaptation unit 28 learns a new language model based on the acquired text.

次にステップＳ２０９において、制御部４０は、ユーザーからの操作に基づいて、文字起こしの処理を完了させるか否かを判断する。判断の結果、完了させるという指示がユーザーから為された場合（ステップＳ２０９：ＹＥＳ）には、このフローチャートの処理を終了する。完了させないという指示がユーザーから為された場合（ステップＳ２０９：ＮＯ）には、ステップＳ２０３に戻り、適応化されたモデル（新たに学習されたモデルを含む）を用いて、文字起こしの処理を続行する。つまり、ステップＳ２０３からＳ２０８までの一連の処理を繰り返す。 In step S209, the control unit 40 determines whether to complete the transcription process based on an operation from the user. As a result of the determination, when the user gives an instruction to complete the process (step S209: YES), the process of this flowchart is terminated. If the user gives an instruction not to complete (step S209: NO), the process returns to step S203, and the transcription process is continued using the adapted model (including the newly learned model). To do. That is, a series of processing from step S203 to S208 is repeated.

このように、複数回、繰り返して、徐々に詳細な文字起こしを行なっていくことが可能となる。一例として、記者会見における発言者の発話に基づいて報道映像の原稿や字幕等に用いるためのテキストを書き起こす場合、最初の段階では、速報性が優先されるとともに発話の中の重要な部分だけが文字起こしできていれば十分である。つまり、その他の部分の認識結果が仮に誤っていてもそのときには使用されないこともある。そして、後の段階では、発話の中のより多くの部分を、より正確に文字起こしする必要も生じる。つまり、制御部４０が入力音声データの文字起こしを複数回に渡って繰り返し行なえるように制御すること、および必ずしも初回からすべての発話を完全且つ正確に文字起こしする必要もないという事情がある。これらの事情を考慮した利用もできるように、制御部４０は、上述した手順による処理の制御を行なう。 In this way, it becomes possible to perform detailed transcription gradually by repeating a plurality of times. As an example, when writing a text to be used for manuscripts or subtitles of news footage based on the speaker's utterance at a press conference, in the first stage, priority is given to promptness and only important parts of the utterance It is enough if is able to transcribe. That is, even if the recognition result of the other part is wrong, it may not be used at that time. At a later stage, more parts of the utterance need to be transcribed more accurately. That is, there is a situation in which the control unit 40 performs control so that transcription of the input voice data can be repeated a plurality of times, and it is not always necessary to transcribe all utterances completely and accurately from the first time. The control unit 40 controls the process according to the above-described procedure so that the use can be performed in consideration of these circumstances.

図７は、修正部２０によるユーザーインターフェースの画面例を示す概略図である。同図において、８０は修正用の画面である。修正用画面は、例えば、パーソナルコンピューターの基本ソフトによって管理される１つの窓として、ディスプレイ装置に表示される。また、８１は、画面８０の一部分であり、文字起こしの対象となっている入力音声データを波形で表示するための領域である。８２は、画面８０の一部分であり、音声認識部１６から出力された認識処理結果のテキストを表示するための領域である。８３は、同じく画面８０の一部分であり、認識処理結果のテキストを修正するための編集領域である。ユーザーがポインティングデバイスやキーボードを用いて、この編集領域８３に含まれるテキストを修正、編集できるようになっている。そして、音声波形を表示するための領域８１と、認識処理結果のテキストを表示するための領域８２と、修正処理のための編集領域８３とは、縦方向の破線を用いて対応付けられている。同図において、横方向が時間軸に対応している。領域８１と領域８２との間に縦に引かれている複数の破線や、領域８２と編集領域８３との間に縦に引かれている複数の破線は、所定の時刻を示すものである。 FIG. 7 is a schematic diagram illustrating an example of a user interface screen displayed by the correction unit 20. In the figure, reference numeral 80 denotes a correction screen. The correction screen is displayed on the display device as one window managed by basic software of a personal computer, for example. Reference numeral 81 denotes a part of the screen 80, which is an area for displaying input speech data to be transcribed as a waveform. Reference numeral 82 denotes a part of the screen 80, which is an area for displaying the text of the recognition processing result output from the voice recognition unit 16. 83 is also a part of the screen 80, and is an editing area for correcting the text of the recognition processing result. The user can correct and edit the text included in the editing area 83 by using a pointing device or a keyboard. The area 81 for displaying the speech waveform, the area 82 for displaying the text of the recognition process result, and the editing area 83 for the correction process are associated with each other using a vertical broken line. . In the figure, the horizontal direction corresponds to the time axis. A plurality of broken lines drawn vertically between the area 81 and the area 82 and a plurality of broken lines drawn vertically between the area 82 and the editing area 83 indicate a predetermined time.

なお、上述した実施形態における文字起こし装置の機能をコンピューターで実現するようにしても良い。その場合、この制御機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Note that the function of the transcription apparatus in the above-described embodiment may be realized by a computer. In that case, the program for realizing the control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible disk, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, a “computer-readable recording medium” dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included, and a program that holds a program for a certain period of time. The program may be a program for realizing a part of the above-described functions, or may be a program that can realize the above-described functions in combination with a program already recorded in a computer system.

以上、複数の実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。 Although a plurality of embodiments have been described above, the present invention can also be implemented in the following modifications.

（変形例１）前述の構成では、修正結果出力部２２が出力した修正結果テキストのすべてに対して、発音付与部が発音（音素列）を付与し、それらのすべてを用いて適応部２８がモデルの適応化を行なうこととした。本変形例では、発音付与部２４は、ユーザーが選択した発話のみについて発音を付与する。また、適応部は上記のユーザーによって選択された発話のみに対応する修正結果テキスト、発音、入力音声データ（および特徴量）を用いて、モデルの適応化を行うようにする。この場合、修正結果出力部２２または発音付与部２４が、図５に示す修正結果データに基づいて、ユーザーが発話を選択することができるようにした画面を表示し、ユーザーからの選択操作を受け付けるようにする。
本例のように構成することにより、すべての認識結果、すべての修正結果をモデルの適応化に用いるのではなく、それらの一部のみを用いたモデルの適応化を行なうこととなる。これにより、適応化のために要する計算量を削減することができる。 (Modification 1) In the above-described configuration, the pronunciation giving unit gives a pronunciation (phoneme string) to all of the correction result text output by the correction result output unit 22, and the adaptation unit 28 uses all of them. It was decided to adapt the model. In the present modification, the sounding imparting unit 24 provides sounding only for the utterance selected by the user. The adapting unit adapts the model using the correction result text, pronunciation, and input speech data (and feature amount) corresponding to only the utterance selected by the user. In this case, the correction result output unit 22 or the pronunciation providing unit 24 displays a screen that allows the user to select an utterance based on the correction result data shown in FIG. 5 and accepts a selection operation from the user. Like that.
By configuring as in this example, not all recognition results and all correction results are used for model adaptation, but model adaptation using only a part of them is performed. Thereby, the calculation amount required for adaptation can be reduced.

（変形例２）発音付与部２４による自動的な発音付与の精度が９５％程度であることは既に述べた。本変形例では、発音付与部２４は、自動的な処理による発音付与の結果を画面等に表示するとともに、ユーザーからの入力操作等により、発音（音素列）を修正できるようにする。そして、後段の処理においては、発音付与部２４によって自動的に付与され且つユーザーによって修正済みの発音を用いてモデルの適応化を行なうようにする。これにより、モデルの適応化をより良好に行うことができ、音声認識の精度がより一層向上する。 (Variation 2) It has already been described that the accuracy of automatic sound generation by the sound generation unit 24 is about 95%. In this modification, the sound generation unit 24 displays the result of sound generation by automatic processing on a screen or the like, and can correct the sound generation (phoneme string) by an input operation from the user. In the subsequent processing, the model is adapted using the pronunciation automatically given by the pronunciation giving unit 24 and corrected by the user. Thereby, the adaptation of the model can be performed better, and the accuracy of speech recognition is further improved.

（変形例３）上述した実施形態においては、録音されたひとまとまりの入力音声データについて、（オフラインで）音声認識処理を行ない、そして、認識結果テキストおよび修正結果テキストを用いた文字起こし処理を行なった。本変形例では、外部から入力される音声のストリームに基づき、オンラインで音声認識処理を行い、認識結果テキストおよび修正結果テキストを用いた文字起こし処理を行なう。なお、修正結果テキストに基づいて、発音付与、整列、モデルの適応化の処理を上述した実施形態と同様の方法により行なう。つまり、オンラインでの音声認識処理と並行して、モデルの適応化を行う。 (Modification 3) In the embodiment described above, speech recognition processing is performed (offline) on a set of recorded input speech data, and transcription processing using the recognition result text and the correction result text is performed. It was. In this modification, speech recognition processing is performed online based on a speech stream input from the outside, and transcription processing using recognition result text and correction result text is performed. Based on the correction result text, pronunciation assignment, alignment, and model adaptation are performed by the same method as in the above-described embodiment. That is, the model is adapted in parallel with the online speech recognition process.

以上、説明した、実施形態およびその変形例において、制御部４０による制御について以下に整理する。
（ａ）制御部４０は、音声認識処理を行なった後に、ユーザーからの指示に基づき、認識結果テキストの修正処理を行なうか否かを制御する。
（ｂ）制御部４０は、認識結果テキストの修正処理を行なう場合において、ユーザーからの指示に基づき、修正結果テキストの全部を用いた適応化を行なうか、修正結果テキストの一部のみを用いた適応化を行なうかを制御する。
（ｃ）制御部４０は、ユーザーからの指示に基づき、ある入力音声データに関して、１回だけ修正結果テキストに基づくモデルの適応化を行うか、複数回繰り返して修正結果テキストに基づくモデルの適応化を行うかを制御する。
（ｄ）制御部４０は、ユーザーからの指示に基づき、オンラインで音声認識処理を行なうかオフラインで音声認識処理を行なうかを制御する。 In the above-described embodiment and its modifications, the control by the control unit 40 is summarized below.
(A) After performing the voice recognition process, the control unit 40 controls whether or not the recognition result text correction process is performed based on an instruction from the user.
(B) When the correction process of the recognition result text is performed, the control unit 40 performs adaptation using the entire correction result text or uses only a part of the correction result text based on an instruction from the user. Controls whether adaptation is performed.
(C) Based on an instruction from the user, the control unit 40 adapts the model based on the correction result text only once with respect to certain input voice data, or repeatedly adapts the model based on the correction result text several times. Control what to do.
(D) The control unit 40 controls whether the speech recognition process is performed online or the speech recognition process is performed offline based on an instruction from the user.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明は、文字起こしの作業を効率化するために利用可能である。具体的用途としては、会議録の作成や、記者会見やインタビュー等に基づく記録作成、放送番組等の音声の書き起こしなどに利用可能である。 The present invention can be used to improve the efficiency of transcription work. Specifically, it can be used to create conference minutes, to create records based on press conferences and interviews, and to transcribe audio from broadcast programs.

１文字起こし装置
１１音声取得部
１２入力音声記憶部
１４特徴量抽出部
１５モデル記憶部
１６音声認識部
１８認識結果記憶部
２０修正部
２２修正結果出力部
２４発音付与部
２６整列部
２８適応部
４０制御部
１５１音響モデル記憶部
１５２言語モデル記憶部 DESCRIPTION OF SYMBOLS 1 Transcription apparatus 11 Voice acquisition part 12 Input voice memory | storage part 14 Feature-value extraction part 15 Model memory | storage part 16 Speech recognition part 18 Recognition result memory | storage part 20 Correction part 22 Correction result output part 24 Pronunciation imparting part 26 Alignment part 28 Adaptation part 40 Control unit 151 Acoustic model storage unit 152 Language model storage unit

Claims

A model storage unit for storing an acoustic model and a language model;
Referring to the acoustic model and the language model stored in the model storage unit, performing a recognition process based on the feature amount extracted from the input speech data, and outputting a recognition result text;
A correction unit that enables correction of the recognition result text based on an operation by a user and creates a correction result text obtained as a result of the correction;
A pronunciation giving unit that gives a phoneme string corresponding to the pronunciation of the correction result text;
An alignment unit that aligns the phoneme sequence provided by the pronunciation providing unit and the input speech data in association with each other at time;
The model storage unit is updated by adapting or creating an acoustic model based on the correspondence relationship between the phoneme sequence aligned by the alignment unit and the input speech data, and the feature amount of the input speech data. Adapting or creating a language model based on the corrected result text and updating the model storage unit;
A transcription apparatus comprising:

When the correction unit does not correct the recognition result text based on an operation by a user, the correction result text is directly output as the correction result text.
The transcription apparatus according to claim 1.

A recognition process by the speech recognition unit, a process by which the correction unit creates the correction result text, a process by which the pronunciation giving unit gives the phoneme string, and an alignment unit by the phoneme string and the input speech data A control unit that controls a series of processes including a process of aligning and a process in which the adaptation unit updates the model storage unit to repeat a plurality of times,
The transcription apparatus according to claim 1, further comprising:

Computer
A model storage unit for storing an acoustic model and a language model;
Speech recognition means for performing recognition processing based on the feature amount extracted from the input speech data and outputting a recognition result text while referring to the acoustic model and the language model stored in the model storage unit;
Correction means for enabling correction of the recognition result text based on an operation by a user and creating a correction result text obtained as a result of the correction;
Pronunciation giving means for giving a phoneme string corresponding to the pronunciation of the correction result text;
An aligning means for associating and aligning the phoneme sequence given by the pronunciation giving means and the input voice data in time;
The model storage unit is updated by adapting or creating an acoustic model based on the correspondence between the phoneme sequence aligned by the alignment unit and the input speech data and the feature amount of the input speech data. Adapting means for adapting or creating a language model based on the correction result text and updating the model storage unit;
Program to function as.