JP6327745B2

JP6327745B2 - Speech recognition apparatus and program

Info

Publication number: JP6327745B2
Application number: JP2014033024A
Authority: JP
Inventors: 彰夫小林
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2014-02-24
Filing date: 2014-02-24
Publication date: 2018-05-23
Anticipated expiration: 2034-02-24
Also published as: JP2015158582A

Description

本発明は、音声認識装置、及びプログラムに関する。 The present invention relates to a speech recognition apparatus and a program.

生放送番組の字幕制作に音声認識を利用する技術が実用化されている。放送字幕は、放送番組の音声を音声認識した結果を人手により修正して作成される（例えば、特許文献１参照）。 A technology that uses speech recognition to produce subtitles for live broadcast programs has been put into practical use. Broadcast subtitles are created by manually correcting the result of speech recognition of broadcast program audio (see, for example, Patent Document 1).

特開２００４−２２６９１０号公報JP 2004-226910 A

放送番組の音声認識は、主に聴覚障碍者や高齢者への情報補償を目的としている。このときの音声認識の対象は、放送番組における音声言語の音声のみである。しかし、多くの放送番組の音声は、音声言語だけから構成されている訳ではない。例えば、番組の演出上の要請から、非言語的な音声（例えば、笑い声）や、拍手、背景音楽などの音響イベントが付加されている。音響イベントは、放送番組のシーンを補足的に説明したり、場面の転換を知らせたりするなど、音声言語同様、情報伝達において重要な役割を担っていると考えられる。このとこから、音響イベントは、視聴者が番組を理解する際に欠かせない要素の一つといえる。 Speech recognition of broadcast programs is mainly aimed at information compensation for hearing impaired and elderly people. The target of voice recognition at this time is only the voice of the voice language in the broadcast program. However, the sound of many broadcast programs is not composed solely of sound languages. For example, a non-linguistic sound (for example, laughter), applause, background music, and other acoustic events are added in response to a program production request. The acoustic event is considered to play an important role in information transmission, like the spoken language, such as supplementarily explaining the scene of the broadcast program or notifying the change of the scene. From this point, it can be said that the acoustic event is one of the elements indispensable for the viewer to understand the program.

ところが、現在の音声認識による字幕制作では、音響イベントは考慮されておらず、番組理解のための情報が視聴者に十分伝えられていないことがある。音響イベントの持つ情報が字幕に反映されれば、伝達する字幕に彩りやアクセント、あるいはニュアンスといった補足的な情報を付加することとなり、視聴者の番組理解に大いに貢献するものと考えられる。そのためには、音響イベントの情報を付加した字幕制作することが求められる。 However, in current caption production by voice recognition, acoustic events are not taken into account, and information for understanding a program may not be sufficiently conveyed to viewers. If the information of an acoustic event is reflected in subtitles, supplemental information such as color, accent, or nuance will be added to the subtitles to be transmitted, which will greatly contribute to viewer understanding of the program. To that end, it is required to produce captions with information on acoustic events.

本発明は、このような事情を考慮してなされたもので、音響イベントの情報を付加した字幕を制作することができる音声認識装置、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and provides a speech recognition apparatus and program capable of producing subtitles with information on acoustic events added thereto.

本発明の一態様は、音声データを音声認識し、音声認識結果の発話内容を示す文字列のデータを出力する音声認識部と、前記音声データから得られた音響特徴量に基づいて音響イベントの事後確率を計算し、計算された前記事後確率に基づいて検出した音響イベントを表す文字列のデータを出力する音響イベント認識部と、前記音声認識部が出力した前記発話内容の文字列のデータと、前記音響イベント認識部が出力した前記音響イベントを表す文字列のデータとを修正端末に表示させ、表示させた中から指定された前記発話内容の文字列における注釈挿入位置と、表示させた中から選択された前記音響イベントを表す文字列とを示す注釈挿入指示を前記修正端末から受信し、受信した前記注釈挿入指示に従って前記発話内容を示す文字列のデータに前記音響イベントを表す文字列のデータを挿入した注釈付き字幕データを生成する認識結果修正部と、を備えることを特徴とする音声認識装置である。
この発明によれば、音声認識装置は、音声データを音声認識して得た発話内容を示す文字列と、当該音声データについて検出された音響イベントを表す文字列とを修正端末に表示させる。音声認識装置は、修正者が修正端末において指定した発話内容の文字列における注釈挿入位置と、挿入する注釈として選択した音響イベントを表わす文字列とに従って、発話内容に音響イベントを表す文字列を挿入して注釈付き字幕を生成する。
これにより、音声認識装置は、修正者が修正端末の表示を見ながら、注釈を挿入したい発話内容の位置と、注釈として挿入したい音響イベントを表す文字列を選択する簡易な操作によって、音響イベントの情報を付加した字幕を生成することができる。 According to one aspect of the present invention, a speech recognition unit that recognizes speech data and outputs character string data indicating the utterance content of the speech recognition result; and an acoustic event based on an acoustic feature obtained from the speech data. A sound event recognition unit for calculating a posteriori probability and outputting character string data representing an acoustic event detected based on the calculated posteriori probability; and a character string data of the utterance content output by the voice recognition unit And the character string data representing the acoustic event output by the acoustic event recognition unit is displayed on the correction terminal, and the annotation insertion position in the character string of the utterance content specified from the displayed is displayed. An annotation insertion instruction indicating the character string representing the acoustic event selected from the inside is received from the correction terminal, and the character string indicating the utterance content is received according to the received annotation insertion instruction. A recognition result correction unit for generating an annotated caption data inserting a string of characters representing the acoustic events over data, a speech recognition apparatus comprising: a.
According to this invention, the voice recognition device displays a character string indicating the utterance content obtained by voice recognition of voice data and a character string representing an acoustic event detected for the voice data on the correction terminal. The speech recognition device inserts a character string representing an acoustic event into the utterance content according to the annotation insertion position in the character string of the utterance content designated by the corrector at the correction terminal and the character string representing the acoustic event selected as the annotation to be inserted. To generate subtitles with annotations.
Thus, the voice recognition device allows the corrector to view the acoustic event by a simple operation of selecting the position of the utterance content to which the annotation is to be inserted and the character string representing the acoustic event to be inserted as the annotation while viewing the display on the correction terminal. Subtitles with information added can be generated.

本発明の一態様は、上述する音声認識装置であって、前記音声データをフレームに分割し、各フレームの音響特徴量と、無音、音響イベント、及び音声言語それぞれの音響特徴量とを照合して音響イベントを含んだ区間を検出する音響イベント区間検出部を備え、前記音響イベント認識部は、前記音響イベント区間検出部が検出した前記区間の前記音声データから得られた音響特徴量に基づいて音響イベントの事後確率を計算し、計算された前記事後確率に基づいて検出した音響イベントを表す文字列のデータを出力する、ことを特徴とする。
この発明によれば、音声認識装置は、音声データから音響イベントを含んだ区間を検出し、検出した区間の音声データを対象に音響イベント認識を行う。
これにより、音声認識装置は、音響イベントが含まれている区間のみを音響イベント認識の対象とするため、音響イベント認識の精度を良くすることができる。 One aspect of the present invention is the speech recognition device described above, wherein the speech data is divided into frames, and the acoustic feature amount of each frame is compared with the acoustic feature amount of each of silence, acoustic event, and speech language. An acoustic event section detection unit that detects a section including an acoustic event, and the acoustic event recognition unit is based on an acoustic feature amount obtained from the audio data of the section detected by the acoustic event section detection unit. A posterior probability of the acoustic event is calculated, and character string data representing the acoustic event detected based on the calculated posterior probability is output.
According to this invention, the voice recognition device detects a section including an acoustic event from voice data, and performs acoustic event recognition on the voice data of the detected section.
Thereby, since the speech recognition apparatus sets only the section including the acoustic event as a target for the acoustic event recognition, the accuracy of the acoustic event recognition can be improved.

本発明の一態様は、上述する音声認識装置であって、前記音響イベント認識部は、前記音声データを分割した時刻順のフレームそれぞれの音響特徴量を並べて畳み込みニューラルネットワークに入力して音響イベントの事後確率を算出し、前記畳み込みニューラルネットワークは、入力層、隠れ層、プーリング層、及び出力層を有し、前記入力層は、時刻順に並べた前記フレームそれぞれの音響特徴量を入力とし、前記隠れ層の各ユニットは、所定フレーム数分のシフトを保ちながら前記入力層の所定数のフレームと結合しており、結合している前記入力層のフレームの音響特徴量を畳み込み演算した結果を示し、前記プーリング層の各ユニットは、当該プーリング層のユニット数に応じた数の前記隠れ層のユニットと結合しており、結合している前記隠れ層のユニットのうち最大値が伝搬され、前記出力層の各ユニットは、異なる種類の音響イベントに対応しており、前記プーリング層の全てのユニットと、対応する前記音響イベントの事後確率を算出するためのそれぞれの重みにより結合している、ことを特徴とする。
この発明によれば、音声認識装置は、音声データを音響イベント認識における音響特徴量の処理単位であるフレームに分割し、分割した各フレームの音響特徴量を、対応するフレームの時刻順に並べて畳み込みニューラルネットワークに入力することにより、各音響イベントの事後確率を算出する。
これにより、音声認識装置は、音声データから得られた各フレームの音響特徴量を用いて、各音響イベントの事後確率を得ることができる。 One aspect of the present invention is the speech recognition device described above, in which the acoustic event recognition unit arranges acoustic feature amounts of time-ordered frames obtained by dividing the speech data and inputs the acoustic feature amounts to a convolutional neural network to input an acoustic event. The posterior probability is calculated, and the convolutional neural network has an input layer, a hidden layer, a pooling layer, and an output layer, and the input layer receives the acoustic feature values of the frames arranged in time order as inputs, and Each unit of the layer is combined with a predetermined number of frames of the input layer while maintaining a shift by a predetermined number of frames, and shows the result of convolution calculation of the acoustic feature amount of the frame of the input layer combined, Each unit of the pooling layer is coupled to the number of hidden layer units corresponding to the number of units of the pooling layer. A maximum value is propagated among the hidden layer units, each unit of the output layer corresponds to a different type of acoustic event, and all the units of the pooling layer and the corresponding posterior probability of the acoustic event Are combined by respective weights for calculating.
According to this invention, the speech recognition apparatus divides speech data into frames that are processing units of acoustic feature amounts in acoustic event recognition, and arranges the acoustic feature amounts of the divided frames in the order of the times of the corresponding frames to perform a convolutional neural network. By inputting to the network, the posterior probability of each acoustic event is calculated.
Thereby, the speech recognition apparatus can obtain the posterior probability of each acoustic event using the acoustic feature amount of each frame obtained from the speech data.

本発明の一態様は、上述する音声認識装置であって、前記音響特徴量は、時間周波数領域の特徴量である、ことを特徴とする。
この発明によれば、音声認識装置は、音声データの時間周波数領域の特徴量を用いて音響イベントを認識する。
これにより、音声認識装置は、周波数領域の特徴量を所定時間分以上連結して音響イベントを認識することができるため、音響イベントの認識の精度を良くすることができる。 One aspect of the present invention is the speech recognition device described above, wherein the acoustic feature amount is a feature amount in a time-frequency domain.
According to the present invention, the voice recognition device recognizes an acoustic event using the feature quantity in the time frequency domain of the voice data.
Thereby, since the speech recognition apparatus can recognize the acoustic event by connecting the feature quantities in the frequency domain for a predetermined time or more, it can improve the accuracy of the acoustic event recognition.

本発明の一態様は、コンピュータを、音声データを音声認識し、音声認識結果の発話内容を示す文字列のデータを出力する音声認識手段と、前記音声データから得られた音響特徴量に基づいて音響イベントの事後確率を計算し、計算された前記事後確率に基づいて検出した音響イベントを表す文字列のデータを出力する音響イベント認識手段と、前記音声認識手段が出力した前記発話内容の文字列のデータと、前記音響イベント認識手段が出力した前記音響イベントを表す文字列のデータとを修正端末に表示させ、表示させた中から指定された前記発話内容の文字列における注釈挿入位置と、表示させた中から選択された前記音響イベントを表す文字列とを示す注釈挿入指示を前記修正端末から受信し、受信した前記注釈挿入指示に従って前記発話内容を示す文字列のデータに前記音響イベントを表す文字列のデータを挿入した注釈付き字幕データを生成する認識結果修正手段と、を具備する音声認識装置として機能させるためのプログラムである。 According to one aspect of the present invention, a computer recognizes voice data, outputs voice string data indicating the utterance content of the voice recognition result, and an acoustic feature obtained from the voice data. A sound event recognition means for calculating a posterior probability of an acoustic event and outputting data of a character string representing an acoustic event detected based on the calculated posterior probability; and a character of the utterance content output by the voice recognition means Annotation insertion position in the character string of the utterance content designated from among the data of the column and the character string data representing the acoustic event output by the acoustic event recognition unit is displayed on the correction terminal, An annotation insertion instruction indicating the character string representing the acoustic event selected from among the displayed events is received from the correction terminal, and the previous instruction is received according to the received annotation insertion instruction. A recognition result correction means for generating an annotated caption data inserting a string of characters representing the acoustic event data string indicating the speech content is a program to function as the speech recognition apparatus comprising.

本発明によれば、音響イベントの情報を付加した字幕を制作することができる。 According to the present invention, it is possible to produce a caption with information on an acoustic event added.

本発明の一実施形態による字幕制作手法と、従来の字幕制作手法との比較を示す図である。It is a figure which shows the comparison with the closed caption production method by one Embodiment of this invention, and the conventional closed caption production method. 同実施形態による字幕制作システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the caption production system by the same embodiment. 同実施形態による音声認識装置の全体処理フローを示す図である。It is a figure which shows the whole processing flow of the speech recognition apparatus by the embodiment. 同実施形態による音響イベント区間検出用のＨＭＭを示す図である。It is a figure which shows HMM for the acoustic event area detection by the same embodiment. 同実施形態による音響イベント区間検出部の音響イベント区間検出処理フローを示す図である。It is a figure which shows the acoustic event area detection process flow of the acoustic event area detection part by the embodiment. 同実施形態による音響イベント認識用のニューラルネットワークを示す図である。It is a figure which shows the neural network for acoustic event recognition by the embodiment. 同実施形態による音響イベント認識部の音響イベント認識処理フローを示す図である。It is a figure which shows the acoustic event recognition process flow of the acoustic event recognition part by the embodiment. 同実施形態による修正端末の表示部に表示される修正作業画面を示す図である。It is a figure which shows the correction work screen displayed on the display part of the correction terminal by the embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
字幕制作を目的とした音声認識では、遅延のない認識結果文字列の出力が重要視されている。従来は、視聴者への情報伝達に重要な音声言語のみが音声から文字列へと変換する字幕化の対象であり、音響イベントのような非言語音は字幕化の対象外であった。これは、特に生放送の番組では、音声認識誤りの修正のための時間が十分に取れず、音声言語以外の情報を字幕化することが困難であったためである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In speech recognition for the purpose of subtitle production, output of recognition result character strings without delay is regarded as important. Conventionally, only a speech language important for information transmission to a viewer is a subject of subtitle conversion from speech to a character string, and non-language sounds such as acoustic events are not subject to subtitle conversion. This is because, particularly in a live broadcast program, there is not enough time for correcting a speech recognition error, and it is difficult to subtitle information other than the speech language.

ニュースなどの番組では、音声言語が極めて重要なウェイトを占めており、効果音などの音響イベントはほとんど含まれていない。よって、音声言語のみを字幕化するだけで、必要な情報を視聴者に伝達することが可能である。一方、スポーツ番組や情報番組では、非言語音である笑い声や拍手、歓声などの音響的なイベントがより大きな役割を果たしている。ニュースが事実を伝えることに主眼を置いている一方で、その他の番組は、臨場感を伝えるなどの演出上の要請から、非言語音の重要性が増すことが一因である。演出上重要な存在である音響イベントは、従来の生放送を対象とした字幕制作では、どちらかといえば重要視されてこなかったという背景がある。しかし、聴覚障碍者や高齢者が放送番組をより楽しむ、あるいは、理解するという観点から見た場合、非言語音である音響イベントを字幕として充実させることが求められるのは当然といえる。 In a program such as news, the speech language occupies a very important weight, and hardly includes sound events such as sound effects. Therefore, it is possible to transmit necessary information to the viewer only by subtitling only the audio language. On the other hand, in sports programs and information programs, acoustic events such as laughter, applause, and cheers, which are non-verbal sounds, play a larger role. While the news focuses on telling the facts, other programs are partly due to the increasing importance of non-verbal sounds due to staging demands such as telling a sense of reality. Sound events, which are important in production, have a background that they have not been regarded as important in conventional caption production for live broadcasting. However, from the viewpoint of hearing impaired people and elderly people more enjoying or understanding broadcast programs, it is natural to enhance non-verbal sound events as subtitles.

図１（ａ）は、従来の字幕制作手法を示す図である。従来の字幕制作手法では、入力音声に含まれるテキスト化可能な音声言語のみを字幕制作の対象としているため、入力音声から音声言語を含む音声区間を検出し、該当区間を切り出している。次に、切り出した音声区間を音声認識し、認識結果である単語列のテキストデータを出力する。この認識結果には通常認識誤りが含まれているため、人手により認識結果中の誤りを修正し、修正結果を放送字幕として送出する。
この一連の手続きは、音声区間が切り出されるたびに逐次的に行われ、低遅延で字幕制作を行うことができる。 FIG. 1A is a diagram showing a conventional caption production method. In the conventional caption production method, only a speech-capable speech language included in the input speech is subject to caption production. Therefore, a speech section including the speech language is detected from the input speech, and the corresponding section is cut out. Next, the extracted speech section is speech-recognized, and text data of a word string as a recognition result is output. Since the recognition result usually includes a recognition error, the error in the recognition result is manually corrected, and the correction result is transmitted as a broadcast subtitle.
This series of procedures is performed sequentially every time a voice section is cut out, and caption production can be performed with low delay.

音声認識に基づく従来の字幕制作手法において音響イベントを挿入する場合、非言語音が表す内容を修正者が適宜解釈した上で、キーボード等の入力方法を用いて、音響イベントを表す文字列を注釈として音声認識結果に挿入することが考えられる。しかし、キーボード入力には時間を要するため、修正者が、音声認識結果を修正しながら、さらに追加のキーボード入力作業を行うことは現実的には非常に困難である。
本実施形態の音声認識装置は、このような問題を解決し、音響イベントに関する情報伝達を視聴者に行うための字幕制作を行う。 When inserting an acoustic event in the conventional caption production method based on speech recognition, the corrector interprets the content represented by the non-language sound as appropriate, and then uses a keyboard or other input method to annotate the character string representing the acoustic event. Can be inserted into the speech recognition result. However, since keyboard input takes time, it is actually very difficult for the corrector to perform additional keyboard input work while correcting the speech recognition result.
The speech recognition apparatus according to the present embodiment solves such a problem and performs caption production for performing information transmission regarding an acoustic event to a viewer.

そこで、本実施形態の音声認識装置は、従来の字幕制作手法と同様の音声認識結果とともに、音響イベントの認識結果を注釈として出力する。ここで「注釈」とは、音声言語に対する付加情報である音響イベントを言語表現としてテキスト（文字列）で表したものである。また、音声言語の音声認識結果に基づく従来の字幕に対して注釈が挿入されたものを「注釈付き字幕」と記載する。 Therefore, the speech recognition apparatus according to the present embodiment outputs the recognition result of the acoustic event as an annotation together with the speech recognition result similar to the conventional caption production method. Here, the “annotation” is a text (character string) representing an acoustic event, which is additional information for the speech language, as a language expression. In addition, an annotation inserted with respect to a conventional caption based on a speech recognition result of a speech language is referred to as “annotated caption”.

図１（ｂ）は、本実施形態の音声認識装置による字幕制作手法を示す図である。
同図に示すように、本実施形態の音声認識装置による字幕制作手法においては、従来の音声区間検出処理及び音声認識処理に併せて、音響イベント区間検出処理及び音響イベント認識処理を並列で実行する。音響イベント区間検出処理では、入力音声から音響イベントを含む音声区間を検出し、該当区間を切り出す。音響イベント認識処理では、切り出された音響イベント区間の音響イベントを認識し、認識した音響イベントを表す単語列のテキストデータを出力する。音声認識処理と音響イベント認識処理の並列動作により、本実施形態の音声認識装置は、個々の認識処理に対して独立に最適なアルゴリズムを実装することが可能となる。また、音響イベントの認識が不要であれば、音響イベント認識処理の実行プログラムを動作させないように本実施形態の音声認識装置に設定すればよい。これにより、字幕制作者のニーズに合わせた字幕制作手法を選択することも可能である。 FIG. 1B is a diagram showing a caption production method by the speech recognition apparatus of the present embodiment.
As shown in the figure, in the caption production method by the speech recognition apparatus of the present embodiment, the acoustic event section detection process and the acoustic event recognition process are executed in parallel with the conventional speech section detection process and the speech recognition process. . In the acoustic event section detection process, a voice section including an acoustic event is detected from the input voice, and the corresponding section is cut out. In the acoustic event recognition process, an acoustic event in the extracted acoustic event section is recognized, and text data of a word string representing the recognized acoustic event is output. Through the parallel operation of the speech recognition process and the acoustic event recognition process, the speech recognition apparatus according to the present embodiment can mount an optimal algorithm independently for each recognition process. If the recognition of the acoustic event is unnecessary, the sound recognition device of the present embodiment may be set so that the execution program for the acoustic event recognition process is not operated. This makes it possible to select a caption production method that meets the needs of the caption producer.

そして、本実施形態の音声認識装置による字幕制作手法においては、人手による音声認識結果の修正作業時に音声認識結果と音響イベント認識結果とを統合し、放送する注釈付き字幕である注釈付き放送字幕を制作する。上述のように、本実施形態の音声認識装置が、音声認識処理と音響イベント認識処理を並列に実行した場合、最終的な音声認識結果と、注釈として与えられる音響イベント認識結果とを統合する必要がある。通常は、音声認識結果に対して修正端末において人手による修正が行われる。本実施形態の音声認識装置は、修正端末に表示させた音声認識結果に対して修正者が修正指示を入力する際に、音響イベント認識結果である注釈についても修正端末に表示させ、音声認識結果に挿入するための効率的なインタフェースを有する。このインタフェースにより、キーボード入力による音響イベント文字列作成の省力化を図る。 Then, in the caption production method by the speech recognition apparatus of the present embodiment, the annotated broadcast subtitle that is an annotated caption to be broadcast is integrated by integrating the speech recognition result and the acoustic event recognition result at the time of manually correcting the speech recognition result. Produce. As described above, when the speech recognition apparatus of the present embodiment executes speech recognition processing and acoustic event recognition processing in parallel, it is necessary to integrate the final speech recognition result and the acoustic event recognition result given as an annotation. There is. Normally, manual correction is performed on the voice recognition result at the correction terminal. When the corrector inputs a correction instruction to the voice recognition result displayed on the correction terminal, the voice recognition device of the present embodiment also displays the annotation that is the acoustic event recognition result on the correction terminal, and the voice recognition result With an efficient interface for insertion into This interface will save labor for creating acoustic event character strings by keyboard input.

上記のような音声認識処理と音響イベント認識処理の並列実行、及び、修正作業時の音声認識結果と音響イベント認識結果の統合により、本実施形態の音声認識装置は、従来困難であった、音響イベントに関する注釈を付加した効率的な字幕制作を可能とする。 Due to the parallel execution of the speech recognition process and the acoustic event recognition process as described above, and the integration of the speech recognition result and the acoustic event recognition result at the time of the correction work, the speech recognition apparatus according to the present embodiment has been difficult to achieve. Enables efficient caption production with annotations about events.

図２は、本発明の一実施形態による字幕制作システムの構成を示すブロック図であり、本実施形態と関係する機能ブロックのみ抽出して示してある。同図に示すように、字幕制作システムは、音声認識装置１と修正端末５とを備えて構成される。音声認識装置１と修正端末５とはネットワークを介して接続される。同図においては、字幕制作システムが、２台の修正端末５を備える場合を示しているが、修正端末５を１台のみ備えてもよく、３台以上備えてもよい。２台の修正端末５をそれぞれ、修正端末５−１、５−２とする。 FIG. 2 is a block diagram showing a configuration of a caption production system according to an embodiment of the present invention, and shows only functional blocks related to the present embodiment. As shown in the figure, the caption production system includes a voice recognition device 1 and a correction terminal 5. The speech recognition apparatus 1 and the correction terminal 5 are connected via a network. In the figure, the subtitle production system includes two correction terminals 5, but only one correction terminal 5 or three or more correction terminals 5 may be provided. The two correction terminals 5 are referred to as correction terminals 5-1 and 5-2, respectively.

音声認識装置１は、コンピュータ装置により実現される。同図に示すように、音声認識装置１は、記憶部１０、音声分岐部１１、音声区間検出部１２、音声認識部１３、音響イベント区間検出部１４、音響イベント認識部１５、及び認識結果修正部１６を備えて構成される。 The voice recognition device 1 is realized by a computer device. As shown in the figure, the speech recognition apparatus 1 includes a storage unit 10, a speech branching unit 11, a speech segment detection unit 12, a speech recognition unit 13, an acoustic event segment detection unit 14, an acoustic event recognition unit 15, and a recognition result correction. A portion 16 is provided.

記憶部１０は、音声区間検出用の統計的音響モデルと、音声認識用の統計的音響モデル及び統計的言語モデルを格納する。さらに、記憶部１０は、音響イベント区間検出用の統計的音響モデルと、音響イベント認識用のニューラルネットワークを格納する。音声分岐部１１は、音声認識装置１に入力された音声データＤ１を２つに分岐し、音声区間検出部１２と音響イベント区間検出部１４に出力する。 The storage unit 10 stores a statistical acoustic model for speech section detection, a statistical acoustic model for speech recognition, and a statistical language model. Further, the storage unit 10 stores a statistical acoustic model for acoustic event section detection and a neural network for acoustic event recognition. The voice branching unit 11 branches the voice data D1 input to the voice recognition device 1 into two and outputs it to the voice segment detection unit 12 and the acoustic event segment detection unit 14.

音声区間検出部１２は、記憶部１０に記憶されている音声区間検出用の統計的音響モデルを用いて、音声分岐部１１から入力された音声データＤ１において、テキスト化の対象となる音声言語の音声区間である音声言語区間を検出する。音声区間検出部１２は、検出した音声データＤ１の音声言語区間である音声言語区間データＤ２を音声認識部１３に出力する。音声認識部１３は、記憶部１０に記憶されている音声認識用の統計的音響モデル及び統計的言語モデルを用いて音声言語区間データＤ２を音声認識する。音声認識部１３は、発話内容の音声認識結果を設定した音声認識結果データＤ３を認識結果修正部１６に出力する。 The speech section detection unit 12 uses the statistical acoustic model for speech section detection stored in the storage unit 10 and uses the statistical data of speech language to be converted into text in the speech data D1 input from the speech branching unit 11. A speech language segment that is a speech segment is detected. The speech segment detection unit 12 outputs the speech language segment data D2, which is the speech language segment of the detected speech data D1, to the speech recognition unit 13. The speech recognition unit 13 recognizes speech language section data D2 using the statistical acoustic model and statistical language model for speech recognition stored in the storage unit 10. The voice recognition unit 13 outputs the voice recognition result data D3 set with the voice recognition result of the utterance content to the recognition result correction unit 16.

音響イベント区間検出部１４は、記憶部１０に記憶されている音響イベント区間検出用の統計的音響モデルを用いて、音声分岐部１１から入力された音声データＤ１において、音響イベントが含まれる音声区間である音響イベント区間を検出する。音響イベント区間検出部１４は、検出した音声データＤ１の音響イベント区間である音響イベント区間データＤ４を音響イベント認識部１５に出力する。音響イベント認識部１５は、記憶部１０に記憶されている音響イベント認識用のニューラルネットワークを用いて音響イベント区間データＤ４の音響イベントを認識する。音響イベント認識部１５は、音響イベント認識結果を設定した音響イベント認識結果データＤ５を認識結果修正部１６に出力する。音響イベント認識結果は、検出した音響イベントを表すテキスト表現（文字列）である。 The acoustic event section detection unit 14 uses a statistical acoustic model for acoustic event section detection stored in the storage unit 10, and the voice section including the acoustic event in the voice data D <b> 1 input from the voice branching unit 11. An acoustic event section is detected. The acoustic event section detection unit 14 outputs acoustic event section data D4 that is an acoustic event section of the detected audio data D1 to the acoustic event recognition unit 15. The acoustic event recognition unit 15 recognizes the acoustic event of the acoustic event section data D4 using the neural network for acoustic event recognition stored in the storage unit 10. The acoustic event recognition unit 15 outputs the acoustic event recognition result data D5 in which the acoustic event recognition result is set to the recognition result correction unit 16. The acoustic event recognition result is a text expression (character string) representing the detected acoustic event.

認識結果修正部１６は、音声認識部１３から出力された音声認識結果データＤ３と、音響イベント認識部１５から出力された音響イベント認識結果データＤ５を修正端末５へ出力し、表示させる。認識結果修正部１６は、修正端末５から受信した修正指示に基づいて音声認識結果を修正するとともに、修正端末５から受信した注釈挿入指示に基づいて注釈文字列を音声認識結果に挿入し、注釈付き放送字幕データＤ６を生成する。修正指示は、音声認識結果における修正箇所と、その修正箇所における文字の削除、挿入、置換などの修正内容を示す。注釈挿入指示は、音声認識結果における注釈挿入箇所と、その注釈挿入箇所に挿入する注釈文字列を示す。注釈文字列は、修正端末５に表示させた音響イベント認識結果データＤ５の音響イベントのテキスト表現の中から、修正者が選択したものである。認識結果修正部１６は、生成した注釈付き放送字幕データＤ６を出力する。 The recognition result correction unit 16 outputs the speech recognition result data D3 output from the speech recognition unit 13 and the acoustic event recognition result data D5 output from the acoustic event recognition unit 15 to the correction terminal 5 for display. The recognition result correction unit 16 corrects the speech recognition result based on the correction instruction received from the correction terminal 5 and inserts an annotation character string into the speech recognition result based on the annotation insertion instruction received from the correction terminal 5. Additional broadcast caption data D6 is generated. The correction instruction indicates a correction location in the speech recognition result and correction contents such as deletion, insertion, and replacement of characters at the correction location. The annotation insertion instruction indicates an annotation insertion location in the speech recognition result and an annotation character string to be inserted at the annotation insertion location. The annotation character string is selected by the corrector from the text representation of the acoustic event of the acoustic event recognition result data D5 displayed on the correction terminal 5. The recognition result correction unit 16 outputs the generated annotated broadcast caption data D6.

修正端末５は、例えば、パーソナルコンピュータなどのコンピュータ装置により実現される。修正端末５は、制御部５１、表示部５２、入力部５３、及び音声出力部５４を備えて構成される。表示部５２は、ディスプレイであり、画面を表示する。入力部５３は、キーボードやマウスなどであり、修正者による操作を受ける。本実施形態では、修正端末５がタッチパネルと、キーボードを備える場合を例に説明する。タッチパネルは、表示部５２と入力部５３を兼ねる。音声出力部５４は、ヘッドホンやスピーカーであり、音声データＤ１の再生音声を出力する。制御部５１は、音声認識装置１から受信した音声認識結果データＤ３と音響イベント認識結果データＤ５を表示部５２に表示させる。また、制御部５１は、入力部５３により修正者が入力した音声認識結果の修正指示や、音声認識結果への注釈挿入指示を音声認識装置１に出力する。さらに、制御部５１は、音声データＤ１の再生音声を音声出力部５４から出力させる。 The correction terminal 5 is realized by a computer device such as a personal computer, for example. The correction terminal 5 includes a control unit 51, a display unit 52, an input unit 53, and an audio output unit 54. The display unit 52 is a display and displays a screen. The input unit 53 is a keyboard, a mouse, or the like, and receives an operation by a corrector. In the present embodiment, a case where the correction terminal 5 includes a touch panel and a keyboard will be described as an example. The touch panel serves as both the display unit 52 and the input unit 53. The audio output unit 54 is a headphone or a speaker, and outputs a reproduction sound of the audio data D1. The control unit 51 causes the display unit 52 to display the voice recognition result data D3 and the acoustic event recognition result data D5 received from the voice recognition device 1. In addition, the control unit 51 outputs a speech recognition result correction instruction input by the corrector through the input unit 53 and an annotation insertion instruction to the speech recognition result to the speech recognition apparatus 1. Further, the control unit 51 causes the audio output unit 54 to output the reproduced audio of the audio data D1.

次に、音声認識装置１の動作について説明する。
まず、音声認識装置１は、音声区間検出用、音響イベント区間検出用それぞれの統計的音響モデルと、音声認識用の統計的音響モデル及び統計的言語モデルと、音響イベント認識用のニューラルネットワークを記憶部１０に格納する。音声区間検出用の統計的音響モデルや、音声認識用の統計的音響モデル及び統計的言語モデルは、従来と同様のものを用いることができる。本実施形態では、音響イベント区間検出用の統計的音響モデルとして、ＨＭＭ（Hidden Markov Model、隠れマルコフモデル）及びＧＭＭ（Gaussian Mixture Model、ガウス混合分布）を用いる。この音響イベント区間検出用のＨＭＭ及びＧＭＭは、音声、音響イベント、及び無音の３つのクラスそれぞれのラベルがつけられた音声データを学習データとして用い、従来技術と同様の学習方法により学習される。なお、音声のラベルは、音声言語の音声データにつけられる。例えば、音響イベントのＧＭＭの場合、混合されるガウス分布のそれぞれが、異なる種類の音響イベントの特徴を表すようにする。また、音響イベント認識用のニューラルネットワークの学習には、各音響イベントのラベルが付けられた音声データを学習データとして用い、従来技術と同様の学習方法により学習される。音響イベント区間検出用のＨＭＭについては図４を用いて、音響イベント認識用のニューラルネットワークについては図６を用いて後述する。 Next, the operation of the voice recognition device 1 will be described.
First, the speech recognition apparatus 1 stores a statistical acoustic model for speech segment detection and acoustic event segment detection, a statistical acoustic model and statistical language model for speech recognition, and a neural network for acoustic event recognition. Stored in the unit 10. A statistical acoustic model for detecting a speech section, a statistical acoustic model for speech recognition, and a statistical language model can be the same as those in the past. In the present embodiment, HMM (Hidden Markov Model) and GMM (Gaussian Mixture Model) are used as statistical acoustic models for detecting acoustic event intervals. The acoustic event section detection HMM and GMM are learned by the same learning method as in the prior art, using speech data labeled with three classes of speech, acoustic event, and silence as learning data. Note that the voice label is attached to the voice data of the voice language. For example, in the case of GMM of acoustic events, each of the mixed Gaussian distributions represents a characteristic of a different type of acoustic event. For learning of a neural network for acoustic event recognition, speech data labeled with each acoustic event is used as learning data, and learning is performed by a learning method similar to that of the prior art. The HMM for detecting an acoustic event section will be described later with reference to FIG. 4, and the neural network for acoustic event recognition will be described later with reference to FIG.

図３は、音声認識装置１の全体処理フローを示す図である。音声認識装置１は、音声データＤ１が入力される度に、同図に示す処理を行う。
音声認識装置１に放送番組の音声データＤ１が入力されると、音声分岐部１１は、入力された音声データＤ１を、音声認識及び音響イベント認識それぞれの入力とするために２つに分岐する。これは、音声言語と音響イベントに重なりがあるためである。音声認識処理と音響イベント認識処理を分割することにより、それぞれ独立に最適な認識アルゴリズムを適用できるようにする。音声分岐部１１は、２つに分岐した音声データＤ１のうち一方を、音声認識の前処理を行う音声区間検出部１２に出力し、もう一方を、音響イベント認識の前処理を行う音響イベント区間検出部１４に出力する（ステップＳ１）。 FIG. 3 is a diagram showing an overall processing flow of the speech recognition apparatus 1. The voice recognition device 1 performs the process shown in the figure every time the voice data D1 is input.
When the audio data D1 of the broadcast program is input to the audio recognition device 1, the audio branching unit 11 branches the input audio data D1 into two in order to input the audio recognition and the acoustic event recognition. This is because there is an overlap between the speech language and the acoustic event. By dividing the speech recognition process and the acoustic event recognition process, an optimal recognition algorithm can be applied independently. The audio branching unit 11 outputs one of the audio data D1 branched into two to the audio interval detecting unit 12 that performs preprocessing for audio recognition, and the other audio event interval that performs preprocessing for acoustic event recognition. It outputs to the detection part 14 (step S1).

音声区間検出部１２は、従来技術によって、音声データＤ１においてテキスト化が必要となる音声言語区間を検出して切り出す（ステップＳ２）。この音声言語区間には、背景音などの音響イベントとの重なりが含まれ得る。本実施形態では、特開２００７−２３３１４８号公報や、特開２００７−２３３１４９号公報に記載の技術により、音声区間を検出する。音声区間検出部１２は、検出した音声データＤ１の音声言語区間である音声言語区間データＤ２を音声認識部１３に出力する。 The speech section detection unit 12 detects and cuts out a speech language section that needs to be converted into text in the speech data D1 by the conventional technique (step S2). This speech language segment may include overlap with acoustic events such as background sounds. In the present embodiment, a voice section is detected by the techniques described in Japanese Patent Application Laid-Open No. 2007-233148 and Japanese Patent Application Laid-Open No. 2007-233149. The speech segment detection unit 12 outputs the speech language segment data D2, which is the speech language segment of the detected speech data D1, to the speech recognition unit 13.

具体的には、音声区間検出部１２は、音声データＤ１が入力される度に、音声データＤ１が示す音声を、所定の時間間隔の１処理単位のフレームである入力フレームに分割する。音声区間検出部１２は、時刻が早い順に選択した所定数の入力フレームそれぞれの音響特徴量を計算する。発話区間検出用の状態遷移ネットワークは、発話開始から発話終了までに、非音声言語、音声言語、無音の３状態を飛越しなく遷移するｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型のＨＭＭである。なお、無音の状態に代えて、非音声言語の状態を用いてもよい。音声区間検出部１２は、記憶部１０から非音声言語、音声言語それぞれの音響モデルを読み出し、読み出したこれらの音響モデルを用いて各入力フレームの音響スコア（対数尤度）計算を行う。非音声言語の音響モデルは、無音や音響イベントなどのＨＭＭを表す。また、音声言語の音響モデルは、各音素の音素ＨＭＭからなる。音声区間検出部１２は、各入力フレームの状態遷移の記録を記憶しておき、現在の状態から開始状態に向かって状態遷移の記録を遡り、状態遷移ネットワークを用いて処理開始（始端）の入力フレームからの各状態系列の累積の音響スコアを算出する。音声区間検出部１２は、各状態系列の累積の音響スコアのうち最大のものと、始端の音響スコアとの差が閾値より大きい場合、最大の累積の音響スコアが得られた系列において最後に非音声言語の状態であった時刻から所定時間遡った時刻を発話開始時刻とする。
音声区間検出部１２は、さらに発話開始時刻検出後の入力フレームについて、上記と同様に処理開始の入力フレームからの現在の入力フレームまでの各状態系列の累積の音響スコアを算出する。音声区間検出部１２は、各状態系列の中で最大の累積の音響スコアと、各状態系列のうち音声言語から非音声言語の終端に至る状態系列の中で最大の累積の音響スコアとの差が閾値を超えたかを判断する。音声区間検出部１２は、閾値を超えた状態が所定時間経過した場合、その経過した時刻から所定時間遡った時刻を発話終了時刻とする。
音声区間検出部１２は、発話開始時刻から発話終了時刻までの区間の入力フレームをまとめた音声言語区間データＤ２を出力する。 Specifically, every time voice data D1 is input, the voice section detection unit 12 divides the voice indicated by the voice data D1 into input frames that are frames of one processing unit at a predetermined time interval. The voice section detection unit 12 calculates the acoustic feature amount of each of a predetermined number of input frames selected in order from the earliest time. The state transition network for detecting an utterance section is a left-to-right type HMM that makes a transition between non-speech language, speech language, and silence without skipping from the start of utterance to the end of utterance. Note that a non-speech language state may be used instead of the silent state. The speech section detection unit 12 reads out the acoustic models of the non-speech language and the speech language from the storage unit 10 and calculates the acoustic score (logarithmic likelihood) of each input frame using the read out acoustic models. A non-speech language acoustic model represents an HMM such as silence or an acoustic event. The acoustic model of the speech language is composed of phoneme HMMs for each phoneme. The voice section detection unit 12 stores a record of the state transition of each input frame, traces the state transition record from the current state toward the start state, and inputs a process start (starting end) using the state transition network. The cumulative acoustic score of each state series from the frame is calculated. When the difference between the maximum accumulated acoustic score of each state series and the starting acoustic score is larger than the threshold, the speech section detecting unit 12 is the last in the series in which the maximum accumulated acoustic score is obtained. A time that is a predetermined time later than the time when the voice language was in use is set as the utterance start time.
Further, the speech section detection unit 12 calculates the accumulated acoustic score of each state series from the input frame at the start of processing to the current input frame in the same manner as described above for the input frame after the speech start time is detected. The speech section detection unit 12 determines the difference between the maximum accumulated acoustic score in each state sequence and the maximum accumulated acoustic score in the state sequence from the spoken language to the end of the non-speech language in each state sequence. Determines whether the threshold value has been exceeded. When a predetermined time has elapsed after the threshold value is exceeded, the voice section detection unit 12 sets a time that is a predetermined time later than the elapsed time as the utterance end time.
The speech section detection unit 12 outputs speech language section data D2 in which input frames in a section from the utterance start time to the utterance end time are collected.

音声認識部１３は、従来技術により、記憶部１０に記憶されている音声認識用の統計的音響モデル及び統計的言語モデルを用いて音声言語区間データＤ２を音声認識する（ステップＳ３）。本実施形態では、音声認識部１３は、統計的音響モデルに、ＨＭＭ、及びＧＭＭを用いる。また、本実施形態では、音声認識部１３は、統計的言語モデルに単語ｎ−ｇｒａｍ言語モデルを用いたマルチパス音声認識により認識結果を得る。この認識結果は、単語を単位とした分かち書きであり、音声認識部１３は、各単語に、当該単語が発話された時刻情報を付与する。音声認識部１３は、音声認識結果を設定した音声認識結果データＤ３を認識結果修正部１６に出力する（ステップＳ４）。 The speech recognition unit 13 recognizes the speech language section data D2 using the statistical acoustic model and the statistical language model for speech recognition stored in the storage unit 10 according to the conventional technique (step S3). In the present embodiment, the speech recognition unit 13 uses HMM and GMM for the statistical acoustic model. In the present embodiment, the speech recognition unit 13 obtains a recognition result by multipath speech recognition using a word n-gram language model as a statistical language model. This recognition result is a segmentation in units of words, and the speech recognition unit 13 assigns each word with time information when the word is uttered. The voice recognition unit 13 outputs the voice recognition result data D3 set with the voice recognition result to the recognition result correction unit 16 (step S4).

一方、音響イベント区間検出部１４は、音声データＤ１において背景音等の音響イベントを含む音響イベント区間を検出して切り出す（ステップＳ５）。この音響イベント区間には、音声認識によりテキスト化が必要となる部分との重複が含まれ得る。音響イベント区間検出部１４は、音声区間検出部１２と同様のアルゴリズムにより、記憶部１０に記憶されている音響イベント区間検出用のＧＭＭとＨＭＭを用いて音響イベント区間の検出を行う。ただし、音声区間検出部１２が、音声言語の音声区間（音声言語区間）を検出対象としているのに対し、音響イベント区間検出部１４は、非言語音の音声区間を検出対象とする点が異なる。また、発話区間検出用の状態遷移ネットワークに代えて、音響イベント区間検出用のＨＭＭを用いる。 On the other hand, the acoustic event section detection unit 14 detects and cuts out an acoustic event section including an acoustic event such as a background sound in the audio data D1 (step S5). This acoustic event section may include an overlap with a part that needs to be converted into text by speech recognition. The acoustic event section detection unit 14 detects the acoustic event section using the GMM and the HMM for acoustic event section detection stored in the storage unit 10, using the same algorithm as the speech section detection unit 12. However, while the speech segment detection unit 12 is targeted for the speech segment (speech language segment) of the speech language, the acoustic event segment detection unit 14 is different in that the speech segment of the non-language sound is the detection target. . Further, an HMM for detecting an acoustic event section is used in place of the state transition network for detecting an utterance section.

図４は、記憶部１０に記憶されている音響イベント区間検出用のＨＭＭを示す図である。本実施形態では、ＨＭＭの構成を、いわゆるエルゴディックＨＭＭとする。同図に示すように、このエルゴディックＨＭＭは、音声、音響イベント、無音の３クラスの遷移を表現したＨＭＭである。各遷移には、学習により得られた遷移確率が付与されている。 FIG. 4 is a diagram illustrating an HMM for detecting an acoustic event section stored in the storage unit 10. In the present embodiment, the configuration of the HMM is a so-called ergodic HMM. As shown in the figure, the ergodic HMM is an HMM that expresses three classes of transition of voice, acoustic event, and silence. Each transition is given a transition probability obtained by learning.

図５は、音響イベント区間検出部１４の音響イベント区間検出処理フローを示す図であり、図３のステップＳ５における詳細な処理を示す。まず、音響イベント区間検出部１４は、音声データＤ１が入力される度に、音声データＤ１を、所定の時間間隔の１処理単位のフレームである入力フレームＤ１１に分割する。 FIG. 5 is a diagram showing an acoustic event section detection processing flow of the acoustic event section detection unit 14, and shows detailed processing in step S5 of FIG. First, every time the audio data D1 is input, the acoustic event section detection unit 14 divides the audio data D1 into input frames D11 that are frames of one processing unit at a predetermined time interval.

音響イベント区間検出部１４は、まだ処理対象としていない入力フレームＤ１１のうち、時刻が早い順に所定数の入力フレームＤ１１を取得する（ステップＳ５１）。音響イベント区間検出部１４は、取得した各入力フレームＤ１１の音響特徴量を計算する。音響イベント区間検出部１４は、記憶部１０からＨＭＭの各状態である音声、音響イベント、及び無音それぞれのＧＭＭを読み出す。音響イベント区間検出部１４は、読み出したこれらのＧＭＭと各入力フレームＤ１１の音響特徴量とを照合して各入力フレームＤ１１の音響スコア計算を行い、必要があればＨＭＭの状態間の遷移を行う（ステップＳ５２）。音響イベント区間検出部１４は、トレースバックに必要な定められた数の入力フレームを処理していない場合（ステップＳ５３：ＮＯ）、ステップＳ５１に戻って新たな入力フレームＤ１１を取得し、音響スコアの計算を行う。 The acoustic event section detection unit 14 acquires a predetermined number of input frames D11 in order from the earliest time among the input frames D11 that are not yet processed (step S51). The acoustic event section detection unit 14 calculates the acoustic feature amount of each acquired input frame D11. The acoustic event section detection unit 14 reads from the storage unit 10 each voice, acoustic event, and silent GMM that is each state of the HMM. The acoustic event section detection unit 14 compares the read GMM and the acoustic feature quantity of each input frame D11 to calculate the acoustic score of each input frame D11, and performs transition between HMM states if necessary. (Step S52). When the predetermined number of input frames necessary for the traceback are not processed (step S53: NO), the acoustic event section detection unit 14 returns to step S51 to acquire a new input frame D11, and acquires the acoustic score. Perform the calculation.

音響イベント区間検出部１４は、トレースバックに必要な定められた数の入力フレームを処理した場合（ステップＳ５３：ＹＥＳ）、現在の状態に至るまでの状態系列のリストをトレースバックにより求める（ステップＳ５４）。つまり、音響イベント区間検出部１４は、現在の状態から開始状態に向かって状態遷移の記録を遡り、図４に示すエルゴディックＨＭＭを用いて、処理開始の入力フレームＤ１１の状態（開始状態）から現在の状態までの各状態系列の累積の音響スコアを算出する。この際、音響イベント区間検出部１４は、累積の音響スコアが大きい順に系列をソートしておく。 When the predetermined number of input frames necessary for traceback are processed (step S53: YES), the acoustic event section detection unit 14 obtains a list of state series up to the current state by traceback (step S54). ). That is, the acoustic event section detection unit 14 traces the record of the state transition from the current state toward the start state, and uses the ergodic HMM shown in FIG. 4 from the state (start state) of the input frame D11 at the start of processing. The cumulative acoustic score of each state series up to the current state is calculated. At this time, the acoustic event section detection unit 14 sorts the series in descending order of the accumulated acoustic score.

音響イベント区間検出部１４は、トレースバックにより得られたＨＭＭの状態系列から、第１位の系列と第２位の系列を比較する（ステップＳ５５）。音響イベント区間検出部１４は、累積の音響スコアの差が予め定めた閾値以下である場合、区間が確定しないと判断し（ステップＳ５６：ＮＯ）、ステップＳ５１に戻って新たな入力フレームＤ１１に対して音響スコアの計算を行う。音響イベント区間検出部１４は、累積の音響スコアの差が予め定めた閾値を超えたと判断した場合（ステップＳ５６：ＹＥＳ）、第１位の系列を確定区間とする。音響イベント区間検出部１４は、最後に音響イベントの確定区間のフレームをまとめあげたフレーム列を、音響イベント区間データＤ４として出力する（ステップＳ５７）。 The acoustic event section detection unit 14 compares the first rank series and the second rank series from the HMM state series obtained by the traceback (step S55). The acoustic event section detection unit 14 determines that the section is not fixed when the accumulated acoustic score difference is equal to or less than a predetermined threshold (step S56: NO), returns to step S51, and performs a new input frame D11. To calculate the acoustic score. When the acoustic event section detection unit 14 determines that the difference between the accumulated acoustic scores exceeds a predetermined threshold (step S56: YES), the acoustic event section detection unit 14 sets the first rank series as the confirmed section. The acoustic event section detection unit 14 outputs a frame sequence that finally collects the frames of the confirmed section of the acoustic event as acoustic event section data D4 (step S57).

図３において、音響イベント認識部１５は、記憶部１０に記憶されている音響イベント認識用のニューラルネットワークを用いて、音響イベント区間検出部１４において得られた音響イベント区間データＤ４から音響イベントを認識する（ステップＳ６）。そこでまず、音響イベント認識部１５は、音響イベント区間データＤ４を構成する音響イベントのフレーム列を、フレーム列連結により予め定めた長さＮフレーム以上に至るまで連結する。これは、短すぎるフレーム列からは音響イベントの周波数特性の時間変化をとらえることが困難となり、精度よく音響イベントを推定することは困難なためである。音響イベント認識部１５は、フレーム連結によりＮフレーム以上のフレーム列からなる入力フレーム列を得ると、記憶部１０に記憶されているニューラルネットワークを用いて、音響イベント認識を行う。 In FIG. 3, the acoustic event recognition unit 15 recognizes an acoustic event from the acoustic event section data D4 obtained by the acoustic event section detection unit 14 using a neural network for acoustic event recognition stored in the storage unit 10. (Step S6). First, the acoustic event recognizing unit 15 connects the acoustic event frame data constituting the acoustic event section data D4 until the frame reaches a predetermined length N or more by connecting the frame sequences. This is because it is difficult to capture temporal changes in the frequency characteristics of acoustic events from a frame sequence that is too short, and it is difficult to estimate acoustic events with high accuracy. When the acoustic event recognizing unit 15 obtains an input frame sequence including N or more frame sequences by frame connection, the acoustic event recognizing unit 15 performs acoustic event recognition using a neural network stored in the storage unit 10.

図６は、記憶部１０に記憶されている音響イベント認識用のニューラルネットワークを示す図である。同図に示すように、本実施形態では、音響イベント認識部１５は、音響イベント認識に、ニューラルネットワークの一種である畳み込みニューラルネットワークを用いる。畳み込みニューラルネットワークの例は、例えば、文献「Andrew L. Maas et al., "Word-level Acoustic Modeling with Convolutional Vector Regression", ICML Representation Learning Workshop, 2012」に記載されている。 FIG. 6 is a diagram showing a neural network for acoustic event recognition stored in the storage unit 10. As shown in the figure, in this embodiment, the acoustic event recognition unit 15 uses a convolutional neural network, which is a kind of neural network, for acoustic event recognition. An example of a convolutional neural network is described in, for example, a document “Andrew L. Maas et al.,“ Word-level Acoustic Modeling with Convolutional Vector Regression ”, ICML Representation Learning Workshop, 2012”.

同図に示す畳み込みニューラルネットワークは、入力層、隠れ層、プーリング層、出力層の４層から構成される。入力層は、音響イベント区間検出部１４で出力された時刻順の複数のフレームに対応し、入力層の値は、対応するフレームから得られたメル周波数ケプストラムなどの時間周波数領域の音響特徴量である。この音響特徴量は、例えば、ベクトルで表される。本実施形態において、入力層の音響特徴量の総フレーム数Ｎ_ｓ（≧Ｎ）は可変である。 The convolutional neural network shown in the figure is composed of four layers: an input layer, a hidden layer, a pooling layer, and an output layer. The input layer corresponds to a plurality of time-ordered frames output from the acoustic event section detection unit 14, and the value of the input layer is an acoustic feature quantity in the time frequency domain such as a mel frequency cepstrum obtained from the corresponding frame. is there. This acoustic feature amount is represented by a vector, for example. In the present embodiment, the total number of frames N _s (≧ N) of the acoustic feature quantity of the input layer is variable.

隠れ層の各ユニット（素子）は、入力層の総フレーム数Ｎ_ｓのフレーム（素子）のうち、連続するｎ_ｓ個のフレームのみと結合している。隠れ層の各ユニットが結合している入力層のｎ_ｓ個のフレームは、１つ前の隣接するユニットが結合しているｎ_ｓ個のフレームよりも後の時刻に対応するが、一部が重複するようにｋフレームずつシフトしている（ｋ＜ｎ_ｓ）。例えば、入力層のｉ〜（ｉ＋２）番目のフレームが隠れ層のｉ番目のユニットに結合しているとする。隠れ層のｉ番目のユニットの値は、入力層のｉ〜（ｉ＋２）番目のフレームの値の加算（畳み込み演算）となる。ただし、入力層のｉ番目のフレーム、（ｉ＋１）番目のフレーム、（ｉ＋２）番目のフレームそれぞれと隠れ層のｉ番目のユニットとの結合重み（加算の際の重み）は均等でなくてもよい。例えば、入力層の１〜３番目のフレームが隠れ層の第１番目のユニットに結合し、入力層の２〜４番目のフレームが隠れ層の第２番目のユニットに結合し、入力層の３〜５番目のフレームが隠れ層の第３番目のユニットに結合する。このとき、（入力層の１番目のフレームから隠れ層の１番目のユニットの結合重み）＝（入力層の２番目のフレームから隠れ層の２番目のユニットの結合重み）＝（入力層の３番目のフレームから隠れ層の３番目のユニットの結合重み）＝…である。同様に、（入力層の２番目のフレームから隠れ層の１番目のユニットの結合重み）＝（入力層の３番目のフレームから隠れ層の２番目のユニットの結合重み）＝（入力層の４番目のフレームから隠れ層の３番目のユニットの結合重み）＝…である。つまり、隠れ層のユニットと入力層のフレームとの結合は、ｋフレーム分のシフトを保ちながら、入力層と隠れ層の各素子の間を同じ結合重みで結んでいる。隠れ層のユニット数Ｎ_ｈは、入力層のユニット数に応じた数になる。 Each unit (element) of the hidden layer is coupled to only the continuous n _s frames among the frames (elements) of the total number of frames N _s of the input layer. The n _s number of frames of the input layer units of the hidden layer are bonded, but the previous adjacent units corresponds to the time after the n _s frames attached, some Shifting by k frames so as to overlap (k < _ns ). For example, it is assumed that the i to (i + 2) -th frame in the input layer is coupled to the i-th unit in the hidden layer. The value of the i-th unit of the hidden layer is an addition (convolution operation) of the values of the i to (i + 2) -th frames of the input layer. However, the connection weight (weight at the time of addition) of the i-th frame of the input layer, the (i + 1) -th frame, the (i + 2) -th frame, and the i-th unit of the hidden layer may not be equal. . For example, the first to third frames of the input layer are coupled to the first unit of the hidden layer, the second to fourth frames of the input layer are coupled to the second unit of the hidden layer, and 3 The fifth frame binds to the third unit of the hidden layer. At this time, (combination weight of the first unit of the hidden layer from the first frame of the input layer) = (joint weight of the second unit of the hidden layer from the second frame of the input layer) = (3 of the input layer) The coupling weight of the third unit of the hidden layer from the th frame) =. Similarly, (combination weight of the first unit of the hidden layer from the second frame of the input layer) = (joint weight of the second unit of the hidden layer from the third frame of the input layer) = (4 of the input layer) The coupling weight of the third unit of the hidden layer from the th frame) =. In other words, the coupling between the hidden layer unit and the input layer frame connects the elements of the input layer and the hidden layer with the same coupling weight while maintaining a shift of k frames. The number of hidden layer units _Nh is a number corresponding to the number of units in the input layer.

隠れ層の上位のプーリング層は、予め定められた固定のユニット数Ｎ_ｐのユニットにより構成される。プーリング層の各ユニットは、隠れ層のユニットのうち可変のユニット数ｎ_ｈ＝Ｎ_ｐ／Ｎ_ｈのユニットと結合している。プーリング層のユニットと隠れ層のユニットとの結合は、同じプーリング層のユニットに結合されている隠れ層のユニットの値のうち、最大値のみプーリング層に伝搬するという特質をもつ。 Pooling layer above the hidden layer is constituted by a unit of the number of units N _p of predetermined fixed. Each unit in the pooling layer is coupled to a unit having a variable number of units n _h = N _p / N _h among the units in the hidden layer. The coupling between the pooling layer unit and the hidden layer unit has a characteristic that only the maximum value among the values of the hidden layer units coupled to the same pooling layer unit propagates to the pooling layer.

プーリング層と出力層は、互いに各ユニットが全て結合している。出力層の値は、プーリング層の値に、プーリング層の各ユニットと出力層の各ユニットとの間それぞれの重みを表す重み係数行列を作用させた後、Softmax関数を用いて出力層の各ユニットの出力を正規化して計算される。出力層のユニットは、音響イベントに対応したテキスト表現（文字列）を表しており、音響特徴量が与えられたときのテキスト表現の事後確率を与える。
なお、本実施形態では、プーリング層と出力層を連結しているが、この間には任意の数の隠れ層及びプーリング層を挿入可能である。 The pooling layer and the output layer are all connected to each other. The value of the output layer is obtained by applying the weighting coefficient matrix representing the weight between each unit of the pooling layer and each unit of the output layer to the value of the pooling layer, and then using the Softmax function for each unit of the output layer. Is calculated by normalizing the output of. The unit of the output layer represents a text expression (character string) corresponding to an acoustic event, and gives a posterior probability of the text expression when an acoustic feature amount is given.
In this embodiment, the pooling layer and the output layer are connected, but any number of hidden layers and pooling layers can be inserted between them.

図７は、音響イベント認識部１５の音響イベント認識処理フローを示す図であり、図３のステップＳ６における詳細な処理を示す。
音響イベント認識部１５は、畳み込みニューラルネットワークの入力特徴量が十分な長さとなるよう、音響イベント区間検出部１４からの出力である音響イベント区間データＤ４のフレーム列を時刻順にフレーム連結し、入力フレーム列を生成する（ステップＳ６１）。入力フレーム列の長さがＮに達していない場合（ステップＳ６２：ＮＯ）、音響イベント認識部１５は、ステップＳ６１に戻り、Ｎフレーム以上の入力フレーム列が得られるまで新たな音響イベント区間データＤ４のフレーム列をフレーム連結する。入力フレーム列の長さが音響イベント認識に必要なＮ以上となった場合（ステップＳ６２：ＹＥＳ）、音響イベント認識部１５は、記憶部１０に記憶されている畳み込みニューラルネットワークにより音響イベント認識を行う（ステップＳ６３）。音響イベント認識部１５は、入力フレーム列を構成する各フレームの音響特徴量を計算する。音響イベント認識部１５は、入力フレーム列の各フレームについて計算した音響特徴量を、図６に示す畳み込みニューラルネットワークの入力層の入力とし、隠れ層、プーリング層、出力層の各ユニットの値を計算する。 FIG. 7 is a diagram showing an acoustic event recognition processing flow of the acoustic event recognition unit 15, and shows detailed processing in step S6 of FIG.
The acoustic event recognition unit 15 concatenates the frame sequence of the acoustic event section data D4, which is the output from the acoustic event section detection unit 14, in order of time so that the input feature amount of the convolutional neural network is sufficiently long, and the input frame A column is generated (step S61). When the length of the input frame sequence has not reached N (step S62: NO), the acoustic event recognition unit 15 returns to step S61, and new acoustic event section data D4 until an input frame sequence of N frames or more is obtained. Frame frames of the frames are connected. When the length of the input frame sequence is greater than or equal to N necessary for acoustic event recognition (step S62: YES), the acoustic event recognition unit 15 performs acoustic event recognition using the convolutional neural network stored in the storage unit 10. (Step S63). The acoustic event recognition unit 15 calculates the acoustic feature amount of each frame constituting the input frame sequence. The acoustic event recognition unit 15 calculates the values of the units of the hidden layer, the pooling layer, and the output layer by using the acoustic feature amount calculated for each frame of the input frame sequence as the input of the input layer of the convolutional neural network shown in FIG. To do.

最後に音響イベント認識部１５は、畳み込みニューラルネットワークの出力層のユニットを、各ユニットの出力が示す事後確率に基づいて選択する。例えば、音響イベント認識部１５は、事後確率が最大のものから順に所定数のユニットを選択してもよく、事後確率が所定以上のユニットを選択してもよく、事後確率が所定以上の中から事後確率が大きい順に所定数までのユニットを選択してもよい。記憶部１０には、予め、出力層のユニットの番号と、その番号のユニットが表す音響イベントについてユーザが選んだテキスト表現とを対応付けて記憶しておく。音響イベント認識部１５は、選択したユニットに対応する音響イベントのテキスト表現を記憶部１０から読み出す。 Finally, the acoustic event recognition unit 15 selects a unit in the output layer of the convolutional neural network based on the posterior probability indicated by the output of each unit. For example, the acoustic event recognition unit 15 may select a predetermined number of units in order from the largest posterior probability, may select a unit having a posterior probability of a predetermined value or higher, and the posterior probability is higher than a predetermined value. Up to a predetermined number of units may be selected in descending order of the posterior probability. The storage unit 10 stores in advance the unit number of the output layer and the text expression selected by the user for the acoustic event represented by the unit of that number. The acoustic event recognition unit 15 reads the text representation of the acoustic event corresponding to the selected unit from the storage unit 10.

本実施形態では、以下の表１から表５に示すような分類に従った音響イベントのテキスト表現を用いる。 In this embodiment, the text representation of the acoustic event according to the classification as shown in Table 1 to Table 5 below is used.

表１から表５では、該当する音響イベントのテキスト表現の例を示しているが、ある音響イベントに対応するテキスト表現を一意に定めることは難しい。そこで、過去に行われた字幕放送のテキストを解析し、頻度の高い代表的な表現をテキスト表現として選んでおく。例えば、これらの表現は、字幕放送のト書き（場面の説明を行う脚注）として表現されるものである。そして、出力層のユニットの番号と、その番号のユニットが表す音響イベントとして選んだテキスト表現とを対応付けて記憶部１０に記憶しておく。 Tables 1 to 5 show examples of the text representation of the corresponding acoustic event, but it is difficult to uniquely determine the text representation corresponding to a certain acoustic event. Therefore, the text of subtitle broadcasting performed in the past is analyzed, and a representative expression with high frequency is selected as the text expression. For example, these expressions are expressed as a subtitle broadcast (footnote explaining the scene). Then, the unit number of the output layer and the text expression selected as the acoustic event represented by the unit of that number are stored in the storage unit 10 in association with each other.

図３において、音響イベント認識部１５は、読み出した音響イベントのテキスト表現に、事後確率が大きい順に順位を付与する。音響イベント認識部１５は、順位が付与された音響イベントのテキスト表現である注釈文字列を音響イベント認識結果データＤ５に設定し、認識結果修正部１６に出力する（ステップＳ７）。 In FIG. 3, the acoustic event recognition unit 15 assigns ranks to the text representations of the read acoustic events in descending order of posterior probabilities. The acoustic event recognizing unit 15 sets an annotation character string, which is a text representation of the acoustic event to which the ranking is given, in the acoustic event recognition result data D5, and outputs it to the recognition result correcting unit 16 (step S7).

認識結果修正部１６は、音声認識結果データＤ３が示す音声認識結果と、音響イベント認識結果データＤ５が示す注釈文字列とを統合して、最終的な放送字幕を作成する（ステップＳ８）。本実施形態の音声認識装置１は、両者を効率的に実施可能な効率的なインタフェースを提供する。このインタフェースの提供方法には、以下の２つがある。 The recognition result correction unit 16 integrates the voice recognition result indicated by the voice recognition result data D3 and the annotation character string indicated by the acoustic event recognition result data D5 to create a final broadcast subtitle (step S8). The speech recognition apparatus 1 of the present embodiment provides an efficient interface that can efficiently implement both. There are the following two methods for providing this interface.

第１のインタフェースの提供方法は、修正者が認識結果を修正する際に、注釈を挿入する方法である。認識結果の修正は、タッチパネルを具備したコンピュータ装置によって実現される修正端末５を用い、操作者の入力に基づいて行われる。 The first interface providing method is a method of inserting an annotation when the corrector corrects the recognition result. The correction of the recognition result is performed based on the input of the operator using the correction terminal 5 realized by a computer device having a touch panel.

図８は、修正端末５の表示部５２に表示されるコンピュータディスプレイ画面である修正作業画面８を示す。修正作業画面８は、音声認識結果表示ウィンドウ８０、音響イベント認識結果表示ウィンドウ８３、音響イベント認識結果候補ウィンドウ８６、履歴表示ウィンドウ８７を含む。
音声認識結果表示ウィンドウ８０は、音声認識結果と、音声認識結果に修正や注釈文字列の挿入を行った文字列とを表示する。音響イベント認識結果表示ウィンドウ８３は、注釈文字列を表示する。音響イベント認識結果表示ウィンドウ８３に表示される注釈文字列は、音響イベント認識結果データＤ５に設定されている順位が最も高い注釈文字列である。音響イベント認識結果候補ウィンドウ８６は、注釈文字列の候補を表示する。注釈文字列の候補は、音響イベント認識結果データＤ５に設定されている順位が２番目以下の注釈文字列である。履歴表示ウィンドウ８７は、音声認識結果に対する修正文字列を表示する。 FIG. 8 shows a correction work screen 8 which is a computer display screen displayed on the display unit 52 of the correction terminal 5. The correction work screen 8 includes a speech recognition result display window 80, an acoustic event recognition result display window 83, an acoustic event recognition result candidate window 86, and a history display window 87.
The voice recognition result display window 80 displays a voice recognition result and a character string obtained by correcting the voice recognition result and inserting an annotation character string. The acoustic event recognition result display window 83 displays an annotation character string. The annotation character string displayed in the acoustic event recognition result display window 83 is the annotation character string having the highest order set in the acoustic event recognition result data D5. The acoustic event recognition result candidate window 86 displays annotation character string candidates. Annotation character string candidates are annotation character strings with the second or lower rank set in the acoustic event recognition result data D5. The history display window 87 displays a corrected character string for the voice recognition result.

音声認識装置１の認識結果修正部１６は、音声認識部１３から出力された音声認識結果データＤ３と、音響イベント認識部１５から出力された音響イベント認識結果データＤ５を、修正端末５に随時出力する。このとき、認識結果修正部１６は、音声認識結果データＤ３に対応した音声データＤ１も修正端末５に出力する。認識結果修正部１６は、修正端末５に出力した音声認識結果データＤ３が示す音声認識結果を作業中字幕とする。 The recognition result correction unit 16 of the speech recognition device 1 outputs the speech recognition result data D3 output from the speech recognition unit 13 and the acoustic event recognition result data D5 output from the acoustic event recognition unit 15 to the correction terminal 5 as needed. To do. At this time, the recognition result correction unit 16 also outputs the voice data D1 corresponding to the voice recognition result data D3 to the correction terminal 5. The recognition result correction unit 16 sets the voice recognition result indicated by the voice recognition result data D3 output to the correction terminal 5 as a working subtitle.

各修正端末５の制御部５１は、受信した音声データＤ１の再生音声を音声出力部５４から出力する。制御部５１は、音声認識結果表示ウィンドウ８０に、受信した音声認識結果データＤ３から読み出した音声認識結果を、修正対象の文字列として音声認識結果表示ウィンドウ８０の最下行に表示させる。このとき、制御部５１は、音声認識結果を、単語間に縦棒を挟んだ文字列により表示させる。なお、音声認識結果表示ウィンドウ８０にすでに最下行まで修正済みの音声認識結果が表示されていた場合、制御部５１は、表示していた修正済みの音声認識結果の中で最も先の時刻の修正済みの音声認識結果を消去する。消去後、制御部５１は、残りの修正済みの音声認識結果を現在よりも上の行に移動し、受信した音声認識結果データＤ３から読み出した音声認識結果を、音声認識結果表示ウィンドウ８０の最下行に表示させる。 The control unit 51 of each correction terminal 5 outputs the reproduced sound of the received sound data D1 from the sound output unit 54. The control unit 51 causes the voice recognition result display window 80 to display the voice recognition result read from the received voice recognition result data D3 as the correction target character string on the bottom line of the voice recognition result display window 80. At this time, the control unit 51 displays the speech recognition result as a character string with a vertical bar between words. If the voice recognition result corrected to the bottom line has already been displayed in the voice recognition result display window 80, the control unit 51 corrects the earliest time among the corrected voice recognition results displayed. Erase the completed speech recognition result. After erasure, the control unit 51 moves the remaining corrected speech recognition results to a line above the current level, and displays the speech recognition results read from the received speech recognition result data D3 in the speech recognition result display window 80. Display on the bottom line.

また、各修正端末５の制御部５１は、音響イベント認識結果表示ウィンドウ８３の右端から順に最新の注釈文字列を表示させる。つまり、制御部５１は、音声認識装置１から新たな音響イベント認識結果データＤ５を受信する度に、音響イベント認識結果表示ウィンドウ８３に表示していた注釈文字列を左にシフトして表示させる。制御部５１は、新たに受信した音響イベント認識結果データＤ５から読み出した、最も順位の高い注釈文字列を、音響イベント認識結果表示ウィンドウ８３の右端に表示させる。また、制御部５１は、音響イベント認識結果候補ウィンドウ８６に、受信した音響イベント認識結果データＤ５に設定されている２位以下の順位の注釈文字列をメニュー表示させる。 Further, the control unit 51 of each correction terminal 5 displays the latest annotation character strings in order from the right end of the acoustic event recognition result display window 83. That is, every time new acoustic event recognition result data D5 is received from the speech recognition apparatus 1, the control unit 51 shifts the annotation character string displayed in the acoustic event recognition result display window 83 to the left and displays it. The control unit 51 displays the highest-ranked annotation character string read from the newly received acoustic event recognition result data D5 on the right end of the acoustic event recognition result display window 83. Further, the control unit 51 causes the acoustic event recognition result candidate window 86 to display the annotation character strings ranked in the second or lower rank set in the received acoustic event recognition result data D5 as a menu.

音声認識結果の修正作業は、以下のように行う。修正者は、番組音声を聞きながら、音声認識結果表示ウィンドウ８０により表示部５２が表示している文字列の中から、修正対象の文字列を含む文字の表示部分を指などにより触れる。修正者は、指を移動させて、複数の文字に触れてもよい。入力部５３は、接触を検知した画面位置の情報を制御部５１に出力する。制御部５１は、接触を検知した画面位置に表示させていた文字が含まれる単語を選択し、選択された単語を特定する指摘情報を音声認識装置１に送信する。例えば、指摘情報には、単語が発音された時刻を用いることができる。音声認識装置１の認識結果修正部１６は、修正端末５−１からの指摘情報を最も早く受信したとする。認識結果修正部１６は、修正端末５−１から受信した指摘情報により示される文字列の表示を赤色等の選択色に変更するよう各修正端末５に指示する。各修正端末５の制御部５１は、音声認識装置１からの指示に基づき選択された文字列の表示を選択色に変更する。さらに、認識結果修正部１６は、修正端末５−２には、選択色に変更に併せて修正ガードを指示する。修正ガードが指示された修正端末５−２においては、修正作業や注釈の挿入作業はできない。 The speech recognition result correction operation is performed as follows. While listening to the program sound, the corrector touches the display portion of the character including the character string to be corrected from the character string displayed on the display unit 52 by the voice recognition result display window 80 with a finger or the like. The corrector may move a finger and touch a plurality of characters. The input unit 53 outputs information on the screen position at which contact is detected to the control unit 51. The control unit 51 selects a word including the character displayed at the screen position where contact is detected, and transmits indication information for specifying the selected word to the speech recognition apparatus 1. For example, the time at which a word is pronounced can be used as the indication information. It is assumed that the recognition result correction unit 16 of the voice recognition device 1 has received the indication information from the correction terminal 5-1 earliest. The recognition result correction unit 16 instructs each correction terminal 5 to change the display of the character string indicated by the indication information received from the correction terminal 5-1 to a selected color such as red. The control part 51 of each correction terminal 5 changes the display of the character string selected based on the instruction | indication from the speech recognition apparatus 1 to a selection color. Furthermore, the recognition result correction unit 16 instructs the correction terminal 5-2 to change the guard to the selected color. In the correction terminal 5-2 in which the correction guard is instructed, correction work and annotation insertion work cannot be performed.

修正端末５−１を使用している修正者は、入力部５３を用いて、選択色で表示されている文字列に対する置換、挿入、消去などの修正作業を行う。例えば、修正者は、単語が選択された状態で、キーボードにより文字を入力する。修正者は、修正作業が終了すると、修正作業終了操作として、キーボード上でＥｎｔｅｒ等のキーを押下する。制御部５１は、修正作業終了操作の入力を受けると、修正作業の内容を音声認識装置１に送信する。音声認識装置１の認識結果修正部１６は、作業中字幕における選択文字列を、修正端末５−１から受信した修正作業内容に従って修正し、新たな作業中字幕を生成する。認識結果修正部１６は、新たな作業中字幕と、修正作業において修正者がキーボードから入力した文字列を各修正端末５に送信する。各修正端末５の制御部５１は、音声認識装置１から受信した作業中字幕により、音声認識結果表示ウィンドウ８０に表示されている音声認識結果の表示を置き代える。また、各修正端末５の制御部５１は、一覧の作業の履歴として、修正者がキーボードから入力した文字列を履歴表示ウィンドウ８７に表示させる。修正端末５−２は、修正ガードを解除する。 The corrector using the correction terminal 5-1 uses the input unit 53 to perform correction operations such as replacement, insertion, and deletion for the character string displayed in the selected color. For example, the corrector inputs characters using a keyboard while a word is selected. When the correction work is completed, the corrector presses a key such as Enter on the keyboard as the correction work end operation. When receiving the input of the correction work end operation, the control unit 51 transmits the content of the correction work to the voice recognition device 1. The recognition result correction unit 16 of the speech recognition apparatus 1 corrects the selected character string in the working subtitle according to the correction work content received from the correction terminal 5-1, and generates a new working subtitle. The recognition result correction unit 16 transmits a new subtitle during work and a character string input from the keyboard by the corrector in the correction work to each correction terminal 5. The control unit 51 of each correction terminal 5 replaces the display of the speech recognition result displayed in the speech recognition result display window 80 with the working subtitle received from the speech recognition device 1. Further, the control unit 51 of each correction terminal 5 causes the history display window 87 to display a character string input from the keyboard by the corrector as a list work history. The correction terminal 5-2 releases the correction guard.

注釈の挿入作業は、以下のように行う。修正者は、番組音声を聞きながら、音響イベント認識結果表示ウィンドウ８３に表示されている任意の注釈文字列を、音声認識結果表示ウィンドウ８０に表示されている文字列の任意の箇所に挿入していく。
例えば、文字列８１が示す音声認識結果（あるいは修正済み音声認識結果）「お料理が上手ですね。」の直後に、音響イベント認識結果表示ウィンドウ８３に表示されている注釈文字列８４「（笑い）」を挿入する場合、修正者は次の操作を行う。修正者は、注釈文字列を挿入したい文字列８１の最後の文字「。」に触れる。入力部５３は、接触を検知した画面位置の情報を制御部５１に出力する。制御部５１は、接触を検知した画面位置に表示させていた文字が含まれる単語「。」を選択し、選択された単語を特定する指摘情報を音声認識装置１に送信する。つまり、このときの指摘情報は、注釈挿入位置を示す。音声認識装置１の認識結果修正部１６は、修正端末５−１からの指摘情報を最も早く受信したとする。認識結果修正部１６は、修正端末５−１から受信した指摘情報により示される文字列の表示を赤色等の選択色に変更するよう各修正端末５に指示する。各修正端末５の制御部５１は、音声認識装置１からの指示に基づき、選択された文字列の表示を選択色に変更する。さらに、認識結果修正部１６は、修正端末５−２に、選択色への変更に併せて修正ガードを指示する。 Annotation is inserted as follows. The corrector inserts an arbitrary annotation character string displayed in the acoustic event recognition result display window 83 into an arbitrary portion of the character string displayed in the voice recognition result display window 80 while listening to the program sound. Go.
For example, immediately after the voice recognition result (or the corrected voice recognition result) indicated by the character string 81 “I am good at cooking,” the annotation character string 84 “(laughter) displayed in the acoustic event recognition result display window 83 is displayed. ) ", The corrector performs the following operation. The corrector touches the last character “.” Of the character string 81 into which the annotation character string is to be inserted. The input unit 53 outputs information on the screen position at which contact is detected to the control unit 51. The control unit 51 selects the word “.” That includes the character displayed at the screen position where the contact is detected, and transmits the indication information that identifies the selected word to the speech recognition apparatus 1. That is, the indication information at this time indicates the annotation insertion position. It is assumed that the recognition result correction unit 16 of the voice recognition device 1 has received the indication information from the correction terminal 5-1 earliest. The recognition result correction unit 16 instructs each correction terminal 5 to change the display of the character string indicated by the indication information received from the correction terminal 5-1 to a selected color such as red. The control unit 51 of each correction terminal 5 changes the display of the selected character string to the selected color based on an instruction from the voice recognition device 1. Furthermore, the recognition result correction unit 16 instructs the correction terminal 5-2 to change the guard in accordance with the change to the selected color.

修正端末５−１を使用している修正者は、キーボード上の「挿入（Insert）」キーを押下し、さらに、注釈文字列８４「（笑い）」のいずれかの文字に触れる。入力部５３は、「挿入（Insert）」キーの押下と、接触を検知した画面位置の情報を制御部５１に出力する。制御部５１は、接触を検知した画面位置に表示させていた文字が含まれる注釈文字列を判断すると、その注釈文字列を特定する情報、あるいは、注釈文字列を設定した挿入注釈情報を音声認識装置１に送信する。先に送信した指摘情報と挿入注釈情報とを併せたものが注釈挿入指示に相当する。音声認識装置１の認識結果修正部１６は、挿入注釈情報により特定される、あるいは、挿入注釈情報が示す注釈文字列を、作業中字幕における選択された単語「。」の直後に挿入し、新たな作業中字幕「お料理が上手ですね。（笑い）」を生成する。認識結果修正部１６は、新たな作業中字幕を各修正端末５に送信する。各修正端末５の制御部５１は、音声認識装置１から受信した作業中字幕により、音声認識結果表示ウィンドウ８０に表示されている音声認識結果（あるいは修正済み音声認識結果）の表示を置き代える。修正端末５−２は、修正ガードを解除する。 The corrector using the correction terminal 5-1 presses the “Insert” key on the keyboard, and touches any character of the annotation character string 84 “(laugh)”. The input unit 53 outputs to the control unit 51 information on the screen position where the “Insert” key is pressed and the contact is detected. When the control unit 51 determines the annotation character string including the character displayed at the screen position where the contact is detected, the control unit 51 recognizes the information specifying the annotation character string or the insertion annotation information in which the annotation character string is set. Transmit to device 1. A combination of the previously sent indication information and insertion annotation information corresponds to the annotation insertion instruction. The recognition result correction unit 16 of the speech recognition apparatus 1 inserts the annotation character string specified by the insertion annotation information or indicated by the insertion annotation information immediately after the selected word “.” In the working subtitle, and newly adds the annotation character string. The subtitle "I'm good at cooking. (Laughs)" is generated. The recognition result correction unit 16 transmits a new working subtitle to each correction terminal 5. The control unit 51 of each correction terminal 5 replaces the display of the voice recognition result (or the corrected voice recognition result) displayed in the voice recognition result display window 80 by the working subtitle received from the voice recognition device 1. The correction terminal 5-2 releases the correction guard.

なお、修正者は、注釈文字列「（笑い）」を挿入したい場合、音響イベント認識結果表示ウィンドウ８３に表示されている注釈文字列８４「（笑い）」に代えて、注釈文字列８５「（笑い）」のいずれかの文字に触れてもよい。
また、例えば、音声認識結果表示ウィンドウ８０に表示されている文字列８２が示す修正済みの認識結果「○○さんの趣味はなんですか。」の直後に、注釈文字列を挿入する場合、文字列８２の最後の文字「。」に触れればよい。 When the corrector wants to insert the annotation character string “(laugh)”, the corrector replaces the annotation character string “(laugh)” displayed in the acoustic event recognition result display window 83 with the annotation character string 85 “( You may touch any of the characters "laughter".
Further, for example, when an annotation character string is inserted immediately after the corrected recognition result “What is Mr. XX's hobby?” Indicated by the character string 82 displayed in the voice recognition result display window 80, the character string Touch the last character “.” Of 82.

音響イベント認識結果が誤っている場合、音響イベント認識結果表示ウィンドウ８３から正しい注釈文字列を選択することができない。この場合、作業者は、音響イベント認識結果候補ウィンドウ８６にメニュー表示される注釈文字列の候補の一覧の中から、挿入する注釈文字列を選択する。 If the acoustic event recognition result is incorrect, a correct annotation character string cannot be selected from the acoustic event recognition result display window 83. In this case, the operator selects an annotation character string to be inserted from a list of annotation character string candidates displayed on the acoustic event recognition result candidate window 86 as a menu.

第２のインタフェースの提供方法は、修正後の文字列の装飾時に注釈文字列を挿入する方法である。情報番組やスポーツ中継の字幕制作では、話者（番組出演者）に応じて、該当する字幕の色を、白、青、黄等に色分けすることが行われる。色分けは、修正後の字幕について別の作業者が行うことが多い。この場合は、図８に示す画面において、文字列を修正する代わりに、表示されている文字列の各行に対して適切な色を指定する同時に、音響イベント認識結果表示ウィンドウ８３から適切な音響イベント認識結果を挿入すればよい。以下では、修正端末５−１により音声認識結果の修正を行い、修正端末５−２により修正後の音声認識結果に装飾を行う場合について、第１のインタフェースの提供方法との差分を中心に説明する。 The second interface providing method is a method of inserting an annotation character string when decorating a corrected character string. In the production of subtitles for information programs and sports broadcasts, the corresponding subtitle color is classified into white, blue, yellow, etc., depending on the speaker (program performer). Color coding is often performed by another operator for the subtitles after correction. In this case, on the screen shown in FIG. 8, instead of correcting the character string, an appropriate color is designated for each line of the displayed character string, and at the same time, an appropriate acoustic event is displayed from the acoustic event recognition result display window 83. What is necessary is just to insert a recognition result. In the following, a case where the speech recognition result is corrected by the correction terminal 5-1 and decoration is applied to the corrected speech recognition result by the correction terminal 5-2 will be described focusing on differences from the first interface providing method. To do.

音声認識装置１の認識結果修正部１６は、音声認識部１３から出力された音声認識結果データＤ３、及び対応する音声データＤ１と、音響イベント認識部１５から出力された音響イベント認識結果データＤ５を、修正端末５に随時出力する。各修正端末５の制御部５１は、受信した音声データＤ１の再生音声を音声出力部５４から出力し、図８に示す修正作業画面８を示す。修正端末５−１の修正者による音声認識結果の修正作業は、第１のインタフェースの提供方法と同様である。ただし、音声認識装置１の認識結果修正部１６は、音声認識結果の修正を行う他の修正端末５がある場合には修正ガードを送信するが、修正後の音声認識結果に装飾を行う修正端末５−２には、修正ガードを送信しなくてもよい。 The recognition result correction unit 16 of the speech recognition apparatus 1 uses the speech recognition result data D3 output from the speech recognition unit 13, the corresponding speech data D1, and the acoustic event recognition result data D5 output from the acoustic event recognition unit 15. And output to the correction terminal 5 as needed. The control part 51 of each correction terminal 5 outputs the reproduction | regeneration audio | voice of the received audio | voice data D1 from the audio | voice output part 54, and shows the correction work screen 8 shown in FIG. The correction work of the speech recognition result by the corrector of the correction terminal 5-1 is the same as the method for providing the first interface. However, the recognition result correction unit 16 of the speech recognition apparatus 1 transmits a correction guard when there is another correction terminal 5 that corrects the speech recognition result, but a correction terminal that decorates the corrected speech recognition result. In 5-2, the correction guard may not be transmitted.

続いて、音声認識装置１の認識結果修正部１６は、新たに生成された音声認識結果データＤ３と、対応する音声データＤ１を音声認識装置１に出力する。各修正端末５の制御部５１は、新たに受信した音声データＤ１の再生音声を音声出力部５４から出力する。さらに、制御部５１は、第１のインタフェースの提供方法と同様に、受信した音声認識結果データＤ３から読み出した音声認識結果を、修正対象の文字列として音声認識結果表示ウィンドウ８０の最下行に表示させる。 Subsequently, the recognition result correction unit 16 of the speech recognition device 1 outputs the newly generated speech recognition result data D3 and the corresponding speech data D1 to the speech recognition device 1. The control unit 51 of each correction terminal 5 outputs the reproduction sound of the newly received audio data D1 from the audio output unit 54. Further, as in the first interface providing method, the control unit 51 displays the voice recognition result read from the received voice recognition result data D3 as the correction target character string on the bottom line of the voice recognition result display window 80. Let

修正端末５−２の修正者は、番組音声を聞きながら、音声認識結果表示ウィンドウ８０により表示部５２が表示している文字列の中から、色を変えたい修正済みの音声認識結果（例えば、文字列８２）を含む文字の表示部分を指などにより触れ、文字色を入力する。文字色は、キーボードなどにより入力してもよく、音声認識結果表示ウィンドウ８０に文字色を選択するボタンを設け、そのボタンに触れることにより入力してもよい。入力部５３は、接触を検知した画面位置の情報を制御部５１に出力する。制御部５１は、接触を検知した画面位置に表示させていた文字が含まれる行を選択し、選択された行を特定する情報と、入力された文字色とを示す装飾情報を音声認識装置１に送信する。音声認識装置１の認識結果修正部１６は、修正端末５−２から受信した装飾情報により示される作業中字幕における行の文字列を、装飾情報により示される文字色に変更し、新たな作業中字幕を生成する。認識結果修正部１６は、選択された行の文字列を、変更後の文字色により表示するよう各修正端末５に指示する。各修正端末５の制御部５１は、音声認識装置１からの指示に従って、音声認識結果表示ウィンドウ８０の指定された行（修正済みの音声認識結果）の文字列を変更後の文字色により表示する。 The corrector of the correction terminal 5-2, while listening to the program audio, corrects the corrected voice recognition result (for example, the color of the character string displayed on the display unit 52 by the voice recognition result display window 80). A character color including a character string 82) is touched with a finger or the like to input a character color. The character color may be input using a keyboard or the like, or may be input by providing a button for selecting a character color in the voice recognition result display window 80 and touching the button. The input unit 53 outputs information on the screen position at which contact is detected to the control unit 51. The control unit 51 selects a line including the character displayed at the screen position where contact is detected, and the voice recognition device 1 displays the decoration information indicating the information specifying the selected line and the input character color. Send to. The recognition result correction unit 16 of the speech recognition apparatus 1 changes the character string of the line in the working subtitle indicated by the decoration information received from the correction terminal 5-2 to the character color indicated by the decoration information, and performs a new work. Generate subtitles. The recognition result correction unit 16 instructs each correction terminal 5 to display the character string of the selected line with the changed character color. The control unit 51 of each correction terminal 5 displays the character string of the designated line (corrected voice recognition result) in the voice recognition result display window 80 with the changed character color in accordance with the instruction from the voice recognition device 1. .

さらに、修正端末５−２の修正者は、音響イベント認識結果表示ウィンドウ８３に表示されている任意の注釈文字列を、音声認識結果表示ウィンドウ８０に表示されている修正済みの音声認識結果の任意の箇所に挿入していく。
例えば、文字列８２が示す修正済みの音声認識結果「○○さんの趣味はなんですか。」の直後に、注釈文字列８４「（笑い）」を挿入する場合、修正者は、キーボード上の「挿入（Insert）」キーを押下し、さらに、文字列８２の最後の文字「。」に触れる。入力部５３は、「挿入（Insert）」キーの押下と、接触を検知した画面位置の情報を制御部５１に出力する。制御部５１は、接触を検知した画面位置に表示させていた文字が含まれる単語を選択し、選択された単語を特定する注釈挿入位置情報を生成する。さらに、修正者は、注釈文字列８４「（笑い）」のいずれかの文字に触れる。入力部５３は、接触を検知した画面位置の情報を制御部５１に出力する。制御部５１は、接触を検知した画面位置に表示させていた文字が含まれ注釈文字列を判断すると、その注釈文字列を特定する情報、あるいは、注釈文字列を設定した挿入注釈情報を生成する。制御部５１は、注釈挿入位置情報と挿入注釈情報を設定した注釈挿入指示を音声認識装置１に送信する。音声認識装置１の認識結果修正部１６は、注釈挿入位置情報により、作業中字幕における注釈挿入対象の単語「。」を特定する。認識結果修正部１６は、挿入注釈情報により特定される、あるいは、挿入注釈情報が示す注釈文字列を、作業中字幕における注釈挿入対象の単語「。」の直後に挿入し、新たな作業中字幕を生成する。認識結果修正部１６は、新たな作業中字幕を各修正端末５に送信する。各修正端末５の制御部５１は、音声認識装置１から受信した作業中字幕により、音声認識結果表示ウィンドウ８０に表示されている修正済みの音声認識結果の表示を置き代える。 Further, the corrector of the correction terminal 5-2 changes the arbitrary annotation character string displayed in the acoustic event recognition result display window 83 to any of the corrected voice recognition results displayed in the voice recognition result display window 80. Insert it in the place of.
For example, when the annotation character string 84 “(laughs)” is inserted immediately after the corrected speech recognition result “What is Mr. XX ’s hobby?” Indicated by the character string 82, the corrector selects “ Press the “Insert” key and touch the last character “.” Of the character string 82. The input unit 53 outputs to the control unit 51 information on the screen position where the “Insert” key is pressed and the contact is detected. The control unit 51 selects a word including the character displayed at the screen position where contact is detected, and generates annotation insertion position information that identifies the selected word. Further, the corrector touches any character of the annotation character string 84 “(laugh)”. The input unit 53 outputs information on the screen position at which contact is detected to the control unit 51. When the control unit 51 determines the annotation character string including the character displayed at the screen position where the contact is detected, the control unit 51 generates information for specifying the annotation character string or insertion annotation information in which the annotation character string is set. . The control unit 51 transmits an annotation insertion instruction in which annotation insertion position information and insertion annotation information are set to the speech recognition apparatus 1. The recognition result correction unit 16 of the speech recognition apparatus 1 identifies the word “.” As the annotation insertion target in the working subtitle based on the annotation insertion position information. The recognition result correcting unit 16 inserts the annotation character string specified by the inserted annotation information or indicated by the inserted annotation information immediately after the word “.” As the annotation insertion target in the working subtitle, and creates a new working subtitle. Is generated. The recognition result correction unit 16 transmits a new working subtitle to each correction terminal 5. The control unit 51 of each correction terminal 5 replaces the display of the corrected voice recognition result displayed in the voice recognition result display window 80 with the working subtitle received from the voice recognition device 1.

図２において、音声認識装置１の認識結果修正部１６は、上記の音声認識結果の修正作業と、注釈の挿入作業とが反映された作業中字幕を設定した注釈付き放送字幕データＤ６を出力する（ステップＳ９）。注釈付き放送字幕データＤ６は、放送局内で放送波に重畳されて放送される。 In FIG. 2, the recognition result correction unit 16 of the speech recognition apparatus 1 outputs annotated broadcast subtitle data D <b> 6 in which the in-work subtitle is set reflecting the above-described speech recognition result correction operation and annotation insertion operation. (Step S9). The annotated broadcast subtitle data D6 is broadcast by being superimposed on the broadcast wave in the broadcast station.

上記のように、修正者は、音響イベントのテキスト表現である注釈を、簡易な操作によって音声認識結果に挿入し、注釈付き字幕を制作することができる。よって、キーボード入力により注釈文字列を挿入する場合と比較し、大幅に作業を効率化することが可能となる。 As described above, the corrector can create an annotated caption by inserting an annotation, which is a text representation of an acoustic event, into the speech recognition result by a simple operation. Therefore, it is possible to greatly improve the work efficiency as compared with the case where an annotation character string is inserted by keyboard input.

なお、字幕制作システムが修正端末５を１台のみ備える場合、第１のインタフェースの提供方法において、音声認識装置１の認識結果修正部１６は、上述した処理のうち、最も早く指摘情報を送信した修正端末５以外の修正端末５との間の動作は実行しない。
また、認識結果修正部１６は、音響イベント認識結果が変わったタイミングで、音響イベント認識結果データＤ５を修正端末５に出力して表示させるようにしてもよい。これにより、音響イベント認識結果表示ウィンドウ８３に、同じ注釈文字列が連続して表示されないようにすることができる。 When the caption production system includes only one correction terminal 5, in the first interface providing method, the recognition result correction unit 16 of the speech recognition device 1 transmits the indication information earliest among the above-described processes. The operation with the correction terminal 5 other than the correction terminal 5 is not executed.
Further, the recognition result correction unit 16 may output the acoustic event recognition result data D5 to the correction terminal 5 for display at the timing when the acoustic event recognition result is changed. Thereby, it is possible to prevent the same annotation character string from being continuously displayed in the acoustic event recognition result display window 83.

本実施形態によれば、音声認識装置１は、従来の音声認識に加え、音響イベントの認識を並行して行って修正端末５にそれらの認識結果を表示させ、修正者は、修正端末５の表示から注釈挿入位置と、挿入する注釈（音響イベントのテキスト表現）を指定する。従って、人手による注釈付き字幕制作の負荷を大幅に軽減することが可能となる。また、音声認識装置１は、様々な種類の音響イベントについてのテキスト表現を認識結果として得ることができるため、得られた音響イベントのテキスト表現を注釈として字幕に挿入することによって、より豊かな字幕表現が可能となる。 According to the present embodiment, the speech recognition apparatus 1 performs acoustic event recognition in parallel with conventional speech recognition and displays the recognition results on the correction terminal 5. Specify the annotation insertion position and the annotation to be inserted (text representation of the acoustic event) from the display. Therefore, it is possible to greatly reduce the burden of manual production of annotated captions. In addition, since the speech recognition apparatus 1 can obtain text representations of various types of acoustic events as recognition results, a richer subtitle can be obtained by inserting the text representation of the obtained acoustic events into the subtitles as annotations. Expression is possible.

なお、上述の音声認識装置１は、内部にコンピュータシステムを有している。そして、音声認識装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 The voice recognition device 1 described above has a computer system inside. The operation process of the speech recognition apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１…音声認識装置、５…修正端末、１０…記憶部、１１…音声分岐部、１２…音声区間検出部、１３…音声認識部、１４…音響イベント区間検出部、１５…音響イベント認識部、１６…認識結果修正部、５１…制御部、５２…表示部、５３…入力部、５４…音声出力部 DESCRIPTION OF SYMBOLS 1 ... Voice recognition apparatus, 5 ... Correction terminal, 10 ... Memory | storage part, 11 ... Voice branching part, 12 ... Voice area detection part, 13 ... Voice recognition part, 14 ... Acoustic event area detection part, 15 ... Acoustic event recognition part, 16 ... Recognition result correction unit 51 ... Control unit 52 ... Display unit 53 ... Input unit 54 ... Audio output unit

Claims

A speech recognition unit that recognizes speech data and outputs character string data indicating the utterance content of the speech recognition result;
An acoustic event recognition unit that calculates a posteriori probability of an acoustic event based on an acoustic feature obtained from the voice data and outputs character string data representing the acoustic event detected based on the calculated posteriori probability; ,
The character string data of the utterance content output by the voice recognition unit and the character string data representing the acoustic event output by the acoustic event recognition unit are displayed on the correction terminal, and specified from the display. An annotation insertion instruction indicating the annotation insertion position in the character string of the utterance content and the character string representing the acoustic event selected from the displayed contents is received from the correction terminal, and the utterance is received according to the received annotation insertion instruction. A recognition result correction unit that generates annotated caption data in which character string data representing the acoustic event is inserted into character string data indicating content;
Equipped with a,
The acoustic event recognizing unit calculates the posterior probability of the acoustic event by inputting the acoustic feature amounts of the frames in time order obtained by dividing the audio data and inputting the acoustic feature amounts into a convolutional neural network,
The convolutional neural network has an input layer, a hidden layer, a pooling layer, and an output layer,
The input layer has an acoustic feature amount of each of the frames arranged in time order as an input,
Each unit of the hidden layer is combined with a predetermined number of frames of the input layer while maintaining a shift by a predetermined number of frames, and the result of convolution calculation of the acoustic feature amount of the frame of the input layer that is combined is obtained. Show
Each unit of the pooling layer is coupled to a number of the hidden layer units corresponding to the number of units of the pooling layer, and the maximum value of the coupled hidden layer units is propagated,
Each unit of the output layer corresponds to a different type of acoustic event, and is coupled to all the units of the pooling layer by respective weights for calculating the posterior probability of the corresponding acoustic event.
A speech recognition apparatus characterized by that.

An audio event section detection unit that divides the audio data into frames and compares the acoustic feature amount of each frame with the acoustic feature amount of each of silence, acoustic event, and speech language to detect a section including the acoustic event; Prepared,
The acoustic event recognizing unit calculates a posterior probability of the acoustic event based on the acoustic feature amount obtained from the audio data of the section detected by the acoustic event section detecting unit, and based on the calculated posterior probability. Output character string data representing the detected acoustic event,
The speech recognition apparatus according to claim 1.

The acoustic feature amount is a feature amount in a time-frequency domain.
The speech recognition apparatus according to claim 1 or claim 2 , wherein

Computer
Voice recognition means for voice recognition of voice data and outputting character string data indicating the utterance content of the voice recognition result;
An acoustic event recognition means for calculating a posteriori probability of an acoustic event based on an acoustic feature obtained from the voice data, and outputting character string data representing the acoustic event detected based on the calculated posteriori probability; ,
The character string data of the utterance content output by the voice recognition unit and the character string data representing the acoustic event output by the acoustic event recognition unit are displayed on the correction terminal and designated from among the displayed items. An annotation insertion instruction indicating the annotation insertion position in the character string of the utterance content and the character string representing the acoustic event selected from the displayed contents is received from the correction terminal, and the utterance is received according to the received annotation insertion instruction. Recognition result correcting means for generating annotated subtitle data in which character string data representing the acoustic event is inserted into character string data indicating contents;
Equipped with,
The acoustic event recognizing means calculates the posterior probability of the acoustic event by inputting the acoustic feature amounts of the frames in time order obtained by dividing the audio data and inputting the acoustic feature amounts into a convolutional neural network,
The convolutional neural network has an input layer, a hidden layer, a pooling layer, and an output layer,
The input layer has an acoustic feature amount of each of the frames arranged in time order as an input,
Each unit of the hidden layer is combined with a predetermined number of frames of the input layer while maintaining a shift by a predetermined number of frames, and the result of convolution calculation of the acoustic feature amount of the frame of the input layer that is combined is obtained. Show
Each unit of the pooling layer is coupled to a number of the hidden layer units corresponding to the number of units of the pooling layer, and the maximum value of the coupled hidden layer units is propagated,
Each unit of the output layer corresponds to a different type of acoustic event, and is coupled to all the units of the pooling layer by respective weights for calculating the posterior probability of the corresponding acoustic event.
A program for functioning as a voice recognition device.