JP2012181358A

JP2012181358A - Text display time determination device, text display system, method, and program

Info

Publication number: JP2012181358A
Application number: JP2011044232A
Authority: JP
Inventors: Keiko Inagaki; 敬子稲垣
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-03-01
Filing date: 2011-03-01
Publication date: 2012-09-20

Abstract

PROBLEM TO BE SOLVED: To provide a text display time determination device, a text display system, a text display time determination method, and a text display time determination program, which can generate a caption easy to read and understand for users when sequentially converting an input voice into text to display.SOLUTION: A text display device comprises a recognition result creation means 81 and a display time determination means 82. The recognition result creation means 81 creates a recognition result obtained by sequentially recognizing an input voice and converting it into text. The display time determination means 82 determines a display time for each sentence contained in the recognition result on the basis of the utterance time of the voice.

Description

本発明は、入力される音声をリアルタイムで変換し、変換したテキストの表示時間を決定するテキスト表示時間決定装置、テキスト表示システム、テキスト表示時間決定方法、およびテキスト表示時間決定プログラムに関する。 The present invention relates to a text display time determination device, a text display system, a text display time determination method, and a text display time determination program for converting input speech in real time and determining the display time of the converted text.

テレビや映画などに表示する字幕の作成作業を効率化するため、音声認識技術を用いて自動で音声を文字に変換する方法が知られている。この認識結果を人手で修正することで、予め録画した映像の内容を人が確認しながら字幕を作成する方法に比べ、その作業の時間や手間を省いている。 In order to improve the efficiency of creating subtitles to be displayed on a television or a movie, a method of automatically converting speech into characters using speech recognition technology is known. By manually correcting the recognition result, the time and labor of the work can be saved as compared with the method of creating subtitles while the user confirms the contents of the pre-recorded video.

また、音声認識技術を字幕作成に用いることで作業を効率化できることから、配信映像が放映直前に決まるニュースなどの生放送番組でも字幕を付与できるようになってきている。特許文献１には、ニュースなど生放送番組で字幕放送を行う字幕ずれ補正装置が記載されている。特許文献１に記載された補正装置は、リレー方式によって放送直前に字幕をデータ化し、データ化した字幕に対して事前編集する機能を備えている。また、特許文献１に記載された補正装置では、リアルタイム入力に対応すべくブロック管理による切替機能を有している。そのため、例えば、放送直近に搬入される番組や生放送における字幕の確認および修正など、字幕を緊急に差し替える場合に、今まで困難であった事前処理を円滑に行うことができる。 In addition, since the work can be made more efficient by using voice recognition technology for the creation of captions, captions can be added even to live broadcast programs such as news that the distribution video is determined immediately before airing. Patent Document 1 describes a caption shift correction device that performs caption broadcasting on a live broadcast program such as news. The correction apparatus described in Patent Document 1 has a function of converting subtitles into data immediately before broadcasting by a relay method and pre-editing the converted subtitles. In addition, the correction apparatus described in Patent Document 1 has a switching function based on block management to cope with real-time input. Therefore, for example, when subtitles are urgently replaced, such as confirmation and correction of subtitles in a program or live broadcast that is brought in close to the broadcast, pre-processing that has been difficult until now can be performed smoothly.

特許文献２には、音声認識をより正確に行う表記文字列変換方法が記載されている。特許文献２に記載された方法では、操作者が入力した音声データから特徴量を抽出して表記文字列が作成されると、この表記文字列を未確定状態で表示手段に表示させる。そして、表示させた表記文字列の注目部分について変換命令が与えられると、この命令に基づいた変換を行う。 Patent Document 2 describes a notation character string conversion method for performing voice recognition more accurately. In the method described in Patent Document 2, when a notation character string is created by extracting feature amounts from voice data input by an operator, this notation character string is displayed on the display means in an unconfirmed state. Then, when a conversion command is given for the target portion of the displayed character string, conversion based on this command is performed.

特許文献３には、番組放送用に予め用意されたテキストに基づき発せられる発話音声に対応した字幕を出力する字幕出力装置が記載されている。特許文献３に記載された字幕出力装置は、ある認識結果に対するテキストの照合範囲を、その認識結果の長さよりも長くとることで、コマーシャルの直前にあるテキストを字幕として確実に出力できるようになる。 Patent Document 3 describes a caption output device that outputs captions corresponding to speech sounds that are uttered based on text prepared in advance for program broadcasting. The caption output device described in Patent Document 3 can reliably output text immediately before a commercial as caption by setting a text collation range for a certain recognition result to be longer than the length of the recognition result. .

なお、特許文献４および非特許文献１には、発声速度の検出方法が記載されている。 Note that Patent Document 4 and Non-Patent Document 1 describe a method for detecting a speech rate.

特開２００７−２０２０９４号公報JP 2007-202094 A 特開２０００−１０９７１号公報JP 2000-10971 A 特開２００９−１８２８５９号公報JP 2009-182859 A 特開平９−１４６５７５号公報JP-A-9-146575

大野誠寛、他４名、「同時的な独話音声要約に基づくリアルタイム字幕生成」、情報処理学会研究報告、v.SLP-62-10、2006、pp.51-56Masahiro Ohno and four others, “Real-time caption generation based on simultaneous monologue speech summarization”, Information Processing Society of Japan, v.SLP-62-10, 2006, pp.51-56

一般に、発言の音声認識と同時進行で作成した認識結果を字幕として表示するシステムでは、１つの認識処理が終わり次第、その認識結果を表示する。そのため、認識結果の表示時間は、その認識結果を作成した認識処理の直後に行われる音声認識の処理時間に依存していた。そのため、認識結果を作成した直後の発話が短く、その発話の認識処理が早く終了してしまう場合、その発話の前に表示していた認識結果（表示文字数）が長くても、すぐに次の結果が表示されてしまうことがある。このような場合、利用者が認識結果を読みきれないことがあった。 Generally, in a system that displays a recognition result created simultaneously with speech recognition of speech as subtitles, the recognition result is displayed as soon as one recognition process is completed. For this reason, the display time of the recognition result depends on the processing time of the speech recognition performed immediately after the recognition process that created the recognition result. Therefore, if the utterance immediately after creating the recognition result is short and the utterance recognition process ends quickly, even if the recognition result (number of displayed characters) displayed before that utterance is long, the next utterance is immediately Results may be displayed. In such a case, the user may not be able to read the recognition result.

また、最近では、セミナー、講演、大学の講義、会議などの場で、発言と同時に音声をテキスト化したいというニーズが増えつつある。しかし、特許文献１に記載された補正装置のように、ニュース番組の字幕を作成する方法では、多くの人手が必要であり、会場に専用の機材を設置したり、編集作業を行う場所を確保したりする必要がある。そのため、コスト面や人手の確保、機材設置の負担等から、これらの用途で利用するのは難しい。 Recently, there is an increasing need to make speech into text at the same time as speaking in seminars, lectures, university lectures, and conferences. However, like the correction device described in Patent Document 1, the method of creating subtitles for a news program requires a lot of human resources, and it is necessary to secure a place for installing dedicated equipment and editing work in the venue. It is necessary to do. For this reason, it is difficult to use for these purposes because of cost, manpower, and burden of equipment installation.

さらに、特許文献１に記載された補正装置は、事前に複数端末によるリレー方式で字幕文字入力を行うことを前提にした装置である。特許文献１に記載された補正装置を用いて緊急の生放送対応を行う場合、字幕を逐次入力する必要がある。そのため、入力される音声を逐次認識してテキスト化した字幕を表示する場面へ特許文献１に記載された補正装置を常に適用することは困難である。 Furthermore, the correction apparatus described in Patent Document 1 is an apparatus based on the premise that subtitle character input is performed in advance by a relay method using a plurality of terminals. When urgent live broadcasting correspondence is performed using the correction apparatus described in Patent Document 1, it is necessary to sequentially input subtitles. Therefore, it is difficult to always apply the correction device described in Patent Document 1 to a scene where subtitles that are converted into text by sequentially recognizing input speech are displayed.

特許文献２に記載された表記文字列変換方法では、音声データをもとに表記文字列が作成された後、操作者が変換命令を与える処理が行われる。このように、操作者が表記文字列についてその都度命令を行う方法では、絶えず作成される音声認識結果を提供し続けることは困難である。 In the notation character string conversion method described in Patent Document 2, a notation character string is created on the basis of voice data, and then an operator gives a conversion command. As described above, it is difficult to continuously provide a voice recognition result that is constantly created by a method in which an operator issues a command for a written character string each time.

また、特許文献３に記載された字幕出力装置では、そもそもテレビ番組の出演者が予め定めたテキストの内容以外の発話を行わないことを前提としている。そのため、出力される音声をその都度認識し、その認識結果を字幕で表示する方法には適用できない。 In addition, the subtitle output device described in Patent Document 3 is based on the premise that a performer of a television program does not utter anything other than the predetermined text content. Therefore, the method cannot be applied to a method of recognizing output sound each time and displaying the recognition result in subtitles.

そこで、本発明は、入力される音声を逐次テキスト化して表示する際、利用者にとって読みやすく理解しやすい字幕を生成できるテキスト表示時間決定装置、テキスト表示システム、テキスト表示時間決定方法およびテキスト表示時間決定プログラムを提供することを目的とする。 Accordingly, the present invention provides a text display time determination device, a text display system, a text display time determination method, and a text display time that can generate subtitles that are easy to read and understand for the user when the input speech is sequentially converted into text. The purpose is to provide a decision program.

本発明によるテキスト表示時間決定装置は、入力される音声を逐次認識してテキスト化した認識結果を作成する認識結果作成手段と、音声の発話時間に基づいて、認識結果に含まれる文章ごとに表示時間を決定する表示時間決定手段とを備えたことを特徴とする。 The text display time determination device according to the present invention recognizes input speech sequentially and creates a recognition result by creating a recognition result, and displays each sentence included in the recognition result based on the speech utterance time. The display time determining means for determining the time is provided.

本発明によるテキスト表示システムは、音声を入力する音声入力装置と、音声入力装置に入力された音声を認識する音声認識装置と、音声認識装置による音声の認識結果を表示する認識結果表示装置とを備え、音声認識装置が、音声入力装置に入力される音声を逐次認識してテキスト化した認識結果を作成する認識結果作成手段と、音声の発話時間に基づいて、認識結果に含まれる文章ごとに表示時間を決定する表示時間決定手段とを含むことを特徴とする。 A text display system according to the present invention includes a voice input device that inputs voice, a voice recognition device that recognizes voice input to the voice input device, and a recognition result display device that displays a voice recognition result by the voice recognition device. A speech recognition device for recognizing the speech input to the speech input device in order to create a recognition result that is converted into text, and for each sentence included in the recognition result based on the speech utterance time Display time determining means for determining a display time.

本発明によるテキスト表示時間決定方法は、入力される音声を逐次認識してテキスト化した認識結果を作成し、音声の発話時間に基づいて、認識結果に含まれる文章ごとに表示時間を決定することを特徴とする。 A method for determining a text display time according to the present invention creates a recognition result obtained by sequentially recognizing input speech, and determines the display time for each sentence included in the recognition result based on the speech utterance time. It is characterized by.

本発明によるテキスト表示時間決定プログラムは、コンピュータに、入力される音声を逐次認識してテキスト化した認識結果を作成する認識結果作成処理、および、音声の発話時間に基づいて、認識結果に含まれる文章ごとに表示時間を決定する表示時間決定処理を実行させることを特徴とする。 A text display time determination program according to the present invention is included in a recognition result based on a recognition result creation process for creating a recognition result obtained by sequentially recognizing input speech and converting it into text, and a speech utterance time. A display time determination process for determining the display time for each sentence is executed.

本発明によれば、入力される音声を逐次テキスト化して表示する際、利用者にとって読みやすく理解しやすい字幕を生成できる。 According to the present invention, it is possible to generate subtitles that are easy to read and understand for the user when the input voice is sequentially converted into text and displayed.

本発明の第１の実施形態におけるテキスト表示システムの例を示すブロック図である。It is a block diagram which shows the example of the text display system in the 1st Embodiment of this invention. 音声認識後の認識結果の例を示す説明図である。It is explanatory drawing which shows the example of the recognition result after speech recognition. 音声認識手段及び音声認識用辞書記憶手段の例を示すブロック図である。It is a block diagram which shows the example of a speech recognition means and a dictionary storage means for speech recognition. 辞書に定義される情報の例を示す説明図である。It is explanatory drawing which shows the example of the information defined in a dictionary. 変換データベース記憶手段が記憶する情報の例を示す説明図である。It is explanatory drawing which shows the example of the information which a conversion database memory | storage means memorize | stores. 第１の実施形態におけるテキスト表示システムの動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the text display system in 1st Embodiment. 音声認識処理および認識結果変換処理の例を示す説明図である。It is explanatory drawing which shows the example of a speech recognition process and a recognition result conversion process. 第１の実施形態の変形例におけるテキスト表示システムの例を示す説明図である。It is explanatory drawing which shows the example of the text display system in the modification of 1st Embodiment. 本発明の第２の実施形態におけるテキスト表示システムの例を示すブロック図である。It is a block diagram which shows the example of the text display system in the 2nd Embodiment of this invention. 重要語データベース記憶手段が記憶する情報の例を示す説明図である。It is explanatory drawing which shows the example of the information which a keyword database storage means memorize | stores. 第２の実施形態におけるテキスト表示システムの動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the text display system in 2nd Embodiment. 本発明によるテキスト表示時間決定装置の最小構成の例を示すブロック図である。It is a block diagram which shows the example of the minimum structure of the text display time determination apparatus by this invention. 本発明によるテキスト表示システムの最小構成の例を示すブロック図である。It is a block diagram which shows the example of the minimum structure of the text display system by this invention.

以下、本発明の実施形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

実施形態１．
図１は、本発明の第１の実施形態におけるテキスト表示システムの例を示すブロック図である。本実施形態におけるテキスト表示システムは、音声入力手段１と、音声認識手段２と、音声認識用辞書記憶手段３と、認識結果変換手段４と、変換データベース記憶手段５（以下、変換ＤＢ５と記す。）と、表示時間決定手段６と、テキスト表示手段７とを備えている。 Embodiment 1. FIG.
FIG. 1 is a block diagram showing an example of a text display system according to the first embodiment of the present invention. The text display system in this embodiment is described as voice input means 1, voice recognition means 2, voice recognition dictionary storage means 3, recognition result conversion means 4, and conversion database storage means 5 (hereinafter referred to as conversion DB 5). ), Display time determination means 6 and text display means 7.

音声入力手段１と、音声認識手段２と、認識結果変換手段４と、表示時間決定手段６およびテキスト表示手段７は、プログラム制御により動作する。これらの手段は、１つの端末の中に全ての手段が含まれていてもよい。また、これらの手段は、手段ごとに別の端末に含まれ、インターネットやＬＡＮ（Local Area Network）などを介して相互に接続されていてもよい。 The voice input means 1, the voice recognition means 2, the recognition result conversion means 4, the display time determination means 6 and the text display means 7 operate by program control. As for these means, all the means may be included in one terminal. These means may be included in different terminals for each means, and may be connected to each other via the Internet, a LAN (Local Area Network), or the like.

音声入力手段１は、入力された音声を音声認識手段２に通知する。音声入力手段１には、音声を表すファイル（以下、音声ファイルと記す。）が入力されてもよい。 The voice input unit 1 notifies the voice recognition unit 2 of the input voice. A file representing sound (hereinafter referred to as a sound file) may be input to the sound input means 1.

音声認識用辞書記憶手段３は、音声認識手段２が音声認識を行う際に利用する各種情報を記憶する。なお、音声認識用辞書記憶手段３の内容については後述する。 The voice recognition dictionary storage means 3 stores various information used when the voice recognition means 2 performs voice recognition. The contents of the voice recognition dictionary storage means 3 will be described later.

音声認識手段２は、音声入力手段１に入力される音声を逐次認識してテキスト化した認識結果を作成する。また、音声認識手段２は、その音声に含まれる各単語が発声された時間（以下、発話時間長と記す。）を算出する。具体的には、音声認識手段２は、入力された音声ファイルを分析し、音響的特徴量と音声ファイルに含まれる各単語の発話時間長とを算出する。さらに、音声認識手段２は、音声認識用辞書記憶手段３を参照し、格納されている単語または単語列の中から、音声ファイルの音響的特徴に最も近い単語または単語列を抽出する。そして、音声認識手段２は、抽出した単語または単語列を音声認識の結果として出力する。このとき、音声認識手段２は、音声ファイルを基に算出した発話時間長、音声認識用辞書記憶手段３に記憶された表記、読み、および、品詞を単語単位で対応付け、対応づけた内容を音声認識結果として認識結果変換手段４に通知する。 The voice recognition unit 2 sequentially recognizes the voice input to the voice input unit 1 and creates a text recognition result. Further, the voice recognition means 2 calculates a time when each word included in the voice is uttered (hereinafter referred to as an utterance time length). Specifically, the voice recognition unit 2 analyzes the input voice file and calculates the acoustic feature amount and the utterance time length of each word included in the voice file. Further, the voice recognition means 2 refers to the voice recognition dictionary storage means 3 and extracts a word or word string closest to the acoustic feature of the voice file from the stored words or word strings. Then, the voice recognition unit 2 outputs the extracted word or word string as a result of voice recognition. At this time, the voice recognition means 2 associates the speech duration length calculated based on the voice file, the notation, the reading, and the part of speech stored in the voice recognition dictionary storage means 3 in units of words, and the associated contents. The recognition result conversion means 4 is notified as a voice recognition result.

図２は、音声認識後の認識結果の例を示す説明図である。図２に例示する認識結果は、音声データに含まれる文章には単語ＳＰ１〜ＳＰｎが含まれ、単語ごとに発話時間長、表記、読みおよび品詞が対応付けられていることを示す。 FIG. 2 is an explanatory diagram illustrating an example of a recognition result after speech recognition. The recognition result illustrated in FIG. 2 indicates that the sentences included in the speech data include the words SP1 to SPn, and the speech duration, notation, reading, and part of speech are associated with each word.

ここで、音声認識手段２および音声認識用辞書記憶手段３の内容について、さらに説明する。図３は、音声認識手段２および音声認識用辞書記憶手段３の例を示すブロック図である。音声認識手段２は、音声検出部２１と、音声分析部２２と、音声照合部２３とを含む。また、音声認識用辞書記憶手段３は、音響モデル３１と、言語モデル３２と、辞書３３とを記憶する。 Here, the contents of the voice recognition means 2 and the voice recognition dictionary storage means 3 will be further described. FIG. 3 is a block diagram showing an example of the voice recognition means 2 and the voice recognition dictionary storage means 3. The voice recognition unit 2 includes a voice detection unit 21, a voice analysis unit 22, and a voice collation unit 23. The speech recognition dictionary storage means 3 stores an acoustic model 31, a language model 32, and a dictionary 33.

音響モデル３１は、日本語の音素毎の標準的なパタンを含む。 The acoustic model 31 includes a standard pattern for each Japanese phoneme.

言語モデル３２は、後述する辞書３３に含まれる単語の出現確率をデータ化したモデルである。言語モデル３２は、日本語の単語間、音素間の接続関係を規定した情報や、単語間の接続関係を規定した文法規則などを含む。 The language model 32 is a model in which appearance probabilities of words included in a dictionary 33 described later are converted into data. The language model 32 includes information defining connection relationships between Japanese words and phonemes, grammatical rules defining connection relationships between words, and the like.

辞書３３は、認識対象の単語をデータ化したものであり、該当部分の表記と読み、品詞情報等が単語または単語列単位で定義される。図４は、辞書３３に定義される情報の例を示す説明図である。図４に示す例では、辞書３３には、単語単位の表記、読み、および、品詞が含まれていることを示す。例えば、図４に例示する表記「ＶＰＣ」の読みは「ぶいぴーしー」であり、品詞は「名詞」である。また、表記「ＳａａＳ」の読みは「さーす」であり、品詞が「名詞」である。 The dictionary 33 is a word-to-be-recognized word, and the notation and reading of the corresponding part, part-of-speech information, etc. are defined in units of words or word strings. FIG. 4 is an explanatory diagram illustrating an example of information defined in the dictionary 33. In the example shown in FIG. 4, the dictionary 33 indicates that word unit notation, reading, and part of speech are included. For example, the reading of the notation “VPC” illustrated in FIG. 4 is “Buipsy”, and the part of speech is “noun”. Further, the reading of the notation “SaaS” is “Saas”, and the part of speech is “noun”.

音響モデル３１、言語モデル３２および辞書３３については、ユーザ等により、予め音声認識用辞書記憶手段３に記憶される。 The acoustic model 31, the language model 32, and the dictionary 33 are stored in advance in the speech recognition dictionary storage unit 3 by a user or the like.

音声検出部２１は、入力された音声ファイルから音声と雑音を切り分け、音声を含む区間を検出して、音声分析部２２に送出する。音声検出部２１が音声を検出する方法として、例えば、音声ファイルが示す音声のパワーを使った方法が利用できる。この音声検出方法では、音声ファイルが示す音声のパワーを逐次計算する。そして、そのパワーが予め定めた閾値を一定時間連続して上回った時点を音声の開始時点と判定し、そのパワーが予め定めた閾値を一定時間連続して下回った時点を音声の終了時点と判定する。音声検出部２１は、音声開始時点から音声終了時点までを１つの文章を示す音声として逐次音声分析部２２に送出する。なお、音声検出部２１は、句点で区切られる単語（列）を１つの文章としてもよく、音声の区切りまでの単語（列）を１つの文章と判定してもよい。 The voice detection unit 21 separates voice and noise from the input voice file, detects a section including the voice, and sends it to the voice analysis unit 22. As a method for detecting the sound by the sound detection unit 21, for example, a method using the power of the sound indicated by the sound file can be used. In this voice detection method, the power of the voice indicated by the voice file is calculated sequentially. Then, the time when the power exceeds a predetermined threshold for a certain period of time is determined as the voice start time, and the time when the power falls below the predetermined threshold for a certain period of time is determined as the voice end time. To do. The voice detection unit 21 sequentially transmits the voice from the voice start time to the voice end time as voice indicating one sentence to the voice analysis unit 22. Note that the voice detection unit 21 may determine a word (sequence) separated by punctuation points as one sentence, and may determine a word (sequence) up to the speech separation as one sentence.

音声分析部２２は、音声検出部２１が切り出した音声の音響分析を行い、音声の特徴を表現する音響的特徴を音声照合部２３に送出する。 The voice analysis unit 22 performs an acoustic analysis of the voice extracted by the voice detection unit 21 and sends an acoustic feature expressing the feature of the voice to the voice collation unit 23.

音声照合部２３は、音声分析部２２から音声の音響的特徴を受け取ると、音響モデル３１に格納された日本語の音素の標準的なパタンと言語モデル３２を用いて音声認識を行い、音声認識結果をテキストとして出力する。 When the speech collation unit 23 receives the acoustic features of the speech from the speech analysis unit 22, the speech collation unit 23 performs speech recognition using a standard Japanese phoneme pattern stored in the acoustic model 31 and the language model 32, and performs speech recognition. Output the result as text.

変換ＤＢ５は、予め定められた単語または単語列と、その単語または単語列よりも短く変換した表記（以下、変換後表記と記す。）とを対応づけて記憶する。具体的には、変換ＤＢ５は、音声認識結果に含まれる単語または単語列のうち、表記の変換が必要な単語または単語列を、変換後表記と対応付けて記憶する。なお、単語または単語列よりも短く変換した表記（すなわち、変換後表記）には、何も表示しないものも含まれる。また、変換ＤＢ５は、表記の他、その表記の読み、その表記の属性を記憶していてもよい。例えば、表記が単語の場合、表記の属性には該当する品詞が設定される。また、表記が単語列の場合、表記の属性には、単語列であることが設定される。なお、変換ＤＢ５には、ユーザ等により、予め変換後表記が記憶される。 The conversion DB 5 stores a predetermined word or word string and a notation converted to be shorter than the word or word string (hereinafter referred to as a converted notation) in association with each other. Specifically, the conversion DB 5 stores a word or a word string that needs to be converted from a word or a word string included in the speech recognition result in association with the converted notation. Note that the notation converted to be shorter than the word or the word string (that is, the notation after conversion) includes those that display nothing. In addition to the notation, the conversion DB 5 may store reading of the notation and attributes of the notation. For example, when the notation is a word, the corresponding part of speech is set as the notation attribute. When the notation is a word string, the notation attribute is set to be a word string. The conversion DB 5 stores a post-conversion notation in advance by a user or the like.

図５は、変換ＤＢ５が記憶する情報の例を示す説明図である。図５に示す例では、単語または単語列の表記、読み、属性、および、その表記の変換後表記を対応づけて記憶していることを示す。例えば、図５に例示する単語「えーと」は、読みが「えーと」であり、属性が品詞を表す「フィラー」である。また、この単語「えーと」の変換後表記の項目を空欄（すなわち、何も表示しない）とすることで、「その単語を削除する」ことを意味している。他にも、図５に例示する単語列「というわけです」は、読みが「というわけです」である。また、この単語列の属性の項目を「単語列」とすることで、「複数の単語から構成されている単語列であること」を意味している。さらに、この単語列の変換後表記が「です」であることを意味している。 FIG. 5 is an explanatory diagram illustrating an example of information stored in the conversion DB 5. The example shown in FIG. 5 indicates that the notation, reading, attribute, and converted notation of the word or word string are stored in association with each other. For example, the word “Etto” illustrated in FIG. 5 is “Etter” for reading and “Filler” for attribute representing part of speech. Further, by setting the item of the notation after conversion of the word “um” as blank (that is, displaying nothing), it means “deleting the word”. In addition, the word string “That's why” illustrated in FIG. Further, by setting the item of the attribute of the word string as “word string”, it means “it is a word string composed of a plurality of words”. Furthermore, it means that the converted notation of this word string is “I”.

認識結果変換手段４は、認識結果に含まれる単語または単語の表記を、変換ＤＢ５に記憶された単語または単語列に対応する変換後表記に変換する。具体的には、認識結果変換手段４は、変換ＤＢ５を参照し、音声認識手段２が生成した認識結果のうち変換が必要な単語または単語列について、認識結果を変換後の表記に変換または削除を行う。 The recognition result conversion means 4 converts the word or word notation included in the recognition result into a converted notation corresponding to the word or word string stored in the conversion DB 5. Specifically, the recognition result conversion unit 4 refers to the conversion DB 5 and converts or deletes the recognition result to the converted notation for the word or word string that needs to be converted among the recognition results generated by the speech recognition unit 2. I do.

例えば、音声認識手段２による認識結果が、「えーと、その件につきましてはこれから検討というわけです。」である場合、認識結果変換手段４は、単語「えーと」を削除するとともに、単語列「というわけです」を「です」に変換する。その結果、認識結果は、「その件につきましては、これから検討です。」に変換される。 For example, if the recognition result by the speech recognition means 2 is “Well, the matter will be considered from now on”, the recognition result conversion means 4 deletes the word “Ehto” and the word string “Well. Is "is converted to" is ". As a result, the recognition result is converted to “I will consider the matter from now on”.

表示時間決定手段６は、音声認識手段２が認識した音声の発話時間に基づいて、認識結果に含まれる文章ごとに表示時間を決定する。具体的には、表示時間決定手段６は、認識結果変換手段４が変換した認識結果をもとに、表示する文章ごとに表示時間を算出する。ここで、表示する文章に含まれる単語は、１つであってもよく、複数であってもよい。なお、１つの文章は、音声検出部２１が判定した音声開始時点から音声終了時点までを１つの文章とすればよい。表示時間決定手段６は、例えば、以下に例示する式１に基づいて、文章ごとに表示時間を算出する。 The display time determination means 6 determines the display time for each sentence included in the recognition result based on the speech utterance time recognized by the voice recognition means 2. Specifically, the display time determination unit 6 calculates the display time for each sentence to be displayed based on the recognition result converted by the recognition result conversion unit 4. Here, the number of words included in the displayed text may be one or plural. In addition, what is necessary is just to make one sentence into one sentence from the audio | voice start time determined by the audio | voice detection part 21 to the audio | voice end time. The display time determination unit 6 calculates the display time for each sentence based on, for example, Equation 1 exemplified below.

Ｔ＝Ｓ × Ｗ・・・式（１）
（ただし、Ｓ＝Ｓ１＋Ｓ２＋・・・＋Ｓｎ） T = S × W (1)
(However, S = S1 + S2 + ... + Sn)

ここで、ｎは、表示する１つの文章（すなわち、認識結果変換手段４によって変換された後の文章）に含まれる単語の数であり、Ｓｎは、表示する文章に含まれる単語ｎが変換される前の単語の発話時間長である。また、Ｓは、単語ｎの発話時間長の総和であり、Ｗは、Ｓに対する重み値である。以下、この重み値Ｗのことを、表示重み値と記すこともある。 Here, n is the number of words included in one sentence to be displayed (that is, the sentence after being converted by the recognition result conversion means 4), and Sn is a word n included in the sentence to be displayed. This is the length of the utterance time of the word before S is the total utterance time length of the word n, and W is a weight value for S. Hereinafter, the weight value W may be referred to as a display weight value.

表示重み値Ｗは、以下に例示する式（２）によって算出できる。 The display weight value W can be calculated by the following equation (2).

Ｗ＝１文あたりの発声速度／平均発声速度・・・式（２） W = utterance speed per sentence / average utterance speed (2)

ここでの発声速度は、変換前の文章の発声速度であり、例えば、特許文献４や非特許文献１に記載されているように、一定時間内の音素数で表すことができる。一定時間内の音素数は、「１文内に含まれる音素数／その１文の発話時間長」で算出できる。また、平均発声速度は、表示重み値Ｗを算出する直前のまでの文を対象として算出した発声速度の平均値を使用すればよい。なお、表示重み値Ｗの算出方法は、上記方法に限定されない。表示時間決定手段６は、表示重み値Ｗの値を、最適な表示時間の実測値より決定してもよい。 The utterance speed here is the utterance speed of the text before conversion, and can be expressed by the number of phonemes within a fixed time, as described in Patent Document 4 and Non-Patent Document 1, for example. The number of phonemes within a certain time can be calculated by “number of phonemes included in one sentence / the length of utterance of one sentence”. The average utterance speed may be the average value of utterance speeds calculated for the sentence up to immediately before the display weight value W is calculated. Note that the method of calculating the display weight value W is not limited to the above method. The display time determination means 6 may determine the value of the display weight value W from the actual measurement value of the optimal display time.

このように、表示時間は、個々の単語の発話時間長（実測値）に重み付けをした値から算出されるため、表示時間は、発声速度（すなわち、音素数／文の発話時間）と相関があると言うことができる。 Thus, since the display time is calculated from a value obtained by weighting the utterance time length (actually measured value) of each word, the display time is correlated with the utterance speed (that is, the phoneme number / sentence utterance time). I can say that there is.

テキスト表示手段７は、変換後の認識結果を受け取り、算出された表示時間分、認識結果を表示する。 The text display means 7 receives the converted recognition result and displays the recognition result for the calculated display time.

音声認識手段２と、認識結果変換手段４と、表示時間決定手段６とは、プログラム（テキスト表示時間決定プログラム）に従って動作するコンピュータのＣＰＵによって実現される。例えば、プログラムは、音声認識装置の記憶部（図示せず）に記憶され、ＣＰＵは、そのプログラムを読み込み、プログラムに従って、音声認識手段２、認識結果変換手段４および表示時間決定手段６として動作してもよい。また、音声認識手段２と、認識結果変換手段４と、表示時間決定手段６とは、それぞれが専用のハードウェアで実現されていてもよい。 The voice recognition unit 2, the recognition result conversion unit 4, and the display time determination unit 6 are realized by a CPU of a computer that operates according to a program (text display time determination program). For example, the program is stored in a storage unit (not shown) of the voice recognition device, and the CPU reads the program and operates as the voice recognition unit 2, the recognition result conversion unit 4, and the display time determination unit 6 according to the program. May be. Moreover, each of the speech recognition unit 2, the recognition result conversion unit 4, and the display time determination unit 6 may be realized by dedicated hardware.

音声認識用辞書記憶手段３と、変換ＤＢ５は、例えば、磁気ディスク等により実現される。また、音声入力手段１は、例えば、マイクロホンにより実現され、テキスト表示手段７は、例えば、ディスプレイ装置により実現される。 The voice recognition dictionary storage means 3 and the conversion DB 5 are realized by a magnetic disk, for example. The voice input unit 1 is realized by, for example, a microphone, and the text display unit 7 is realized by, for example, a display device.

次に、動作について説明する。図６は、第１の実施形態におけるテキスト表示システムの動作の例を示すフローチャートである。 Next, the operation will be described. FIG. 6 is a flowchart showing an example of the operation of the text display system in the first embodiment.

まず、音声入力手段１を介して音声が入力されると（ステップＡ１）、音声認識手段２は、音声入力手段１から音声データを受け取り、音声認識用辞書記憶部３を参照して音声を認識する（ステップＡ２）。この際、音声認識手段２は、音声認識とともに、音声認識結果に含まれる単語または単語列の発話時間長を算出する。 First, when a voice is input via the voice input means 1 (step A1), the voice recognition means 2 receives voice data from the voice input means 1, and recognizes the voice by referring to the voice recognition dictionary storage unit 3. (Step A2). At this time, the voice recognition means 2 calculates the utterance time length of the word or word string included in the voice recognition result together with the voice recognition.

続いて、認識結果変換手段４は、音声認識手段２から単語または単語列を含む認識結果を受け取ると、変換ＤＢ５を参照して、認識結果に該当する単語が含まれるか否かを判定する（ステップＡ３）。変換ＤＢ５に対応する単語が存在する場合、認識結果変換手段４は、その単語を対応する表記（すなわち、変換後表記）に変換し、表示時間決定手段６に通知する（ステップＡ４）。なお、変換ＤＢ５に対応する単語が存在しない場合、認識結果変換手段４は、変換処理を行わず、認識結果をそのまま表示時間決定手段６に通知する。 Subsequently, when the recognition result conversion unit 4 receives a recognition result including a word or a word string from the speech recognition unit 2, the recognition result conversion unit 4 refers to the conversion DB 5 to determine whether or not the word corresponding to the recognition result is included ( Step A3). When a word corresponding to the conversion DB 5 exists, the recognition result conversion unit 4 converts the word into a corresponding notation (ie, a notation after conversion) and notifies the display time determination unit 6 (step A4). When there is no word corresponding to the conversion DB 5, the recognition result conversion unit 4 notifies the display time determination unit 6 as it is without performing the conversion process.

表示時間決定手段６は、認識結果変換手段４から受け取った認識結果と、その認識結果に含まれる単語の発話時間とをもとに、受け取った認識結果を表示する時間を決定する（ステップＡ５）。テキスト表示手段７は、表示時間決定手段６が決定した時間長分だけ、認識結果を表示する（ステップＡ６）。 The display time determining means 6 determines the time for displaying the received recognition result based on the recognition result received from the recognition result converting means 4 and the utterance time of the word included in the recognition result (step A5). . The text display means 7 displays the recognition result for the length of time determined by the display time determination means 6 (step A6).

図７は、音声認識処理および認識結果変換処理の例を示す説明図である。図７では、図６におけるステップＡ２において音声認識手段２が音声認識した際の表記と、ステップＡ４において認識結果変換手段４が不要な単語を変換（削除）した際の表記を例示している。 FIG. 7 is an explanatory diagram illustrating an example of a voice recognition process and a recognition result conversion process. FIG. 7 illustrates notation when the speech recognition unit 2 recognizes speech in step A2 in FIG. 6 and notation when the recognition result conversion unit 4 converts (deletes) unnecessary words in step A4.

例えば、ステップＡ２における音声認識の結果、表記が「えーと、それではただいまから合同会議を開催いたします。」に決定され、ステップＡ３において、認識結果変換手段４が変換ＤＢ５を参照した結果、認識結果に含まれる単語「えーと」に対応する単語（変換対象用語）を見つけたとする。すると、認識結果変換手段４は、ステップＡ４において不要語「えーと」を削除し、「それではただいまから合同会議を開催いたします。」に変換した結果を作成する。 For example, as a result of the speech recognition in step A2, the notation is determined to be "Uh, so we will hold a joint meeting now." In step A3, the recognition result conversion means 4 refers to the conversion DB 5, and as a result, It is assumed that a word (a conversion target term) corresponding to the included word “eto” is found. Then, the recognition result conversion means 4 deletes the unnecessary word “Ut” in step A4, and creates a result converted into “Now we will hold a joint meeting now”.

以上のように、本実施形態によれば、認識結果変換手段４が、入力される音声を逐次認識してテキスト化した認識結果を作成する。そして、表示時間決定手段６が、音声の発話時間に基づいて、認識結果に含まれる文章ごとに表示時間を決定する。具体的には、認識結果変換手段４が、入力された音声の認識結果に含まれる単語または単語列の表記を変換ＤＢ５に記憶された単語または単語列に対応する変換後表記に変換する。そして、表示時間決定手段６が、変換後表記に変換された認識結果に含まれる文章ごとに表示時間を決定する。そのような構成により、入力される音声を逐次テキスト化して表示する際、利用者にとって読みやすく理解しやすい字幕を生成できる。 As described above, according to the present embodiment, the recognition result conversion unit 4 sequentially recognizes input speech and creates a recognition result that is converted into text. Then, the display time determining means 6 determines the display time for each sentence included in the recognition result based on the speech utterance time. Specifically, the recognition result conversion unit 4 converts the notation of the word or word string included in the input speech recognition result into a converted notation corresponding to the word or word string stored in the conversion DB 5. Then, the display time determination unit 6 determines the display time for each sentence included in the recognition result converted into the post-conversion notation. With such a configuration, it is possible to generate subtitles that are easy to read and understand for the user when the input speech is sequentially converted into text.

すなわち、本実施形態によるテキスト表示システムは、変換した文字を端末に表示する際の表示内容および表示時間を、実際の発話内容および発話時間を考慮しながら決定するため、読みやすく理解しやすい字幕を生成することができる。 That is, the text display system according to the present embodiment determines the display content and display time when displaying the converted character on the terminal in consideration of the actual utterance content and utterance time. Can be generated.

また、話している内容を逐次認識し、字幕として表示する場合、ユーザには、今話している音と、認識結果の字幕とが両方提示されることになる。具体的には、認識結果が表示されるタイミングは、必ず発話終了後になる。認識結果の字幕表示が早すぎたり遅すぎたりすると、ユーザに不自然な感じを与え、話している内容の理解を妨げてしまう恐れがある。そのため、できる限りもとの音声の発話時間に基づいて認識結果を表示することが望ましい。本実施形態では、認識結果変換手段４が不要（無意味）な発言は削除し、また、あまり重要でない単語を短くすることで、利用者にとって読みやすく理解しやすい字幕を表示できる。 In addition, when the content being spoken is sequentially recognized and displayed as subtitles, the user is presented with both the sound currently being spoken and the subtitles of the recognition result. Specifically, the timing when the recognition result is displayed is always after the end of the utterance. If the recognition result subtitle display is too early or too late, it may give the user an unnatural feeling and hinder understanding of the content being spoken. Therefore, it is desirable to display the recognition result based on the utterance time of the original voice as much as possible. In the present embodiment, it is possible to display subtitles that are easy to read and understand for the user by deleting words that are unnecessary (meaningless) by the recognition result conversion means 4 and shortening words that are not very important.

次に、本実施形態の変形例について説明する。本変形例におけるテキスト表示システムは、複数の装置で実現され、各装置がインターネットを介して接続される。図８は、本変形例におけるテキスト表示システムの例を示す説明図である。図８に例示するテキスト表示システムは、音声送出端末１０と、音声認識サーバ２０と、認識結果表示端末３０とを備えている。 Next, a modification of this embodiment will be described. The text display system in this modification is implemented by a plurality of devices, and each device is connected via the Internet. FIG. 8 is an explanatory diagram showing an example of a text display system in the present modification. The text display system illustrated in FIG. 8 includes a voice sending terminal 10, a voice recognition server 20, and a recognition result display terminal 30.

音声送出端末１０は、音声入力手段１を含む。音声認識サーバ２０は、音声認識手段２と、音声認識用辞書記憶手段３と、認識結果変換手段４と、変換ＤＢ５と、表示時間決定手段６とを含む。また、認識結果表示端末３は、テキスト表示手段７を含む。なお、音声入力手段１、音声認識手段２、音声認識用辞書記憶手段３、認識結果変換手段４、変換ＤＢ５、表示時間決定手段６およびテキスト表示手段７の内容については、第１の実施形態と同様である。 The voice sending terminal 10 includes a voice input unit 1. The speech recognition server 20 includes speech recognition means 2, speech recognition dictionary storage means 3, recognition result conversion means 4, conversion DB 5, and display time determination means 6. The recognition result display terminal 3 includes text display means 7. The contents of the voice input unit 1, the voice recognition unit 2, the voice recognition dictionary storage unit 3, the recognition result conversion unit 4, the conversion DB 5, the display time determination unit 6, and the text display unit 7 are the same as those in the first embodiment. It is the same.

このように、インターネットを介した構成であっても、入力される音声を逐次テキスト化して表示する際、利用者にとって読みやすく理解しやすい字幕を生成できる。 As described above, even if the configuration is via the Internet, subtitles that are easy to read and understand for the user can be generated when the input speech is converted into text and displayed.

実施形態２．
図９は、本発明の第２の実施形態におけるテキスト表示システムの例を示すブロック図である。なお、第１の実施形態と同様の構成については、図１と同一の符号を付し、説明を省略する。本実施形態におけるテキスト表示システムは、音声入力手段１と、音声認識手段２と、音声認識用辞書記憶手段３と、認識結果変換手段４と、変換ＤＢ５と、表示時間決定手段６と、テキスト表示手段７と、重要語抽出手段８と、重要語データベース記憶手段９（以下、重要語ＤＢ９と記す。）とを備えている。 Embodiment 2. FIG.
FIG. 9 is a block diagram showing an example of a text display system according to the second embodiment of the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. The text display system according to this embodiment includes a voice input unit 1, a voice recognition unit 2, a voice recognition dictionary storage unit 3, a recognition result conversion unit 4, a conversion DB 5, a display time determination unit 6, and a text display. Means 7, important word extraction means 8, and important word database storage means 9 (hereinafter referred to as important word DB 9) are provided.

すなわち、本実施形態におけるテキスト表示システムは、第１の実施形態におけるテキスト表示システムの構成に、重要語抽出手段８と、重要語ＤＢ９とをさらに備えている。 That is, the text display system according to the present embodiment further includes an important word extraction unit 8 and an important word DB 9 in the configuration of the text display system according to the first embodiment.

重要語ＤＢ９は、発話時間長に乗じる重み値を単語ごとに記憶する。具体的には、重要語ＤＢ９は、音声認識結果に含まれる単語のうち、重要であると想定される単語の発話時間長に付与する重み値を記憶する。なお、この重み値のことを、以下、発話時間長重み値と記すこともある。 The important word DB 9 stores a weight value by which the utterance time length is multiplied for each word. Specifically, the important word DB 9 stores a weight value to be given to the utterance time length of a word that is assumed to be important among the words included in the speech recognition result. Hereinafter, this weight value may be referred to as an utterance time length weight value.

図１０は、重要語ＤＢ９が記憶する情報の例を示す説明図である。図１０に示す例では、単語の表記、読み、品詞、および、その単語に付与する重み値を対応づけて記憶していることを示す。例えば、図１０に例示する表記「収益」は、読みが「しゅうえき」であり、品詞が「名詞」である。また、その単語の発話時間長に付与する重み値が「１．３」であることを示す。 FIG. 10 is an explanatory diagram illustrating an example of information stored in the keyword DB 9. The example shown in FIG. 10 indicates that the word notation, reading, part of speech, and weight value assigned to the word are stored in association with each other. For example, the notation “revenue” illustrated in FIG. 10 has a reading “Shueki” and a part of speech “Noun”. Further, the weight value assigned to the utterance time length of the word is “1.3”.

重要語ＤＢ９には、ユーザ等により、予め単語ごとに発話時間長重み値が記憶される。例えば、音声認識用辞書３から、個人名、数詞または製品名を抽出し、抽出した単語を重要語ＤＢ９に記憶させてもよい。他にも、音声認識用のユーザ辞書から抽出した単語を記憶させてもよく、ユーザが事前に作成した重要語リストに含まれる単語を重要語ＤＢ９に記憶させてもよい。 The important word DB 9 stores an utterance time length weight value for each word in advance by a user or the like. For example, a personal name, a number, or a product name may be extracted from the speech recognition dictionary 3 and the extracted word may be stored in the important word DB 9. In addition, the words extracted from the user dictionary for speech recognition may be stored, or the words included in the important word list created in advance by the user may be stored in the important word DB 9.

また、単語ごとに設定する発話時間長重み値には、任意の値が設定可能である。例えば、言語モデルを作成したコーパスから単語の出現頻度を求め、より頻度の高い単語に対し、より大きな重みを設定するようにしてもよい。また、ユーザの経験に基づき、任意の値を各単語に設定してもよい。 Moreover, an arbitrary value can be set as the utterance time length weight value set for each word. For example, the appearance frequency of a word may be obtained from a corpus that has created a language model, and a greater weight may be set for a more frequent word. Also, an arbitrary value may be set for each word based on the user's experience.

重要語抽出手段８は、音声認識手段２が生成した音声認識結果に含まれる単語の発話時間長に対して、重要語ＤＢ９に記憶された対応する単語の重み値を乗じる。具体的には、重要語抽出手段８は、重要語ＤＢ９を参照し、音声認識手段２が生成した認識結果に含まれる単語が重要語ＤＢ９に存在する場合、該当する重みを認識結果の発話時間長に乗じる。重要語抽出手段８は、例えば、以下に例示する式３に基づいて、発話時間長に重み値を付与してもよい。 The keyword extraction unit 8 multiplies the utterance time length of the word included in the speech recognition result generated by the speech recognition unit 2 by the weight value of the corresponding word stored in the keyword DB 9. Specifically, the important word extraction means 8 refers to the important word DB 9, and when a word included in the recognition result generated by the speech recognition means 2 exists in the important word DB 9, the corresponding weight is set as the utterance time of the recognition result. Multiply the length. For example, the keyword extraction unit 8 may assign a weight value to the utterance time length on the basis of Equation 3 illustrated below.

Ｓｍ’＝Ｓｍ × Ｉ・・・式（３） Sm ′ = Sm × I (3)

ここで、ｍは、音声認識手段２が変換した認識結果に含まれる単語の数であり、Ｓｍは、単語ｍの発話時間長である。また、Ｉは、重要語ＤＢ９に記憶された単語ｍの重み（発話時間長重み値）を表し、Ｓｍ’は、重要語抽出手段８が重みを付与した後の単語ｍの発話時間長を示す。例えば、認識結果に“収益”という単語が含まれており、その単語の発話時間長が１．０であったとする。ここで、重み値＝１．３の場合、重要語抽出手段８は、処理後の発話時間長Ｓｍ’を、１．０×１．３＝１．３と算出する。 Here, m is the number of words included in the recognition result converted by the speech recognition means 2, and Sm is the utterance time length of the word m. I represents the weight (speech time length weight value) of the word m stored in the important word DB 9, and Sm ′ represents the utterance time length of the word m after the important word extraction means 8 gives the weight. . For example, it is assumed that the word “revenue” is included in the recognition result and the utterance time length of the word is 1.0. Here, when the weight value = 1.3, the keyword extraction unit 8 calculates the utterance time length Sm ′ after processing as 1.0 × 1.3 = 1.3.

音声認識手段２と、認識結果変換手段４と、表示時間決定手段６と、重要語抽出手段８とは、プログラム（テキスト表示プログラム）に従って動作するコンピュータのＣＰＵによって実現される。また、音声認識手段２と、認識結果変換手段４と、表示時間決定手段６と、重要語抽出手段８とは、それぞれが専用のハードウェアで実現されていてもよい。また、音声認識用辞書記憶手段３と、変換ＤＢ５と、重要語ＤＢ９とは、例えば、磁気ディスク等により実現される。 The voice recognition unit 2, the recognition result conversion unit 4, the display time determination unit 6, and the keyword extraction unit 8 are realized by a CPU of a computer that operates according to a program (text display program). Further, each of the speech recognition unit 2, the recognition result conversion unit 4, the display time determination unit 6, and the keyword extraction unit 8 may be realized by dedicated hardware. The voice recognition dictionary storage means 3, the conversion DB 5, and the keyword DB 9 are realized by, for example, a magnetic disk.

次に、動作について説明する。図１１は、第２の実施形態におけるテキスト表示システムの動作の例を示すフローチャートである。なお、音声が入力され、音声認識を行うまでのステップＡ１〜ステップＡ２までの処理は、第１の実施形態と同様である。 Next, the operation will be described. FIG. 11 is a flowchart showing an example of the operation of the text display system in the second embodiment. Note that the processes from step A1 to step A2 from when a voice is input until voice recognition is performed are the same as those in the first embodiment.

次に、重要語抽出手段８は、重要語ＤＢ９を参照し、音声認識手段２が生成した認識結果に含まれる単語の重みを認識結果の発話時間長に付与する（ステップＢ１）。 Next, the keyword extraction unit 8 refers to the keyword DB 9 and assigns the weight of the word included in the recognition result generated by the speech recognition unit 2 to the utterance time length of the recognition result (step B1).

以降、認識結果を変換して表示時間を決定し、テキストを表示するまでの処理は、第１の実施形態におけるステップＡ３〜ステップＡ７までの処理と同様である。 Thereafter, the processing from converting the recognition result to determining the display time and displaying the text is the same as the processing from step A3 to step A7 in the first embodiment.

以上のように、本実施形態によれば、重要語抽出手段８が、入力された音声の認識結果に含まれる各単語の発話時間長に重要語ＤＢ９に記憶された対応する重み値を乗じる。そのため、第１の実施形態の効果に加え、重要な単語が含まれている場合には、通常よりも認識結果を長く表示できるため、字幕をより見やすくすることができる。 As described above, according to the present embodiment, the important word extracting unit 8 multiplies the utterance time length of each word included in the input speech recognition result by the corresponding weight value stored in the important word DB 9. Therefore, in addition to the effect of the first embodiment, when an important word is included, the recognition result can be displayed longer than usual, so that the subtitle can be more easily seen.

次に、本発明の最小構成の例を説明する。図１２は、本発明によるテキスト表示時間決定装置の最小構成の例を示すブロック図である。本発明によるテキスト表示時間決定装置は、入力される音声を逐次認識してテキスト化した認識結果を作成する認識結果作成手段８１（例えば、音声認識手段２）と、音声の発話時間に基づいて、認識結果に含まれる文章ごとに表示時間を決定する表示時間決定手段８２（例えば、表示時間決定手段６）とを備えている。 Next, an example of the minimum configuration of the present invention will be described. FIG. 12 is a block diagram showing an example of the minimum configuration of the text display time determination device according to the present invention. The text display time determination device according to the present invention is based on a recognition result creation unit 81 (for example, speech recognition unit 2) that sequentially recognizes input speech and creates a text recognition result, and a speech utterance time. Display time determining means 82 (for example, display time determining means 6) for determining the display time for each sentence included in the recognition result is provided.

そのような構成により、入力される音声を逐次テキスト化して表示する際、利用者にとって読みやすく理解しやすい字幕を生成できる。 With such a configuration, it is possible to generate subtitles that are easy to read and understand for the user when the input speech is sequentially converted into text.

また、テキスト表示時間決定装置は、予め定められた単語または単語列と、その単語または単語列よりも短く変換した表記である変換後表記とを対応づけて記憶する変換後表記記憶手段（例えば、変換ＤＢ５）と、入力された音声の認識結果に含まれる単語または単語列の表記を、変換後表記記憶手段に記憶された単語または単語列に対応する変換後表記に変換する認識結果変換手段（例えば、認識結果変換手段４）とを備えていてもよい。そして、表示時間決定手段８２は、変換後表記に変換された認識結果に含まれる文章ごとに表示時間を決定してもよい。 In addition, the text display time determination device includes a post-conversion storage means (for example, a post-conversion storage means for storing a predetermined word or word string and a post-conversion post-conversion that is shorter than the word or word string) Conversion DB 5) and recognition result conversion means for converting a word or word string notation included in the input speech recognition result into a converted notation corresponding to the word or word string stored in the converted notation storage means ( For example, recognition result conversion means 4) may be provided. Then, the display time determining unit 82 may determine the display time for each sentence included in the recognition result converted into the post-conversion notation.

また、変換後表記記憶手段は、変換後表記として、単語または単語列を削除することを示す表記（例えば、空欄）を記憶していてもよい。そして、認識結果変換手段は、単語または単語列の変換後表記が削除することを示す表記である場合、認識結果からその単語または単語列を削除してもよい。このようにすることで、不要（無意味）な発言を削除できるため、利用者にとってより読みやすく理解しやすい字幕を表示できる。 Further, the post-conversion notation storage means may store a notation (for example, blank) indicating that the word or the word string is deleted as the post-conversion notation. Then, the recognition result conversion means may delete the word or word string from the recognition result when the post-conversion notation of the word or word string is a notation indicating deletion. In this way, unnecessary (nonsense) speech can be deleted, so that subtitles that are easier to read and understand for the user can be displayed.

また、テキスト表示時間決定装置は、入力された音声の認識結果に含まれる単語が発声された時間である発話時間長を単語ごとに算出する発話時間長算出手段（例えば、音声認識手段２）を備えていてもよい。そして、表示時間決定手段８２は、変換後表記に変換された認識結果に含まれる文章の表示時間を、その文章に含まれる単語が変換後表記に変換される前の単語の発話時間長に基づいて決定してもよい。 The text display time determination device further includes an utterance time length calculation unit (for example, a voice recognition unit 2) that calculates, for each word, an utterance time length that is a time when a word included in the input speech recognition result is uttered. You may have. Then, the display time determining means 82 uses the display time of the sentence included in the recognition result converted into the converted notation based on the utterance time length of the word before the word included in the sentence is converted into the converted notation. May be determined.

また、表示時間決定手段８２は、発声速度の平均値に対する表示対象である文章の発声速度の割合（例えば、表示重み値Ｗ）を、単語の発話時間長の総和に乗じた値を表示時間と決定（例えば、式１および式２に基づいて決定）してもよい。 Further, the display time determining means 82 uses a value obtained by multiplying the ratio of the utterance speed of the sentence to be displayed with respect to the average value of the utterance speed (for example, the display weight value W) by the sum of the utterance time lengths of the words as the display time. It may be determined (for example, determined based on Equation 1 and Equation 2).

また、テキスト表示時間決定装置は、発話時間長に乗じる重み値である発話時間長重み値（例えば、発話時間重み値Ｉ）を予め定められた単語ごとに記憶する重み値記憶手段（例えば、重要語ＤＢ９）と、入力された音声の認識結果に含まれる各単語の発話時間長に、対応する発話時間長重み値を乗じる重み値付与手段（例えば、重要語抽出手段８）とを備えていてもよい。このような構成により、重要な単語が含まれている場合には、通常よりも認識結果を長く表示できるため、字幕をより見やすくすることができる。 Further, the text display time determination device stores weight value storage means (for example, important value) for storing a speech time length weight value (for example, speech time weight value I), which is a weight value to be multiplied by the speech time length, for each predetermined word. Word DB 9) and weight value assigning means (for example, important word extracting means 8) for multiplying the utterance time length of each word included in the input speech recognition result by the corresponding utterance time length weight value. Also good. With such a configuration, when an important word is included, the recognition result can be displayed longer than usual, so that the subtitle can be more easily seen.

図１３は、本発明によるテキスト表示システムの最小構成の例を示すブロック図である。本発明によるテキスト表示システムは、音声を入力する音声入力装置７０（例えば、音声送出端末１０）と、音声入力装置７０に入力された音声を認識する音声認識装置８０（例えば、音声認識サーバ２０）と、音声認識装置８０による音声の認識結果を表示する認識結果表示装置９０（例えば、認識結果表示端末３０）とを備えている。 FIG. 13 is a block diagram showing an example of the minimum configuration of the text display system according to the present invention. The text display system according to the present invention includes a voice input device 70 (for example, the voice sending terminal 10) for inputting voice, and a voice recognition device 80 (for example, the voice recognition server 20) for recognizing the voice input to the voice input device 70. And a recognition result display device 90 (for example, a recognition result display terminal 30) for displaying a speech recognition result by the speech recognition device 80.

音声認識装置８０は、認識結果作成手段８１（例えば、音声認識手段２）と、表示時間決定手段８２（例えば、表示時間決定手段６）とを備えている。認識結果作成手段８１および表示時間決定手段８２の内容は、図１２に例示する内容と同様である。このような構成であっても、入力される音声を逐次テキスト化して表示する際、利用者にとって読みやすく理解しやすい字幕を生成できる。 The voice recognition device 80 includes a recognition result creation unit 81 (for example, the voice recognition unit 2) and a display time determination unit 82 (for example, the display time determination unit 6). The contents of the recognition result creation means 81 and the display time determination means 82 are the same as the contents illustrated in FIG. Even with such a configuration, it is possible to generate subtitles that are easy to read and understand for the user when the input speech is sequentially converted into text.

本発明は、入力される音声をリアルタイムで変換し、変換したテキストを表示するテキスト表示システムに好適に適用される。 The present invention is suitably applied to a text display system that converts input speech in real time and displays the converted text.

１音声入力手段
２音声認識手段
３音声認識用辞書記憶手段
４認識結果変換手段
５変換データベース記憶手段
６表示時間決定手段
７テキスト表示手段
８重要語抽出手段
９重要語データベース記憶手段
１０音声送出端末
２０音声認識サーバ
２１音声検出部
２２音声分析部
２３音声照合部
３０認識結果表示端末
３１音響モデル
３２言語モデル
３３辞書
１００インターネット DESCRIPTION OF SYMBOLS 1 Voice input means 2 Voice recognition means 3 Voice recognition dictionary storage means 4 Recognition result conversion means 5 Conversion database storage means 6 Display time determination means 7 Text display means 8 Important word extraction means 9 Important word database storage means 10 Voice sending terminal 20 Speech recognition server 21 Speech detection unit 22 Speech analysis unit 23 Speech collation unit 30 Recognition result display terminal 31 Acoustic model 32 Language model 33 Dictionary 100 Internet

Claims

A recognition result creating means for sequentially recognizing input speech and creating a recognition result converted into text;
A text display time determination device comprising: display time determination means for determining a display time for each sentence included in the recognition result based on the speech utterance time.

A post-conversion storage means for storing a predetermined word or word string in association with a post-conversion notation that is a notation obtained by converting the length of the word or word string to a short length;
Recognition result conversion means for converting the notation of a word or word string included in the input speech recognition result into a converted notation corresponding to the word or word string stored in the converted notation storage means,
The text display time determination device according to claim 1, wherein the display time determination means determines a display time for each sentence included in the recognition result converted into the converted notation.

The post-conversion notation storage means stores, as post-conversion notation, a notation indicating that a word or word string is deleted,
The text display time determination device according to claim 2, wherein the recognition result conversion means deletes the word or the word string from the recognition result when the post-conversion notation of the word or the word string is a notation indicating deletion.

Utterance time length calculating means for calculating, for each word, an utterance time length that is a time when a word included in the input speech recognition result is uttered;
The display time determining means determines the display time of the sentence included in the recognition result converted into the converted notation based on the utterance time length of the word before the word included in the sentence is converted into the converted notation. The text display time determination device according to claim 2 or 3.

5. The text display time according to claim 4, wherein the display time determination means determines a value obtained by multiplying a ratio of the utterance speed of the sentence to be displayed with respect to the average value of the utterance speed by a sum of the utterance time lengths of words as the display time. Decision device.

Weight value storage means for storing an utterance time length weight value, which is a weight value multiplied by the utterance time length, for each predetermined word;
The text display time determination device according to claim 4 or 5, further comprising weight value giving means for multiplying the utterance time length of each word included in the input speech recognition result by the corresponding utterance time length weight value. .

A voice input device for inputting voice;
A speech recognition device for recognizing speech input to the speech input device;
A recognition result display device for displaying a voice recognition result by the voice recognition device;
The voice recognition device
A recognition result creating means for sequentially recognizing speech input to the speech input device and creating a recognition result that is converted into text;
And a display time determining means for determining a display time for each sentence included in the recognition result based on the speech utterance time.

Recognize input speech and create a recognition result that is converted into text,
A text display time determination method, comprising: determining a display time for each sentence included in the recognition result based on the speech utterance time.

The notation of the word or word string included in the input speech recognition result is associated with a predetermined word or word string and a converted notation that is a notation obtained by converting the length of the word or word string to a shorter length. Convert to the corresponding converted notation stored in the stored converted notation storage means,
The text display time determination method according to claim 8, wherein the display time is determined for each sentence included in the recognition result converted into the post-conversion notation.

On the computer,
A recognition result creation process that sequentially recognizes input speech and creates a text recognition result; and
A text display time determination program for executing display time determination processing for determining a display time for each sentence included in the recognition result based on the speech utterance time.

On the computer,
The notation of the word or word string included in the input speech recognition result is associated with a predetermined word or word string and a converted notation that is a notation obtained by converting the length of the word or word string to a shorter length. Executing recognition result conversion processing for converting into a corresponding post-conversion notation stored in the post-conversion post-notation storage means,
The text display time determination program according to claim 10, wherein the display time is determined for each sentence included in the recognition result converted into the post-conversion notation in the display time determination process.