JP2004212799A

JP2004212799A - Dictation assisting device

Info

Publication number: JP2004212799A
Application number: JP2003001352A
Authority: JP
Inventors: Eiji Sawamura; 英治沢村; Takao Monma; 隆雄門馬; Noriyoshi Uratani; 則好浦谷; Katsuhiko Shirai; 克彦白井
Original assignee: NEC Corp; Nippon Hoso Kyokai NHK; Telecommunications Advancement Organization; NHK Engineering Services Inc; Japan Broadcasting Corp
Current assignee: NEC Corp; Telecommunications Advancement Organization; Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2003-01-07
Filing date: 2003-01-07
Publication date: 2004-07-29
Anticipated expiration: 2023-01-07
Also published as: JP4314376B2

Abstract

<P>PROBLEM TO BE SOLVED: To facilitate hearing and dictation by applying processes such as proper division, speed reduction if necessary, and repetition to a program voice signal by using detected pauses. <P>SOLUTION: A dictation assisting device which assists a dictation of text data from a speech while the speech is heard is equipped with a pause detection part 1 which detects pauses of an input speech signal and a division part 11 which divides the input speech signal by a time or the number of characters enabling a dictator to memorize the divided signal for a short period on the basis of the detected pauses. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ポーズ部分の検出を利用して番組音声等の書起しを支援する書起し支援装置に関する。
【０００２】
［発明の概要］
本発明は、ポーズ部分の検出を利用した番組音声等の書起しを支援する装置にに関するものであり、特に番組音声を聞いてこれを書起す際に、検出したポーズを利用して適切な分割を行い、必要に応じて低速化、あるいは繰返しなどの処理を番組音声信号に施し、聞取り易く、書起しし易いように支援するものである。
【０００３】
この処理を行うことにより、書起しに要する時間や作業に伴う緊張感や疲労を低減することができる。特に、この処理法を字幕用テキストの書起しに適用すると、字幕番組制作時間の短縮、作業の平易化による書起し作業者層の拡大が可能となり、字幕放送番組の拡大やコスト低減に寄与する。
【０００４】
【従来の技術】
字幕つきテレビ放送番組を受信者が利用する際、字幕が読みやすく、理解しやすいものであることが重要となる。そのため、字幕番組制作における字幕原稿作成では、熟練した人手を使い、多大の労力・時間をかけて、より読みやすく、理解しやすい字幕となるよう制作している。
【０００５】
【発明が解決しようとする課題】
しかしながら、今後適用番組分野・番組数などの拡大を進めている字幕放送において、この熟練した人手、多大の労力・時間を要するこのような形態の字幕番組制作システムでは、字幕番組制作上の大きなネックとなっており、その改善が急がれている。
【０００６】
現在最も多く行われている字幕番組制作形態では、タイムコードを映像にスーパーした番組テープと必要な場合は番組台本を素材とし、これを放送関係経験者など専門知識のある人によって、１）番組スピーチの要約書起し、２）字幕表示イメージ化（別途定めのある字幕原稿作成要領による）、および３）その開始・終了タイムコード記入を行い、字幕用の原稿を作成している。この字幕原稿を基に、オペレーターが電子化字幕データを作成し、担当の字幕制作責任者、原稿作成者、電子化したオペレーター立ち会いのもとで、試写・校正を行って完成字幕としている。
【０００７】
この作業の中で最も多くの時間を必要とするのは、１）の番組スピーチなどを聴取して、字幕原稿を書起し作成する点であり、ほとんどが人間の知能と手作業によっている。
【０００８】
番組スピーチの具体的な書起し作業例では、番組テープを再生操作して音声を聴取し、音声中のスピーチ開始点からテープを再生聴取しつつ、ワープロや筆記で書起しを行う。実際には、書起し作業者の書起し速度や内容確認などのため、一区切りのスピーチ区間を対象として録音テープの頭出し・再生操作を繰り返し、書起し作業が行われる。
【０００９】
したがって書起し作業は、テープの頭出し・再生操作の繰り返しといった煩雑なテープ操作と、スピーチの聴取確認、書起しといった人間の知能に負う負担の多い業務であり、必要な能力・労力を低減できる適切な支援機能が必要であり、その開発が望まれている。
【００１０】
本発明は上記事情に鑑み、検出したポーズを利用して適切な分割を行い、必要に応じて低速化、あるいは繰返しなどの処理を番組音声信号に施し、聞取り易く、書起しし易いように支援することのできる書起し支援装置を提供することを目的としている。
【００１１】
【課題を解決するための手段】
上記の目的を達成するために請求項１の発明は、音声を聞きつつ、その音声からテキストデータを書起す作業を支援する書起し支援装置であって、入力音声信号のポーズを検出するポーズ検出部と、検出されたポーズに基づいて、書起し作業者の短期記憶が可能な時間または文字数で前記入力音声信号を分割する分割部とを備えたことを特徴としている。
【００１２】
請求項２の発明は、請求項１に記載の書起し支援装置において、前記分割部は、前記ポーズ検出部によって検出されたポーズの中で、最も信頼性の高いポーズにより前記入力音声信号を分割する第１次分割手段と、この第１次分割手段による分割の結果、短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次に信頼性の高いポーズにより前記入力音声信号を分割する第２次分割手段とから成ることを特徴としている。
【００１３】
請求項３の発明は、請求項１または２に記載の書起し支援装置において、前記分割部は、前記第２次分割手段による分割の結果、短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次の次に信頼性の高いポーズにより前記入力音声信号を分割するか、または予め決定された時間若しくは文字数で機械的に分割する第３次分割手段を含むことを特徴としている。
【００１４】
請求項４の発明は、請求項１乃至３のいずれかに記載の書起し支援装置において、前記短期記憶が可能な時間は、約２．５秒前後であり、または前記短期記憶が可能な文字数は、モーラ数で約２５モーラであることを特徴としている。
【００１５】
請求項５の発明は、請求項１乃至４のいずれかに記載の書起し支援装置において、前記分割部は、分割された音声の前後部分を所定の時間または文字数だけ重複させる重複手段を含むことを特徴としている。
【００１６】
請求項６の発明は、請求項１乃至５のいずれかに記載の書起し支援装置において、前記ポーズ検出部は、前記入力音声信号の振幅レベルまたはパワーレベルを示すレベル信号から、通常の速度におけるスピーチの変動特性を示す特定周波数成分を抽出する特定周波数成分抽出手段と、抽出された特定周波数成分のエンベロープ波形を求め、求められたエンベロープ波形信号に対して所定のスライスレベルを設定してポーズ区間を検出するポーズ区間検出手段とから成ることを特徴としている。
【００１７】
【発明の実施の形態】
＜本発明の背景＞
前述したように、番組音声の書起し作業は、テープの頭出し・繰り返し再生といった煩雑なテープ操作や、スピーチの聴取確認、書起しといった人間の知能に負う負担の多い業務であり、必要な能力・労力を低減できる適切な支援機能が必要である。
【００１８】
例えば前記「一区切りのスピーチ区間」やその「繰り返し」については、これを一つの文とするとその前後には通常ポーズ部分があるので、このポーズを検出して自動的に区切りを付与し、必要回数繰返しを行うように番組音声を処理する。その結果、スピーチ区間は聞取り易く、書起しし易くなる。
【００１９】
一方、書起し不要な非スピーチ区間は自動的に除去できるので、書起し作業者はスピーチ部分の書起し作業に専念することができ、字幕用テキストの正確・迅速な作成に大いに貢献することができる。
【００２０】
また、この処理は必要な能力・労力の低減にも寄与するので、書起し作業者は一般的なワープロ作業者レベルの方々まで拡大できる可能性がある。
【００２１】
なお、ポーズの検出に関しては特願２００２−１６０２５５「音声のスピーチ／ポーズ区間検出装置」で提案済みであり、このような手法（この手法を第１の手法として後述する）で検出したポーズを本件番組音声の分割などの処理に適用できる。
【００２２】
《番組音声分割処理の手法》
前記「一区切りのスピーチ区間」の意義をさらに検討すると、書起しのための適切な分割単位の設定と必要に応じて行う低速化処理は、この分割時間内にこの単位となるスピーチの書起し作業を完了出来るようにすることであり、分割と必要回数の繰り返しも同様である。
【００２３】
まず、聞き易さの点では、番組音声におけるスピーチのないところ、もしくは息継ぎなどによる途切れ箇所（書起しのためのポーズとする）を利用して分割するのが合理的である。
【００２４】
どの程度の長さに分割するかについては後述するが、この処理で所要の分割長に収まらない場合は、追加の分割処理を考える必要がある。
【００２５】
そのため、検出した比較的長いポーズ（例えば１〜２秒）でまず分割し、分割が足りない部分はより短いポーズ（例えば０．５〜１秒）で分割し、なお分割が足りない部分は他の方法（より短いポーズ、所定時間など）によることとした。
【００２６】
《書起しのための分割の長さについての考察》
書起しのための分割の長さについては、未知語や固有名詞なども含めて、短期記憶が可能である長さが望ましいと考えているが、簡単な実験からも妥当性がある。
【００２７】
短期記憶可能な長さの具体的なイメージは一義的には定めがたいものであるが、身近な事例として俳句や和歌、標語の例がある。これらは、かたかな（モーラ）で１７文字（モーラ）や３１文字（モーラ）であるので、一応２５文字（モーラ）程度を目標として検討した。これは、漢字の割合を１／３としたかな漢字混じり文では約１９文字になり、１モーラを０．１秒とすると時間は２．５秒に相当する。
【００２８】
聞取り易く、書起しし易いような番組音声の分割法として、分割の位置やここで上げた分割長としてのモーラ数や時間の妥当性を検証するために、まず実際の番組テキストを利用して検討を行った。
【００２９】
《実際の番組テキストを利用した分割法の検討》
図２は、ニュース番組を例として、手作業により書起しのための分割をしたものである。この図２に示す例は番組テキストなので、通常存在する句読点にまず着目するとともに、文節も考慮して分割した。
【００３０】
図３はドキュメンタリー番組を例として手作業による書起しのための分割例である。図３に示す例も同様に、まず句読点に着目するとともに、文節も考慮して分割した。ここで、両図で使用される語句を以下のように定義する。
【００３１】
開始：対応する番組音声の開始時間
終了：対応する番組音声の終了時間
時間：対応する番組音声の経過時間
カナ：テキストのカナ数
漢字：テキストの漢字等の数
モーラ：テキストのモーラ数（漢字等は、１．８６モーラとして計算）
／Ｓ：モーラ数／経過時間で１秒あたりのモーラ数である。
【００３２】
図２の場合、経過時間つまり時間長は平均２．４５秒（最小：１．３秒、最大：３．４秒）であり、モーラ数は平均２４．４（最小：１２．３、最大：３３．３で３９０／１６）であった。総モーラ数（モーラ数の合計）／総経過時間（経過時間の合計）は３９０／３９．２＝９．９５であり、これはスピーチの速度と関係する値で、やや早いスピーチである。
【００３３】
図３の場合、経過時間つまり時間長は平均２．２６秒（最小：０．９秒、最大：３．５秒）であり、モーラ数は平均１７．５（最小：９．５、最大：２５．２）であった。総モーラ数（モーラ数の合計）／総経過時間（経過時間の合計）は７．７６であり、これはスピーチの速度と関係する値で、ほぼ標準のスピードである。
【００３４】
これら二つの番組例について、図２、図３のようにテキストを手作業で分割し、この分割テキストに対応するするよう番組音声を分割し、必要回数繰り返し再生しながら、簡単な書起し実験を行った。
【００３５】
その結果、一部の長い経過時間部分を除き、かなり聞取り易く、書起しし易く、しかも速いとの評価が得られ、平均２５モーラ、２．５秒程度での分割の妥当性を確認することが出来た。
【００３６】
《実際の番組音声におけるスピーチ／ポーズ部分の検出》
しかし、実際の書起しの対象は番組音声中のスピーチであり、句読点や文節があるわけでなく、さらに目的のスピーチだけでなくさまざまな背景音があり、まったく事情が異なる。
【００３７】
番組音声の処理では、望ましくは正しく各スピーチ区間の開始・終了時間を求め、この開始・終了時間を基に前記の説明に準じて適切な時間長となるよう分割することとなる。
【００３８】
しかし、実際の番組音声では目的のスピーチだけでなく様々な背景音が重なっており、各スピーチ区間の開始・終了時間を正しく求めることは一般には至難なことである。一方、非スピーチ区間の場合は、無音区間は確実であり、背景音がある場合でもそのレベル、区間長によっては信頼性が大きいと判断できる可能性がある。
【００３９】
そこで、本発明では、入力音声信号の振幅レベルまたはパワーレベルを示すレベル信号の分析により、通常の速度におけるスピーチの変動特性を示す特定周波数成分を抽出し、抽出された特定周波数成分のエンベロープ波形を求め、求められたエンベロープ波形信号に対して所定のスライスレベルを設定してポーズ区間を検出する第１の手法、または音響データ内の複数のＬＣＰケプストラムベクトルを、基準フレームから相互に比較することで、音響データ内容の切り替わり点を、より安定に検出する第２の手法（ブロックケプストラムフラックス法）によりポーズを検出するようにしている。
【００４０】
＜実施の形態の説明＞
図１は、本発明に係る書起し支援装置の実施形態を示すブロック図である。
【００４１】
同図に示す書起し支援装置は、入力音声信号のポーズを検出するポーズ検出部１と、検出されたポーズに基づいて、書起し作業者の短期記憶が可能な時間または文字数で前記入力音声信号を分割する分割部１１とを備えている。
【００４２】
ポーズ検出部１は、入力音声信号の振幅レベルまたはパワーレベルを示すレベル信号から、通常の速度におけるスピーチの変動特性を示す特定周波数成分を抽出する特定周波数成分抽出手段３と、抽出された特定周波数成分のエンベロープ波形を求め、求められたエンベロープ波形信号に対して所定のスライスレベルを設定してそのレベル以下のポーズ区間を検出するポーズ区間検出手段５とから構成されている。
【００４３】
また、分割部１は、ポーズ検出部１によって検出されたポーズの中で、最も信頼性の高いポーズにより前記入力音声信号を分割する第１次分割手段１３と、この第１次分割手段１３による分割の結果、短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次に信頼性の高いポーズにより前記入力音声信号を分割する第２次分割手段１５と、この第２次分割手段による分割の結果、短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次の次に信頼性の高いポーズにより前記入力音声信号を分割するか、または予め決定された時間若しくは文字数で機械的に分割する第３次分割手段１７と、分割された音声の前後部分を所定の時間または文字数だけ重複させる重複手段１９とから構成されている。
【００４４】
次に、この実施の形態の作用をポーズ検出部１の処理と、分割部１１の処理に分けて説明する。
【００４５】
《ポーズ検出部１の処理》
図４乃至図７はポーズ検出処理の手順を示す説明図である。
【００４６】
図４のフローチャートにおいて、先ず、図５（Ａ）に示すような振幅波形を持つ番組音声信号を取り込んで、音声振幅値の基準化処理を実行する（ステップＳＴ１，ＳＴ３）。この処理では、取り込まれた番組音声信号中からそのエンベロープの低域成分のみを取り出して（ステップＳＴ５）、この低域成分の振幅値を基として信号成分を所定レベルの大きさにする処理である。一般に、低域成分のレベルが大きい場合には高域成分のレベルも大きいと考えられる。レベルに違いがあると検出精度に影響を与えるので、ある程度のレベル基準化を図る必要があるからである。
【００４７】
こうして音声振幅値の基準化がされた音声信号は、次に、絶対値化処理がされて、図５（Ｂ）に示すような＋側に折り返した振幅値波形信号（Ｌｃｈ，Ｒｃｈのステレオ音声信号がある場合は、その和と差の信号を利用してスピーチ部分を伸長する）と成る（ステップＳＴ７）。図６（Ａ）の波形は、図５（Ｂ）に矢印で示した範囲の時間軸を１０倍に拡大して示したものである。
【００４８】
次いで、絶対値化された振幅値波形信号からディジタル高域ろ波処理（ステップＳＴ９）、ディジタル帯域ろ波処理（ステップＳＴ１１）が実行される。ディジタル高域ろ波処理では、例えば２Ｈｚ以上の周波数の信号が取り出され、ディジタル帯域ろ波処理では、例えば４〜７Ｈｚの周波数の信号が取り出される。図６（Ｂ）の波形は、４〜７Ｈｚ成分の信号波形である。
【００４９】
スピーチに関する時間軸方向の変動特性をスピーチの発音記号列と比較すると、母音の発音記号に対応する音声パワーが他の部分よりも大きくなる傾向があることが知られている。そして、通常速度のスピーチにおける信号波形の時間軸方向への変動は、４〜７Ｈｚ程度の周波数となっていることを先願で示した（特願２００２−１６０２５５）。この実施形態は、この波形の時間軸方向の変動に着目し、大まかな周期性を捉えることでポーズ部分の検出をする。
【００５０】
ディジタル帯域ろ波処理で４〜７Ｈｚの周波数の信号が取り出され、さらに絶対値化処理（ステップＳＴ１２）をすると、スピーチに類似した波形の信号となる。一方、演算処理（ステップＳＴ１３）が実行されると、これら取り出された信号の差分として主なスピーチ成分以外の成分が抽出され、さらに絶対値化処理（ステップＳＴ１５）、ディジタル低域ろ波処理（ステップＳＴ１７）が実行される。レベル補正処理（ステップＳＴ１９）では、低域ろ波処理で生成された０．５Ｈｚ以下の低域成分の波形は主にスピーチ成分以外の成分が多いので、このレベルを参照して、ステップＳＴ１２出力のスピーチ類似波形信号のレベルを逆方向に補正し、結果的にスピーチ成分を強調する処理がされる。この処理後のスピーチ近似波形を図７に示す。
【００５１】
こうして、レベル補正がされたエンベロープ波形信号は、その適否検証のためディスプレイ上に波形表示される（ステップＳＴ２１）。そして、図７に示すように、所定のスライスレベル（閾値）でスライスされる（ステップＳＴ２３）。
【００５２】
次に、微小ポーズ区間除去処理が実行される（ステップＳＴ２５）。この処理では、スライス処理された音声振幅値信号中から、例えば、ちょっとした母音区間の変動や息継ぎ程度の区間は検出対象から除外するためである。検出時間範囲として、例えば、“０．２〜２秒”程度を設定し、その以下の時間を検出対象外として除去する処理である。これにより、意味を持たない信頼性が低いポーズ検出が効果的に防止できる。
【００５３】
こうして検出されたポーズ区間が画面表示される（ステップＳＴ２７）。ポーズの検出精度やスライス（ステップＳＴ２３）レベル設定の最適化などの目的でスピーチ区間を実測し、図７に示すように、実測スピーチ区間と比較される（ステップＳＴ２９）。これにより、実測されたスピーチ区間から導かれるポーズ区間と、検出されたポーズ区間とが比較され、比較によって、ポーズ検出精度をチチェックしたり、スライスレベルが最適となるように変更することができる。
【００５４】
《分割部１１の処理》
上述のようにしてポーズが検出されると、次に、図８のフローチャートに示すように、比較的長いポーズが一番信頼性が高いといえるので、先ず、１〜２秒程度以上の長いポーズにより入力音声信号を分割する（ステップＳＴ４１）。この処理は第１次分割手段１３で実行される。
【００５５】
この分割の結果、作業者にとって短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次に長いポーズ（０．５〜１秒程度）により入力音声信号を分割する（ステップＳＴ４３，ＳＴ４５）。この処理は第２次分割手段１５で実行される。
【００５６】
この分割の結果、作業者にとって短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次の次に長いポーズ（例えば０．２〜０．５秒）により前記入力音声信号を分割するか、または予め決定された時間若しくは文字数で機械的に分割する（ステップＳＴ４７，ＳＴ４９）。この処理は第３次分割手段１７で実行される。
【００５７】
また、分割部１１における分割に際しては、信頼性上問題がある可能性のある分割手段１５，１７による分割の場合には、重複手段１９により、分割された音声の前後部分を所定の時間または文字数だけ重複させる（ステップＳＴ５１）。例えば１秒程度の重複部分を設けることにより、ポーズ検出の信頼性の低さによって引起こされる書起し作業者の聴取不能や書起しの欠落を防止することができる。
【００５８】
《ブロックケプストラムフラックス法によるポーズの検出》
スピーチ区間を検出する第２の手法としてのブロックケプストラムフラックス（ＢｌｏｃｈＣｅｐｓｔｒｕｍＦｌｕｘ）法について簡単に説明する。
【００５９】
この手法は、音響データ内の複数のＬＣＰケプストラムベクトルを、基準フレームから相互に比較することで、音響データ内容の切り替わり点を、より安定に検出するものである。
【００６０】
図９のグラフには、平均的な背景音レベルと考えられる実際のテレビ番組の音声をこの両手法で分析処理した結果を示した。
【００６１】
図９において、（Ａ）はブロックケプストラムフラックス法による処理値（ｃｆｌｘ解析値）、（Ｂ）は音声エンベロープの４〜７Ｈｚ成分抽出（第１の手法による）値、（Ｃ）は（Ｂ）の波形を適切なレベルでスライスし、２値化したデータである。
【００６２】
この（Ｃ）の波形では、高レベル範囲はスピーチ、低レベル範囲は非スピーチ（ポーズ）区間に対応するものとし、この場合の（Ｃ）は、実測したスピーチ区間とかなりよく合致しているといえる。
【００６３】
したがって、（Ｃ）の波形から音声中のスピーチ区間をある程度正確に把握することができる。（Ｃ）の波形中でＰ１は短いポーズ、Ｐ３は比較的長いポーズであり、このＰ３が最も信頼性の高いポーズ、Ｐ２が次に信頼性の高いポーズといえる。
【００６４】
なお、（Ａ）のブロックケプストラムフラックス法（第２の手法）についても同様にして比較したが、第１の手法よりやや悪い結果であるが、場合によっては適用可能である。
【００６５】
＜実際の番組音声のポーズ部分検出を利用した分割実験＞
実際の番組音声におけるスピーチのないところ、もしくは息継ぎなどによる途切れ箇所を番組音声の適切な処理により検出（書起しのためのポーズとする）し、このポーズなどを目安にして書き起こすべきスピーチを作業がし易い単位に分割し、また必要に応じて低速化処理などを行った。
【００６６】
書起し作業がし易いスピーチの時間は前記のように２．５秒程度が望ましいので、この程度となるよう分割した。対象にした番組はかなり高レベルの背景音（音楽）があるため、ここの分割手法は、まず比較的長いポーズ（例えば２秒程度以上）を検出して分割した。
【００６７】
図１０は、対象にした番組の音声（かなり高レベルの背景音（音楽）がある）からのポーズ部分検出を説明したものである。
【００６８】
対象番組のオリジナル音声波形を下端に示したが、前述した第１の手法を用いて、この波形のエンベロープからその４〜７Ｈｚの周波数範囲成分を抽出し、その振幅値を「処理後の音声波形」として示した。この「処理後の音声波形」を適当なスレシホルドレベルで大小を比較し、「検出ポーズ」とした。この「検出ポーズ」の適否を評価するために、参考に実測スピーチのデータも表示した。少なくとも比較的長いポーズ（例えば２秒程度以上）は正しく検出されているといえる。
【００６９】
このポーズ検出結果を利用し、ポーズが１〜２秒での第１分割、０．５〜１秒の第２分割を行った。この分割でも２．５秒程度以上となる部分は、以下のいずれもの手法で２．５秒程度となるよう分割した。
【００７０】
（１）より短いポーズ（例えば０．２〜０．５秒以上）を検出し、これを利用してさらに分割。
（２）３秒程度となるように、この部分を等分割。
【００７１】
なお、少なくとも（１）または（２）の処理を適用した部分は、１秒程度の重複部分を設けて、欠落や聴取不能を回避するようにした。また重複部分を示すため、その部分の冒頭に重複表示音ＣＨを挿入して重複であることが分かるようにした。
【００７２】
＜分割・重複処理例＞
実際の処理例を文章で示すと下記のようになる。下線部分は重複処理を示す。また、ＰＰ：１〜２秒以上の検出ポーズ、ＳＰ：０．５〜１秒以上の検出ポーズ、ＣＨ：重複表示音（下線部分は重複部分）をそれぞれ示す。
【００７３】
（１）の処理例１秒重複１８分割
ＰＰ大自然、そこにはいつも、生きもの
ＣＨ生きものたちが織りなす不思議と感動があります。ＳＰ
さあ、大自然の真っただ中へ、生きもの地球紀行。ＳＰ
新しい出会いを求めて、とっておきの
ＣＨとっておきの地球の旅の始まりです。ＰＰ
今日の舞台は中米コスタリカに
ＣＨ米コスタリカに広がる熱帯の森です。ＳＰ
この森では、多くの生きものたちが
ＣＨ生きものたちが、工夫を凝らして生きています。ＳＰ
中でも、独特の知恵をもって暮らして
ＣＨをもって暮らしているのが、ノドジロオマキザルです。ＰＰ
ノドジロオマキザルの知恵は様々です。ＳＰ
木の実を石にたたきつけて、中
ＣＨきつけて、中の種を取り出します。ＳＰ
更に、種類の違うサルと平和に暮らす
ＣＨ平和に暮らす、駆け引きの知恵まであるのです。ＰＰ
ノドジロオマキザルが、森の暮らしの中で見せる、ＳＰ知恵
ＣＨ見せる、知恵の数々を見つめました。ＰＰ
【００７４】
（２）処理例ＰＰ間等分割、１秒重複１６分割
ＰＰ大自然、そこにはいつも、生きものたち
生きものたちが織りなす不思議と感動があります。さあ
さあ、大自然の真っただ中へ、生きもの地
生きもの地球紀行。新しい出会いを求めて、
会いを求めて、とっておきの地球の旅の始まりです。ＰＰ
今日の舞台は中米コスタリカに広がる
カに広がる熱帯の森です。この森では、
この森では、多くの生きものたちが、工夫を凝らして生きてい
凝らして生きています。中でも、独特の知恵をも
特の知恵をもって暮らしているのが、ノドジロオマキザルです。ＰＰ
ノドジロオマキザルの知恵は様々です。木
木の実を石にたたきつけて、中の種を取り出し
の種を取り出します。更に、種類の違うサルと平
違うサルと平和に暮らす、駆け引きの知恵まであるのです。ＰＰ
ノドジロオマキザルが、森の暮らしの中で見せる
の中で見せる、知恵の数々を見つめました。ＰＰ
【００７５】
《分割番組音声を利用した書起し実験》
実験対象にした番組は全体で７８秒、スピーチ部分は４８秒（６２％）である。
【００７６】
前記の分割方法による番組音声を利用して書起し実験を行った。実験は、図１１に示すように、ＭＩＤＩＰＬＡＹＥＲ（シェアウエアソフト）に上記のように処理した分割音声を入れ、このソフトの操作機能を利用して行った。具体的には、各分割音声は繰り返し再生状態とし、ハッチした部分の書起しが完了するとキー操作で次の分割音声に進むようになっている。図１１は上記（１）の処理例を示す。
【００７７】
図１１のソフトにより実験した結果、
（１）の方法による書起し時間：約１３５秒各分割音声の繰り返し回数約２．８回
（２）の方法による書起し時間：約１５２秒各分割音声の繰り返し回数約３．２回
であり、一応検出したポーズをできるだけ活用する分割方法が望ましいことが分かった。
【００７８】
【発明の効果】
以上説明したように本発明によれば、検出したポーズを利用すると番組音声の適切な分割、必要ならば低速化、あるいは繰返しなどの処理を番組音声信号に施し、聞取り易く、書起しし易いように支援することが可能となる。
【００７９】
また、重複手段により、分割された特定音声の前後部分を所定の時間または文字数だけ重複させるようにしているので、書起し作業者の聴取不能や書起しの欠落を効果的に防止することができる。
【図面の簡単な説明】
【図１】本発明に係る書起し支援装置に一例を示すブロック図である。
【図２】ニューズ番組の書起しのための分割実験例を示す説明図である。
【図３】ドキュメンタリー番号見の書起しで検証された分割例を示す説明図である。
【図４】ポーズ検出の手順を示すフローチャートである。
【図５】番組音声信号波形と、この波形に対応する振幅レベル信号またはパワーレベル信号を示す説明図である。
【図６】図５に示した振幅レベル信号またはパワーレベル信号を時間軸を拡大して示すと共に特性周波数範囲の抽出成分値を示す説明図である。
【図７】図６に示した特定周波数範囲の振幅レベル信号またはパワーレベル信号のエンベロープ波形と実測ポーズ（スピーチ）部分を示す説明図である。
【図８】ポーズによる番組音声の分割の処理手順を示す説明図である。
【図９】音声信号を特殊処理したスピーチ近似データの説明図である。
【図１０】ポーズ部分検出の説明図である。
【図１１】分割された番組音声を音声再生ソフトを使用して再生する際に使用されるコンピュータの画面例を示す説明図である。
【符号の説明】
１ポーズ検出部
３特定周波数成分抽出手段
５ポーズ区間検出手段
１１分割部
１３第１次分割手段
１５第２次分割手段
１７第３次分割手段
１９重複手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a transcription support device that supports transcription of a program audio or the like using detection of a pause portion.
[0002]
[Summary of the Invention]
The present invention relates to an apparatus for supporting the transcription of a program sound or the like using the detection of a pause portion, and particularly, when listening to and transcribe a program sound, it is appropriate to use the detected pause to make an appropriate use. Processing is performed on the program audio signal by performing division, and if necessary, processing such as lowering the speed or repeating the processing, thereby assisting in easy listening and transcription.
[0003]
By performing this processing, it is possible to reduce the time required for the transcription and the feeling of tension and fatigue associated with the work. In particular, if this processing method is applied to the transcription of subtitle text, it will be possible to shorten the time for producing subtitle programs and expand the number of transcription workers by simplifying the work. Contribute.
[0004]
[Prior art]
When a receiver uses a television broadcast program with subtitles, it is important that the subtitles be easy to read and understand. For this reason, in the production of subtitles in the production of subtitle programs, skilled human resources are used to produce subtitles that are easy to read and understand with great effort and time.
[0005]
[Problems to be solved by the invention]
However, in the case of caption broadcasting, which is expanding in the field of applied programs and the number of programs in the future, this skilled manpower, such a subtitle program production system that requires a great deal of labor and time, poses a major bottleneck in subtitle program production. The improvement is urgently needed.
[0006]
Currently, the most common closed caption program production format is to use a program tape with time code superimposed on the video and, if necessary, a program script as a material, which can be used by someone with expertise in broadcasting or other specialized knowledge to 1) program He transcribes the summary of the speech, 2) creates a subtitle display image (according to the procedure for preparing subtitle manuscripts specified separately), and 3) enters the start and end time codes, and prepares a manuscript for subtitles. Based on the subtitle manuscript, the operator creates digitized subtitle data, and performs previewing and proofreading in the presence of the person in charge of the subtitle production, the manuscript creator, and the digitized operator to obtain the completed subtitles.
[0007]
The most time-consuming part of this work is to listen to the program speech in 1) and transcribe and create a subtitle manuscript. Most of this work depends on human intelligence and manual work.
[0008]
In a specific example of a program speech transcribing operation, a program tape is reproduced to listen to audio, and the tape is reproduced and listened to from the speech start point in the audio, and is transcribed by a word processor or writing. Actually, in order to confirm the transcription speed and the contents of the transcription operator, the heading and reproduction operations of the recording tape are repeated for one segment of the speech section, and the transcription operation is performed.
[0009]
Therefore, the transcripting work is a burdensome task that depends on human intelligence such as complicated tape operation such as repetition of tape cueing / playback operation, speech confirmation and transcription, and the necessary skills and labor are required. An appropriate support function that can be reduced is required, and its development is desired.
[0010]
In view of the above circumstances, the present invention performs appropriate division using the detected pose, performs processing such as slowing down or repetition as necessary on the program audio signal, so that it is easy to hear and transcribe. An object of the present invention is to provide a transcription support device that can support.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, an invention according to claim 1 is a writing support device that supports a task of writing text data from a voice while listening to the voice, the pause supporting detecting a pause of an input voice signal. It is characterized by comprising a detecting unit, and a dividing unit that divides the input voice signal based on the detected pose based on the time or the number of characters in which a short time can be stored by the transcriber.
[0012]
According to a second aspect of the present invention, in the transcript assisting apparatus according to the first aspect, the division unit converts the input audio signal by a most reliable pause among the pauses detected by the pause detection unit. The primary dividing means to be divided and, as a result of the division by the primary dividing means, a portion which is longer than a predetermined time or more than a predetermined number of characters that can be stored for a short time, is input with a next more reliable pause. And secondary division means for dividing the audio signal.
[0013]
According to a third aspect of the present invention, in the transcript assisting apparatus according to the first or second aspect, the division unit is configured to perform the division by the secondary division unit, and as a result, a predetermined time or more or a predetermined period that allows short-term storage is possible. For a portion that is longer than the number of characters, the input audio signal may be divided by the next next more reliable pause, or a tertiary division unit that mechanically divides the input audio signal at a predetermined time or number of characters may be included. Features.
[0014]
According to a fourth aspect of the present invention, in the transcription support apparatus according to any one of the first to third aspects, the short-term memory is available for about 2.5 seconds or the short-term memory is available. The number of characters is about 25 mora in number of mora.
[0015]
According to a fifth aspect of the present invention, in the transcript assisting apparatus according to any one of the first to fourth aspects, the division unit includes an overlap unit that overlaps a front and rear part of the divided voice by a predetermined time or the number of characters. It is characterized by:
[0016]
According to a sixth aspect of the present invention, in the transcript assisting apparatus according to any one of the first to fifth aspects, the pause detection unit determines a normal speed from a level signal indicating an amplitude level or a power level of the input audio signal. A specific frequency component extracting means for extracting a specific frequency component indicating a fluctuation characteristic of speech in the above, obtaining an envelope waveform of the extracted specific frequency component, setting a predetermined slice level with respect to the obtained envelope waveform signal, and performing pause. And a pause section detecting means for detecting a section.
[0017]
BEST MODE FOR CARRYING OUT THE INVENTION
<Background of the present invention>
As described above, the transcription of program audio is a burdensome task that depends on human intelligence, such as complicated tape operations such as cueing and repeating playback of tapes, confirmation of speech listening, and transcription. It is necessary to have appropriate support functions that can reduce the necessary skills and labor.
[0018]
For example, as for the "one-segment speech section" and its "repetition", if this is considered as one sentence, there is usually a pause portion before and after the sentence. Process the program audio so that it repeats. As a result, the speech section is easy to hear and transcribe.
[0019]
On the other hand, unnecessary non-speech sections that are transcribed can be automatically removed, so that the transcriber can concentrate on the transcription of the speech part, greatly contributing to the accurate and quick creation of subtitle text. can do.
[0020]
Further, since this processing also contributes to a reduction in necessary skills and labor, there is a possibility that the transcriber can be extended to general word processing workers.
[0021]
The detection of a pause has already been proposed in Japanese Patent Application No. 2002-160255, "Speech Speech / Pause Section Detecting Apparatus", and a pause detected by such a method (this method will be described as a first method) will be described in the present application. It can be applied to processing such as division of program audio.
[0022]
《Method of program audio division processing》
Considering further the significance of the "one-segment speech section", setting of an appropriate division unit for transcription and slowing down processing as necessary requires that the speech of this unit be generated within this division time. And the repetition of the division and the required number of times is the same.
[0023]
First, in terms of ease of listening, it is reasonable to divide the program sound using a place where there is no speech, or a break due to breathing or the like (a pause for writing).
[0024]
The length of the division will be described later, but if the required division length does not fit in this processing, it is necessary to consider additional division processing.
[0025]
For this reason, first, a detected relatively long pose (for example, 1 to 2 seconds) is used to divide a part, and a portion that is not sufficiently divided is divided using a shorter pose (for example, 0.5 to 1 second). (Short pause, predetermined time, etc.).
[0026]
《Consideration on length of division for transcription》
We think that the length of the division for transcription should be short enough to allow short-term memory, including unknown words and proper nouns, but it is reasonable from simple experiments.
[0027]
Although the specific image of the length that can be stored for a short period of time is hard to determine, the familiar examples include haiku, waka, and slogans. Since these characters are 17 characters (mora) and 31 characters (mora) in a hard way (mora), a study was conducted aiming at about 25 characters (mora). This is approximately 19 characters in a sentence mixed with kana-kanji characters in which the ratio of kanji is 1/3, and when one mora is set to 0.1 second, the time is equivalent to 2.5 seconds.
[0028]
First, use the actual program text to verify the validity of the position of the division, the number of mora as the division length raised here, and the time as a division method of the program audio that is easy to hear and transcribe. Was examined.
[0029]
《Examination of segmentation method using actual program text》
FIG. 2 shows a news program as an example, which is manually divided for transcription. Since the example shown in FIG. 2 is the program text, the punctuation marks that normally exist are first focused on, and the division is made in consideration of the phrases.
[0030]
FIG. 3 is an example of division for manual transcription using a documentary program as an example. Similarly, in the example shown in FIG. 3, first, the punctuation is focused on, and the punctuation is also taken into consideration to divide the phrase. Here, the terms used in both figures are defined as follows.
[0031]
Start: Start time of the corresponding program audio
End: End time of the corresponding program audio
Time: Elapsed time of the corresponding program audio
Kana: Number of kana in text
Kanji: Number of Kanji in text
Mora: Number of moras in text (Kanji etc. are calculated as 1.86 moras)
/ S: number of moras / elapsed time, which is the number of moras per second.
[0032]
In the case of FIG. 2, the elapsed time, that is, the time length is 2.45 seconds on average (minimum: 1.3 seconds, maximum: 3.4 seconds), and the number of mora is 24.4 on average (minimum: 12.3, maximum: 33.3 and 390/16). The total number of moras (total of moras) / total elapsed time (total of elapsed times) is 390 / 39.2 = 9.95, which is a value related to the speed of speech, which is rather fast.
[0033]
In the case of FIG. 3, the elapsed time, that is, the time length is 2.26 seconds on average (minimum: 0.9 seconds, maximum: 3.5 seconds), and the number of mora is 17.5 on average (minimum: 9.5, maximum: 25.2). The total number of moras (sum of moras) / total elapsed time (sum of elapsed times) is 7.76, which is a value related to the speed of speech, which is approximately the standard speed.
[0034]
For these two program examples, the text was manually divided as shown in FIGS. 2 and 3, and the program audio was divided so as to correspond to the divided text. Was done.
[0035]
As a result, except for a part of the long elapsed time portion, it was evaluated that it was fairly easy to hear, easy to transcribe, and fast, and the validity of division at an average of 25 mora and about 2.5 seconds was confirmed. I was able to do it.
[0036]
《Detection of speech / pause part in actual program audio》
However, the actual transcription target is the speech in the program audio, which does not include punctuation marks or phrases, and also has various background sounds as well as the target speech, and the situation is completely different.
[0037]
In the processing of the program audio, it is desirable to correctly determine the start and end times of each speech section, and to divide the speech section into an appropriate time length based on the start and end times according to the above description.
[0038]
However, in the actual program audio, not only the target speech but also various background sounds overlap, and it is generally difficult to correctly determine the start and end times of each speech section. On the other hand, in the case of the non-speech section, the silent section is certain, and even if there is a background sound, it may be possible to determine that the reliability is high depending on the level and the section length.
[0039]
Therefore, in the present invention, by analyzing a level signal indicating an amplitude level or a power level of an input audio signal, a specific frequency component indicating a fluctuation characteristic of speech at a normal speed is extracted, and an envelope waveform of the extracted specific frequency component is extracted. A first method of setting a predetermined slice level with respect to the obtained envelope waveform signal and detecting a pause section, or comparing a plurality of LCP cepstrum vectors in acoustic data with each other from a reference frame. The pause is detected by the second method (block cepstrum flux method) for more stably detecting the switching point of the sound data content.
[0040]
<Description of Embodiment>
FIG. 1 is a block diagram showing an embodiment of a transcription support device according to the present invention.
[0041]
The transcription support device shown in FIG. 1 includes a pause detection unit 1 that detects a pause of an input voice signal, and a time or number of characters that allows a transcription worker to perform short-term memory based on the detected pose. And a dividing unit 11 for dividing the audio signal.
[0042]
The pause detection unit 1 includes a specific frequency component extraction unit 3 for extracting a specific frequency component indicating a fluctuation characteristic of speech at a normal speed from a level signal indicating an amplitude level or a power level of an input audio signal; And a pause section detecting means 5 for determining an envelope waveform of the component, setting a predetermined slice level with respect to the determined envelope waveform signal, and detecting a pause section below the level.
[0043]
The dividing unit 1 includes a primary dividing unit 13 that divides the input audio signal by the most reliable pause among the poses detected by the pause detecting unit 1. As a result of the division, for a portion having a predetermined time or more or a predetermined number of characters or more that can be stored for a short time, a secondary dividing means 15 for dividing the input voice signal by a next more reliable pause, As a result of the division by the next division means, for a portion that is longer than a predetermined time or a predetermined number of characters that can be stored for a short time, the input audio signal is divided by the next next more reliable pause or determined in advance. A tertiary dividing means 17 for mechanically dividing the divided voice or the number of characters, and an overlapping means 19 for overlapping the preceding and succeeding portions of the divided voice by a predetermined time or the number of characters. To have.
[0044]
Next, the operation of this embodiment will be described separately for the processing of the pause detecting section 1 and the processing of the dividing section 11.
[0045]
<< Process of Pose Detection Unit 1 >>
4 to 7 are explanatory diagrams showing the procedure of the pause detection processing.
[0046]
In the flowchart of FIG. 4, first, a program audio signal having an amplitude waveform as shown in FIG. 5A is fetched, and audio amplitude value standardization processing is executed (steps ST1 and ST3). In this process, only the low-frequency component of the envelope is extracted from the fetched program audio signal (step ST5), and the signal component is set to a predetermined level based on the amplitude value of the low-frequency component. . Generally, when the level of the low-frequency component is high, the level of the high-frequency component is also considered to be high. This is because if the level is different, the detection accuracy is affected, and it is necessary to standardize the level to some extent.
[0047]
The audio signal whose audio amplitude value has been standardized in this way is then subjected to absolute value processing, and the amplitude value waveform signal (Lch, Rch stereo audio signal) folded back to the + side as shown in FIG. If there is a signal, the speech part is expanded using the sum and difference signals) (step ST7). The waveform in FIG. 6A shows the time axis in the range indicated by the arrow in FIG.
[0048]
Next, digital high-pass filtering (step ST9) and digital band-pass filtering (step ST11) are executed from the amplitude-valued waveform signal that has been converted into an absolute value. In the digital high-pass filtering, a signal having a frequency of, for example, 2 Hz or more is extracted, and in the digital band-pass filtering, a signal having a frequency of, for example, 4 to 7 Hz is extracted. The waveform in FIG. 6B is a signal waveform of a 4 to 7 Hz component.
[0049]
It is known that when the fluctuation characteristics of speech in the time axis direction are compared with the phonetic symbol sequence of speech, the voice power corresponding to the vowel phonetic symbol tends to be larger than other parts. The prior application has shown that the fluctuation of the signal waveform in the time axis direction in the speech at the normal speed has a frequency of about 4 to 7 Hz (Japanese Patent Application No. 2002-160255). This embodiment focuses on the fluctuation of the waveform in the time axis direction, and detects a pause portion by capturing rough periodicity.
[0050]
A signal having a frequency of 4 to 7 Hz is extracted by the digital band filtering process, and when an absolute value process (step ST12) is performed, a signal having a waveform similar to speech is obtained. On the other hand, when the arithmetic processing (step ST13) is executed, components other than the main speech components are extracted as a difference between the extracted signals, and further, an absolute value process (step ST15), a digital low-pass filtering process (step ST15). Step ST17) is performed. In the level correction processing (step ST19), the waveform of the low-frequency component of 0.5 Hz or less generated by the low-frequency filtering processing mainly includes components other than the speech component. , The level of the speech-like waveform signal is corrected in the reverse direction, and as a result, the speech component is emphasized. FIG. 7 shows a speech approximate waveform after this processing.
[0051]
The envelope-corrected envelope waveform signal is displayed as a waveform on a display for verifying its suitability (step ST21). Then, as shown in FIG. 7, the slice is performed at a predetermined slice level (threshold) (step ST23).
[0052]
Next, a minute pause section removal process is executed (step ST25). In this processing, for example, from the sliced audio amplitude value signal, for example, a section of a slight vowel section variation or a degree of breathing is excluded from detection targets. For example, the detection time range is set to about “0.2 to 2 seconds”, and the subsequent time is excluded as a detection target and removed. As a result, pause detection with no meaning and low reliability can be effectively prevented.
[0053]
The detected pause section is displayed on the screen (step ST27). The speech section is actually measured for the purpose of optimizing pose detection accuracy and slice (step ST23) level setting, and compared with the actually measured speech section as shown in FIG. 7 (step ST29). As a result, the pause section derived from the actually measured speech section is compared with the detected pause section, and by performing the comparison, the pose detection accuracy can be checked or the slice level can be changed to be optimal. .
[0054]
<< Processing of Dividing Unit 11 >>
If a pause is detected as described above, then a relatively long pause can be said to have the highest reliability, as shown in the flowchart of FIG. To divide the input audio signal (step ST41). This processing is executed by the primary dividing means 13.
[0055]
As a result of this division, for a portion that is longer than a predetermined time or a predetermined number of characters that can be stored for a short period of time by an operator, the input voice signal is divided by the next long pause (about 0.5 to 1 second) (step ST43, ST45). This processing is executed by the secondary dividing means 15.
[0056]
As a result of this division, for a portion that is longer than a predetermined time or a predetermined number of characters that can be stored for a short term by the operator, the input voice signal is generated by the next next longer pause (for example, 0.2 to 0.5 seconds). Is divided or mechanically divided at a predetermined time or number of characters (steps ST47 and ST49). This processing is executed by the tertiary division means 17.
[0057]
When the division is performed by the division unit 15 or 17 which may have a reliability problem, the duplication unit 19 sets the preceding and succeeding portions of the divided voice to a predetermined time or number of characters. (Step ST51). For example, by providing an overlapping portion of about 1 second, it is possible to prevent the transcriber from being unable to hear or lose the transcript caused by the low reliability of the pause detection.
[0058]
《Pose detection by block cepstrum flux method》
A block cepstrum flux (Bloch Cepstrum Flux) method as a second technique for detecting a speech section will be briefly described.
[0059]
This method detects a switching point of audio data content more stably by comparing a plurality of LCP cepstrum vectors in audio data with each other from a reference frame.
[0060]
The graph of FIG. 9 shows the results of analyzing the sound of an actual television program considered to be an average background sound level by both of these methods.
[0061]
In FIG. 9, (A) is a processed value (cflx analysis value) by the block cepstrum flux method, (B) is a value extracted from the 4-7 Hz component of the audio envelope (by the first method), and (C) is a value of (B). This is data obtained by slicing the waveform at an appropriate level and binarizing it.
[0062]
In the waveform (C), the high-level range corresponds to speech, and the low-level range corresponds to a non-speech (pause) section. In this case, (C) agrees quite well with the actually measured speech section. I can say.
[0063]
Therefore, the speech section in the voice can be grasped to some extent accurately from the waveform of (C). In the waveform of (C), P1 is a short pause, and P3 is a relatively long pause. It can be said that P3 is the most reliable pose and P2 is the next most reliable pose.
[0064]
Note that the block cepstrum flux method (A) (A) was compared in a similar manner. The result is slightly worse than that of the first method, but can be applied in some cases.
[0065]
<Division experiment using pause detection of actual program audio>
A place where there is no speech in the actual program audio, or a break due to breathing, etc. is detected by appropriate processing of the program audio (a pause for transcription), and the speech to be transcribed based on this pause etc. The work was divided into easy-to-use units, and speed reduction was performed as necessary.
[0066]
As described above, it is desirable that the time of the speech that facilitates the writing operation is about 2.5 seconds. Since the target program has a considerably high level of background sound (music), the division method here first detected and divided a relatively long pause (for example, about 2 seconds or more).
[0067]
FIG. 10 illustrates the detection of a pause portion from the sound of a target program (there is a background sound (music) having a considerably high level).
[0068]
Although the original audio waveform of the target program is shown at the lower end, the frequency range component of 4 to 7 Hz is extracted from the envelope of this waveform by using the above-described first technique, and the amplitude value is referred to as “processed audio waveform”. ". The “sound waveform after processing” was compared in magnitude at an appropriate threshold level, and defined as “detection pause”. In order to evaluate the suitability of the “detected pose”, data of measured speech is also displayed for reference. It can be said that at least a relatively long pause (for example, about 2 seconds or more) is correctly detected.
[0069]
Using the result of the pause detection, the first division was performed when the pause was 1 to 2 seconds, and the second division was performed when the pause was 0.5 to 1 second. In this division, a portion that takes about 2.5 seconds or more is divided into about 2.5 seconds by any of the following methods.
[0070]
(1) A shorter pause (for example, 0.2 to 0.5 seconds or more) is detected, and further divided using the detected pause.
(2) Divide this part equally so that it takes about 3 seconds.
[0071]
In addition, at least a portion to which the processing of (1) or (2) is applied is provided with an overlapping portion of about 1 second to avoid omission or inaudibility. Further, in order to indicate an overlapping portion, an overlapping display sound CH is inserted at the beginning of the overlapping portion so that the overlapping portion can be recognized.
[0072]
<Example of division / duplication processing>
An example of the actual processing is shown below in text. The underlined portion indicates the overlap processing. In addition, PP: a detection pause of 1 to 2 seconds or more, SP: a detection pause of 0.5 to 1 second or more, and CH: overlapping display sound (underlined portions are overlapping portions).
[0073]
Processing example of (1) 1 second overlap 18 divisions
PP nature, there is always,Creature
CHCreatureThere is mystery and impression that we weave. SP
Come, travel to the earth in the middle of nature. SP
In search of new encounters,Special
CHSpecialIt is the beginning of the journey of the earth. PP
Today's stage is mediumIn Costa Rica
CHIn Costa RicaIt is a spreading tropical forest. SP
In this forest, manyCreatures
CHCreatures, I live with elaboration. SP
Above all, unique wisdomLiving with
CHLiving withThere is a capuchin capuchin monkey. PP
The wisdom of white-faced capuchin monkeys varies. SP
Tree nuts on stoneTighten, medium
CHTighten, mediumTake out the seeds. SP
In addition, with different types of monkeysLive in peace
CHLive in peaceThere is even the wisdom of bargaining. PP
White-faced capuchin monkey in a forest lifeshow, SP wisdom
CHShow, WisdomStared at a number of. PP
[0074]
(2) Processing example Equal division between PP, 1 second overlap 16 division
PP nature, there is always,Creatures
CreaturesThere is mystery and impression that weave.here we go
here we go, Into the middle of natureLand of living things
Land of living thingsBall travel. New outIn search of a meeting,
In search of a meeting,It is the beginning of a special earth journey. PP
Today's stage is Central America CostaSpread over mosquitoes
Spread over mosquitoesIt is a tropical forest.In this forest,
In this forest,Many creatures are creativeLiving hard
Living hardYou. Above all, GermanyWith special wisdom
With special wisdomThe giant capuchin monkey lives. PP
The wisdom of white-faced capuchin monkeys varies.wood
woodHit the stones on the stone, insideTake out the seeds
Take out the seedsYou. In addition,Different monkeys and flats
Different monkeys and flatsThere is even the wisdom of bargaining to live in harmony. PP
White-faced capuchin monkey living in the forestShow in
Show in, Staring at the wisdom. PP
[0075]
《Writing experiment using split program audio》
The experimental program was 78 seconds in total, and the speech part was 48 seconds (62%).
[0076]
An experiment was conducted by using the program audio according to the above-described division method. In the experiment, as shown in FIG. 11, the divided voice processed as described above was put into MIDIPLAYER (shareware software), and the operation function of this software was used. More specifically, each divided voice is repeatedly played back, and when transcription of the hatched portion is completed, the operation proceeds to the next divided voice by key operation. FIG. 11 shows an example of the process (1).
[0077]
As a result of an experiment using the software shown in FIG.
Transcription time by the method of (1): about 135 seconds The number of times each divided voice is repeated About 2.8 times
Transcription time by the method of (2): about 152 seconds Number of repetitions of each divided voice About 3.2 times
Therefore, it was found that a dividing method that utilizes the detected pose as much as possible is desirable.
[0078]
【The invention's effect】
As described above, according to the present invention, by utilizing the detected pause, the program audio signal is subjected to processing such as appropriate division of the program audio, if necessary, slowing down, or repetition, so that the program audio signal is easily heard and transcribed. It will be possible to help.
[0079]
In addition, since the preceding and succeeding portions of the divided specific voice are overlapped by a predetermined time or number of characters by an overlapping means, it is possible to effectively prevent a transcriber from being unable to hear or lack a transcript. Can be.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an example of a transcription support device according to the present invention.
FIG. 2 is an explanatory diagram showing an example of a division experiment for writing a news program.
FIG. 3 is an explanatory diagram showing an example of division verified by writing a documentary number.
FIG. 4 is a flowchart illustrating a procedure of pause detection.
FIG. 5 is an explanatory diagram showing a program audio signal waveform and an amplitude level signal or a power level signal corresponding to the waveform.
6 is an explanatory diagram showing the amplitude level signal or the power level signal shown in FIG. 5 with the time axis enlarged, and showing extracted component values in a characteristic frequency range.
7 is an explanatory diagram showing an envelope waveform and an actual measurement pause (speech) portion of an amplitude level signal or a power level signal in a specific frequency range shown in FIG. 6;
FIG. 8 is an explanatory diagram showing a processing procedure for dividing program audio by a pause.
FIG. 9 is an explanatory diagram of speech approximation data obtained by specially processing an audio signal.
FIG. 10 is an explanatory diagram of pause portion detection.
FIG. 11 is an explanatory diagram illustrating an example of a computer screen used when reproducing the divided program audio using audio reproduction software.
[Explanation of symbols]
1 Pause detector
3 Specific frequency component extraction means
5 Pause section detection means
11 Division
13 Primary dividing means
15 Secondary dividing means
17 Third division means
19 Overlapping means

Claims

A transcription support device that supports a task of transcribing text data from the voice while listening to the voice,
A pause detector that detects a pause of the input audio signal;
Based on the detected pose, a dividing unit that divides the input audio signal by a time or the number of characters that allows short-term memory of the transcriber,
A transcription support device comprising:

The transcription support device according to claim 1,
The dividing unit includes:
Primary division means for dividing the input audio signal by the most reliable pose among the poses detected by the pose detection unit;
As a result of the division by the primary division means, for a portion that is longer than a predetermined time or a predetermined number of characters that can be stored for a short period of time, a second division in which the input voice signal is divided by the next more reliable pause Means,
A transcription support device, comprising:

The transcription support device according to claim 1 or 2,
The dividing unit includes:
As a result of the division by the secondary dividing means, for a portion that is longer than a predetermined time or a predetermined number of characters that can be stored for a short time, the input audio signal is divided by the next next more reliable pause, Or tertiary division means for mechanically dividing at a predetermined time or number of characters,
A transcription support device comprising:

The transcription support device according to any one of claims 1 to 3,
The writing support device, wherein the short-term memory is available for about 2.5 seconds, or the short-term memory is available for about 25 mora characters.

The transcription support device according to any one of claims 1 to 4,
The dividing unit includes:
Including overlapping means for overlapping the front and rear parts of the divided voice by a predetermined time or number of characters,
A transcription support device, characterized in that:

The transcription support device according to any one of claims 1 to 5,
The pose detection unit,
From a level signal indicating the amplitude level or power level of the input audio signal, a specific frequency component extracting unit that extracts a specific frequency component indicating a fluctuation characteristic of speech at a normal speed,
Pause section detection means for obtaining an envelope waveform of the extracted specific frequency component, setting a predetermined slice level for the obtained envelope waveform signal and detecting a pause section,
A transcription support device, comprising: