JP4314376B2

JP4314376B2 - Writing support device

Info

Publication number: JP4314376B2
Application number: JP2003001352A
Authority: JP
Inventors: 英治沢村; 隆雄門馬; 則好浦谷; 克彦白井
Original assignee: NEC Corp; National Institute of Information and Communications Technology; NHK Engineering Services Inc; Japan Broadcasting Corp
Current assignee: NEC Corp; National Institute of Information and Communications Technology; Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2003-01-07
Filing date: 2003-01-07
Publication date: 2009-08-12
Anticipated expiration: 2023-01-07
Also published as: JP2004212799A

Abstract

<P>PROBLEM TO BE SOLVED: To facilitate hearing and dictation by applying processes such as proper division, speed reduction if necessary, and repetition to a program voice signal by using detected pauses. <P>SOLUTION: A dictation assisting device which assists a dictation of text data from a speech while the speech is heard is equipped with a pause detection part 1 which detects pauses of an input speech signal and a division part 11 which divides the input speech signal by a time or the number of characters enabling a dictator to memorize the divided signal for a short period on the basis of the detected pauses. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ポーズ部分の検出を利用して番組音声等の書起しを支援する書起し支援装置に関する。
【０００２】
［発明の概要］
本発明は、ポーズ部分の検出を利用した番組音声等の書起しを支援する装置に関するものであり、特に番組音声を聞いてこれを書起す際に、検出したポーズを利用して適切な分割を行い、必要に応じて低速化、あるいは繰返しなどの処理を番組音声信号に施し、聞取り易く、書起しし易いように支援するものである。
【０００３】
この処理を行うことにより、書起しに要する時間や作業に伴う緊張感や疲労を低減することができる。特に、この処理法を字幕用テキストの書起しに適用すると、字幕番組制作時間の短縮、作業の平易化による書起し作業者層の拡大が可能となり、字幕放送番組の拡大やコスト低減に寄与する。
【０００４】
【従来の技術】
字幕つきテレビ放送番組を受信者が利用する際、字幕が読みやすく、理解しやすいものであることが重要となる。そのため、字幕番組制作における字幕原稿作成では、熟練した人手を使い、多大の労力・時間をかけて、より読みやすく、理解しやすい字幕となるよう制作している。
【０００５】
【発明が解決しようとする課題】
しかしながら、今後適用番組分野・番組数などの拡大を進めている字幕放送において、この熟練した人手、多大の労力・時間を要するこのような形態の字幕番組制作システムでは、字幕番組制作上の大きなネックとなっており、その改善が急がれている。
【０００６】
現在最も多く行われている字幕番組制作形態では、タイムコードを映像にスーパーした番組テープと必要な場合は番組台本を素材とし、これを放送関係経験者など専門知識のある人によって、１）番組スピーチの要約書起し、２）字幕表示イメージ化（別途定めのある字幕原稿作成要領による）、および３）その開始・終了タイムコード記入を行い、字幕用の原稿を作成している。この字幕原稿を基に、オペレーターが電子化字幕データを作成し、担当の字幕制作責任者、原稿作成者、電子化したオペレーター立ち会いのもとで、試写・校正を行って完成字幕としている。
【０００７】
この作業の中で最も多くの時間を必要とするのは、１）の番組スピーチなどを聴取して、字幕原稿を書起し作成する点であり、ほとんどが人間の知能と手作業によっている。
【０００８】
番組スピーチの具体的な書起し作業例では、番組テープを再生操作して音声を聴取し、音声中のスピーチ開始点からテープを再生聴取しつつ、ワープロや筆記で書起しを行う。実際には、書起し作業者の書起し速度や内容確認などのため、一区切りのスピーチ区間を対象として録音テープの頭出し・再生操作を繰り返し、書起し作業が行われる。
【０００９】
したがって書起し作業は、テープの頭出し・再生操作の繰り返しといった煩雑なテープ操作と、スピーチの聴取確認、書起しといった人間の知能に負う負担の多い業務であり、必要な能力・労力を低減できる適切な支援機能が必要であり、その開発が望まれている。
【００１０】
本発明は上記事情に鑑み、検出したポーズを利用して適切な分割を行い、必要に応じて低速化、あるいは繰返しなどの処理を番組音声信号に施し、聞取り易く、書起しし易いように支援することのできる書起し支援装置を提供することを目的としている。
【００１１】
【課題を解決するための手段】
上記の目的を達成するために請求項１の発明は、音声を聞きつつ、その音声からテキストデータを書起す作業を支援する書起し支援装置であって、入力音声信号のポーズを検出するポーズ検出部と、検出されたポーズに基づいて、書起し作業者の短期記憶が可能な時間または前記時間から換算された文字数で前記入力音声信号を分割する分割部とを備え、前記分割部は、前記ポーズ検出部によって検出されたポーズの中で、最も信頼性の高いポーズにより前記入力音声信号を分割する第１次分割手段と、前記第１次分割手段による分割の結果、短期記憶が可能な所定の時間以上又は前記時間から換算された文字数以上である部分については、次に信頼性の高いポーズにより前記入力音声信号を分割する第２次分割手段と、前記第２次分割手段による分割の結果、短期記憶が可能な所定の時間以上又は前記時間から換算された文字数以上である部分については、次の次に信頼性の高いポーズにより前記入力音声信号を分割するか、または予め決定された時間若しくは前記時間から換算された文字数で機械的に分割する第３次分割手段とを備えたことを特徴としている。
【００１４】
請求項２の発明は、請求項１に記載の書起し支援装置において、前記短期記憶が可能な時間は、約２．５秒前後であり、または前記短期記憶が可能な前記期間から換算された文字数は、モーラ数で約２５モーラであることを特徴としている。
【００１５】
請求項３の発明は、請求項１または２に記載の書起し支援装置において、前記分割部は、分割された音声の前後部分を所定の時間または前記時間から換算された文字数だけ重複させる重複手段を含むことを特徴としている。
【００１６】
請求項４の発明は、請求項１乃至３のいずれかに記載の書起し支援装置において、前記ポーズ検出部は、前記入力音声信号の振幅レベルまたはパワーレベルを示すレベル信号から、通常の速度におけるスピーチの変動特性を示す特定周波数成分を抽出する特定周波数成分抽出手段と、抽出された特定周波数成分のエンベロープ波形を求め、求められたエンベロープ波形信号に対して所定のスライスレベルを設定してポーズ区間を検出するポーズ区間検出手段とから成ることを特徴としている。
【００１７】
【発明の実施の形態】
＜本発明の背景＞
前述したように、番組音声の書起し作業は、テープの頭出し・繰り返し再生といった煩雑なテープ操作や、スピーチの聴取確認、書起しといった人間の知能に負う負担の多い業務であり、必要な能力・労力を低減できる適切な支援機能が必要である。
【００１８】
例えば前記「一区切りのスピーチ区間」やその「繰り返し」については、これを一つの文とするとその前後には通常ポーズ部分があるので、このポーズを検出して自動的に区切りを付与し、必要回数繰返しを行うように番組音声を処理する。その結果、スピーチ区間は聞取り易く、書起しし易くなる。
【００１９】
一方、書起し不要な非スピーチ区間は自動的に除去できるので、書起し作業者はスピーチ部分の書起し作業に専念することができ、字幕用テキストの正確・迅速な作成に大いに貢献することができる。
【００２０】
また、この処理は必要な能力・労力の低減にも寄与するので、書起し作業者は一般的なワープロ作業者レベルの方々まで拡大できる可能性がある。
【００２１】
なお、ポーズの検出に関しては特願2002-160255「音声のスピーチ／ポーズ区間検出装置」で提案済みであり、このような手法（この手法を第１の手法として後述する）で検出したポーズを本件番組音声の分割などの処理に適用できる。
【００２２】
《番組音声分割処理の手法》
前記「一区切りのスピーチ区間」の意義をさらに検討すると、書起しのための適切な分割単位の設定と必要に応じて行う低速化処理は、この分割時間内にこの単位となるスピーチの書起し作業を完了出来るようにすることであり、分割と必要回数の繰り返しも同様である。
【００２３】
まず、聞き易さの点では、番組音声におけるスピーチのないところ、もしくは息継ぎなどによる途切れ箇所（書起しのためのポーズとする）を利用して分割するのが合理的である。
【００２４】
どの程度の長さに分割するかについては後述するが、この処理で所要の分割長に収まらない場合は、追加の分割処理を考える必要がある。
【００２５】
そのため、検出した比較的長いポーズ（例えば１〜２秒）でまず分割し、分割が足りない部分はより短いポーズ（例えば０．５〜１秒）で分割し、なお分割が足りない部分は他の方法（より短いポーズ、所定時間など）によることとした。
【００２６】
《書起しのための分割の長さについての考察》
書起しのための分割の長さについては、未知語や固有名詞なども含めて、短期記憶が可能である長さが望ましいと考えているが、簡単な実験からも妥当性がある。
【００２７】
短期記憶可能な長さの具体的なイメージは一義的には定めがたいものであるが、身近な事例として俳句や和歌、標語の例がある。これらは、かたかな（モーラ）で１７文字（モーラ）や３１文字（モーラ）であるので、一応２５文字（モーラ）程度を目標として検討した。これは、漢字の割合を1/3としたかな漢字混じり文では約１９文字になり、１モーラを0.1秒とすると時間は2.5秒に相当する。
【００２８】
聞取り易く、書起しし易いような番組音声の分割法として、分割の位置やここで上げた分割長としてのモーラ数や時間の妥当性を検証するために、まず実際の番組テキストを利用して検討を行った。
【００２９】
《実際の番組テキストを利用した分割法の検討》
図２は、ニュース番組を例として、手作業により書起しのための分割をしたものである。この図２に示す例は番組テキストなので、通常存在する句読点にまず着目するとともに、文節も考慮して分割した。
【００３０】
図３はドキュメンタリー番組を例として手作業による書起しのための分割例である。図３に示す例も同様に、まず句読点に着目するとともに、文節も考慮して分割した。ここで、両図で使用される語句を以下のように定義する。
【００３１】
開始：対応する番組音声の開始時間
終了：対応する番組音声の終了時間
時間：対応する番組音声の経過時間
カナ：テキストのカナ数
漢字：テキストの漢字等の数
モーラ：テキストのモーラ数（漢字等は、1.86モーラとして計算）
／Ｓ：モーラ数／経過時間で１秒あたりのモーラ数である。
【００３２】
図２の場合、経過時間つまり時間長は平均2.45秒（最小：1.3秒、最大：3.4秒）であり、モーラ数は平均24.4（最小：12.3、最大：33.3で390/16）であった。総モーラ数（モーラ数の合計）/総経過時間（経過時間の合計）は390/39.2=9.95であり、これはスピーチの速度と関係する値で、やや早いスピーチである。
【００３３】
図３の場合、経過時間つまり時間長は平均2.26秒（最小：0.9秒、最大：3.5秒）であり、モーラ数は平均17.5（最小：9.5、最大：25.2）であった。総モーラ数（モーラ数の合計）/総経過時間（経過時間の合計）は7.76であり、これはスピーチの速度と関係する値で、ほぼ標準のスピードである。
【００３４】
これら二つの番組例について、図２、図３のようにテキストを手作業で分割し、この分割テキストに対応するよう番組音声を分割し、必要回数繰り返し再生しながら、簡単な書起し実験を行った。
【００３５】
その結果、一部の長い経過時間部分を除き、かなり聞取り易く、書起しし易く、しかも速いとの評価が得られ、平均25モーラ、2.5秒程度での分割の妥当性を確認することが出来た。
【００３６】
《実際の番組音声におけるスピーチ／ポーズ部分の検出》
しかし、実際の書起しの対象は番組音声中のスピーチであり、句読点や文節があるわけでなく、さらに目的のスピーチだけでなくさまざまな背景音があり、まったく事情が異なる。
【００３７】
番組音声の処理では、望ましくは正しく各スピーチ区間の開始・終了時間を求め、この開始・終了時間を基に前記の説明に準じて適切な時間長となるよう分割することとなる。
【００３８】
しかし、実際の番組音声では目的のスピーチだけでなく様々な背景音が重なっており、各スピーチ区間の開始・終了時間を正しく求めることは一般には至難なことである。一方、非スピーチ区間の場合は、無音区間は確実であり、背景音がある場合でもそのレベル、区間長によっては信頼性が大きいと判断できる可能性がある。
【００３９】
そこで、本発明では、入力音声信号の振幅レベルまたはパワーレベルを示すレベル信号の分析により、通常の速度におけるスピーチの変動特性を示す特定周波数成分を抽出し、抽出された特定周波数成分のエンベロープ波形を求め、求められたエンベロープ波形信号に対して所定のスライスレベルを設定してポーズ区間を検出する第１の手法、または音響データ内の複数のＬＣＰケプストラムベクトルを、基準フレームから相互に比較することで、音響データ内容の切り替わり点を、より安定に検出する第２の手法（ブロックケプストラムフラックス法）によりポーズを検出するようにしている。
【００４０】
＜実施の形態の説明＞
図１は、本発明に係る書起し支援装置の実施形態を示すブロック図である。
【００４１】
同図に示す書起し支援装置は、入力音声信号のポーズを検出するポーズ検出部１と、検出されたポーズに基づいて、書起し作業者の短期記憶が可能な時間または文字数で前記入力音声信号を分割する分割部１１とを備えている。
【００４２】
ポーズ検出部１は、入力音声信号の振幅レベルまたはパワーレベルを示すレベル信号から、通常の速度におけるスピーチの変動特性を示す特定周波数成分を抽出する特定周波数成分抽出手段３と、抽出された特定周波数成分のエンベロープ波形を求め、求められたエンベロープ波形信号に対して所定のスライスレベルを設定してそのレベル以下のポーズ区間を検出するポーズ区間検出手段５とから構成されている。
【００４３】
また、分割部１１は、ポーズ検出部１によって検出されたポーズの中で、最も信頼性の高いポーズにより前記入力音声信号を分割する第１次分割手段１３と、この第１次分割手段１３による分割の結果、短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次に信頼性の高いポーズにより前記入力音声信号を分割する第２次分割手段１５と、この第２次分割手段による分割の結果、短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次の次に信頼性の高いポーズにより前記入力音声信号を分割するか、または予め決定された時間若しくは文字数で機械的に分割する第３次分割手段１７と、分割された音声の前後部分を所定の時間または文字数だけ重複させる重複手段１９とから構成されている。
【００４４】
次に、この実施の形態の作用をポーズ検出部１の処理と、分割部１１の処理に分けて説明する。
【００４５】
《ポーズ検出部１の処理》
図４乃至図７はポーズ検出処理の手順を示す説明図である。
【００４６】
図４のフローチャートにおいて、先ず、図５（Ａ）に示すような振幅波形を持つ番組音声信号を取り込んで、音声振幅値の基準化処理を実行する（ステップＳＴ１，ＳＴ３）。この処理では、取り込まれた番組音声信号中からそのエンベロープの低域成分のみを取り出して（ステップＳＴ５）、この低域成分の振幅値を基として信号成分を所定レベルの大きさにする処理である。一般に、低域成分のレベルが大きい場合には高域成分のレベルも大きいと考えられる。レベルに違いがあると検出精度に影響を与えるので、ある程度のレベル基準化を図る必要があるからである。
【００４７】
こうして音声振幅値の基準化がされた音声信号は、次に、絶対値化処理がされて、図５（Ｂ）に示すような＋側に折り返した振幅値波形信号（Ｌｃｈ，Ｒｃｈのステレオ音声信号がある場合は、その和と差の信号を利用してスピーチ部分を伸長する）と成る（ステップＳＴ７）。図６（Ａ）の波形は、図５（Ｂ）に矢印で示した範囲の時間軸を１０倍に拡大して示したものである。
【００４８】
次いで、絶対値化された振幅値波形信号からディジタル高域ろ波処理（ステップＳＴ９）、ディジタル帯域ろ波処理（ステップＳＴ１１）が実行される。ディジタル高域ろ波処理では、例えば２Ｈｚ以上の周波数の信号が取り出され、ディジタル帯域ろ波処理では、例えば４〜７Ｈｚの周波数の信号が取り出される。図６（Ｂ）の波形は、４〜７Ｈｚ成分の信号波形である。
【００４９】
スピーチに関する時間軸方向の変動特性をスピーチの発音記号列と比較すると、母音の発音記号に対応する音声パワーが他の部分よりも大きくなる傾向があることが知られている。そして、通常速度のスピーチにおける信号波形の時間軸方向への変動は、４〜７Ｈｚ程度の周波数となっていることを先願で示した（特願２００２−１６０２５５）。この実施形態は、この波形の時間軸方向の変動に着目し、大まかな周期性を捉えることでポーズ部分の検出をする。
【００５０】
ディジタル帯域ろ波処理で４〜７Ｈｚの周波数の信号が取り出され、さらに絶対値化処理（ステップＳＴ１２）をすると、スピーチに類似した波形の信号となる。一方、演算処理（ステップＳＴ１３）が実行されると、これら取り出された信号の差分として主なスピーチ成分以外の成分が抽出され、さらに絶対値化処理（ステップＳＴ１５）、ディジタル低域ろ波処理（ステップＳＴ１７）が実行される。レベル補正処理（ステップＳＴ１９）では、低域ろ波処理で生成された０．５Ｈｚ以下の低域成分の波形は主にスピーチ成分以外の成分が多いので、このレベルを参照して、ステップＳＴ１２出力のスピーチ類似波形信号のレベルを逆方向に補正し、結果的にスピーチ成分を強調する処理がされる。この処理後のスピーチ近似波形を図７に示す。
【００５１】
こうして、レベル補正がされたエンベロープ波形信号は、その適否検証のためディスプレイ上に波形表示される（ステップＳＴ２１）。そして、図７に示すように、所定のスライスレベル（閾値）でスライスされる（ステップＳＴ２３）。
【００５２】
次に、微小ポーズ区間除去処理が実行される（ステップＳＴ２５）。この処理では、スライス処理された音声振幅値信号中から、例えば、ちょっとした母音区間の変動や息継ぎ程度の区間は検出対象から除外するためである。検出時間範囲として、例えば、“０．２〜２秒”程度を設定し、それ以下の時間を検出対象外として除去する処理である。これにより、意味を持たない信頼性が低いポーズ検出が効果的に防止できる。
【００５３】
こうして検出されたポーズ区間が画面表示される（ステップＳＴ２７）。ポーズの検出精度やスライス（ステップＳＴ２３）レベル設定の最適化などの目的でスピーチ区間を実測し、図７に示すように、実測スピーチ区間と比較される（ステップＳＴ２９）。これにより、実測されたスピーチ区間から導かれるポーズ区間と、検出されたポーズ区間とが比較され、比較によって、ポーズ検出精度をチチェックしたり、スライスレベルが最適となるように変更することができる。
【００５４】
《分割部１１の処理》
上述のようにしてポーズが検出されると、次に、図８のフローチャートに示すように、比較的長いポーズが一番信頼性が高いといえるので、先ず、１〜２秒程度以上の長いポーズにより入力音声信号を分割する（ステップＳＴ４１）。この処理は第１次分割手段１３で実行される。
【００５５】
この分割の結果、作業者にとって短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次に長いポーズ（０．５〜１秒程度）により入力音声信号を分割する（ステップＳＴ４３，ＳＴ４５）。この処理は第２次分割手段１５で実行される。
【００５６】
この分割の結果、作業者にとって短期記憶が可能な所定の時間以上または所定の文字数以上である部分については、次の次に長いポーズ（例えば０．２〜０．５秒）により前記入力音声信号を分割するか、または予め決定された時間若しくは文字数で機械的に分割する（ステップＳＴ４７，ＳＴ４９）。この処理は第３次分割手段１７で実行される。
【００５７】
また、分割部１１における分割に際しては、信頼性上問題がある可能性のある分割手段１５，１７による分割の場合には、重複手段１９により、分割された音声の前後部分を所定の時間または文字数だけ重複させる（ステップＳＴ５１）。例えば１秒程度の重複部分を設けることにより、ポーズ検出の信頼性の低さによって引起こされる書起し作業者の聴取不能や書起しの欠落を防止することができる。
【００５８】
《ブロックケプストラムフラックス法によるポーズの検出》
スピーチ区間を検出する第２の手法としてのブロックケプストラムフラックス（Bloch Cepstrum Flux）法について簡単に説明する。
【００５９】
この手法は、音響データ内の複数のＬＣＰケプストラムベクトルを、基準フレームから相互に比較することで、音響データ内容の切り替わり点を、より安定に検出するものである。
【００６０】
図９のグラフには、平均的な背景音レベルと考えられる実際のテレビ番組の音声をこの両手法で分析処理した結果を示した。
【００６１】
図９において、（A）はブロックケプストラムフラックス法による処理値（cflx解析値）、（B）は音声エンベロープの４〜７Hz成分抽出（第１の手法による）値、（C）は（B）の波形を適切なレベルでスライスし、２値化したデータである。
【００６２】
この（C）の波形では、高レベル範囲はスピーチ、低レベル範囲は非スピーチ（ポーズ）区間に対応するものとし、この場合の（C）は、実測したスピーチ区間とかなりよく合致しているといえる。
【００６３】
したがって、（C）の波形から音声中のスピーチ区間をある程度正確に把握することができる。(C)の波形中でＰ１は短いポーズ、Ｐ３は比較的長いポーズであり、このＰ３が最も信頼性の高いポーズ、Ｐ２が次に信頼性の高いポーズといえる。
【００６４】
なお、（A）のブロックケプストラムフラックス法（第２の手法）についても同様にして比較したが、第１の手法よりやや悪い結果であるが、場合によっては適用可能である。
【００６５】
＜実際の番組音声のポーズ部分検出を利用した分割実験＞
実際の番組音声におけるスピーチのないところ、もしくは息継ぎなどによる途切れ箇所を番組音声の適切な処理により検出（書起しのためのポーズとする）し、このポーズなどを目安にして書き起こすべきスピーチを作業がし易い単位に分割し、また必要に応じて低速化処理などを行った。
【００６６】
書起し作業がし易いスピーチの時間は前記のように２．５秒程度が望ましいので、この程度となるよう分割した。対象にした番組はかなり高レベルの背景音（音楽）があるため、ここの分割手法は、まず比較的長いポーズ（例えば２秒程度以上）を検出して分割した。
【００６７】
図１０は、対象にした番組の音声（かなり高レベルの背景音（音楽）がある）からのポーズ部分検出を説明したものである。
【００６８】
対象番組のオリジナル音声波形を下端に示したが、前述した第１の手法を用いて、この波形のエンベロープからその４〜７Ｈｚの周波数範囲成分を抽出し、その振幅値を「処理後の音声波形」として示した。この「処理後の音声波形」を適当なスレシホルドレベルで大小を比較し、「検出ポーズ」とした。この「検出ポーズ」の適否を評価するために、参考に実測スピーチのデータも表示した。少なくとも比較的長いポーズ（例えば２秒程度以上）は正しく検出されているといえる。
【００６９】
このポーズ検出結果を利用し、ポーズが１〜２秒での第１分割、０．５〜１秒の第２分割を行った。この分割でも２．５秒程度以上となる部分は、以下のいずれかの手法で２．５秒程度となるよう分割した。
【００７０】
（１）より短いポーズ（例えば０．２〜０．５秒以上）を検出し、これを利用してさらに分割。
（２）３秒程度となるように、この部分を等分割。
【００７１】
なお、少なくとも（１）または（２）の処理を適用した部分は、１秒程度の重複部分を設けて、欠落や聴取不能を回避するようにした。また重複部分を示すため、その部分の冒頭に重複表示音ＣＨを挿入して重複であることが分かるようにした。
【００７２】
＜分割・重複処理例＞
実際の処理例を文章で示すと下記のようになる。下線部分は重複処理を示す。また、PP ：１〜２秒以上の検出ポーズ、SP：0.５〜１秒以上の検出ポーズ、CH：重複表示音（下線部分は重複部分）をそれぞれ示す。
【００７３】
（１）の処理例１秒重複 18分割
PP大自然、そこにはいつも、生きもの
CH生きものたちが織りなす不思議と感動があります。SP
さあ、大自然の真っただ中へ、生きもの地球紀行。SP
新しい出会いを求めて、とっておきの
CHとっておきの地球の旅の始まりです。PP
今日の舞台は中米コスタリカに
CH米コスタリカに広がる熱帯の森です。SP
この森では、多くの生きものたちが
CH生きものたちが、工夫を凝らして生きています。SP
中でも、独特の知恵をもって暮らして
CHをもって暮らしているのが、ノドジロオマキザルです。PP
ノドジロオマキザルの知恵は様々です。SP
木の実を石にたたきつけて、中
CHきつけて、中の種を取り出します。SP
更に、種類の違うサルと平和に暮らす
CH平和に暮らす、駆け引きの知恵まであるのです。PP
ノドジロオマキザルが、森の暮らしの中で見せる、 SP 知恵
CH見せる、知恵の数々を見つめました。PP
【００７４】
（２）処理例 PP間等分割、１秒重複 16分割
PP大自然、そこにはいつも、生きものたち
生きものたちが織りなす不思議と感動があります。さあ
さあ、大自然の真っただ中へ、生きもの地
生きもの地球紀行。新しい出会いを求めて、
会いを求めて、とっておきの地球の旅の始まりです。PP
今日の舞台は中米コスタリカに広がる
カに広がる熱帯の森です。この森では、
この森では、多くの生きものたちが、工夫を凝らして生きてい
凝らして生きています。中でも、独特の知恵をも
特の知恵をもって暮らしているのが、ノドジロオマキザルです。PP
ノドジロオマキザルの知恵は様々です。木
木の実を石にたたきつけて、中の種を取り出し
の種を取り出します。更に、種類の違うサルと平
違うサルと平和に暮らす、駆け引きの知恵まであるのです。PP
ノドジロオマキザルが、森の暮らしの中で見せる
の中で見せる、知恵の数々を見つめました。PP
【００７５】
《分割番組音声を利用した書起し実験》
実験対象にした番組は全体で７８秒、スピーチ部分は４８秒（62％）である。
【００７６】
前記の分割方法による番組音声を利用して書起し実験を行った。実験は、図１１に示すように、MIDIPLAYER（シェアウエアソフト）に上記のように処理した分割音声を入れ、このソフトの操作機能を利用して行った。具体的には、各分割音声は繰り返し再生状態とし、ハッチした部分の書起しが完了するとキー操作で次の分割音声に進むようになっている。図１１は上記（１）の処理例を示す。
【００７７】
図１１のソフトにより実験した結果、
（１）の方法による書起し時間：約１３５秒各分割音声の繰り返し回数約2.8回
（２）の方法による書起し時間：約１５２秒各分割音声の繰り返し回数約3.2回
であり、一応検出したポーズをできるだけ活用する分割方法が望ましいことが分かった。
【００７８】
【発明の効果】
以上説明したように本発明によれば、検出したポーズを利用すると番組音声の適切な分割、必要ならば低速化、あるいは繰返しなどの処理を番組音声信号に施し、聞取り易く、書起しし易いように支援することが可能となる。
【００７９】
また、重複手段により、分割された特定音声の前後部分を所定の時間または文字数だけ重複させるようにしているので、書起し作業者の聴取不能や書起しの欠落を効果的に防止することができる。
【図面の簡単な説明】
【図１】本発明に係る書起し支援装置に一例を示すブロック図である。
【図２】ニューズ番組の書起しのための分割実験例を示す説明図である。
【図３】ドキュメンタリー番号見の書起しで検証された分割例を示す説明図である。
【図４】ポーズ検出の手順を示すフローチャートである。
【図５】番組音声信号波形と、この波形に対応する振幅レベル信号またはパワーレベル信号を示す説明図である。
【図６】図５に示した振幅レベル信号またはパワーレベル信号を時間軸を拡大して示すと共に特性周波数範囲の抽出成分値を示す説明図である。
【図７】図６に示した特定周波数範囲の振幅レベル信号またはパワーレベル信号のエンベロープ波形と実測ポーズ（スピーチ）部分を示す説明図である。
【図８】ポーズによる番組音声の分割の処理手順を示す説明図である。
【図９】音声信号を特殊処理したスピーチ近似データの説明図である。
【図１０】ポーズ部分検出の説明図である。
【図１１】分割された番組音声を音声再生ソフトを使用して再生する際に使用されるコンピュータの画面例を示す説明図である。
【符号の説明】
１ポーズ検出部
３特定周波数成分抽出手段
５ポーズ区間検出手段
１１分割部
１３第１次分割手段
１５第２次分割手段
１７第３次分割手段
１９重複手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a writing support apparatus that supports writing of program audio or the like using detection of a pause portion.
[0002]
[Summary of Invention]
The present invention relates to a device for supporting a book cause the program audio, such as using detection of pause portion, when causing write this particular hearing program audio, appropriate to utilize the detected pose Division is performed, and the program audio signal is subjected to processing such as speed reduction or repetition as necessary to assist easy listening and writing.
[0003]
By performing this process, it is possible to reduce the tension and fatigue associated with the time and work required for writing. In particular, when this processing method is applied to the transcription of subtitle text, it is possible to shorten the production time of subtitle programs and to increase the number of transcription workers by simplifying the work, thereby reducing the cost and cost of subtitle broadcast programs. Contribute.
[0004]
[Prior art]
When a receiver uses a TV broadcast program with subtitles, it is important that the subtitles are easy to read and understand. For this reason, in the production of subtitle manuscripts in the production of subtitle programs, skilled manpower is used to produce subtitles that are easier to read and understand, taking much effort and time.
[0005]
[Problems to be solved by the invention]
However, in the subtitle broadcasting that is expanding the application program field and the number of programs in the future, this type of subtitle program production system that requires a lot of labor and time is a big bottleneck in the production of subtitle programs. The improvement is urgent.
[0006]
The most popular subtitle program production format is a program tape that superimposes the time code into video and, if necessary, a program script as a material. A summary of the speech was created, 2) a caption display image (according to the procedure for preparing caption manuscripts specified separately), and 3) the start and end time codes were entered, and a caption manuscript was created. Based on this subtitle manuscript, the operator creates digitized subtitle data, and previews and proofreads in the presence of the person in charge of subtitle production, the manuscript creator, and the digitized operator to obtain the completed subtitle.
[0007]
The most time required for this work is to listen to the program speech of 1) and write and create a subtitle manuscript, which is mostly based on human intelligence and manual work.
[0008]
In a specific example of a program speech transcribe operation, a program tape is played and listened to by voice, and the tape is reproduced and listened from the speech start point in the voice, and written by a word processor or writing. Actually, in order to confirm the writing speed and contents of the writing operator, the writing operation is performed by repeating the cueing / playing operation of the recording tape for a single speech segment.
[0009]
Therefore, transcription work is a burdensome task for human intelligence such as cumbersome tape operations such as cueing and replaying of tapes, listening confirmation of speech, and transcription. Appropriate support functions that can be reduced are necessary, and their development is desired.
[0010]
In view of the above circumstances, the present invention performs appropriate division using the detected pose, and performs processing such as slowing down or repetition as necessary on the program audio signal so that it can be easily heard and written. An object of the present invention is to provide a transcription support device that can support.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, the invention of claim 1 is a writing support apparatus for supporting a work of writing text data from a voice while listening to the voice, and a pause for detecting a pause of an input voice signal. a detecting unit, based on the detected pose, and a division unit for dividing the input audio signal with the number of characters converted from the time capable of short-term memory book cause operator or the time, the dividing unit Among the poses detected by the pose detection unit, a primary division unit that divides the input audio signal by a pose having the highest reliability, and a result of division by the primary division unit enables short-term storage. Secondary division means for dividing the input voice signal by a highly reliable pause for a portion that is longer than a predetermined time or the number of characters converted from the time, and the second division means As a result of the division, the input audio signal is divided by the next highly reliable pose for a portion that is longer than a predetermined time capable of short-term memory or more than the number of characters converted from the time, And tertiary dividing means for mechanically dividing the determined time or the number of characters converted from the time .
[0014]
According to a second aspect of the present invention, in the transcription support apparatus according to the first aspect, the time for which the short-term memory is possible is approximately 2.5 seconds, or is converted from the period for which the short-term memory is possible. The number of characters is about 25 mora in number of mora.
[0015]
According to a third aspect of the present invention, in the transcription support apparatus according to the first or second aspect, the division unit overlaps the front and rear portions of the divided voice by a predetermined time or the number of characters converted from the time. It is characterized by including a means.
[0016]
According to a fourth aspect of the present invention, in the transcription support apparatus according to any one of the first to third aspects, the pause detection unit receives a normal speed from a level signal indicating an amplitude level or a power level of the input audio signal. A specific frequency component extracting means for extracting a specific frequency component indicating the fluctuation characteristics of speech in the signal, obtaining an envelope waveform of the extracted specific frequency component, setting a predetermined slice level for the obtained envelope waveform signal, and posing It is characterized by comprising pause section detecting means for detecting a section.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
<Background of the present invention>
As described above, the transcription of program audio is a task that requires a lot of burden on human intelligence, such as complicated tape operations such as cueing and repeated playback of tapes, listening confirmation of speech, and writing. It is necessary to have an appropriate support function that can reduce abilities and labor.
[0018]
For example, regarding the “one-segment speech segment” and its “repetition”, if this is a single sentence, there is a normal pose part before and after that, so this pose is detected and a segment is automatically assigned, and the required number of times Process the program audio to repeat. As a result, the speech section is easy to hear and write.
[0019]
On the other hand, non-speech sections that do not require writing can be automatically removed, so that the writing worker can concentrate on writing the speech part, greatly contributing to the accurate and rapid creation of subtitle text. can do.
[0020]
Moreover, since this process also contributes to a reduction in necessary capacity and labor, there is a possibility that the writing worker can be expanded to a general word processor worker level.
[0021]
Note that pose detection has been proposed in Japanese Patent Application No. 2002-160255 “Speech Speech / Pause Interval Detection Device”, and the pose detected by such a method (this method will be described later as the first method) is the main issue. It can be applied to processing such as dividing program audio.
[0022]
《Program audio division processing method》
When the significance of the “one-line speech segment” is further examined, the setting of an appropriate division unit for transcription and the speed-down processing performed as necessary, the writing of the speech that becomes this unit within this division time is performed. However, it is possible to complete the work, and the same applies to the division and repetition of the required number of times.
[0023]
First, in terms of ease of listening, it is reasonable to divide by using a place where there is no speech in the program audio, or where there is a break (such as a pause for transcription) due to breathing.
[0024]
The length of division will be described later, but if this processing does not fit in the required division length, it is necessary to consider additional division processing.
[0025]
Therefore, it is divided first with the detected relatively long pose (for example, 1 to 2 seconds), the part where the division is insufficient is divided with the shorter pose (for example, 0.5 to 1 second), and the part where the division is still insufficient is the other The method (shorter pose, predetermined time, etc.).
[0026]
《Consideration about the length of division for transcription》
As for the length of the division for transcription, it is desirable to have a length that allows short-term memory, including unknown words and proper nouns, but it is also valid from simple experiments.
[0027]
Although the specific image of the length that can be memorized in a short term is unambiguous, there are examples of haiku, waka, and slogans as familiar examples. Since these are 17 characters (mora) and 31 characters (mora) in katakana (mora), they were examined with a goal of about 25 characters (mora). This is about 19 characters in a kana-kanji mixed sentence with the ratio of kanji being 1/3, and if one mora is 0.1 seconds, the time is 2.5 seconds.
[0028]
In order to divide the program audio so that it is easy to hear and write, the actual program text is used first to verify the validity of the number of mora and time as the division position and the division length raised here. And examined.
[0029]
<< Examination of division method using actual program text >>
FIG. 2 shows an example in which a news program is divided for writing by hand. Since the example shown in FIG. 2 is a program text, attention is first paid to punctuation marks that are usually present, and the text is divided in consideration of phrases.
[0030]
FIG. 3 shows an example of division for manually transcribe, taking a documentary program as an example. Similarly, in the example shown in FIG. 3, first, attention is paid to punctuation marks, and the sentence is also divided into consideration. Here, the terms used in both figures are defined as follows.
[0031]
Start: Corresponding program audio start time End: Corresponding program audio end time: Corresponding program audio elapsed time Kana: Text kana number Kanji: Number of text kanji etc. Mora: Text mora number (kanji etc.) Calculated as 1.86 mora)
/ S: Number of mora / number of mora per second in elapsed time.
[0032]
In the case of FIG. 2, the elapsed time, that is, the length of time was 2.45 seconds on average (minimum: 1.3 seconds, maximum: 3.4 seconds), and the number of mora was 24.4 on average (minimum: 12.3, maximum: 390/16 at 33.3). The total number of mora (total number of mora) / total elapsed time (total elapsed time) is 390 / 39.2 = 9.95, which is a value related to the speed of speech and is a little faster speech.
[0033]
In the case of FIG. 3, the elapsed time, that is, the length of time was an average of 2.26 seconds (minimum: 0.9 seconds, maximum: 3.5 seconds), and the number of mora was an average of 17.5 (minimum: 9.5, maximum: 25.2). The total number of mora (total number of mora) / total elapsed time (total elapsed time) is 7.76, which is a value related to the speed of speech, which is almost standard speed.
[0034]
For these two program examples, the text is manually divided as shown in FIG. 2 and FIG. 3, and the program audio is divided so as to correspond to the divided text, and a simple transcription experiment is performed while repeatedly reproducing the required number of times. went.
[0035]
As a result, with the exception of some long elapsed times, it was fairly easy to hear, easy to write, and fast, and it was possible to confirm the validity of the division in an average of 25 mola and 2.5 seconds. done.
[0036]
<< Speech / pause detection in actual program audio >>
However, the actual transcription target is speech in the program audio, and there are no punctuation marks or phrases, and there are various background sounds as well as the target speech, and the circumstances are completely different.
[0037]
In the processing of the program audio, preferably, the start / end times of each speech section are obtained correctly, and the speech is divided so as to have an appropriate time length based on the start / end times.
[0038]
However, in the actual program sound, not only the target speech but also various background sounds overlap, and it is generally difficult to correctly obtain the start / end time of each speech section. On the other hand, in the case of a non-speech section, a silent section is certain, and even if there is a background sound, it may be determined that reliability is high depending on the level and section length.
[0039]
Therefore, in the present invention, by analyzing a level signal indicating the amplitude level or power level of the input audio signal, a specific frequency component indicating a fluctuation characteristic of speech at a normal speed is extracted, and an envelope waveform of the extracted specific frequency component is obtained. A first method for detecting a pause section by setting a predetermined slice level for the obtained envelope waveform signal, or by comparing a plurality of LCP cepstrum vectors in acoustic data with each other from a reference frame The pose is detected by the second method (block cepstrum flux method) for detecting the switching point of the contents of the acoustic data more stably.
[0040]
<Description of Embodiment>
FIG. 1 is a block diagram showing an embodiment of a transcription support apparatus according to the present invention.
[0041]
The transcription support apparatus shown in the figure includes a pose detection unit 1 for detecting a pose of an input voice signal, and based on the detected pose, the input is performed for a time or the number of characters that allows a transcription worker to perform short-term storage. And a dividing unit 11 for dividing the audio signal.
[0042]
The pause detection unit 1 includes a specific frequency component extracting unit 3 that extracts a specific frequency component indicating a fluctuation characteristic of speech at a normal speed from a level signal indicating the amplitude level or power level of the input audio signal, and the extracted specific frequency. It comprises a pause section detecting means 5 that obtains a component envelope waveform, sets a predetermined slice level for the obtained envelope waveform signal, and detects a pause section below that level.
[0043]
The dividing unit 11 includes a primary dividing unit 13 that divides the input audio signal according to the most reliable pose among the poses detected by the pause detecting unit 1, and the primary dividing unit 13. As a result of the division, the second division means 15 that divides the input voice signal by the next highly reliable pose for a portion that is longer than a predetermined time or a predetermined number of characters that can be short-term memorized, and this second As a result of the division by the next dividing means, the input speech signal is divided or pre-determined with the next most reliable pose for a portion that is longer than a predetermined time or a predetermined number of characters capable of short-term storage. A third dividing unit 17 that mechanically divides by the divided time or the number of characters, and an overlapping unit 19 that overlaps the front and rear portions of the divided voice by a predetermined time or the number of characters. It has been.
[0044]
Next, the operation of this embodiment will be described separately for the process of the pause detection unit 1 and the process of the division unit 11.
[0045]
<< Processing of Pause Detection Unit 1 >>
4 to 7 are explanatory diagrams showing the procedure of the pause detection process.
[0046]
In the flowchart of FIG. 4, first, a program audio signal having an amplitude waveform as shown in FIG. 5A is fetched, and the audio amplitude value is normalized (steps ST1 and ST3). In this process, only the low frequency component of the envelope is extracted from the captured program audio signal (step ST5), and the signal component is set to a predetermined level based on the amplitude value of the low frequency component. . Generally, when the level of the low frequency component is large, the level of the high frequency component is also considered to be large. This is because if there is a difference in level, the detection accuracy will be affected, so that it is necessary to achieve a certain level of standardization.
[0047]
The audio signal whose audio amplitude value has been standardized in this way is then subjected to absolute value processing, and the amplitude value waveform signal (Lch, Rch stereo audio that is turned back to the + side as shown in FIG. 5B). If there is a signal, the speech portion is expanded using the sum and difference signals) (step ST7). The waveform in FIG. 6 (A) is obtained by enlarging the time axis in the range indicated by the arrow in FIG. 5 (B) 10 times.
[0048]
Next, a digital high-pass filtering process (step ST9) and a digital band filtering process (step ST11) are executed from the amplitude value waveform signal converted into an absolute value. In the digital high-pass filtering process, a signal having a frequency of, for example, 2 Hz or more is extracted, and in the digital band filtering process, a signal having a frequency of, for example, 4 to 7 Hz is extracted. The waveform in FIG. 6B is a signal waveform having a 4 to 7 Hz component.
[0049]
It is known that the speech power corresponding to the vowel phonetic symbols tends to be higher than the other portions when the fluctuation characteristics of the speech in the time axis direction are compared with the phonetic symbol strings of the speech. The prior application showed that the fluctuation in the time axis direction of the signal waveform in the normal speed speech was a frequency of about 4 to 7 Hz (Japanese Patent Application No. 2002-160255). In this embodiment, paying attention to the fluctuation of the waveform in the time axis direction, the pose portion is detected by capturing rough periodicity.
[0050]
When a signal having a frequency of 4 to 7 Hz is extracted by digital band filtering and further subjected to absolute value processing (step ST12), a signal having a waveform similar to speech is obtained. On the other hand, when the arithmetic processing (step ST13) is executed, components other than the main speech component are extracted as the difference between these extracted signals, and further, an absolute value processing (step ST15) and a digital low-pass filtering processing (step ST15). Step ST17) is executed. In the level correction process (step ST19), the waveform of the low-frequency component of 0.5 Hz or less generated by the low-pass filtering process mainly includes components other than the speech component. The level of the speech-like waveform signal is corrected in the reverse direction, and as a result, the speech component is enhanced. The speech approximate waveform after this processing is shown in FIG.
[0051]
Thus, the level-corrected envelope waveform signal is displayed on the display for verification of its suitability (step ST21). Then, as shown in FIG. 7, the slice is sliced at a predetermined slice level (threshold) (step ST23).
[0052]
Next, a minute pause section removal process is executed (step ST25). This is because, in this processing, for example, a slight vowel interval variation or a breathing-interval interval is excluded from the detection target from the sliced audio amplitude value signal. As the detection time range, for example, set to "0.2 to 2 seconds" extent, its is a process of removing the detected outside the following times. This effectively prevents pose detection that has no meaning and low reliability.
[0053]
The detected pause section is displayed on the screen (step ST27). The speech section is measured for the purpose of optimizing the pause detection accuracy and slicing (step ST23) level setting, and compared with the measured speech section as shown in FIG. 7 (step ST29). As a result, the pose interval derived from the actually measured speech interval is compared with the detected pose interval, and the comparison can be performed so that the pose detection accuracy can be checked or the slice level can be optimized. .
[0054]
<< Processing of Dividing Unit 11 >>
When a pose is detected as described above, a relatively long pose is the most reliable as shown in the flowchart of FIG. To divide the input audio signal (step ST41). This process is executed by the primary dividing means 13.
[0055]
As a result of this division, the input voice signal is divided by the next long pause (about 0.5 to 1 second) for a portion that is longer than a predetermined time or a predetermined number of characters that can be stored in the short term for the worker (step 0.5 to 1 second) ST43, ST45). This process is executed by the secondary dividing means 15.
[0056]
As a result of this division, for the portion that is longer than a predetermined time or a predetermined number of characters that can be short-term memorized by the operator, the input voice signal is output by the next long pause (for example, 0.2 to 0.5 seconds). Or mechanically divided by a predetermined time or number of characters (steps ST47 and ST49). This process is executed by the third dividing means 17.
[0057]
When the dividing unit 11 performs division by the dividing units 15 and 17 that may have a problem in reliability, the overlapping unit 19 converts the front and rear parts of the divided voice to a predetermined time or number of characters. (Step ST51). For example, by providing an overlapping portion of about 1 second, it is possible to prevent the transcription operator from being inaudible and lack of transcription caused by the low reliability of pause detection.
[0058]
《Pause detection by block cepstrum flux method》
A brief description will be given of the block cepstrum flux method as a second method for detecting a speech interval.
[0059]
In this method, a plurality of LCP cepstrum vectors in acoustic data are compared with each other from a reference frame, thereby detecting the switching point of the contents of the acoustic data more stably.
[0060]
The graph of FIG. 9 shows the result of analyzing the sound of an actual television program that is considered to be an average background sound level by both methods.
[0061]
In FIG. 9, (A) is a processing value (cflx analysis value) by the block cepstrum flux method, (B) is a 4-7 Hz component extraction value (by the first method) of the speech envelope, and (C) is a value of (B). Data obtained by slicing the waveform at an appropriate level and binarizing it.
[0062]
In the waveform of (C), the high level range corresponds to speech, and the low level range corresponds to the non-speech (pause) section. In this case, (C) matches the measured speech section fairly well. I can say that.
[0063]
Therefore, it is possible to accurately grasp the speech section in the speech from the waveform (C). In the waveform of (C), P1 is a short pose, P3 is a relatively long pose, P3 is the most reliable pose, and P2 is the next most reliable pose.
[0064]
The block cepstrum flux method (second method) of (A) was also compared in the same manner. Although the result is slightly worse than the first method, it can be applied in some cases.
[0065]
<Division experiment using pause detection of actual program audio>
Detect the place where there is no speech in the actual program audio, or breaks due to breathing, etc. by appropriate processing of the program audio (use it as a pose for writing), and use this pose etc. as a guide for the speech to be transcribed Divided into easy-to-work units, and slowed down as necessary.
[0066]
As described above, it is desirable that the speech time during which the writing operation is easy is about 2.5 seconds. Since the target program has a fairly high level background sound (music), the division method here first detects and divides a relatively long pause (for example, about 2 seconds or more).
[0067]
FIG. 10 illustrates pause portion detection from the sound of the target program (there is a fairly high level background sound (music)).
[0068]
The original audio waveform of the target program is shown at the bottom. Using the first method described above, the frequency range component of 4 to 7 Hz is extracted from the envelope of this waveform, and the amplitude value is expressed as “processed audio waveform. ". The “processed speech waveform” was compared in magnitude with an appropriate threshold level to obtain a “detection pose”. In order to evaluate the suitability of this “detected pose”, data of actual speech was also displayed for reference. It can be said that at least a relatively long pause (for example, about 2 seconds or more) is correctly detected.
[0069]
Using this pose detection result, the first division when the pause was 1 to 2 seconds and the second division of 0.5 to 1 second were performed. Even in this division, the portion that is about 2.5 seconds or longer was divided to be about 2.5 seconds by one of the following methods.
[0070]
(1) A shorter pause (for example, 0.2 to 0.5 seconds or more) is detected and further divided using this.
(2) Divide this part equally so that it takes about 3 seconds.
[0071]
In addition, at least a portion to which the processing of (1) or (2) was applied was provided with an overlapping portion of about 1 second so as to avoid omission and inaudibility. Further, in order to indicate an overlapping portion, an overlapping display sound CH is inserted at the beginning of the portion so that it can be seen that the portion is overlapping.
[0072]
<Example of division / duplication processing>
An actual processing example is shown in the following text. The underlined portion indicates duplicate processing. Further, PP represents a detection pause of 1 to 2 seconds or more, SP: a detection pause of 0.5 to 1 second or more, and CH: an overlapping display sound (underlined portions are overlapping portions).
[0073]
Example of processing (1) 1 second overlap 18 divisions
PP Nature, there are always creatures
There is a wonder and excitement that the CH creatures weave. SP
Now, go to the middle of nature and travel to the earth. SP
Seeking a new encounter, aside of
CH is the beginning of a special earth trip. PP
The US Costa Rica medium is the stage of today
CH Rice is a tropical forest that spreads over Costa Rica . SP
In this forest, many creatures
CH creatures are living with ingenuity. SP
Above all, living with unique wisdom
The white-tailed capuchin monkey lives with CH. PP
The wisdom of capuchin capuchin monkeys varies. SP
And care Tata nuts in stone, medium
Tighten CH and take out the seeds inside . SP
Furthermore, live in peace with different kinds of monkeys
There is even the wisdom of bargaining that lives in CH peace . PP
Throat Giro capuchin is, look in the life of the forest, SP wisdom
I looked at the wisdom of CH. PP
[0074]
(2) Processing example Equal division between PPs, 1 second overlap 16 divisions
PP Nature, there are always living things
There is a wonder and excitement that the creatures weave. here we go
Now , in the midst of nature, a living place
Creatures land ball travelogue. Seeking to meet new out,
In search of a meeting, it is the beginning of a special earth trip. PP
Today's stage is spread to Central America Costa Rica
A tropical forest that stretches over mosquitoes . In this forest
In this forest, many living creatures live with ingenuity.
I'm alive . Above all, even the wisdom of the German special
The lives it also the wisdom of the special is the throat Giro capuchin monkeys. PP
The wisdom of capuchin capuchin monkeys varies. wood
And slammed the fruit of the tree in stone, take out the seeds in
Take out the seeds . Furthermore, different types of monkeys and flat
Different live monkeys and peace, but there's up to the wisdom of bargaining. PP
Throat Giro capuchin is, look in the life of the forest
I watched a lot of wisdom to show in. PP
[0075]
《Transcription experiment using divided program audio》
The programs targeted for the experiment are 78 seconds in total, and the speech part is 48 seconds (62%).
[0076]
An experiment was performed using the program audio by the above dividing method. As shown in FIG. 11, the experiment was performed by using the divided audio processed as described above in MIDIPLAYER (shareware software) and using the operation function of this software. Specifically, each divided sound is repeatedly reproduced, and when writing of the hatched portion is completed, the operation proceeds to the next divided sound by a key operation. FIG. 11 shows a processing example of (1) above.
[0077]
As a result of experimenting with the software of FIG.
Transcription time by method (1): approx. 135 seconds Number of repetitions of each divided voice: 2.8 times Transcription time by method (2): approx. 152 seconds The number of repetitions of each divided voice is about 3.2 times. It was found that a division method that uses the detected pose as much as possible is desirable.
[0078]
【The invention's effect】
As described above, according to the present invention, when the detected pause is used, the program audio signal is subjected to processing such as appropriate division of the program audio, speed reduction or repetition if necessary, and it is easy to hear and write. It will be possible to support.
[0079]
In addition, since the front and rear parts of the divided specific speech are overlapped by a predetermined time or the number of characters by the overlapping means, it is possible to effectively prevent the writing operator from being inaudible or missing the writing. Can do.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an example of a transcription support apparatus according to the present invention.
FIG. 2 is an explanatory diagram showing an example of a division experiment for writing a news program.
FIG. 3 is an explanatory diagram showing an example of division verified by writing a documentary number.
FIG. 4 is a flowchart showing a procedure for pause detection.
FIG. 5 is an explanatory diagram showing a program audio signal waveform and an amplitude level signal or a power level signal corresponding to the waveform.
6 is an explanatory diagram showing the amplitude level signal or the power level signal shown in FIG. 5 with the time axis enlarged, and the extracted component values in the characteristic frequency range. FIG.
7 is an explanatory diagram showing an envelope waveform and an actually measured pause (speech) portion of an amplitude level signal or a power level signal in a specific frequency range shown in FIG. 6;
FIG. 8 is an explanatory diagram showing a processing procedure for dividing program audio by a pause;
FIG. 9 is an explanatory diagram of speech approximation data obtained by specially processing an audio signal.
FIG. 10 is an explanatory diagram of pause portion detection.
FIG. 11 is an explanatory diagram illustrating a screen example of a computer used when a divided program sound is played back using sound playback software.
[Explanation of symbols]
1 Pause Detection Unit 3 Specific Frequency Component Extraction Unit 5 Pause Section Detection Unit 11 Division Unit 13 Primary Division Unit 15 Secondary Division Unit 17 Third Division Unit 19 Overlapping Unit

Claims

A writing support device that supports a task of writing text data from the voice while listening to the voice,
A pause detector for detecting the pause of the input audio signal;
A division unit that divides the input voice signal by a time that allows a transcription worker to perform a short-term memory or the number of characters converted from the time based on the detected pose ;
The dividing unit is
Primary dividing means for dividing the input audio signal by the most reliable pose among the poses detected by the pose detector;
As a result of the division by the primary dividing means, the input voice signal is divided in the next highly reliable pause for a portion that is longer than a predetermined time capable of short-term storage or more than the number of characters converted from the time. Secondary dividing means;
As a result of the division by the second division means, for the portion that is longer than a predetermined time capable of short-term storage or more than the number of characters converted from the time, the input voice signal is sent to the next highly reliable pose. Tertiary dividing means for dividing, or mechanically dividing by a predetermined time or the number of characters converted from the time;
A transcription support device characterized by comprising:

The transcription support apparatus according to claim 1 ,
The time for which the short-term memory is possible is about 2.5 seconds, or the number of characters converted from the time for which the short-term memory is possible is about 25 mora in number of mora. Support device.

In the transcription support apparatus according to claim 1 or 2 ,
The dividing unit is
Including overlapping means for overlapping the front and rear portions of the divided speech by a predetermined time or the number of characters converted from the time,
A transcription support device characterized by that.

The transcription support apparatus according to any one of claims 1 to 3 ,
The pause detection unit
Specific frequency component extraction means for extracting a specific frequency component indicating a fluctuation characteristic of speech at a normal speed from a level signal indicating an amplitude level or a power level of the input audio signal;
Pause section detection means for obtaining an envelope waveform of the extracted specific frequency component, setting a predetermined slice level for the obtained envelope waveform signal, and detecting a pause section;
A transcription support device characterized by comprising: