JP2018045256A

JP2018045256A - Subtitle production device and subtitle production method

Info

Publication number: JP2018045256A
Application number: JP2017247280A
Authority: JP
Inventors: 和利渕上; Kazutoshi Fuchigami; 弘幸勝見; Hiroyuki Katsumi; 進一渡辺; Shinichi Watanabe
Original assignee: Faith Inc
Current assignee: Faith Inc
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-03-22
Anticipated expiration: 2035-08-20
Also published as: JP6485977B2

Abstract

PROBLEM TO BE SOLVED: To provide a technique for efficiently producing a subtitle.SOLUTION: A voice recognition part 30 performs voice recognition of object voice 10 or voice obtained by repeating the object voice 10 and converts the voice into a text. A text division/connection processing part 40 performs division processing of the text after the voice recognition, to create a subtitle text. A keyboard correction part 60 corrects the subtitle text. A delay part 82 outputs a plurality of delay voice obtained by delaying the object voice 10 by prescribed different time. A delay switching switch 84 switches the plurality of delay voice output by the delay part 82 and provides the delay voice to the keyboard correction part 60, on the basis of an instruction of a person who corrects the subtitle text.SELECTED DRAWING: Figure 2

Description

この発明は、音声から字幕を制作する技術に関する。 The present invention relates to a technique for producing subtitles from audio.

近年、地上波放送、ＢＳ放送、ＣＳ放送などのテレビ放送において難聴者用字幕放送が実施されている。特にニュースやスポーツ中継など生放送に付与するリアルタイム字幕においては、現在の字幕制作手法では熟練者がキーボードを用いて即時入力する方法が主流となっている。これは、放送またはスタジオの音声を聴きながら、話している内容をオペレータが即座にキーボードで入力し、文字にしていく作業である。このような字幕制作には熟練した専門オペレータが複数人必要であり、字幕制作にかかる費用の低減が課題となっている。 In recent years, subtitle broadcasting for hearing-impaired people has been implemented in television broadcasting such as terrestrial broadcasting, BS broadcasting, and CS broadcasting. Especially for real-time subtitles to be given to live broadcasts such as news and sports broadcasts, current methods for producing subtitles are mainly methods in which an expert inputs immediately using a keyboard. In this process, while listening to broadcast or studio sound, the operator immediately inputs what is being spoken using the keyboard and converts it into text. Such subtitle production requires a plurality of skilled specialized operators, and there is a problem of reducing the cost of subtitle production.

特開２００１−１４２４８２号公報JP 2001-142482 A 特開２００６−１１９５３４号公報JP 2006-119534 A

即時入力手法では、数人のオペレータが、流れてくる音声を時系列で複数人で手分けして順番にキーボード入力していく。しかし、キーボードによる即時入力（速記）には熟練したスキルが求められるため、オペレータを長期間にわたって訓練する必要があり、投資が必要になる。また、複数人で順番に入力するため、オペレータ同士で阿吽の呼吸が必要であることも長期間の訓練を要する要因であり、オペレータのスキルに対する対価が要求される。 In the immediate input method, several operators divide the flowing voice by a plurality of people in time series and input the keyboard in order. However, skilled input is required for immediate input (shorthand writing) using a keyboard, so it is necessary to train the operator over a long period of time, and investment is required. In addition, since a plurality of persons input in order, it is necessary for the operators to take a breath of Aki, which is a factor that requires long-term training, and a price for the operator's skill is required.

キーボード入力以外の方法として音声認識を用いたテキスト化技術もあるが、音声認識の認識率が１００％ではなく、音声認識結果の修正にスピードが要求される。 Although there is a text conversion technique using speech recognition as a method other than keyboard input, the recognition rate of speech recognition is not 100%, and speed is required for correcting speech recognition results.

即時入力手法であれ、音声認識を用いたテキスト化手法であれ、特殊技術であるためにオペレータの人手不足の問題があり、また、新たにオペレータを訓練するためにも人材育成費用がかかるため、字幕制作にはコスト高が避けられないのが現状である。 Whether it is an immediate input method or a text-based method using speech recognition, there is a problem of shortage of operators due to the special technology, and it also costs personnel training to newly train operators, The cost is inevitable for subtitle production.

本発明はこうした課題に鑑みてなされたものであり、その目的は、字幕を効率的に制作する技術を提供することにある。 The present invention has been made in view of these problems, and an object thereof is to provide a technique for efficiently producing captions.

上記課題を解決するために、本発明のある態様の字幕制作装置は、対象音声または対象音声を復唱した音声を音声認識してテキストに変換する音声認識部と、音声認識後のテキストを分割処理して字幕テキストを生成する分割処理部と、字幕テキストを修正する修正部と、対象音声を所定の異なる時間だけ遅延させた複数の遅延音声を出力する遅延部と、字幕テキストの修正者からの指示により、前記遅延部により出力される複数の遅延音声を切り替えて前記修正部に提供する切替部とを備える。 In order to solve the above-described problem, a caption production device according to an aspect of the present invention includes a speech recognition unit that recognizes a target speech or a speech that is a repetition of the target speech and converts the speech into text, and a process for dividing the text after speech recognition A subtitle text generation unit, a subtitle text correction unit, a delay unit that outputs a plurality of delayed sounds obtained by delaying the target audio by a predetermined different time, and a subtitle text corrector And a switching unit that switches a plurality of delayed sounds output by the delay unit and provides them to the correction unit according to an instruction.

本発明の別の態様もまた、字幕制作装置である。この装置は、対象音声または対象音声を復唱した音声を音声認識してテキストに変換する音声認識部と、音声認識後のテキストを分割処理して字幕テキストを生成する分割処理部と、字幕テキストを修正する修正部と、対象音声を所定の時間だけ遅延させた遅延音声を出力する遅延部と、音声認識された音声を記録した音声ファイルを再生する音声再生部と、字幕テキストの修正者からの指示により、前記遅延部により出力される遅延音声または前記音声再生部により出力される音声ファイルの再生音声のいずれかを切り替えて前記修正部に提供する切替部とを備える。 Another aspect of the present invention is also a caption production device. This apparatus includes a speech recognition unit that recognizes a target speech or a speech that is a repetition of the target speech and converts the speech into text, a split processing unit that splits the text after speech recognition to generate subtitle text, A correction unit for correcting, a delay unit for outputting a delayed sound obtained by delaying the target sound by a predetermined time, an audio reproducing unit for reproducing an audio file in which the recognized speech is recorded, and a subtitle text corrector A switching unit that switches between the delayed sound output by the delay unit and the reproduced sound of the audio file output by the sound reproducing unit according to an instruction and provides the sound to the correction unit.

本発明のさらに別の態様は、字幕制作方法である。この方法は、対象音声または対象音声を復唱した音声を音声認識してテキストに変換する音声認識ステップと、音声認識後のテキストを分割処理して字幕テキストを生成する分割処理ステップと、字幕テキストを修正する修正ステップと、対象音声を所定の異なる時間だけ遅延させた複数の遅延音声を出力する遅延ステップと、字幕テキストの修正者からの指示により、前記遅延ステップにより出力される複数の遅延音声を切り替えて前記修正ステップに提供する切替ステップとを備える。 Yet another aspect of the present invention is a caption production method. The method includes a speech recognition step of recognizing target speech or speech reiterating the target speech and converting it to text, a split processing step of splitting the text after speech recognition to generate subtitle text, and subtitle text A correction step for correcting, a delay step for outputting a plurality of delayed sounds obtained by delaying the target sound by a predetermined different time, and a plurality of delayed sounds output by the delay step according to an instruction from a subtitle text corrector. A switching step of switching and providing to the correction step.

本発明のさらに別の態様もまた、字幕制作方法である。この方法は、対象音声または対象音声を復唱した音声を音声認識してテキストに変換する音声認識ステップと、音声認識後のテキストを分割処理して字幕テキストを生成する分割処理ステップと、字幕テキストを修正する修正ステップと、対象音声を所定の時間だけ遅延させた遅延音声を出力する遅延ステップと、音声認識された音声を記録した音声ファイルを再生する音声再生ステップと、字幕テキストの修正者からの指示により、前記遅延ステップにより出力される遅延音声または前記音声再生ステップにより出力される音声ファイルの再生音声のいずれかを切り替えて前記修正ステップに提供する切替ステップとを備える。 Yet another embodiment of the present invention is also a caption production method. The method includes a speech recognition step of recognizing target speech or speech reiterating the target speech and converting it to text, a split processing step of splitting the text after speech recognition to generate subtitle text, and subtitle text A correction step for correcting, a delay step for outputting a delayed sound obtained by delaying the target sound by a predetermined time, an audio reproducing step for reproducing an audio file in which the recognized speech is recorded, and a subtitle text modifier A switching step of switching between the delayed sound output by the delay step and the playback sound of the sound file output by the sound playback step according to an instruction and providing the same to the correction step.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、コンピュータプログラム、データ構造、記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and the expression of the present invention converted between a method, an apparatus, a system, a computer program, a data structure, a recording medium, and the like are also effective as an aspect of the present invention.

本発明によれば、字幕を効率的に制作することができる。 According to the present invention, captions can be produced efficiently.

第１の実施の形態に係る字幕制作装置の構成図である。It is a block diagram of the closed caption production apparatus which concerns on 1st Embodiment. 第２の実施の形態に係る字幕制作装置の構成図である。It is a block diagram of the closed caption production apparatus which concerns on 2nd Embodiment. 第３の実施の形態に係る字幕制作装置の構成図である。It is a block diagram of the closed caption production apparatus which concerns on 3rd Embodiment. 図２の遅延部および遅延切替スイッチによって複数の遅延音声が切り替えて出力される様子を模式的に説明する図である。FIG. 3 is a diagram schematically illustrating a state in which a plurality of delayed sounds are switched and output by a delay unit and a delay switch in FIG. 2. 図２の遅延切替スイッチの構成図である。It is a block diagram of the delay changeover switch of FIG. 図３の字幕制作装置における音声ファイルの同期再生の様子を模式的に説明する図である。It is a figure which illustrates typically the mode of the synchronous reproduction | regeneration of an audio file in the caption production apparatus of FIG.

図１は、第１の実施の形態に係る字幕制作装置１００の構成図である。 FIG. 1 is a configuration diagram of a caption production device 100 according to the first embodiment.

対象音声１０は、一般にテレビ放送などの映像を伴う音声である。対象音声１０はリスピーク部２０に入力されるか、または、そのまま生音声として音声認識部３０に入力される。リスピーク部２０の構成を設けるかどうかは、リスピーカによるリスピークの必要性に依存する。たとえば、ニュース放送の場合、アナウンサが正確な発声をしている場合は、リスピーク部２０の構成を省略して、アナウンサの生音声を音声認識部３０に直接入力してもよい。また、予算の関係等でリスピーカをつけられない場合も生音声を音声認識部３０に直接入力する。 The target audio 10 is generally audio accompanied with video such as television broadcasting. The target voice 10 is input to the squirrel peak unit 20 or is directly input to the voice recognition unit 30 as a raw voice. Whether or not the configuration of the squirrel peak unit 20 is provided depends on the necessity of the squirrel peak by the re-speaker. For example, in the case of news broadcasting, when the announcer is uttering accurately, the configuration of the rispeak unit 20 may be omitted and the announcer's live voice may be directly input to the voice recognition unit 30. In addition, even when the re-speaker cannot be attached due to a budget or the like, the live voice is directly input to the voice recognition unit 30.

リスピーク部２０では、リスピーカが対象音声１０をヘッドホンなどで聴きながら同一内容を適宜区切りながら一定の発話速度で明瞭に復唱する。復唱された音声はマイクに入力される。リスピーク部２０は、マイクに入力されたリスピーカの音声を出力し、音声認識部３０に供給する。 In the squirrel peak unit 20, the re-speaker clearly repeats at a constant utterance speed while listening to the target sound 10 with headphones or the like and appropriately dividing the same content. The repeated voice is input to the microphone. The squirrel peak unit 20 outputs the sound of the re-speaker input to the microphone and supplies it to the voice recognition unit 30.

音声認識部３０は、リスピーク音声または生音声を音声認識してテキスト化する。音声認識部３０は、パーソナルコンピュータ（ＰＣ）上の一般的な音声認識ソフトウェアで構成される（このＰＣを「ＰＣ１」と呼ぶ）。音声認識結果は後段で字幕修正処理にかけられるため、音声認識ソフトウェアは認識率が高くない比較的安価なものであってもよい。音声認識後のテキストはテキスト分割・結合処理部４０に入力される。 The voice recognition unit 30 recognizes the rispeak voice or the live voice and converts it into text. The voice recognition unit 30 is configured by general voice recognition software on a personal computer (PC) (this PC is referred to as “PC1”). Since the speech recognition result is subjected to caption correction processing at a later stage, the speech recognition software may be a relatively inexpensive one that does not have a high recognition rate. The text after the speech recognition is input to the text division / combination processing unit 40.

テキスト分割・結合処理部４０では、音声認識後のテキストを字幕の所定の文字数内に収まるように分割したり、結合する処理を行う。また、テキスト分割・結合処理部４０において話者によってテキストを色分けする処理を行ってもよい。たとえばメインキャスタとサブキャスタによってテキストの色を異ならせる。 The text division / combination processing unit 40 performs processing to divide or combine the speech-recognized text so as to be within a predetermined number of subtitle characters. Further, the text division / combination processing unit 40 may perform processing for color-coding the text by a speaker. For example, the text color is different between the main caster and the sub-caster.

テキストの分割、結合、着色処理は、２台目のＰＣ（「ＰＣ２」と呼ぶ）において、音声認識後のテキストをタッチパネルディスプレイなどに表示し、分割担当者がタッチパネル上で分割・結合位置を指示することで行われる。テキスト分割・結合処理部４０による分割・結合・着色処理後のテキスト（「字幕テキスト」）は字幕時系列管理部５０に入力される。 Text splitting, combining, and coloring are displayed on the touch panel display etc. on the second PC (referred to as “PC2”), and the person in charge of splitting indicates the split / join position on the touch panel. It is done by doing. The text (“subtitle text”) after the division / combination / coloring processing by the text division / combination processing unit 40 is input to the subtitle time series management unit 50.

リスピーク部２０におけるリスピーカと、テキスト分割・結合処理部４０における分割担当者は、同一人物であってもよい。熟練したリスピーカであれば、リスピークしながら、音声認識後のテキストの分割・結合処理を行うことができるからである。 The respeaker in the risk peak unit 20 and the division person in charge in the text division / combination processing unit 40 may be the same person. This is because a skilled re-speaker can perform text division / combination processing after speech recognition while performing a re-peak.

リスピーク部２０、音声認識部３０、およびテキスト分割・結合処理部４０による前処理にかかる時間の合計をＰ１秒とする。前処理時間Ｐ１はあらかじめ計測しておく。 The total time required for the preprocessing by the squirrel peak unit 20, the speech recognition unit 30, and the text division / combination processing unit 40 is P1 seconds. The preprocessing time P1 is measured in advance.

字幕時系列管理部５０は、適正な長さに調整された字幕テキストを時系列管理し、複数のキーボード修正部６０に順次分配する。 The subtitle time series management unit 50 manages the subtitle text adjusted to an appropriate length in time series, and sequentially distributes the subtitle text to the plurality of keyboard correction units 60.

複数のキーボード修正部６０は、複数の修正者のそれぞれが利用する端末（「ＰＣ３」〜「ＰＣｎ」と呼ぶ）である。遅延部８０は、対象音声１０を所定の時間だけ遅延させて出力する。遅延部８０は、一般的なアナログ音声遅延装置であり、入力された音声を指定した時間だけ遅延させて出力することができる。ここでは、遅延部８０は、前述の前処理時間Ｐ１よりも少し長めの時間だけ対象音声１０を遅延させて出力する。キーボード修正部６０またはヘッドホンには、遅延部８０から出力された遅延音声が入力される。 The plurality of keyboard correction units 60 are terminals (referred to as “PC3” to “PCn”) used by each of a plurality of correctors. The delay unit 80 delays the target voice 10 by a predetermined time and outputs it. The delay unit 80 is a general analog audio delay device, and can delay the input audio by a specified time and output it. Here, the delay unit 80 delays and outputs the target voice 10 by a time slightly longer than the above-described preprocessing time P1. The delayed sound output from the delay unit 80 is input to the keyboard correction unit 60 or the headphones.

キーボード修正部６０において、修正者は音声認識結果の間違いを修正する作業を行う。さらに、修正者は、遅延部８０から出力された遅延音声をヘッドホンなどで聞き直しながら字幕テキストを修正する。修正者は自分が担当する字幕テキストの修正が完了次第、修正された字幕テキストを出力する。複数のキーボード修正部６０により出力される修正後の字幕テキストは非同期で送出順序制御部７０に入力される。 In the keyboard correcting unit 60, the corrector performs an operation of correcting an error in the speech recognition result. Further, the corrector corrects the subtitle text while listening to the delayed sound output from the delay unit 80 through headphones. The corrector outputs the corrected subtitle text as soon as the correction of the subtitle text that he / she is responsible for is completed. The corrected subtitle text output by the plurality of keyboard correction units 60 is input to the transmission order control unit 70 asynchronously.

送出順序制御部７０は、複数のキーボード修正部６０から非同期で供給される字幕テキストの順序を正しく入れ替えて最終的な字幕を放送局に送出する。 The transmission order control unit 70 correctly replaces the order of the subtitle texts supplied asynchronously from the plurality of keyboard correction units 60 and transmits the final subtitles to the broadcast station.

字幕時系列管理部５０と送出順序制御部７０は同一のサーバ（「サーバ１」と呼ぶ）で実行することができる。 The caption time-series management unit 50 and the transmission order control unit 70 can be executed by the same server (referred to as “server 1”).

本実施の形態の字幕制作装置１００では、音声認識ソフトウェアの認識精度が低くても、修正者が遅延された生音声を聞きながら字幕を修正することができる。また、リスピーカは熟練者である必要があるが、修正者は熟練者である必要はない。そのため、字幕制作にかかる総費用を安く抑えることができる。 In the caption production device 100 according to the present embodiment, even if the recognition accuracy of the speech recognition software is low, the corrector can correct the caption while listening to the delayed live sound. The re-speaker needs to be an expert, but the corrector does not need to be an expert. As a result, the total cost of subtitle production can be kept low.

図２は、第２の実施の形態に係る字幕制作装置１１０の構成図である。第１の実施の形態の字幕制作装置１００と共通する構成については同一符号を付して説明を省略する。図２の字幕制作装置１１０は、遅延部８２および遅延切替スイッチ８４の構成が図１の字幕制作装置１００とは異なる。 FIG. 2 is a configuration diagram of the caption production device 110 according to the second embodiment. The components common to the caption production device 100 according to the first embodiment are denoted by the same reference numerals and description thereof is omitted. The subtitle production apparatus 110 in FIG. 2 is different from the subtitle production apparatus 100 in FIG. 1 in the configuration of the delay unit 82 and the delay changeover switch 84.

遅延部８２は、複数の異なる遅延時間だけ対象音声１０を遅延させて複数の遅延音声を出力する。出力された複数の遅延音声は遅延切替スイッチ８４に入力される。遅延切替スイッチ８４は、複数の遅延音声のいずれかを選択して出力する。選択された遅延音声はキーボード修正部６０またはヘッドホンに入力される。 The delay unit 82 delays the target sound 10 by a plurality of different delay times and outputs a plurality of delayed sounds. The plurality of delayed sounds that have been output are input to the delay changeover switch 84. The delay changeover switch 84 selects and outputs one of a plurality of delayed sounds. The selected delayed sound is input to the keyboard correcting unit 60 or the headphones.

図４は、遅延部８２および遅延切替スイッチ８４によって複数の遅延音声が切り替えて出力される様子を模式的に説明する図である。 FIG. 4 is a diagram schematically illustrating a state in which a plurality of delayed sounds are switched and output by the delay unit 82 and the delay switch 84.

符号２００は対象音声１０の一区分を示し、ここではＡ秒の長さである。これはリスピーカが復唱の際に適宜区切る文節である。符号２５０は、リスピーク部２０、音声認識部３０、およびテキスト分割・結合処理部４０による「前処理」にかかる時間を示し、ここではＢ秒である。 Reference numeral 200 indicates a section of the target speech 10, which is A seconds long. This is a phrase that the re-speaker appropriately divides when repeating. Reference numeral 250 indicates the time taken for “preprocessing” by the rispeaking unit 20, the speech recognition unit 30, and the text division / combination processing unit 40, which is B seconds here.

遅延部８２は、ここでは、３つの遅延時間Ｄ１、Ｄ２、Ｄ３で生音声を遅延させて出力する。第１の遅延時間Ｄ１は、前処理時間Ｂよりも少し長い時間である。第２の遅延時間Ｄ２は第１の遅延時間Ｄ１に対象音声１０の一区分の時間Ａを加算した時間である。第３の遅延時間Ｄ３は第２の遅延時間Ｄ２に対象音声１０の一区分の時間Ａを加算した時間である。 Here, the delay unit 82 delays and outputs the raw voice by three delay times D1, D2, and D3. The first delay time D1 is slightly longer than the preprocessing time B. The second delay time D2 is a time obtained by adding one section of time A to the target speech 10 to the first delay time D1. The third delay time D3 is a time obtained by adding the time A of one section of the target speech 10 to the second delay time D2.

遅延切替スイッチ８４の第１のスイッチを押し下げすると、生音声を第１の遅延時間Ｄ１だけ遅延された遅延音声がＡ秒間出力される（符号２１０）。同様に、遅延切替スイッチ８４の第２のスイッチ、第３のスイッチを押し下げすると、生音声をそれぞれ第２の遅延時間Ｄ２、第３の遅延時間Ｄ３だけ遅延させた遅延音声がＡ秒間出力される（符号２２０、２３０）。ただし、遅延切替スイッチ８４の第１のスイッチを省略し、第１の遅延時間Ｄ１の経過後に１回目の遅延音声がスイッチの押し下げなしに自動的に出力されるようにしてもよい。その場合、修正者がその後、第２のスイッチ、第３のスイッチを押し下げた場合、２回目、３回目の遅延音声が出力される。 When the first switch of the delay changeover switch 84 is pressed down, a delayed voice obtained by delaying the raw voice by the first delay time D1 is output for A second (reference numeral 210). Similarly, when the second switch and the third switch of the delay changeover switch 84 are pressed down, a delayed sound obtained by delaying the raw sound by the second delay time D2 and the third delay time D3 is output for A seconds. (Reference numerals 220, 230). However, the first switch of the delay changeover switch 84 may be omitted, and the first delayed sound may be automatically output without pressing the switch after the first delay time D1 has elapsed. In this case, when the corrector subsequently depresses the second switch and the third switch, the second and third delayed sounds are output.

図５は、遅延切替スイッチ８４の構成図である。遅延切替スイッチ８４は遅延音声１〜ｎの入力を受けて、いずれかの遅延音声を出力する。内部スイッチＳＷ１〜ＳＷｎ−１が設けられ、すべての内部スイッチＳＷ１〜ＳＷｎ−１がオフであるなら、遅延音声１が出力され、ＳＷ１のみがオンになると遅延音声２が出力され、ＳＷ２のみがオンになると遅延音声３が出力され、ＳＷｎ−１のみがオンになると遅延音声ｎが出力される。 FIG. 5 is a configuration diagram of the delay changeover switch 84. The delay changeover switch 84 receives input of the delayed sounds 1 to n and outputs any delayed sound. If the internal switches SW1 to SWn-1 are provided and all the internal switches SW1 to SWn-1 are off, the delayed sound 1 is output, and if only SW1 is turned on, the delayed sound 2 is output and only SW2 is turned on. Is output, the delayed sound 3 is output, and when only SWn-1 is turned on, the delayed sound n is output.

遅延切替スイッチ８４は修正者毎に用意される。修正者はスイッチを操作することによって２回目、３回目の遅延音声を聞き直すことができる。これは、１回目の遅延音声を聞いただけでは字幕の修正が完了しない場合に、聞き漏らした箇所を数回聞き直せるようにしたものである。 The delay changeover switch 84 is prepared for each corrector. The corrector can rehearse the second and third delayed sounds by operating the switch. In this case, if the correction of subtitles is not completed only by listening to the first delayed sound, the missed portion can be rehearsed several times.

遅延切替スイッチ８４は、キーボードの特定のキーの押し下げで実現してもよく、キーボードとは別に手元スイッチを設けることで実現してもよい。あるいは、修正者がキーボードから手を離さず、修正速度を確保できるように、フットスイッチやペダルによって実現してもよい。 The delay changeover switch 84 may be realized by pressing a specific key of the keyboard, or may be realized by providing a hand switch separately from the keyboard. Or you may implement | achieve by a foot switch or a pedal so that a correction person may ensure a correction speed, without releasing a hand from a keyboard.

一例として遅延音声が最大３回まで出力可能な構成を説明したが、一般に遅延音声がｎ回まで出力可能な構成とすることができる。字幕に修正箇所が少ない場合は、遅延音声を１回聞くだけで修正作業が完了することもある。その場合は、第１の遅延時間Ｄ１の遅延音声だけが用いられ、字幕を早く出すことができる。一方、字幕に修正箇所が多い場合、最大ｎ回まで生音声を繰り返し聞くことで字幕の精度を上げることができる。したがって、字幕のスピードと精度をバランス良く高めることができる。 As an example, a configuration has been described in which delayed speech can be output up to three times, but in general, a configuration in which delayed speech can be output up to n times can be employed. When there are few correction parts in the subtitle, the correction work may be completed by listening to the delayed sound once. In that case, only the delayed sound of the first delay time D1 is used, and the caption can be put out quickly. On the other hand, when there are many correction portions in the subtitle, the accuracy of the subtitle can be improved by repeatedly listening to the live audio up to n times. Therefore, the speed and accuracy of subtitles can be improved with a good balance.

音声認識ソフトウェアの認識率が低い場合や、リスピーカによるリスピーク音声の品質が低い場合、最大ｎ回、生音声を聞くことで字幕の精度を高めることができる。これは言い換えれば、高価な音声認識ソフトウェアを利用したり、熟練したリスピーカを採用しなくても、後処理において字幕の精度を高めることができることを意味し、字幕制作にかかる費用を安く抑えることができる。 When the recognition rate of the voice recognition software is low, or when the quality of the rispeak voice by the re-speaker is low, the accuracy of subtitles can be improved by listening to the live voice up to n times. In other words, this means that the accuracy of subtitles can be improved in post-processing without using expensive speech recognition software or skilled re-speakers, and the cost of subtitle production can be kept low. it can.

図３は、第３の実施の形態に係る字幕制作装置１２０の構成図である。第１の実施の形態の字幕制作装置１００と共通する構成については同一符号を付して説明を省略する。図３の字幕制作装置１２０は、キーボード修正／音声再生制御部６０が音声認識部３０により保存された音声ファイル３２を再生する構成、音声ミキサ９０が音声ファイル３２からの再生音声と遅延部８０からの遅延生音声を選択して出力する構成が図１の字幕制作装置１００とは異なる。 FIG. 3 is a configuration diagram of the caption production device 120 according to the third embodiment. The components common to the caption production device 100 according to the first embodiment are denoted by the same reference numerals and description thereof is omitted. In the caption production device 120 of FIG. 3, the keyboard correction / audio reproduction control unit 60 reproduces the audio file 32 stored by the audio recognition unit 30, and the audio mixer 90 includes the reproduced audio from the audio file 32 and the delay unit 80. 1 is different from the caption production device 100 shown in FIG.

図１の実施の形態１の字幕制作装置１００および図２の実施の形態２の字幕制作装置１１０では、修正者が遅延された生音声を聞いても、生音声は字幕テキストと同期していないため、担当している字幕の前後の不要な音声が入っており、担当字幕の音声位置を探すことになる。これは、字幕テキストの対象となる音声箇所が始まるまで待ったり、再生すると既に対象の音声箇所が始まっているなど不安定さをもたらす要因であり、修正者に時間のロスが発生し、字幕を出すスピードが遅くなる結果となる。そこで、第３の実施の形態の字幕制作装置１２０では、音声認識部３０が音声認識される音声をファイルに保存し、修正者の指示にしたがってキーボード修正／音声再生制御部６０が音声ファイルを再生できるようにする。 In the caption production device 100 according to the first embodiment in FIG. 1 and the caption production device 110 according to the second embodiment in FIG. 2, even when the corrector hears the delayed live speech, the live speech is not synchronized with the caption text. Therefore, unnecessary audio before and after the subtitle in charge is included, and the audio position of the assigned subtitle is searched. This is a factor that causes instability, such as waiting for the audio part that is the subject of the subtitle text to start or if the target audio part has already begun to play, causing the loss of time to the corrector, The result is a slower speed. Therefore, in the caption production device 120 according to the third embodiment, the voice recognized by the voice recognition unit 30 is stored in a file, and the keyboard correction / voice playback control unit 60 plays the voice file according to the instructions of the corrector. It can be so.

音声ファイル３２は、リスピーカにより復唱された音声を記録したものである（ただし、リスピーカをつけない場合は生音声を記録したものを用いる）。音声認識ソフトウェアが音声認識処理する際に音声を一時的に保存するため、その保存ファイルを音声ファイル３２として用いることができる。音声ファイル３２の再生を修正者が担当する字幕テキストに同期させるために、音声認識された単語ごとに音声ファイル３２においてその単語が発声される位置（開始位置と終了位置）をミリ秒の単位で記録した「音声再生情報」を用いる。この音声再生情報により、字幕の文字列と音声ファイルの再生とを完全に同期させることができる。 The audio file 32 is a file in which the voice re-speaked by the re-speaker is recorded (however, when the re-speaker is not attached, a recorded live sound is used). Since the voice is temporarily saved when the voice recognition software performs voice recognition processing, the saved file can be used as the voice file 32. In order to synchronize the reproduction of the audio file 32 with the subtitle text handled by the corrector, the position (start position and end position) where the word is uttered in the audio file 32 for each word that has been voice-recognized in units of milliseconds. The recorded “audio reproduction information” is used. With this audio reproduction information, the subtitle character string and the audio file can be completely synchronized.

音声ファイルの再生を字幕テキストと同期させることができるため、修正者が担当している字幕テキストとは関係のない音声が前後に入ることがない。また音声ファイルであることから容易に何度でも再生することができ、また、再生速度をたとえば１．５倍などに早めて聞くこともできる。 Since the reproduction of the audio file can be synchronized with the subtitle text, audio that is not related to the subtitle text handled by the corrector does not enter before and after. Further, since it is an audio file, it can be easily reproduced any number of times, and the reproduction speed can be increased to 1.5 times, for example.

音声ミキサ９０は、遅延部８０により所定時間だけ遅延された生音声または字幕テキストに同期した音声ファイル３２からの再生音声のどちらかに切り替えてキーボード修正部６０またはヘッドホンに入力し、修正者のヘッドフォンから出力されるようにする。ここでは、音声ミキサ９０はキーボード修正／音声再生制御部６０の外部にあるが、音声ミキサ９０をキーボード修正／音声再生制御部６０内に設けてもよい。 The audio mixer 90 switches to either the live audio delayed by a predetermined time by the delay unit 80 or the reproduced audio from the audio file 32 synchronized with the subtitle text, and inputs it to the keyboard correcting unit 60 or the headphones, and the corrector's headphones Output from. Here, the audio mixer 90 is outside the keyboard correction / audio reproduction control unit 60, but the audio mixer 90 may be provided in the keyboard correction / audio reproduction control unit 60.

図６は、音声ファイル３２の同期再生の様子を模式的に説明する図である。符号２００は対象音声１０の一区分を示し、ここではＡ秒の長さである。符号２５０は、リスピーク部２０、音声認識部３０、およびテキスト分割・結合処理部４０による前処理にかかる時間を示し、ここではＢ秒である。音声ミキサ９０により音声ファイル３２の再生音声を選択すると、時間Ｐ２だけ遅延されて音声ファイル３２が再生される（符号２４１）ここで、Ｐ２＝Ａ＋Ｂでり、編集中のテキストと完全に同期した音が再生される。音声ファイル３２はその後、繰り返し再生可能である（符号２４２、２４３）。 FIG. 6 is a diagram for schematically explaining the state of synchronous reproduction of the audio file 32. Reference numeral 200 indicates a section of the target speech 10, which is A seconds long. Reference numeral 250 indicates a time required for preprocessing by the rispeak unit 20, the speech recognition unit 30, and the text division / combination processing unit 40, which is B seconds here. When the reproduction sound of the audio file 32 is selected by the audio mixer 90, the audio file 32 is reproduced with a delay of time P2 (reference numeral 241). Here, P2 = A + B, and the sound completely synchronized with the text being edited. Is played. Thereafter, the audio file 32 can be repeatedly reproduced (reference numerals 242 and 243).

音声ミキサ９０はスイッチやペダルによる切替部であり、１回目は遅延部８０により遅延された生音声が出力されるが、修正者が指示すれば、２回目以降は音声ファイルの再生音声が出力される。修正箇所の少ない簡単な字幕であれば、１回目の遅延音声を聞くだけで修正作業が終わるが、修正箇所の多い複雑な字幕の場合、何回でも音声ファイルを再生して聞くことができる。１回目の遅延音声は、生音声であるのに対して、２回目以降の再生音声は、リスピーカのリスピーク音声であるから、生音声で聞き取りにくい箇所は、リスピーク音声によって正確な内容を把握することができる。 The audio mixer 90 is a switching unit such as a switch or a pedal, and the live audio delayed by the delay unit 80 is output the first time. However, if the corrector instructs, the reproduced audio of the audio file is output after the second time. The For simple subtitles with few corrections, the correction work can be completed simply by listening to the first delayed audio. However, for complicated subtitles with many corrections, the audio file can be played and listened to as many times as possible. The first delayed sound is live sound, while the second and subsequent playback sounds are re-speaker squirrel peak sounds, so the exact contents of the parts that are difficult to hear with the live sound must be grasped by the squirrel peak sound. Can do.

上記の説明では、音声ファイル３２はリスピーカにより復唱された音声を記録したものであったが、リスピーク音声ファイルに代えてあるいはリスピーク音声ファイルとともに、リスピーカを通さない生音声を記録した生音声ファイルを生成し、キーボード修正／音声再生制御部６０が生音声ファイルを再生して音声ミキサ９０に提供する構成にしてもよい。生音声ファイルは、リスピーク音声ファイルと違って字幕テキストと同期はしないが、リスピーク音声の品質が良くない場合は、修正者は、リスピーク音声からテキスト化された字幕を編集しながら、生音声ファイルを再生して生音声を聞き直すことで字幕の精度を高めることができる。 In the above description, the audio file 32 is a recording of the voice replayed by the re-speaker. However, instead of the lispeak audio file or together with the lispeak audio file, a raw audio file that records the live audio that does not pass through the re-speaker is generated. The keyboard correction / audio reproduction control unit 60 may reproduce the raw audio file and provide it to the audio mixer 90. Unlike the Lispeak audio file, the live audio file does not synchronize with the subtitle text. However, if the quality of the Lispeak audio is not good, the corrector can edit the text subtitle from the Lispeak audio while editing the raw audio file. The accuracy of subtitles can be improved by playing back and listening to live audio.

以上、本発明を実施の形態をもとに説明した。実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. The embodiments are exemplifications, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are within the scope of the present invention. .

送出順序制御部７０により送出された字幕テキストを結合して文章を生成し、その文章を自動翻訳ソフトウェアにより別の言語に自動翻訳することにより、リアルタイムで多言語の字幕放送を実現することもできる。 It is also possible to realize multilingual subtitle broadcasting in real time by combining subtitle texts transmitted by the transmission order control unit 70 to generate a sentence and automatically translating the sentence into another language by automatic translation software. .

上記の実施の形態の字幕制作装置は、放送された番組の音声をもとにリアルタイムで字幕を生成したが、録画された番組をもとに字幕を生成する場合にも上記の実施の形態の字幕制作装置を用いることができる。 Although the caption production device of the above embodiment generates the caption in real time based on the sound of the broadcasted program, the caption production apparatus of the above embodiment can also be used when generating the caption based on the recorded program. A caption production device can be used.

１０対象音声、２０リスピーク部、３０音声認識部、４０テキスト分割・結合処理部、５０字幕時系列管理部、６０キーボード修正部、７０送出順序制御部、８０遅延部、８２遅延部、８４遅延切替スイッチ、９０音声ミキサ、１００、１１０、１２０字幕制作装置。 10 target voices, 20 rispeak parts, 30 voice recognition parts, 40 text division / combination processing parts, 50 subtitle time series management parts, 60 keyboard correction parts, 70 transmission order control parts, 80 delay parts, 82 delay parts, 84 delay switching Switch, 90 audio mixer, 100, 110, 120 caption production device.

Claims

A voice recognition unit for recognizing and converting the target voice or the squirrel voice obtained by reiterating the target voice into text and recognizing the target voice or the voice-recognized squirrel peak voice as a voice file;
A correction section for correcting subtitle text after speech recognition;
An output unit for outputting the target audio;
An audio reproduction unit that reproduces an audio file in which the target audio / risk peak audio is recorded in synchronization with a character string of the subtitle text;
According to an instruction from the corrector of the subtitle text, either the target audio output by the output unit or the target audio / risk peak audio synchronized with the character string of the subtitle text reproduced from the audio file output by the audio reproduction unit A subtitle production apparatus comprising: a switching unit that switches to provide to headphones worn by the corrector.

The switching unit outputs the target voice to the headphones for the first time, and outputs the target voice / risk peak voice synchronized with the character string of the subtitle text from which the audio file is played for the second time and thereafter. The caption production device according to claim 1.

A speech recognition step of recognizing and converting the target speech or the squirrel speech reiterating the target speech into text and recognizing the target speech or speech-recognized squirrel peak speech as a speech file;
A correction step for correcting subtitle text after speech recognition;
An output step for outputting the target audio;
An audio reproduction step of reproducing an audio file in which the target audio / risk peak audio is recorded in synchronization with a character string of the subtitle text;
According to an instruction from the corrector of the subtitle text, either the target voice output by the output step or the target voice / risk peak voice synchronized with the character string of the subtitle text reproduced from the voice file output by the voice playback step A subtitle production method executed by a subtitle production apparatus, comprising: a switching step of switching between and providing to headphones worn by the corrector.