JP2020012855A

JP2020012855A - Device and method for generating synchronization information for text display

Info

Publication number: JP2020012855A
Application number: JP2018132830A
Authority: JP
Inventors: 鉄男石川; Tetsuo Ishikawa; 正明五十崎; Masaaki Isozaki; 浩司浦部; Koji Urabe
Original assignee: SOCKETS Inc
Current assignee: SOCKETS Inc
Priority date: 2018-07-13
Filing date: 2018-07-13
Publication date: 2020-01-23
Anticipated expiration: 2038-07-13
Also published as: JP6615952B1

Abstract

To generate synchronization information for lyric text.SOLUTION: An acoustic data file input section 10, a lyric text file input section 20 and a vocal voice signal extraction section 30 extract a vocal voice signal from an acoustic signal, and a phonetic symbol detection section 40 generates a phonetic symbol and time information from the vocal voice signal. A voice recognition section 50 voice-recognizes the vocal voice signal and outputs a voice recognition result text. A phonetic symbol matching section 60 matches the phonetic symbol with a voice recognition result to make occurrence time information correspond to them. A text matching section 90 matches the voice recognition result text to which the time information is allocated with a lyric text, gives the time information at every text per display unit line and generates a synchronization information file.SELECTED DRAWING: Figure 2

Description

この発明は、テキスト表示用同期情報生成技術に関し、より具体的には、歌詞および楽曲の間の同期情報を生成する歌詞用同期情報生成技術に関する。 The present invention relates to a text display synchronization information generation technology, and more specifically, to a lyrics synchronization information generation technology for generating synchronization information between lyrics and music.

音響データを再生する際に、これに関連するテキストを音響データの再生に同期させて表示させたい場合がある。例えば、楽曲データの再生に際してその音声部分に同期させて歌詞部分を表示またはハイライト表示することがある。また、ビデオコンテンツの再生に際して、音声部分に同期させてスクリプト（セリフ部分）を表示するという要請もある。 When reproducing sound data, there are cases where it is desired to display a text related thereto in synchronization with the reproduction of the sound data. For example, when reproducing music data, the lyrics part may be displayed or highlighted in synchronization with the audio part. There is also a demand for displaying a script (line) in synchronization with an audio part when reproducing video content.

ところで、楽曲データと歌詞データとが関連付けられた状態でコンテンツとして流通していない場合には、そのままでは、楽曲データの再生に同期させて歌詞を表示させることができず、このような場合には、歌詞用同期情報が別途必要になる。このような同期情報のデファクトスタンダードとして歌詞同期フォーマットＬＲＣが知られている（ｈｔｔｐｓ：／／ｅｎ．ｗｉｋｉｐｅｄｉａ．ｏｒｇ／ｗｉｋｉ／ＬＲＣ＿（ｆｉｌｅ＿ｆｏｒｍａｔ））。図１（Ａ）は歌詞同期フォーマットＬＲＣに準拠したファイル（この歌詞はコンピュータで自動生成した著作権フリーのもの）を示しており、楽曲信号の開始からの経過時間（タイムスタンプ）と、歌詞テキストの各表示行とが対応付けられて記述されている。タイムスタンプは、例えば、対応する表示行を歌い始める時刻を表示する。図１（Ｂ）は歌詞カードテキストの例を示す。このような歌詞同期情報を生成するには、作業者が楽曲を再生・視聴しながら、各行の歌いだしのタイミングで、所定の操作を行って各行にタイムスタンプを関係づけ、同期情報ファイル、例えばＬＲＣファイルを生成するといった手作業での処理がおこなわれていた。しかしながら、作業者がＬＲＣファイルを生成するには、基本的には、楽曲の演奏をすべて聞く必要があり、煩雑である。人手によることなく、機械的に同期情報を生成することが望まれる。 By the way, if the music data and the lyrics data are not distributed as contents in a linked state, the lyrics cannot be displayed as it is in synchronization with the reproduction of the music data. , Lyrics synchronization information is required separately. The lyrics synchronization format LRC is known as a de facto standard for such synchronization information (https://en.wikipedia.org/wiki/LRC_(file_format)). FIG. 1A shows a file conforming to the lyrics synchronization format LRC (the lyrics are copyright-free automatically generated by a computer). The elapsed time (time stamp) from the start of the music signal and the lyrics text are shown. Are described in association with each display line. The time stamp indicates, for example, the time at which the corresponding display line starts to be sung. FIG. 1B shows an example of a lyrics card text. In order to generate such lyrics synchronization information, a worker performs a predetermined operation and associates a time stamp with each line at the timing of singing of each line while playing / viewing a song, and associates a time stamp with each line. Manual processing such as generation of an LRC file has been performed. However, in order for the operator to generate the LRC file, it is basically necessary to listen to all the performances of the music, which is complicated. It is desired to generate synchronization information mechanically without manual operation.

なお、特許文献１は、小型ディスプレイの処理装置を利用して同期情報を生成する際に、歌詞テキスト（歌詞カード）の各行をその小型ディスプレイのサイズに収まるように各行を１または複数の文節からなる表示単位に分割し、そのうえで、表示単位にタイムスタンプを割り当てていくことを開示している。 According to Patent Document 1, when synchronizing information is generated using a processing device for a small display, each line of lyrics text (lyric card) is separated from one or more phrases so as to fit in the size of the small display. It discloses that the display unit is divided into display units and a time stamp is assigned to the display unit.

なお、本発明は、上述の事情または以下の課題により限定的に理解されるべきでなく、その内容は特許請求の範囲に規定され、以下に実施例を用いて詳細に説明される。 It should be noted that the present invention should not be understood in a limited manner due to the above circumstances or the following problems, the contents of which are defined in the claims, and will be described in detail below using examples.

特開２００６−２２７０８２号公報JP 2006-227082 A

この発明は上述の事情等を考慮してなされたものであり、機械的に同期情報を生成することを目的としている。 The present invention has been made in view of the above circumstances and the like, and has as its object to generate synchronization information mechanically.

この発明によれば、上述の目的を達成するために、特許請求の範囲に記載のとおりの構成を採用している。ここでは、発明を詳細に説明するのに先だって、特許請求の範囲の記載について補充的に説明を行なっておく。 According to the present invention, in order to achieve the above-described object, the configuration as described in the claims is adopted. Here, before describing the present invention in detail, the description of the claims will be supplementarily described.

この発明の一側面によれば、上述の目的を達成するために、歌詞用同期情報生成装置を：音声信号から音声特徴を判別して、音声特徴記号を、発声時刻に対応する時間情報とともに出力する音声特徴検出手段と；上記音声信号を音声認識してテキストを出力する音声認識手段と；上記発音記号検出手段から出力された音声特徴記号と上記音声認識手段から出力されたテキストを出現順序に沿ってマッチングする第１のマッチング手段と；歌詞テキストと上記音声認識手段から出力されたテキストとを出現順序に沿ってマッチングする第２のマッチング手段と；上記音声認識手段から出力されたテキストを複数の行部分テキストに分割するテキスト分割手段と；上記第１のマッチング手段のマッチング結果と、上記第２のマッチング手段のマッチング結果とに基づいて、上記時間情報を上記複数の行部分テキストの各々に関連付けて、当該複数の行部分テキストの各々の表示タイミングを決定する表示タイミング決定手段とを有するように構成している。 According to one aspect of the present invention, in order to achieve the above-described object, a lyric synchronization information generation device: discriminates a voice feature from a voice signal and outputs a voice feature symbol together with time information corresponding to a utterance time. Voice recognition means for voice-recognizing the voice signal and outputting a text; and outputting the voice feature symbol output from the phonetic symbol detection means and the text output from the voice recognition means in the order of appearance. First matching means for matching the text along with the text output from the speech recognition means in the order of appearance; and a plurality of texts output from the speech recognition means. Text division means for dividing the text into line portion texts; a matching result of the first matching means; And a display timing determining unit that determines a display timing of each of the plurality of line-portion texts by associating the time information with each of the plurality of line-portion texts based on the result of the switching. .

この構成によれば、音声特徴記号に対応付けられた正確な時間情報を、音声特徴記号と音声認識結果とのマッチングおよび音声認識結果とテキストとのマッチングによって、行部分テキストに関連付け、もって、行部分テキストの表示タイミング（タイムスタンプ）を自動的かつ確実に生成できる。 According to this configuration, the accurate time information associated with the speech feature symbol is associated with the line portion text by matching the speech feature symbol with the speech recognition result and matching the speech recognition result with the text. The display timing (time stamp) of the partial text can be automatically and reliably generated.

この構成において、楽曲信号から上記音声信号を抽出する音声信号抽出手段をさらに設けて良い。楽曲は、歌声（ボーカル）と伴奏とからなることが多く、楽曲信号から音声信号（ここでは、歌声信号、ボーカル信号ともいう）を抽出する必要がある。 In this configuration, an audio signal extracting means for extracting the audio signal from the music signal may be further provided. Musical pieces often include singing voices (vocals) and accompaniment, and it is necessary to extract audio signals (here, also referred to as singing voice signals and vocal signals) from the music signals.

楽曲信号から音声信号（ボーカル信号）を抽出する手法は、任意の手法を利用できる。一般的な手法としては、例えば、ステレオボーカル抽出には、センターキャンセリングおよび周波数帯域制限を利用できる。これは、ボーカル音声がＬｃｈとＲｃｈの両方に含まれることを利用するものである。また、モノラルボーカル抽出には、周波数地域制限を利用でき、これは、例えば、人の音声帯域（例えば１００Ｈｚ−１０００Ｈｚ）のバンドパスフィルタをかけて信号をとりだす手法がある。その他、音声部分の分離精度が高い、その他の手法として、多重ＨＰＳＳ法、バイナリマスキング法、ＲＰＣＡ法などが用いられる。 Any method can be used as a method of extracting a voice signal (vocal signal) from a music signal. As a general method, for example, center cancellation and frequency band limitation can be used for stereo vocal extraction. This utilizes that the vocal sound is included in both the Lch and the Rch. For monaural vocal extraction, frequency region restriction can be used. For example, there is a method of extracting a signal through a band-pass filter of a human voice band (for example, 100 Hz to 1000 Hz). In addition, as other techniques with high separation accuracy of the audio portion, a multiplex HPSS method, a binary masking method, an RPCA method, or the like is used.

また、この構成において、上記音声特徴検出手段は、上記楽曲信号の開始時点からの時間を上記時間情報として出力して良い。時間情報の微調整（開始を遅らせたり早めたり）を設定して出力しても良い。
In this configuration, the audio feature detecting means may output a time from the start of the music signal as the time information. Fine adjustment (time delay or advance) of time information may be set and output.

また、この構成において、上記音声特徴検出手段は、音声特徴記号として発音記号を出力して良い。発音記号は、フォルマント分析やメル周波数ケプルトラム（ＭＦＣＣ）分析等によって抽出できる。 In this configuration, the voice feature detection means may output a phonetic symbol as a voice feature symbol. Phonetic symbols can be extracted by formant analysis, mel frequency kepultrum (MFCC) analysis, or the like.

また、この構成において、上記音声認識手段は、上記音声信号の出力が小さい部分で上記音声信号を分割して分割部分単位で音声認識を行って良い。 In this configuration, the voice recognition unit may divide the voice signal at a portion where the output of the voice signal is small, and perform voice recognition in units of divided portions.

また、この構成において、上記音声特徴検出手段は、上記音声信号に現れる典型的な音声特徴を決定して、当該典型的な音声特徴が検出されやすいように上記音声信号を補正して良い。 In this configuration, the audio feature detecting means may determine a typical audio feature appearing in the audio signal, and correct the audio signal so that the typical audio feature is easily detected.

例えば、歌入りの楽曲信号から抜き出したボーカル信号に対して、前処理を行い、発話信号（調音分析信号）の検出精度を向上させる。すなわち、ボーカル信号から発話信号を抽出したのち、出現頻度の高い発話信号を調べ、その発話信号が検出しやすいように波形補正処理をしてから再度発話信号の検出エンジンにかけ直すことで、その発話信号の検出精度があげられ、かつ誤検出を削減することができる。波形補正処理（発音矯正処理例）としては、音声データを多数のバンドパスフィルタで分解し倍音構成（帯域毎の量）を変調して「あ」「い」「う」「え」「お」などの母音の発音信号の矯正を行って良いけれども、その方式に限定されるものではない。 For example, pre-processing is performed on a vocal signal extracted from a song signal containing a song to improve the detection accuracy of a speech signal (articulation analysis signal). That is, after extracting an utterance signal from the vocal signal, the utterance signal having a high appearance frequency is checked, the waveform is corrected so that the utterance signal is easily detected, and then the utterance signal is re-executed by the utterance signal detection engine. Signal detection accuracy can be improved, and erroneous detection can be reduced. As the waveform correction processing (pronunciation correction processing example), audio data is decomposed by a number of band-pass filters, and the harmonic composition (amount for each band) is modulated to “a”, “i”, “u”, “e”, “o”. Although the correction of the vowel sound signal such as vowels may be performed, the present invention is not limited to this method.

また、この構成において、上記第１のマッチング手段は、上記典型的な音声特徴に対応して検出された、上記典型的な音声特徴に対応する音声特徴記号を上記音声認識手段から出力されたテキストにマッチングし、上記表示タイミング決定手段は、上記典型的な音声特徴に対応して検出された、上記典型的な音声特徴に対応する音声特徴記号に割り当てられた時間情報を利用して、上記複数の行部分テキストの各々の表示タイミングを決定して良い。 Further, in this configuration, the first matching unit converts a speech feature symbol corresponding to the typical speech feature detected corresponding to the typical speech feature into a text output from the speech recognition unit. And the display timing determination means uses the time information assigned to the voice feature symbol corresponding to the typical voice feature detected in correspondence with the typical voice feature, and The display timing of each of the line portion texts may be determined.

また、この構成において、上記第２のマッチング手段は、上記歌詞テキストおよび上記音声認識手段から出力されたテキストに対応する編集距離に基づいてマッチングを行って良いけれども、その他種々のマッチング手法を採用して良い。 In this configuration, the second matching means may perform matching based on the edit distance corresponding to the lyrics text and the text output from the speech recognition means, but employs various other matching methods. Good.

この構成において、上記第２のマッチング手段は、上記歌詞テキストの異なる言語への翻訳および上記音声認識手段から出力されたテキストの上記異なる言語への翻訳の間の編集距離に基づいてマッチングを行って良い。適用言語は日本語であって良いけれども、これに限定されず、音声認識が可能で、テキスト表示でき、音声特徴を検出できるものであれば、どのような言語であっても良い。すなわち、上記音声認識手段は日本語を音声認識して日本語テキストを出力し、上記第２のマッピング手段は、日本語の上記歌詞テキストと上記音声認識手段から出力された日本語のテキストとをマッチングして良いけれども、これに限定されず種々の言語に適用して良い。適用言語が日本語の場合、上述の異なる言語は英語であって良いけれども、これに限定されない。 In this configuration, the second matching unit performs matching based on an edit distance between the translation of the lyrics text into a different language and the translation of the text output from the speech recognition unit into the different language. good. The application language may be Japanese, but is not limited to this. Any language may be used as long as it can perform voice recognition, display text, and detect voice characteristics. That is, the voice recognition means outputs the Japanese text by voice-recognizing Japanese, and the second mapping means converts the Japanese lyric text and the Japanese text output from the voice recognition means. Although matching may be performed, the present invention is not limited to this and may be applied to various languages. When the application language is Japanese, the different language described above may be English, but is not limited thereto.

この構成において、上記テキスト分割手段は、上記音声信号が実質的にゼロになる間隔に基づいて上記テキストを複数の上記行部分テキストに分割して良い。 In this configuration, the text division unit may divide the text into the plurality of line portion texts based on an interval at which the audio signal becomes substantially zero.

この構成において、上記テキスト分割手段は、上記部分テキストの文字数の許容範囲に基づいて上記部分テキストを分割して良い。文字数は例えば最大で１８である。また、歌詞データの構成に基づいた分割規則を学習するようにしてもよい。同期情報を生成するたびに学習情報を蓄積することができ、新らたな分割規則で同期情報ファイルを更新して良い。 In this configuration, the text division unit may divide the partial text based on an allowable range of the number of characters of the partial text. The number of characters is, for example, 18 at the maximum. Further, a division rule based on the structure of the lyrics data may be learned. The learning information can be accumulated every time the synchronization information is generated, and the synchronization information file may be updated according to a new division rule.

また、この構成において、歌詞テキストにおける複数のコーラス部分の間の比較を行い、比較結果に基づいて時刻情報を取得できなかった行部分テキストの表示タイミングを補間する表示タイミング補間手段をさらに設けて良い。 Further, in this configuration, a display timing interpolating unit that performs comparison between a plurality of chorus portions in the lyrics text and interpolates the display timing of the line portion text for which time information could not be obtained based on the comparison result may be further provided. .

また、この構成において、上記音声認識手段の音声認識結果に含まれ、上記歌詞テキストに含まれない追加テキストがある場合にこれに基づいて行部分テキストを追加生成し、当該追加生成した行部分テキストの各々に表示タイミングを割り当てて良い。 Further, in this configuration, when there is an additional text included in the speech recognition result of the speech recognition unit and not included in the lyrics text, a line portion text is additionally generated based on the additional text, and the additionally generated line portion text is generated. May be assigned a display timing.

この発明の他の側面によれば、テキスト表示用同期信号生成装置を：音声信号から音声特徴を判別して、音声特徴記号を、発声時刻に対応する時間情報とともに出力する音声特徴検出手段と；上記音声信号を音声認識してテキストを出力する音声認識手段と；上記発音記号検出手段から出力された音声特徴記号と上記音声認識手段から出力されたテキストを出現順序に沿ってマッチングする第１のマッチング手段と；表示対象テキストと上記音声認識手段から出力されたテキストとを出現順序に沿ってマッチングする第２のマッチング手段と；上記音声認識手段から出力されたテキストを複数の行部分テキストに分割するテキスト分割手段と；上記第１のマッチング手段のマッチング結果と、上記第２のマッチング手段のマッチング結果とに基づいて、上記時間情報を上記複数の行部分テキストの各々に関連付けて、当該複数の行部分テキストの各々の表示タイミングを決定する表示タイミング決定手段とを有するように構成している。上記音声認識手段から出力されたテキストが適宜に分割されている場合には、テキスト分割手段を省略して良い。 According to another aspect of the present invention, a text display synchronization signal generating device includes: a voice feature detecting unit that determines a voice feature from a voice signal and outputs a voice feature symbol together with time information corresponding to a utterance time; Voice recognition means for voice-recognizing the voice signal and outputting text; a first feature for matching the voice feature symbol output from the phonetic symbol detection means and the text output from the voice recognition means in the order of appearance. Matching means; second matching means for matching the display target text and the text output from the voice recognition means in the order of appearance; and dividing the text output from the voice recognition means into a plurality of line portion texts Text dividing means; and a matching result of the first matching means and a matching result of the second matching means. Zui and, the time information in association with each of the plurality of row sub text, are configured to have a display timing determining means for determining the display timing of each of the plurality of row sub text. If the text output from the voice recognition unit is appropriately divided, the text division unit may be omitted.

この構成によれば、音声特徴記号に対応付けられた正確な時間情報を、音声特徴記号と音声認識結果とのマッチングおよび音声認識結果とテキストとのマッチングによって、行部分テキストに関連付けることによって、行部分テキストの表示タイミング（タイムスタンプ）を自動的かつ確実に生成できる。 According to this configuration, the accurate time information associated with the speech feature symbol is associated with the line portion text by matching the speech feature symbol with the speech recognition result and matching the speech recognition result with the text. The display timing (time stamp) of the partial text can be automatically and reliably generated.

上記表示対象テキストはビデオコンテンツのセリフのテキストや楽曲コンテンツの歌詞テキストであって良い。 The display target text may be a text of a line of video content or a lyrics text of music content.

また、この発明の他の側面によれば、テキスト表示用同期情報生成装置を：音声信号から音声特徴を判別して、音声特徴記号を、発声時刻に対応する時間情報とともに出力する音声特徴検出手段と；上記音声信号を音声認識してテキストを出力する音声認識手段と；上記発音記号検出手段から出力された音声特徴記号と上記音声認識手段から出力されたテキストを出現順序に沿ってマッチングするマッチング手段と；上記音声認識手段から出力されたテキストを複数の行部分テキストに分割するテキスト分割手段と；上記マッチング手段のマッチング結果に基づいて、上記時間情報を上記複数の行部分テキストの各々に関連付けて、当該複数の行部分テキストの各々の表示タイミングを決定する表示タイミング決定手段と有するように構成している。上記音声認識手段から出力されたテキストが適宜に分割されている場合には、テキスト分割手段を省略して良い。 According to another aspect of the present invention, a text display synchronization information generating apparatus includes: a voice feature detection unit that determines a voice feature from a voice signal and outputs a voice feature symbol together with time information corresponding to a utterance time. Voice recognition means for voice-recognizing the voice signal and outputting a text; matching for matching the voice feature symbol output from the phonetic symbol detection means and the text output from the voice recognition means in the order of appearance. Means for dividing the text output from the speech recognition means into a plurality of line part texts; and associating the time information with each of the plurality of line part texts based on a matching result of the matching means. And display timing determining means for determining a display timing of each of the plurality of line portion texts. There. If the text output from the voice recognition unit is appropriately divided, the text division unit may be omitted.

この構成によれば、音声信号と同期させて対応するテキストを表示できる。 According to this configuration, the corresponding text can be displayed in synchronization with the audio signal.

なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品（コンピュータプログラム）もこの発明の技術的な範囲に含まれることも当然である。 The present invention can be realized not only as a device or a system, but also as a method. In addition, it goes without saying that a part of such an invention can be configured as software. Also, it goes without saying that a software product (computer program) used for causing a computer to execute such software is also included in the technical scope of the present invention.

この発明の上述の側面および他の側面は特許請求の範囲に記載され、以下、実施例等を用いて詳述される。 The above and other aspects of the invention are set forth in the following claims, and are described in detail below using examples and the like.

この発明によれば、音声信号にテキスト部分を同期させる同期情報を機械的かつ確実に生成することができる。 According to the present invention, synchronization information for synchronizing a text portion with an audio signal can be mechanically and reliably generated.

歌詞同期フォーマットＬＲＣおよび歌詞カードテキストの例を説明する図である。It is a figure explaining the example of lyrics synchronous format LRC and lyrics card text. この発明の実施例の同期情報生成装置の構成例を説明するブロック図である。FIG. 1 is a block diagram illustrating a configuration example of a synchronization information generation device according to an embodiment of the present invention. 歌詞同期の一般的な態様を説明する図である。It is a figure explaining the general aspect of lyrics synchronization. 実施例の動作例を説明する図である。FIG. 5 is a diagram illustrating an operation example of the embodiment. 図４の一部の詳細を説明する図である。FIG. 5 is a diagram illustrating a part of FIG. 4 in detail. 発音記号を説明する図である。It is a figure explaining a phonetic symbol. 音声認識の一例を説明する図である。FIG. 3 is a diagram illustrating an example of voice recognition. 発音記号のカナ確率テーブルを説明する図である。It is a figure explaining the Kana probability table of a phonetic symbol. テキストのカナと発音記号（タイムスタンプ）とのマッチングを説明する図である。FIG. 5 is a diagram illustrating matching between text kana and phonetic symbols (time stamps). 息継ぎポイントを考慮した改行位置の調整を説明する図である。It is a figure explaining adjustment of a line feed position in consideration of a breathing point. 音声認識結果のテキストと歌詞カードのテキストとの対応を説明する図である。It is a figure explaining the correspondence of the text of the speech recognition result and the text of the lyrics card. 音声認識結果と歌詞カードテキストとの間のテキストマッチングを説明する図である。It is a figure explaining the text matching between a speech recognition result and a lyrics card text. 英語に翻訳して編集距離を用いてマッチングする際の当初の編集距離の生成を説明する図である。FIG. 11 is a diagram illustrating generation of an initial edit distance when matching is performed using an edit distance by translating into English. テキストの一部の変更を行った際の編集距離を生成する図である。FIG. 9 is a diagram for generating an edit distance when a part of a text is changed. テキストの一部の変更を行った際の編集距離を生成する図である。FIG. 9 is a diagram for generating an edit distance when a part of a text is changed. テキストの一部の変更を行った際の編集距離を生成する図である。FIG. 9 is a diagram for generating an edit distance when a part of a text is changed. テキストの一部の変更を行って編集距離がなくなった状態を説明する図である。FIG. 11 is a diagram illustrating a state in which the edit distance has disappeared due to a partial change of text. 日本語の音声認識結果と歌詞カードテキストとの間のテキストマッチング（入れ替え候補の決定）を説明する図である。It is a figure explaining text matching (decision of a substitution candidate) between a Japanese speech recognition result and a lyrics card text. 歌詞コーラスの構成例を説明する図である。It is a figure explaining the example of composition of a lyrics chorus. 歌詞カードデータから補間のための情報を抽出する例を説明する図である。It is a figure explaining the example which extracts information for interpolation from lyrics card data. 歌詞カードデータからの補間のための情報を用いて時間情報を補間する例を説明する図である。It is a figure explaining the example which interpolates time information using the information for interpolation from lyrics card data. 歌詞カードテキストにないボーカル音声部分を補間する例を説明する図である。It is a figure explaining the example which interpolates the vocal voice part which is not in a lyrics card text. 歌詞カードテキストにないボーカル音声部分を補間する態様を説明する図である。It is a figure explaining the aspect which interpolates the vocal sound part which is not in a lyrics card text. 息継ぎポイントを歌詞および波形とともに説明する図である。It is a figure explaining a breathing point together with lyrics and a waveform. ボーカルのない区間の抽出例を説明する図である。It is a figure explaining the example of extraction of the section without a vocal. 実施例によって出力されるＬＲＣファイルの一部を説明する図である。FIG. 4 is a diagram illustrating a part of an LRC file output according to an embodiment.

以下、この発明の実施例の同期情報生成装置について説明する。この同期情報生成装置は、楽曲データと歌詞テキストデータとの間の同期情報を生成するものであり、典型的には先に図１を参照して説明したＬＲＣフォーマットのファイルを一例として生成するものである。 Hereinafter, a synchronization information generating apparatus according to an embodiment of the present invention will be described. This synchronization information generation device generates synchronization information between music data and lyrics text data, and typically generates the LRC format file described above with reference to FIG. 1 as an example. It is.

図２は、この実施例の同期情報生成装置２００を示しており、この図において、同期情報生成装置２００は、音響データファイル入力部１０、歌詞テキストファイル入力部２０、ボーカル音声信号抽出部３０、発音記号検出部４０、音声認識部５０、発音記号マッチング部６０、ボーカル無音区間検出部７０、表示単位行分割部８０、テキストマッチング部９０、表示単位行同期情報決定部１００、歌詞ブロック分析部１１０、補間部１２０、および同期情報ファイル出力部１３０等を含んで構成されている。各ブロックには適宜に出力情報の種類が記述されている。「ｖｏｉｃｅ」は、ボーカル音声信号（ボイス信号）、「ｐｈｓ」は発音記号（ｐｈｏｎｅｔｉｃｓｙｍｂｏｌ）、「ｔｅｘｔ」はテキスト、「ｔｉｍｅ」は時間情報（タイムスタンプ）、「ＬＲＣ」は同期情報である。同期情報生成装置２００は典型的には１または複数の計算機リソース（コンピュータシステム）を用いて構成される。図２の実施例の動作例は図４に説明され、以下、適宜に、これを参照する。この例では、歌詞用同期情報生成について説明するけれども、これに限定されない。この例では、１つのコンピュータシステム３００によって構成されるけれども、複数のコンピュータシステムや、種々のネットワークシステムを有して構成されて良い。コンピュータシステムの一部をスマートフォン等の情報端末で構成して良い。同期情報生成装置２００は、例えば、記録媒体３０１に記録されたコンピュータプログラムや通信ネットワーク（図示しない）を介して送信されてくるコンピュータプログラムをコンピュータシステム３００にインストールすることにより実現される。コンピュータシステムは、ＣＰＵ、主メモリ、バス、外部メモリ、種々の入出力インタフェース等を有してなり、パーソナルコンピュータ、スマートフォン、情報家電機器等であって良い。同期情報ファイルは同期情報ファイル記憶部１４０に蓄積され、携帯端末等の外部装置４００からアクセス可能であって良い。外部装置４００が音響データおより歌詞テキストデータを同期情報生成装置２００にアップロードし、これに応じて同期情報生成装置２００が生成した同期情報ファイルを取得して良い。 FIG. 2 shows a synchronization information generation device 200 of this embodiment. In this figure, the synchronization information generation device 200 includes an audio data file input unit 10, a lyrics text file input unit 20, a vocal audio signal extraction unit 30, Phonetic symbol detection unit 40, speech recognition unit 50, phonetic symbol matching unit 60, vocal silence section detection unit 70, display unit line division unit 80, text matching unit 90, display unit line synchronization information determination unit 100, lyrics block analysis unit 110 , An interpolation unit 120, a synchronization information file output unit 130, and the like. The type of output information is appropriately described in each block. “Voice” is a vocal audio signal (voice signal), “phs” is a phonetic symbol, “text” is text, “time” is time information (time stamp), and “LRC” is synchronization information. The synchronization information generation device 200 is typically configured using one or a plurality of computer resources (computer systems). An example of the operation of the embodiment of FIG. 2 is described in FIG. 4 and will be referred to hereafter as appropriate. In this example, the generation of lyrics synchronization information will be described, but the present invention is not limited to this. In this example, although it is configured by one computer system 300, it may be configured to include a plurality of computer systems or various network systems. A part of the computer system may be constituted by an information terminal such as a smartphone. The synchronization information generation device 200 is realized by, for example, installing a computer program recorded on the recording medium 301 or a computer program transmitted via a communication network (not shown) in the computer system 300. The computer system includes a CPU, a main memory, a bus, an external memory, various input / output interfaces, and the like, and may be a personal computer, a smartphone, an information home appliance, or the like. The synchronization information file may be stored in the synchronization information file storage unit 140 and be accessible from an external device 400 such as a mobile terminal. The external device 400 may upload the audio data and the lyrics text data to the synchronization information generation device 200, and acquire the synchronization information file generated by the synchronization information generation device 200 in response thereto.

なお、歌詞をボーカルに同期させて表示させる態様としては、図３に示すように、（Ａ）文字同期、（Ｂ）行同期、および（Ｃ）ブロック同期が知られている。この実施例では、（Ｂ）のレベルでの説明をおこなう。（Ａ）、（Ｂ）、（Ｃ）の順に自動化を実現しにくく、作業員が直接入力する必要がある。
（Ａ）文字同期：各歌詞の文字単位の位置で時間を同期させる方法。
（Ｂ）行同期：各歌詞のテキストの行の先頭の位置の時間を同期させ、付加的に、行内の歌詞の時間はおおよそあわせる方法。
（Ｃ）各歌詞のコーラス（１番、２番とか）の先頭位置の時間だけを同期させ、それ以外の時間はおおよそあわせる方法。 As shown in FIG. 3, (A) character synchronization, (B) line synchronization, and (C) block synchronization are known as modes for displaying lyrics in synchronization with vocals. In this embodiment, description will be made at the level of (B). It is difficult to realize automation in the order of (A), (B), and (C), and it is necessary for an operator to directly input.
(A) Character synchronization: A method of synchronizing time at the position of each character in the lyrics.
(B) Line synchronization: A method of synchronizing the time at the head position of the text line of each lyrics, and additionally approximately adjusting the time of the lyrics in the line.
(C) A method of synchronizing only the time of the beginning position of the chorus (No. 1, 2 etc.) of each lyrics and roughly adjusting the other times.

図２に戻る。音響データファイル入力部１０は、音響信号をＭＰ３、ＭＰ４等のフォーマットで記録したファイルを入力するものである（図４、Ｓ１０）。音響信号（ａｕｄｉｏ）は、この例では、ボーカル信号（音声信号、ｖｏｉｃｅ）および伴奏信号を含む。歌詞テキストファイル入力部２０は、歌詞テキスト（歌詞カードテキストともいう）を記録したファイルを入力するものである（図４、Ｓ１１）。歌詞テキストは例えば図１（Ｂ）に示すように、楽曲内の歌詞の第１コーラス、第２コーラス等の曲全体のブロック構成がわかるように記載されることが多く、歌いやすい単位で行が羅列、改行されているわけでない。この実施例では、後述するように、表示単位行分割部８０によって表示単位の行を歌いやすい単位で最適化するようにしている。 Return to FIG. The sound data file input unit 10 inputs a file in which sound signals are recorded in a format such as MP3 or MP4 (FIG. 4, S10). The audio signal (audio) includes a vocal signal (audio signal, voice) and an accompaniment signal in this example. The lyrics text file input unit 20 inputs a file in which lyrics text (also referred to as lyrics card text) is recorded (FIG. 4, S11). For example, as shown in FIG. 1B, the lyrics text is often described so that the block configuration of the entire song such as the first chorus and the second chorus of the lyrics in the song can be understood. There is no line break. In this embodiment, as will be described later, the display unit rows are optimized by the display unit row dividing unit 80 in units that are easy to sing.

ボーカル音声信号抽出部３０は、楽曲信号から音声信号（ボーカル信号、ボイス信号）を抽出するものである（図４、Ｓ１２）。音声信号抽出には種々の既知の手法を採用して良い。ステレオボーカル抽出には、センターキャンセリングおよび周波数帯域制限を利用できる。これは、ボーカル音声がＬｃｈとＲｃｈの両方に含まれることを利用するものである。また、モノラルボーカル抽出には、周波数地域制限を利用でき、これは、例えば、人の音声帯域（例えば１００Ｈｚ−１０００Ｈｚ）のバンドパスフィルタをかけて信号をとりだす手法がある。その他、多重ＨＰＳＳ法、バイナリマスキング法、ＲＰＣＡ法を採用して良い。 The vocal audio signal extraction unit 30 extracts an audio signal (vocal signal, voice signal) from the music signal (FIG. 4, S12). Various known methods may be adopted for audio signal extraction. Center canceling and frequency band limitation can be used for stereo vocal extraction. This utilizes that the vocal sound is included in both the Lch and the Rch. For monaural vocal extraction, frequency region restriction can be used. For example, there is a method of extracting a signal through a band-pass filter of a human voice band (for example, 100 Hz to 1000 Hz). In addition, a multiplex HPSS method, a binary masking method, and an RPCA method may be adopted.

ボーカル音声信号抽出部３０は、楽曲信号から抽出した音声信号（ボーカル信号、ボイス信号）を、発音記号検出部４０、音声認識部５０、およびボーカル無音区間検出部７０に供給する。 The vocal sound signal extraction unit 30 supplies a sound signal (vocal signal, voice signal) extracted from the music signal to the phonetic symbol detection unit 40, the voice recognition unit 50, and the vocal silence section detection unit 70.

発音記号検出部４０は、楽曲信号から抽出した音声信号（ボーカル信号、ボイス信号）を入力して発音記号および時間情報を出力するものである（図４、Ｓ１３）。発音記号は図６に示すようなものであり、フォルマント分析やメル周波数ケプストラム（ＭＦＣＣ）分析等で検出して良い。種々のデジタルオーディオワークステーションを用いて発音記号データを出力でき、例えば、米国Ｄｉｇｉｄｅｓｉｇｎ社のＰｒｏｔｏｏｌｓを用いて良い。この実施例では、音声信号中の出現頻度の高い発音記号を調べ、その発音信号が正確に検出されるように波形補正処理をしてから、再度、発音信号の発音記号検出部４０の検出エンジンにかけ直す。この結果、その発音信号の検出精度が向上して、誤検出を削減することができる。この処理を利用して発話パターンの検出精度を向上させることにより、後段のテキストとのマッチング精度も改善される。発音矯正処理例としては、音声データを多数のバンドパスフィルタで分解し倍音構成（帯域毎の量）を変調して「あ」「い」「う」「え」「お」などの母音の発音信号の矯正を行って良いけれども、これに限定されない。 The phonetic symbol detection section 40 receives a voice signal (vocal signal, voice signal) extracted from the music signal and outputs phonetic symbols and time information (FIG. 4, S13). The phonetic symbols are as shown in FIG. 6 and may be detected by formant analysis, mel frequency cepstrum (MFCC) analysis, or the like. A variety of digital audio workstations can be used to output phonetic symbol data, for example, Digidesign's Protocols, USA. In this embodiment, a phonetic symbol having a high frequency of appearance in a voice signal is examined, a waveform correction process is performed so that the phonetic signal is accurately detected, and then the detection engine of the phonetic symbol detection unit 40 of the phonetic signal again. Call back. As a result, the detection accuracy of the sound signal is improved, and erroneous detection can be reduced. By improving the detection accuracy of the utterance pattern using this processing, the matching accuracy with the subsequent text is also improved. As an example of pronunciation correction processing, voice data is decomposed by a number of band-pass filters, and the harmonic composition (amount for each band) is modulated to produce vowels such as “A”, “I”, “U”, “E”, “O” Signal correction may be performed, but is not limited to this.

発音記号検出部４０の動作例は、後述の音声認識の動作例とともに図５に示される。図５の動作例では、音声特徴抽出エンジンを用いて発音記号および時間情報を抽出し（Ｓ１３１）、検出された発音記号に対して類似発音記号単位でグルーピングを行い（クラスタ、Ｓ１３２）、出現頻度の高い発音記号のグループに対して典型的な発音記号の波形に近くなるように矯正を行い（Ｓ１３３）、矯正された音声信号に対して音声特徴抽出エンジンを用いて改めて実現時間情報を抽出する（Ｓ１３４）。 An operation example of the phonetic symbol detection unit 40 is shown in FIG. 5 together with an operation example of speech recognition described later. In the operation example of FIG. 5, phonetic symbols and time information are extracted by using a voice feature extraction engine (S131), and detected phonetic symbols are grouped in units of similar phonetic symbols (cluster, S132), and the appearance frequency Is corrected so that the waveform of the phonetic symbol group having a high phonetic symbol approximates the waveform of a typical phonetic symbol (S133), and realization time information is newly extracted from the corrected voice signal using the voice feature extraction engine. (S134).

図２の発音記号検出部４０は、発音記号および時間情報（タイムスタンプともいう）の対が順次に出力され、これは発音記号マッチング部６０に供給される。 The pair of phonetic symbols and time information (also referred to as a time stamp) are sequentially output from the phonetic symbol detection unit 40 in FIG. 2 and supplied to the phonetic symbol matching unit 60.

音声認識部５０は、楽曲信号から抽出した音声信号（ボーカル信号、ボイス信号）を入力して音声認識を行って日本語文字列を出力するものである（図４、Ｓ１４）。音声認識部５０は、音声認識エンジンとして音声認識機能のみでなく、形態素解析、構文解析、意味解析等の機能を含んで良い。音声認識部５０は、音声信号を音声認識し日本語テキストデータを出力し、この出力データは例えばカナ（音節）の列を含み、発音記号マッチング部６０に供給される。
なお、この例では音声認識部５０の実現例として日本語文字列に変換し、カナ（音節）に対するマッチングをおこなっているが、言語が日本語に特定されるものではなく、他の言語の歌詞に対しても同様の処理でマッチングできることは明らかである。 The voice recognition unit 50 receives a voice signal (vocal signal, voice signal) extracted from the music signal, performs voice recognition, and outputs a Japanese character string (FIG. 4, S14). The speech recognition unit 50 may include not only a speech recognition function as a speech recognition engine, but also functions such as morphological analysis, syntax analysis, and semantic analysis. The speech recognition unit 50 recognizes the speech signal and outputs Japanese text data. The output data includes, for example, a sequence of kana (syllables) and is supplied to the phonetic symbol matching unit 60.
In this example, as a realization example of the voice recognition unit 50, a character string is converted into a Japanese character string and matching with kana (syllables) is performed. However, the language is not specified in Japanese, and lyrics in other languages are used. It is apparent that the matching can be performed by the same processing.

音声認識部５０の音声認識エンジンが、連続した発話信号の認識に弱い傾向がある場合には、音声信号をゼロクロッシングポイント（信号レベルが小さい位置）で適切な長さに分割し、分割単位で音声認識処理を行うことが好ましい。図７は、このような前処理を示し、交番的に間引きを行い、オリジナルの音声信号から２つの間引き後の音声信号（間引き１および間引き２）を形成して、別個に音声認識を行い、そののち、後処理として音声認識結果を合成する。必要に応じて、波形に特定の語（カナ）が認識しやすくなるようなフィルタ処理を施して良い。具体的な動作例は先の図５に示されている。すなわち、図５の動作例では、ゼロクロッシングポイントの区切りを用いて音声信号の間引き（複数パターン生成）を行い（Ｓ１４１）、音声認識エンジンによる日本語テキストへの変換を行い（Ｓ１４２）、複数の間引きパターンの変換結果を合成する（Ｓ１４３）。 If the speech recognition engine of the speech recognition unit 50 tends to be weak in recognizing continuous speech signals, the speech signal is divided into appropriate lengths at zero crossing points (positions where the signal level is small), and divided in units of division. Preferably, a voice recognition process is performed. FIG. 7 shows such pre-processing, in which alternating thinning is performed, two thinned audio signals (thinning 1 and thinning 2) are formed from the original audio signal, and speech recognition is performed separately. After that, a speech recognition result is synthesized as post-processing. If necessary, the waveform may be subjected to filter processing so that a specific word (kana) can be easily recognized. A specific operation example is shown in FIG. That is, in the operation example of FIG. 5, the thinning of the audio signal (generation of a plurality of patterns) is performed using the delimiter of the zero crossing point (S141), and the conversion to the Japanese text by the voice recognition engine is performed (S142). The conversion result of the thinning pattern is synthesized (S143).

発音記号マッチング部６０は、カナ確率テーブル６０ａを参照して音声認識結果のカナ列に発音記号をマッチングさせるものである（図４、図５のステップＳ１５）。カナ確率テーブル６０ａは、個別の発音記号が表れたときにそれが個別のカナ（語）である確率を表す。先行する記号の条件付き確率であってもよい。カナ確率テーブル６０ａでは、図８に示すように、発音記号の各々、例えば、「ｗｇ」が語（カナ）各々、例えば、「を」、「わ」、・・・に対応する確率が求められている。発音記号マッチング部６０は、発音記号を音声認識結果のカナ列にマッチングさせ、この結果、発音記号と対をなすタイムスタンプをカナ列に対応付けて出力する。図９は、カナ列とタイムスタンプとの対応付け結果を示す。図９のＬＲＣ列は楽曲信号の開始時点を起点にしたときの発話記号が検出された時間（タイムスタンプ）を示している。なお、典型的には、出現頻度の大きな発音記号を中心に、カナにタイムスタンプがマッチングされるけれども、これに限定されない。 The phonetic symbol matching unit 60 matches the phonetic symbols to the kana sequence of the speech recognition result with reference to the kana probability table 60a (step S15 in FIGS. 4 and 5). The kana probability table 60a represents the probability that, when an individual phonetic symbol appears, it is an individual kana (word). It may be the conditional probability of the preceding symbol. In the kana probability table 60a, as shown in FIG. 8, the probabilities that each of the phonetic symbols, for example, "wg" corresponds to each of the words (kana), for example, "wo", "wa", ... are obtained. ing. The phonetic symbol matching unit 60 matches the phonetic symbol to the kana sequence of the speech recognition result, and as a result, outputs a time stamp paired with the phonetic symbol in association with the kana sequence. FIG. 9 shows a result of the association between the kana sequence and the time stamp. The LRC column in FIG. 9 indicates the time (time stamp) at which the utterance symbol was detected starting from the start point of the music signal. Note that, although a time stamp is typically matched to kana with a focus on pronunciation symbols having a high appearance frequency, the present invention is not limited to this.

発音記号マッチング部６０は、マッチングによって、音声認識結果のテキストのカナにタイムスタンプを割り当てて出力する（図９）。発音記号マッチング部６０の時間情報付きのテキストは表示単位行分割部８０に送られ、さらに、テキストマッチング部９０に供給される。 The phonetic symbol matching unit 60 assigns a time stamp to the kana of the text of the speech recognition result and outputs the kana by matching (FIG. 9). The text with time information of the phonetic symbol matching unit 60 is sent to the display unit line dividing unit 80 and further supplied to the text matching unit 90.

ボーカル無音区間検出部７０は、音声信号が基本的に発生されていない区間、例えば、前奏、間奏、後奏、息継ぎの部分を検出するものである（図４、Ｓ１６）。音声無音区間は、図２４に示すようにボーカル音声信号の振幅等の移動平均が閾値を下回った時に検出される。区間の長さや発生タイミング等から前奏、間奏、後奏、息継ぎが検出される。図２５は、前奏ののちの、息継ぎの区間を検出する例を示す。図２５の例では歌詞テキストも対応付けられて示されている。ボーカル無音区間検出部７０は、無音区間およびその種類を出力する。 The vocal silence section detection unit 70 detects a section in which a voice signal is not basically generated, for example, a prelude, an interlude, a postlude, and a breath part (FIG. 4, S16). The voice silence section is detected when the moving average of the vocal voice signal amplitude or the like falls below a threshold as shown in FIG. A prelude, an interlude, a postlude, and a breath are detected from the length of the section, the occurrence timing, and the like. FIG. 25 shows an example of detecting a section of breath after a prelude. In the example of FIG. 25, lyrics texts are also shown in association with each other. The vocal silent section detection section 70 outputs a silent section and its type.

表示単位行分割部８０は、音響再生時に、一時に、表示またはハイライト表示される表示単位の行を決定するものである（図４、Ｓ１８）。表示単位行分割部８０は、息継ぎで区切られる区間を１つまたは複数結合して所望のルールに基づいて表示単位行を決定して良い。最大文字数が１８文字であるというのも１つのルールであって良い。既存の歌詞カードデータと表示単位区切りの例（この実施例で生成した結果を含む）を用いて、種々のパラメータから表示単位行を決定して良い。表示単位行は、改行コード等により記述されて良いけれども、これに限定されない。モデル学習部８０ａは改行位置モデルを学習するものであり、表示単位行分割部８０はこれを参照して良い。図１０は、表示単位行分割部８０によって分割した例を示す。この例では、２７文字のテキストを、息継ぎに準拠して、９文字、５文字、９文字、および６文字の表示単位行に分割している。 The display unit line dividing unit 80 determines a line of a display unit to be displayed or highlighted at a time during sound reproduction (FIG. 4, S18). The display unit line dividing section 80 may determine one or more display unit lines based on a desired rule by combining one or more sections separated by breathing. One rule may be that the maximum number of characters is 18 characters. A display unit line may be determined from various parameters using existing lyrics card data and an example of a display unit break (including the result generated in this embodiment). The display unit line may be described by a line feed code or the like, but is not limited to this. The model learning unit 80a learns a line feed position model, and the display unit line division unit 80 may refer to this. FIG. 10 shows an example of division by the display unit line division unit 80. In this example, the 27-character text is divided into 9-, 5-, 9-, and 6-character display unit lines in accordance with breathing.

テキストマッチング部９０は、発音記号マッチング部６０から出力された時間情報付きの音声認識テキストデータを受け取り、さらに、歌詞テキストファイル入力部２０から入力された歌詞テキストデータを受け取り、両者をマッチングさせる（図４、Ｓ１９）。マッチングは一方のテキストを他方のテキストにマッチングさせるものであればどのようなものでもよい。この例では、編集距離（差分情報）が所定の閾値以下になるようにマッチングを行うけれども、これに限定されない。 The text matching unit 90 receives the voice recognition text data with time information output from the phonetic symbol matching unit 60, further receives the lyrics text data input from the lyrics text file input unit 20, and matches them (FIG. 4, S19). The matching can be anything that matches one text to the other. In this example, the matching is performed so that the editing distance (difference information) is equal to or less than a predetermined threshold, but the present invention is not limited to this.

この実施例では、音声認識テキストおよび歌詞テキストのマッチングを編集距離で評価して行うけれども、日本語で編集（一部変更）したのち、異なる言語（中間言語ともいう）、これ例では英語で翻訳を行い、翻訳文の間の編集距離を評価するようにしている。また、他の例では、日本語のままで編集距離を評価して良い。 In this embodiment, the matching between the speech recognition text and the lyrics text is performed by evaluating the editing distance, but after editing (partially changing) in Japanese, it is translated in a different language (also called an intermediate language), in this example, translated in English. To evaluate the editing distance between the translations. In another example, the editing distance may be evaluated in Japanese.

まず、翻訳を介して間接的に編集距離を評価する例を説明する。この例では、マッチング対象となるテキスト間において
（１）漢字やカタカナなどの表記の違いを吸収する。
（２）日本語の単語の意味を考慮して編集距離を評価する。
（３）日本語以外のテキストのマッチングも行う。
という効果がある。 First, an example in which the editing distance is indirectly evaluated through translation will be described. In this example, (1) differences in notations such as kanji and katakana are absorbed between texts to be matched.
(2) The edit distance is evaluated in consideration of the meaning of the Japanese word.
(3) Text matching other than Japanese is also performed.
This has the effect.

翻訳を介して間接的に編集距離を評価するために、日本語を同じ変換（翻訳）エンジンを用いて差分抽出に適した中間言語に置き換える。この実施例では、中間言語を英語とし、差分抽出エンジンとして、ＵＮＩＸ（商標）のｄｉｆｆコマンドを用いている。日本語を英語に翻訳後にｄｉｆｆコマンドを利用して差分抽出をおこなう。 In order to evaluate the edit distance indirectly via translation, Japanese is replaced with an intermediate language suitable for difference extraction using the same conversion (translation) engine. In this embodiment, the intermediate language is English, and a UNIX (trademark) diff command is used as the difference extraction engine. After translating Japanese into English, a difference is extracted using the diff command.

ここでは、図１１の例に示すように、音声認識結果のテキスト（時間情報付き）と歌詞カードテキストがある場合を考える。 Here, as shown in the example of FIG. 11, it is assumed that there is a text (with time information) of a speech recognition result and a lyrics card text.

図１２は、英語の翻訳文で間接的に編集距離を求める動作例を示しており、図１２に示すように、まずは音声認識結果から部分切り出しを行い、歌詞カードテキストからセンテンスを切り出す（Ｓ１９１）。部分切り出し（マッチングを行う単位）を行う範囲は、１コーラス（前奏、間奏、後奏等で区切られるブロック）でもよいし、それより小さな単位でもよい。図１１の音声認結果テキストについてはコーラスのサイズで切り出しを行っているけれども（前奏、間奏、後奏の無音区間で区切られる）、これに限定されない。 FIG. 12 shows an operation example of indirectly obtaining the edit distance in the English translation. As shown in FIG. 12, first, a partial cutout is performed from the speech recognition result, and a sentence is cut out from the lyrics card text (S191). . The range in which partial cutout (a unit for performing matching) is performed may be one chorus (a block divided by a prelude, an interlude, a postlude, or the like) or a unit smaller than that. Although the speech recognition result text of FIG. 11 is cut out by the chorus size (separated by a silence section of a prelude, an interlude, and a subsequent), the present invention is not limited to this.

切り出した音声認識結果のテキストおよび歌詞カードテキストは中間言語、この例では、英語に翻訳される（Ｓ１９１）。両テキストは同一の翻訳エンジンで翻訳される。つぎに、音声認識結果のテキストおよび歌詞カードテキストの翻訳結果について差分情報検出エンジンを用いて差分情報を形成し（Ｓ１９２）、編集距離を求める（Ｓ１９３）。この段階では、もともとの音声認識結果のテキストおよび歌詞カードテキスト（オリジナル、ワード置き換えなし）が対象となる。音声認識結果および歌詞カードテキストの翻訳結果は図１３（Ａ）に示すとおりであり、テキスト間の比較をワードごとに改行して示すと図１３（Ｂ）のようになり、差分情報は図１３（Ｃ）に示すとおりであり、編集距離は図１３（Ｄ）に示すように１２である。図１３（Ｄ）の例では、レーベンシュタイン距離を用いるけれども、これに限定されない。図１３（Ｃ）において、「差分情報」、例えば、「５，８ｃ５，６」は、左側のテキスト（ｆｉｌｅ１）の５〜８ワード目と、右側のテキスト（ｆｉｌｅ２）の５〜６ワードのテキストを変更（ｃ）することを示し、「ｆｉｌｅ１，ｆｉｌｅ２の変更数」の「４２」は、左側のテキスト（ｆｉｌｅ１）の変更数が４で、右側のテキスト（ｆｉｌｅ２）の変更数が２であることを示す。編集距離は、ｆｉｌｅ１の変更数およびｆｉｌｅ２の変更数の大きいほうをとる。 The extracted speech recognition result text and lyrics card text are translated into an intermediate language, in this example, English (S191). Both texts are translated by the same translation engine. Next, difference information is formed using the difference information detection engine for the text of the speech recognition result and the translation result of the lyrics card text (S192), and the editing distance is obtained (S193). At this stage, the text of the original speech recognition result and the lyrics card text (original, no word replacement) are targeted. The speech recognition result and the translation result of the lyrics card text are as shown in FIG. 13A, and the comparison between the texts is shown in FIG. As shown in FIG. 13 (C), the editing distance is 12, as shown in FIG. 13 (D). In the example of FIG. 13D, the Levenshtein distance is used, but the present invention is not limited to this. In FIG. 13C, “difference information”, for example, “5, 8c5, 6” is the fifth to eighth words of the left text (file1) and the fifth to sixth words of the right text (file2). Is changed (c), and “42” of “the number of changes in file1 and file2” indicates that the number of changes in the left text (file1) is 4 and the number of changes in the right text (file2) is 2 Indicates that there is. The edit distance takes the larger of the number of changes in file1 and the number of changes in file2.

なお、図１３（Ｂ）において、「＜」は音声認識テキストの変更部分を示し、「＞」は歌詞カードテキストの変更部分を示す。また、図１３（Ｂ）、（Ｃ）、（Ｄ）において、差分情報は、Ｆｉｌｅ１とｆｉｌｅ２の差異がある位置を表す。記号のａは追加，ｄは削除，ｃは変更を意味する。記号の前にはｆｉｌｅ１の行番号が、記号の後にはｆｉｌｅ２の行番号が表示され、複数行に渡って差異がある場合は、差異開始行と差異終了行をコンマ（，）で区切って表示する。
In FIG. 13B, “<” indicates a changed portion of the voice recognition text, and “>” indicates a changed portion of the lyrics card text. In FIGS. 13B, 13C, and 13D, the difference information indicates a position at which there is a difference between File1 and File2. The symbol a means addition, d means deletion, and c means change. The line number of file1 is displayed before the symbol, and the line number of file2 is displayed after the symbol. If there are differences over multiple lines, the difference start line and difference end line are displayed separated by commas (,). I do.

つぎに編集距離が予め定められた閾値以下であるかどうかを判別する（Ｓ１９４）。閾値以下である場合には、音声認識結果のテキストと歌詞カードテキストとが成功裏に対応し、音声認識結果の表示単位行ごとに歌詞カードテキストを切り出し、切り出した部分を歌詞カードテキストの表示単位行としてＬＲＣファイル（図１参照）に書き出し、さらに、音声認識結果の表示単位行の最初のカナのタイムスタンプを、当該表示単位行のタイムスタンプとしてＬＲＣファイルに書き出す（Ｓ１９７）。こののち、つぎの区間のマッチングを行い、マッチング区間がない場合には処理を終了する（Ｓ１９９）。なお、タイムスタンプを予め定められた調整時間だけ早めたり、遅めたりして良い。 Next, it is determined whether or not the editing distance is equal to or less than a predetermined threshold (S194). If the value is equal to or less than the threshold, the text of the speech recognition result and the lyrics card text successfully correspond to each other, and the lyrics card text is cut out for each display unit line of the speech recognition results, and the cut out portion is displayed in the lyrics card text display unit. A line is written to the LRC file (see FIG. 1), and a time stamp of the first kana of the display unit line of the speech recognition result is written to the LRC file as a time stamp of the display unit line (S197). Thereafter, the next section is matched, and if there is no matching section, the process ends (S199). Note that the time stamp may be advanced or delayed by a predetermined adjustment time.

編集距離が予め定められた閾値より大きい場合にはステップＳ１９５に進み、マッチングが失敗かどうかを判別し、例えば、予め定められた回数だけ書き換えを行っても（ステップＳ１９６参照）編集距離が予め定められた閾値を上回っている場合には当該マッチング対象区間についてはマッチングが失敗であることを通知する（Ｓ１９８）。こののち、つぎの区間のマッチングを行い、マッチング区間がない場合には処理を終了する（Ｓ１９９）。 If the editing distance is larger than the predetermined threshold, the process proceeds to step S195, and it is determined whether or not the matching has failed. For example, even if rewriting is performed a predetermined number of times (see step S196), the editing distance is predetermined. If the threshold value is exceeded, the matching section is notified that matching has failed (S198). Thereafter, the next section is matched, and if there is no matching section, the process ends (S199).

ステップＳ１９５において、マッチング失敗条件が満たされていない場合には、ステップ１７６に進み、一方のテキストの一部を他方のテキストの一部で変更する。この例では、音声認識結果のテキストの一部を歌詞カードテキストの一部で変更するけれども、その逆でも良い。音声認識結果のテキストを変更したのち、ステップＳ１９１に進み、音声認識結果のテキスト（変更済み）および歌詞カードテキスト（オリジナル）の翻訳結果について差分情報検出エンジンを用いて差分情報を形成し（Ｓ１９２）、編集距離を求める（Ｓ１９３）。この例では、図１４（Ａ）に示すように音声認識結果の「ねえや」を「姐や」に変更し、「おさと」を「お里」に変更した。変更後の音声認識結果のテキストは図１４（Ｂ）に示すとおりであり、その翻訳結果は図１４（Ｃ）のとおりである。歌詞カードテキストには変更がないのでその翻訳結果は図１３（Ａ）のままである。変更後の音声認識結果の翻訳と歌詞カードテキストの翻訳との差分情報は図１４（Ｄ）に示すとおりであり、編集距離は図１４（Ｅ）に示すように９となる。ステップＳ１９４において、この編集距離が予め定められた閾値を上回る場合は、ステップＳ１９５、Ｓ１９６に進み処理を繰り返す。ステップＳ１９４において、編集距離がステップＳ１９７の予め定められた閾値以下の場合は、ステップＳ１９７へ進み、マッチング結果を出力する。 If the matching failure condition is not satisfied in step S195, the process proceeds to step 176, where a part of one text is changed with a part of the other text. In this example, a part of the text of the speech recognition result is changed with a part of the lyrics card text, but the reverse is also possible. After changing the text of the speech recognition result, the process proceeds to step S191, where difference information is formed using the difference information detection engine for the translation result of the text (changed) of the speech recognition result and the lyrics card text (original) (S192). , The edit distance is obtained (S193). In this example, as shown in FIG. 14A, the speech recognition result “Neyaya” was changed to “Sister”, and “Osato” was changed to “Sato”. The text of the speech recognition result after the change is as shown in FIG. 14 (B), and the translation result is as shown in FIG. 14 (C). Since there is no change in the lyrics card text, the translation result remains as shown in FIG. The difference information between the translation of the speech recognition result after the change and the translation of the lyrics card text is as shown in FIG. 14 (D), and the editing distance is 9 as shown in FIG. 14 (E). If the edit distance exceeds the predetermined threshold in step S194, the process proceeds to steps S195 and S196 to repeat the processing. If the edit distance is equal to or smaller than the predetermined threshold value of step S197 in step S194, the process proceeds to step S197, and a matching result is output.

ここでは、現在の編集距離の９が、ステップＳ１９４の閾値より大きく、再度、ステップＳ１９５を介してステップ１９６に進むものとする。 Here, it is assumed that the current edit distance 9 is larger than the threshold value of step S194, and the process proceeds to step 196 again via step S195.

図１５は、音声認識結果のテキストをさらに変更する例を示す。この例では、さきほどの「ねえや」を「姐や」に変え、「おさと」を「お里」に変えたうえで、さらに、「ゆき」を「行き」に変え、「便り」を「たより」に変えている（図１５（Ａ））。どのような優先順位で、またどのような単位で変更していくかは適宜に選定して良い。テキスト中の順番でもよいし、文法情報（体言、用言等）であって良い。変更数は固定でも可変でも良い。 FIG. 15 shows an example in which the text of the speech recognition result is further changed. In this example, "Neyaya" was changed to "Sister", "Osato" was changed to "Sato", then "Yuki" was changed to "Go", and "Letter" was changed to "News". (FIG. 15A). The order of priority and the unit in which the change is made may be appropriately selected. It may be the order in the text or grammatical information (nominal, declinable, etc.). The number of changes may be fixed or variable.

この場合、変更後の音声認識結果のテキストは図１５（Ｂ）に示すとおりであり、その翻訳結果は図１５（Ｃ）のとおりである。歌詞カードテキストには変更がないのでその翻訳結果は図１３（Ａ）のままである。変更後の音声認識結果の翻訳と歌詞カードテキストの翻訳との差分情報は図１５（Ｄ）に示すとおりであり、編集距離は図１５（Ｅ）に示すとおり５となる。この編集距離がステップＳ１９４の予め定められた閾値を上回る場合は、ステップＳ１９６に進み処理を繰り返す。編集距離がステップＳ１９４の予め定められた閾値以下の場合は、ステップＳ１９７へ進み、マッチング結果を出力する。 In this case, the text of the speech recognition result after the change is as shown in FIG. 15B, and the translation result is as shown in FIG. 15C. Since there is no change in the lyrics card text, the translation result remains as shown in FIG. The difference information between the translation of the speech recognition result after the change and the translation of the lyrics card text is as shown in FIG. 15D, and the editing distance is 5 as shown in FIG. 15E. If this editing distance exceeds the predetermined threshold value of step S194, the process proceeds to step S196 and the process is repeated. If the editing distance is equal to or smaller than the predetermined threshold value in step S194, the process proceeds to step S197, and a matching result is output.

図１６は、音声認識結果のテキストをさらに他の態様で変更する例を示す。この例では、さきほどの「ねえや」を「姐や」に変え、「おさと」を「お里」に変えたうえで、さらに、「ゆき」を「行き」に変え、「耐え果て」を「絶えはて」に変えている（図１６（Ａ））。 FIG. 16 shows an example in which the text of the speech recognition result is changed in still another mode. In this example, the previous "Neya" was changed to "Sister", "Osato" was changed to "Sato", then "Yuki" was changed to "Go", and "Endure" was changed to "Endure". It is changed to "End of life" (FIG. 16A).

この場合、変更後の音声認識結果のテキストは図１６（Ｂ）に示すとおりであり、その翻訳結果は図１６（Ｃ）のとおりである。歌詞カードテキストには変更がないのでその翻訳結果は図１３（Ａ）のままである。変更後の音声認識結果の翻訳と歌詞カードテキストの翻訳との差分情報は図１６（Ｄ）に示すとおりであり、編集距離は図１６（Ｅ）に示すように５となる。この編集距離がステップＳ１９４の予め定められた閾値を上回る場合は、ステップＳ１９６に進み処理を繰り返す。編集距離がステップＳ１９４の予め定められた閾値以下の場合は、ステップＳ１９７へ進み、マッチング結果を出力する。 In this case, the text of the speech recognition result after the change is as shown in FIG. 16B, and the translation result is as shown in FIG. 16C. Since there is no change in the lyrics card text, the translation result remains as shown in FIG. The difference information between the translation of the speech recognition result after the change and the translation of the lyrics card text is as shown in FIG. 16D, and the editing distance is 5 as shown in FIG. 16E. If this editing distance exceeds the predetermined threshold value of step S194, the process proceeds to step S196 and the process is repeated. If the editing distance is equal to or smaller than the predetermined threshold value in step S194, the process proceeds to step S197, and a matching result is output.

図１７は、音声認識結果のテキストをさらに他の態様で変更する例を示す。この例では、「ねえや」を「姐や」に変え、「ゆき」を「行き」に変え、「おさと」を「お里」に変え、「便り」を「たより」に変え、「耐え果て」を「絶えはて」に変えている（図１７（Ａ））。変更後の音声認識結果のテキストは図１７（Ｂ）に示すとおりであり、歌詞カードテキストと同一となる。したがって、翻訳後の音声認識結果と歌詞カードテキストとの間の編集距離はゼロである。 FIG. 17 shows an example in which the text of the speech recognition result is changed in still another mode. In this example, "Neya" is changed to "Sister", "Yuki" is changed to "Go", "Osato" is changed to "Sato", "Letter" is changed to "News", "Endure" Is changed to "Zetsuhate" (FIG. 17A). The text of the voice recognition result after the change is as shown in FIG. 17B, which is the same as the lyrics card text. Therefore, the editing distance between the translated speech recognition result and the lyrics card text is zero.

以上のようにして、翻訳の間の編集距離が予め定められた値以下の場合に、マッチングが成功したと判断されて、マッチング区間において、歌詞カードの表示単位行およびそのタイムスタンプが出力される。マッチングが失敗した部分については失敗したことが通知される。マッチングが失敗した部分については歌詞ブロック分析部１１０（図２）を用いて補間して良い。 As described above, when the editing distance between translations is equal to or less than a predetermined value, it is determined that the matching is successful, and the display unit line of the lyrics card and its time stamp are output in the matching section. . The part where the matching has failed is notified of the failure. The part where the matching has failed may be interpolated using the lyrics block analysis unit 110 (FIG. 2).

テキストマッチング部９０は、上述の例では、中間言語への翻訳を用いて差分情報を出力し、編集距離を出力しているけれども、中間言語への変換をおこなわずに、直接日本語のテキストから差分情報を出力し、編集距離を出力して良い。 In the above-described example, the text matching unit 90 outputs the difference information using the translation into the intermediate language and outputs the edit distance. However, the text matching unit 90 directly converts the Japanese text without performing the conversion into the intermediate language. The difference information may be output, and the editing distance may be output.

図１８は、日本語テキストを用いて差分情報および編集距離を出力する例を示す。図１８の例では、図１８（Ａ）が音声認識結果のテキストを示し、図１８（Ｂ）が形態素解析によってワード区切りを行ったワード列を示す。図１８（Ｃ）は歌詞カードテキストを示し、図１８（Ｄ）が形態素解析によってワード区切りを行ったワード列を示す。それぞれワード区切りされた、音声認識結果のテキストおよび歌詞カードテキストの間のテキストの比較は図１８（Ｅ）に示すとおりであり、差分情報は図１８（Ｆ）に示すとおりであり、ワード置き換え候補は図１８（Ｆ）に示すとおりである。差分情報から編集距離を出力し（音声認識テキストの変更ワード数が７で、歌詞カードテキストの変更ワード数が４であるので、この例では編集距離が７である）、予め定められた閾値以下であれば、マッチングを成功とする。 FIG. 18 shows an example of outputting difference information and edit distance using Japanese text. In the example of FIG. 18, FIG. 18A shows a text as a result of speech recognition, and FIG. 18B shows a word string obtained by performing word division by morphological analysis. FIG. 18C shows a lyrics card text, and FIG. 18D shows a word string obtained by performing word division by morphological analysis. The text comparison between the text of the speech recognition result and the lyrics card text, each of which is word-delimited, is as shown in FIG. 18 (E), and the difference information is as shown in FIG. 18 (F). Is as shown in FIG. The editing distance is output from the difference information (the number of changed words in the speech recognition text is 7, and the number of changed words in the lyrics card text is 4, so the editing distance is 7 in this example), and is equal to or less than a predetermined threshold value If so, the matching is successful.

なお、図１８のワード置き換え候補は図１２のステップＳ１７６の一部テキストの変更に利用できる。 Note that the word replacement candidate in FIG. 18 can be used to change part of the text in step S176 in FIG.

図２の表示単位行同期情報決定部１００は、表示単位行分割部８０で決定された表示単位行およびテキストマッチング部９０で実行されたマッチング結果に基づいて表示単位行の同期情報（タイムスタンプ）を決定する（図４、Ｓ１９）。表示単位行のタイムスタンプは基本的には表示単位行に割り当てられた認識結果テキスト部分の先頭のカナに対応付けられたタイムスタンプであるけれども、これに限定されない。 The display unit line synchronization information determination unit 100 of FIG. 2 is based on the display unit line determined by the display unit line division unit 80 and the matching result (time stamp) of the display unit line based on the matching result executed by the text matching unit 90. Is determined (FIG. 4, S19). The time stamp of the display unit line is basically a time stamp associated with the first kana of the recognition result text portion assigned to the display unit line, but is not limited to this.

図２の歌詞ブロック分析部１１０は、歌詞テキストファイル入力部２０から歌詞テキストを受け取って歌詞ブロックの分析を行う（図４、Ｓ２０）。図２０は歌詞カードデータからの情報抽出結果を示す。まず、図２０（Ａ）に示すように歌詞ブロック番号、コーラス番号、歌詞テキスト、ヘボン式ローマ字表記、前奏、間奏、後奏の別が抽出される。つぎに図２０（Ｂ）に示すように、歌詞ブロックの間の類似度を例えばレーベンシュタイン距離で評価し、距離の小さい歌詞ブロックの対を決定する。類似度は種々の手法で算出して良く、例えば、コサイン距離を用いても良い。この例では、各ブロックの間の距離は図２０（Ｂ）に示すとおりである。テキストマッチング部９０のマッチングが失敗した歌詞ブロックについては、レーベンシュタイン距離が予め定められた閾値以下であり類似すると判定できる場合には、他のマッチングに成功した歌詞ブロックから時間情報を流用する。 The lyrics block analysis unit 110 of FIG. 2 receives the lyrics text from the lyrics text file input unit 20 and analyzes the lyrics block (FIG. 4, S20). FIG. 20 shows the result of information extraction from the lyrics card data. First, as shown in FIG. 20 (A), the lyrics block number, chorus number, lyrics text, Hepburn romanized notation, prelude, interlude, and subsequent are extracted. Next, as shown in FIG. 20B, the similarity between the lyrics blocks is evaluated based on, for example, the Levenshtein distance, and a pair of lyrics blocks having a small distance is determined. The similarity may be calculated by various methods, for example, a cosine distance may be used. In this example, the distance between the blocks is as shown in FIG. When the Levenshtein distance is equal to or less than a predetermined threshold value and it can be determined that the lyrics blocks are similar to each other, the time information is diverted from another successfully matched lyrics block.

補間部１２０は、流用できる歌詞ブロックがある場合には、これに対応する表示単位行に対応するタイムスタンプを算出して付加する（図４、Ｓ２１）。この例では、第２コーラス（３行目および４行目）について時間情報が取得できなかった場合には、図２１に示すように、第１コーラス（１行目および２行目）または第４コーラス（第７行目および第８行目）から時間情報を補間して良い。 When there is a lyrics block that can be diverted, the interpolation unit 120 calculates and adds a time stamp corresponding to the display unit line corresponding to the lyrics block (S21 in FIG. 4). In this example, if time information cannot be obtained for the second chorus (the third and fourth lines), as shown in FIG. 21, the first chorus (the first and second lines) or the fourth chorus The time information may be interpolated from the chorus (the seventh and eighth lines).

また、歌詞カードから省略されている繰り返しパートがある場合には、補間部１２０が、音声認識テキストを用いて、表示単位行のテキストおよび時間情報を補間して良い。図２２に示す例では、「ヘイヘイガール」、「ララララララ」の繰り返しが歌詞カードテキストでは省略されている（あるいはボーカルで追加されている）ので、この繰り返し部分のテキストおよび時間情報を補間してＬＲＣファイルに書き込む。この動作例を図２３に示す。 If there is a repetition part omitted from the lyrics card, the interpolation unit 120 may interpolate the text and time information of the display unit line using the speech recognition text. In the example shown in FIG. 22, the repetition of “Hey Hey Girl” and “La La La La La La” is omitted in the lyrics card text (or added by vocals), so the text and time information of the repetition part are interpolated. Write to LRC file. An example of this operation is shown in FIG.

補間部１２０から出力されたデータはＬＲＣファイルとして同期情報ファイル出力部１３０から出力される。出力例は例えば図２６に示すようなものである。テキストを表示する装置の画面表示サイズに応じて異なるＬＲＣファイルを含み、装置に応じて異なる態様でテキストを表示して良い。表示開始時間の微調整（開始を遅らせたり早めたり）を装置ごとに設定して良い。 The data output from the interpolation unit 120 is output from the synchronization information file output unit 130 as an LRC file. An output example is as shown in FIG. 26, for example. An LRC file that differs depending on the screen display size of the device that displays the text may be included, and the text may be displayed in a different manner depending on the device. Fine adjustment of display start time (delay or advance of display) may be set for each device.

以上で実施例の説明を終了する。 This is the end of the description of the embodiment.

なお、この発明は上述の実施例に限定されることなく、その趣旨を逸脱しない範囲で種々変更が可能である。例えば、この発明は、音声信号と表示対象テキストとを同期させる種々の用途に適用できる。例えば、ビデオコンテンツとセリフ（スクリプト）とを同期させてよい。記録した音声（音声付ビデオ）と音声認識結果とを同期させて再生する場合にも適用可能である。この発明は日本語に適用されるだけでなく、種々の言語に適用可能である。また、当業者は、特許請求の範囲の記載に従って種々の変更を行い、種々の実施例を実施でき、このような実施例も特許請求の範囲の記載に含まれることに留意されたい。 The present invention is not limited to the above-described embodiment, but can be variously modified without departing from the gist of the invention. For example, the present invention can be applied to various uses for synchronizing an audio signal and a text to be displayed. For example, video content and dialogue (script) may be synchronized. The present invention is also applicable to a case where the recorded voice (video with voice) and the voice recognition result are reproduced in synchronization with each other. The present invention is applicable not only to Japanese but also to various languages. Also, it should be noted that those skilled in the art can make various changes and implement various embodiments according to the description of the claims, and such embodiments are also included in the description of the claims.

１０音響データファイル入力部
２０歌詞テキストファイル入力部
３０ボーカル音声信号抽出部
４０発音記号検出部
５０音声認識部
６０発音記号マッチング部
６０ａカナ確率テーブル
７０ボーカル無音区間検出部
８０表示単位行分割部
８０ａモデル学習部
９０テキストマッチング部
１００表示単位行同期情報決定部
１１０歌詞ブロック分析部
１２０補間部
１３０同期情報ファイル出力部
１４０同期情報ファイル記憶部
２００同期情報生成装置
３００コンピュータシステム
３０１記録媒体
４００外部装置 Reference Signs List 10 Acoustic data file input unit 20 Lyric text file input unit 30 Vocal voice signal extraction unit 40 Phonetic symbol detection unit 50 Voice recognition unit 60 Phonetic symbol matching unit 60a Kana probability table 70 Vocal silent section detection unit 80 Display unit line division unit 80a Model Learning unit 90 Text matching unit 100 Display unit line synchronization information determination unit 110 Lyric block analysis unit 120 Interpolation unit 130 Synchronization information file output unit 140 Synchronization information file storage unit 200 Synchronization information generation device 300 Computer system 301 Recording medium 400 External device

この発明の一側面によれば、上述の目的を達成するために、歌詞用同期情報生成装置を：音声信号から音声特徴を判別して、音声特徴記号と上記音声特徴の出現時刻の時間情報との対を順次に出力する音声特徴検出手段と；上記音声信号を音声認識して音声認識テキストを出力する音声認識手段と；上記音声特徴検出手段から出力された上記音声特徴記号と上記音声認識手段から出力された上記音声認識テキストとを出現順序に沿ってマッチングして上記音声認識テキストに上記時間情報を関連付ける第１のマッチング手段と；上記音声認識手段から出力された上記音声認識テキストを複数の行部分テキストに分割するテキスト分割手段と；歌詞テキストと、上記音声認識手段から出力され上記テキスト分割手段で上記複数の行部分テキストに分割された上記音声認識テキストとを出現順序に沿ってマッチングして、上記音声認識テキストの上記複数の行部分テキストの各々に対応させて上記歌詞テキストから複数の行部分テキストを取り出す第２のマッチング手段と；上記第１のマッチング手段によって上記音声認識テキストに関連付けられた上記時間情報を、上記第２のマッチング手段によって上記音声認識テキストの上記複数の行部分テキストの各々に対応付けて取り出された、上記歌詞テキストの上記複数の行部分テキストの各々に関連付けて、上記歌詞テキストの上記複数の行部分テキストの各々の表示タイミングを決定する表示タイミング決定手段とを有するように構成している。 According to one aspect of the present invention, in order to achieve the above object, a lyric synchronization information generating apparatus includes: determining a voice feature from a voice signal, and determining a voice feature symbol and time information of an appearance time of the voice feature. a voice feature detecting means for outputting pairs sequentially in; speech recognition means for the speech signal is recognized voice to output the voice recognition text and; the speech features output from the detecting means the above speech features symbol and the voice recognition means has been the above speech recognition text matching along the order of appearance in the speech recognition text the time first associating information matching unit and an output from; the speech recognition text output from the speech recognition means multiple text dividing means for dividing the row partial text; lyrics text and, said plurality of row sub text in the text dividing means are outputted from said speech recognition means And it divided the speech recognition text matching along the order of appearance, a second matching in correspondence with each of the plurality of row sub text of the speech recognition text retrieve multiple row sub text from the lyric text means and; on Symbol the time information associated with the speech recognition text by the first matching means, retrieved in association with each of the plurality of row sub text of the speech recognition text by said second matching means Further, a display timing determining means for determining a display timing of each of the plurality of line portion texts of the lyrics text in association with each of the plurality of line portion texts of the lyrics text is configured.

この構成において、上記第２のマッチング手段は、上記歌詞テキストの異なる言語への翻訳および上記音声認識手段から出力されたテキストの上記異なる言語への翻訳の間の編集距離に基づいてマッチングを行って良い。適用言語は日本語であって良いけれども、これに限定されず、音声認識が可能で、テキスト表示でき、音声特徴を検出できるものであれば、どのような言語であっても良い。すなわち、上記音声認識手段は日本語を音声認識して日本語テキストを出力し、上記第２のマッチング手段は、日本語の上記歌詞テキストと上記音声認識手段から出力された日本語のテキストとをマッチングして良いけれども、これに限定されず種々の言語に適用して良い。適用言語が日本語の場合、上述の異なる言語は英語であって良いけれども、これに限定されない。 In this configuration, the second matching unit performs matching based on an edit distance between the translation of the lyrics text into a different language and the translation of the text output from the speech recognition unit into the different language. good. The application language may be Japanese, but is not limited to this. Any language may be used as long as it can perform voice recognition, display text, and detect voice characteristics. That is, the speech recognition unit outputs the Japanese text by speech recognition of Japanese, and the second matching unit compares the Japanese lyrics text with the Japanese text output from the speech recognition unit. Although matching may be performed, the present invention is not limited to this and may be applied to various languages. When the application language is Japanese, the different language described above may be English, but is not limited thereto.

この発明の他の側面によれば、テキスト表示用同期信号生成装置を：音声信号から音声特徴を判別して、音声特徴記号と上記音声特徴の出現時刻の時間情報との対を順次に出力する音声特徴検出手段と；上記音声信号を音声認識して音声認識テキストを出力する音声認識手段と；上記音声特徴検出手段から出力された上記音声特徴記号と上記音声認識手段から出力された上記音声認識テキストとを出現順序に沿ってマッチングして上記音声認識テキストに上記時間情報を関連付ける第１のマッチング手段と；上記音声認識手段から出力された上記音声認識テキストを複数の行部分テキストに分割するテキスト分割手段と；表示対象テキストと、上記音声認識手段から出力され上記テキスト分割手段で上記複数の行部分テキストに分割された上記音声認識テキストとを出現順序に沿ってマッチングして、上記音声認識テキストの上記複数の行部分テキストの各々に対応させて上記表示対象テキストから複数の行部分テキストを取り出す第２のマッチング手段と；上記第１のマッチング手段によって上記音声認識テキストに関連付けられた上記時間情報を、上記第２のマッチング手段によって上記音声認識テキストの上記複数の行部分テキストの各々に対応付けて取り出された、上記表示対象テキストの上記複数の行部分テキストの各々に関連付けて、上記表示対象テキストの上記複数の行部分テキストの各々の表示タイミングを決定する表示タイミング決定手段とを有するように構成している。上記音声認識手段から出力されたテキストが適宜に分割されている場合には、テキスト分割手段を省略して良い。 According to another aspect of the present invention, a text display synchronization signal generating device: discriminates a speech feature from a speech signal, and sequentially outputs a pair of a speech feature symbol and time information of the appearance time of the speech feature. a voice feature detecting unit; the speech recognition output from the speech features the speech feature symbols output from the detection means and the speech recognition means; speech recognition means and for outputting a speech recognition text of the speech signal and speech recognition text dividing the speech recognition text output from the speech recognition means into a plurality of row sub text; first matching means and the matching along the text appearance order associate the time information to the speech recognition text dividing means; display object text, after being divided into the plurality of row sub text in the text dividing means are outputted from said speech recognition means Matching along the order of appearance and the speech recognition text, a second matching means for extracting a plurality of row sub text from the display target text in association with each of the plurality of row sub text of the speech recognition text; the time information associated with the speech recognition text by the upper Symbol first matching means, retrieved in association with each of the plurality of row sub text of the speech recognition text by said second matching means, the Display timing determining means for determining a display timing of each of the plurality of line portion texts of the display target text in association with each of the plurality of line portion texts of the display target text . If the text output from the voice recognition unit is appropriately divided, the text division unit may be omitted.

また、この発明の他の側面によれば、テキスト表示用同期情報生成装置を：音声信号から音声特徴を判別して、音声特徴記号と上記音声特徴の出現時刻の時間情報との対を順次に出力する音声特徴検出手段と；上記音声信号を音声認識して音声認識テキストを出力する音声認識手段と；上記音声特徴検出手段から出力された上記音声特徴記号と上記音声認識手段から出力された上記音声認識テキストとを出現順序に沿ってマッチングして上記音声認識テキストに上記時間情報を関連付けるマッチング手段と；上記音声認識手段から出力された上記音声認識テキストを複数の行部分テキストに分割するテキスト分割手段と；上記マッチング手段によって上記音声認識テキストに関連付けられた上記時間情報を用いて、上記音声認識テキストの上記複数の行部分テキストの各々の表示タイミングを決定する表示タイミング決定手段とを有するように構成している。上記音声認識手段から出力されたテキストが適宜に分割されている場合には、テキスト分割手段を省略して良い。 Further, according to another aspect of the present invention, a text display synchronization information generating apparatus includes: determining a speech feature from a speech signal, and sequentially pairing a speech feature symbol and time information of the appearance time of the speech feature. a voice feature detecting unit for outputting; the speech signal a speech recognition means for outputting a speech recognition text by speech recognition and, the output from the speech features symbol and the voice recognition means is output from the voice feature detecting means text segmentation for dividing the speech recognition text output from the speech recognition means into a plurality of row sub text; matching means and a speech recognition text matching along the order of appearance associate the time information to the speech recognition text means and; by the matching means by using the time information associated with the speech recognition text, on the speech recognition text It is configured to have a display timing determining means for determining the display timing of each of the plurality of row sub text. If the text output from the voice recognition unit is appropriately divided, the text division unit may be omitted.

発音記号検出部４０の動作例は、後述の音声認識の動作例とともに図５に示される。図５の動作例では、音声特徴抽出エンジンを用いて発音記号および出現時間情報を抽出し（Ｓ１３１）、検出された発音記号に対して類似発音記号単位でグルーピングを行い（クラスタ、Ｓ１３２）、出現頻度の高い発音記号のグループに対して典型的な発音記号の波形に近くなるように矯正を行い（Ｓ１３３）、矯正された音声信号に対して音声特徴抽出エンジンを用いて改めて出現時間情報を抽出する（Ｓ１３４）。 An operation example of the phonetic symbol detection unit 40 is shown in FIG. 5 together with an operation example of speech recognition described later. In the operation example of FIG. 5, phonetic symbols and appearance time information are extracted by using a speech feature extraction engine (S131), and detected phonetic symbols are grouped in units of similar phonetic symbols (cluster, S132). A group of frequent phonetic symbols is corrected so as to have a waveform similar to a typical phonetic symbol (S133), and appearance time information is extracted again from the corrected voice signal using a voice feature extraction engine. (S134).

図２の発音記号検出部４０は、発音記号および出現時間情報（タイムスタンプともいう）の対が順次に出力され、これは発音記号マッチング部６０に供給される。 The phonetic symbol detection unit 40 of FIG. 2 sequentially outputs pairs of phonetic symbols and appearance time information (also referred to as time stamps), which are supplied to a phonetic symbol matching unit 60.

テキストマッチング部９０は、発音記号マッチング部６０から出力された表示行単位に分割された時間情報付きの音声認識テキストデータを受け取り、さらに、歌詞テキストファイル入力部２０から入力された歌詞テキストデータを受け取り、両者をマッチングさせる（図４、Ｓ１９）。マッチングは一方のテキストを他方のテキストにマッチングさせるものであればどのようなものでもよい。この例では、編集距離（差分情報）が所定の閾値以下になるようにマッチングを行うけれども、これに限定されない。
The text matching unit 90 receives the speech recognition text data with time information divided into display lines output from the phonetic symbol matching unit 60, and further receives the lyrics text data input from the lyrics text file input unit 20. The two are matched (FIG. 4, S19). The matching can be anything that matches one text to the other. In this example, the matching is performed so that the editing distance (difference information) is equal to or less than a predetermined threshold, but the present invention is not limited to this.

Claims

Voice feature detection means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice-recognizing the voice signal and outputting a text,
First matching means for matching the speech feature symbol output from the phonetic symbol detection means and the text output from the speech recognition means in the order of appearance;
Second matching means for matching the lyrics text and the text output from the voice recognition means in the order of appearance;
Text division means for dividing the text output from the speech recognition means into a plurality of line portion texts,
The time information is associated with each of the plurality of line portion texts based on the matching result of the first matching unit and the matching result of the second matching unit, and each of the plurality of line portion texts is associated with the time information. A lyrics synchronization information generating apparatus, comprising: display timing determining means for determining display timing.

2. The lyrics synchronization information generating apparatus according to claim 1, further comprising audio signal extracting means for extracting the audio signal from a music signal.

3. The lyrics synchronizing information generating apparatus according to claim 2, wherein the audio feature detecting means outputs a time from a start point of the music signal as the time information.

4. The synchronizing information generator for lyrics according to claim 1, wherein said voice feature detecting means outputs pronunciation symbols as voice feature symbols.

5. The lyrics synchronizing information generating apparatus according to claim 1, wherein the voice recognition means divides the voice signal in a portion where the output of the voice signal is small and performs voice recognition in units of divided portions.

The said audio | voice characteristic detection means determines the typical audio | voice characteristic which appears in the said audio | voice signal, and corrects the said audio | voice signal so that the said typical audio | voice characteristic is easy to be detected. Lyric synchronization information generator.

The first matching means matches a speech feature symbol detected corresponding to the typical speech feature and corresponding to the typical speech feature with a text output from the speech recognition means, and The timing determining means uses the time information assigned to the speech feature symbol corresponding to the typical speech feature detected in correspondence with the typical speech feature, to extract the plurality of line portion texts. 7. The lyrics synchronizing information generating device according to claim 6, wherein each display timing is determined.

8. The lyrics synchronization information generating apparatus according to claim 1, wherein said second matching means performs matching based on an edit distance corresponding to the lyrics text and the text output from the voice recognition means.

9. The method according to claim 8, wherein the second matching unit performs matching based on an edit distance between the translation of the lyrics text into a different language and the translation of the text output from the speech recognition unit into the different language. Lyric synchronization information generator.

The speech recognition means outputs a Japanese text by speech recognition of Japanese, and the second mapping means matches the Japanese lyrics text with the Japanese text output from the speech recognition means. The lyrics synchronization information generating apparatus according to claim 9.

11. The lyrics synchronization information generating apparatus according to claim 10, wherein the different language is English.

The lyrics synchronizing information generating apparatus according to any one of claims 1 to 11, wherein the text dividing means divides the text into a plurality of the line portion texts based on an interval at which the audio signal becomes substantially zero.

13. The lyrics synchronizing information generating apparatus according to claim 12, wherein the text dividing unit divides the partial text based on an allowable range of the number of characters of the partial text.

14. A display timing interpolating means for comparing lyrics texts between a plurality of chorus parts and interpolating display timing of a line part text for which time information could not be obtained based on the comparison result. The synchronizing information generator for lyrics described in 1.

If there is an additional text included in the voice recognition result of the voice recognition means and not included in the lyrics text, a line portion text is additionally generated based on the additional text, and a display timing is set for each of the additionally generated line portion texts. 15. The lyrics synchronizing information generating apparatus according to claim 1, wherein the synchronizing information is assigned.

Voice feature detecting means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice recognition of the voice signal and output text,
A first matching unit that matches the voice feature symbol output from the phonetic symbol detection unit with the text output from the voice recognition unit in the order of appearance;
A second matching unit that matches the lyrics text with the text output from the voice recognition unit in the order of appearance;
Text dividing means for dividing the text output from the voice recognition means into a plurality of line portion texts,
A display timing determining unit that associates the time information with each of the plurality of line portion texts based on the matching result of the first matching unit and the matching result of the second matching unit; Determining a display timing of each of the line part texts.

Computer
Voice feature detection means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice-recognizing the voice signal and outputting a text,
First matching means for matching the speech feature symbol output from the phonetic symbol detection means and the text output from the speech recognition means in the order of appearance;
Second matching means for matching the lyrics text with the text output from the voice recognition means in the order of appearance;
Text division means for dividing the text output from the speech recognition means into a plurality of line portion texts, and
The time information is associated with each of the plurality of line portion texts based on the matching result of the first matching unit and the matching result of the second matching unit, and each of the plurality of line portion texts is associated with the time information. A computer program for generating synchronization information for lyrics used to be realized as display timing determining means for determining display timing.

Voice feature detection means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice-recognizing the voice signal and outputting a text,
First matching means for matching the speech feature symbol output from the phonetic symbol detection means and the text output from the speech recognition means in the order of appearance;
Second matching means for matching the display target text and the text output from the voice recognition means in the order of appearance;
Text division means for dividing the text output from the speech recognition means into a plurality of line portion texts,
The time information is associated with each of the plurality of line portion texts based on the matching result of the first matching unit and the matching result of the second matching unit, and each of the plurality of line portion texts is associated with the time information. A synchronizing information generation device for text display, comprising: a display timing determining means for determining a display timing.

19. The text display synchronization information generating apparatus according to claim 18, wherein the display target text is a text of a line of video content.

Voice feature detecting means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice recognition of the voice signal and output text,
A first matching unit that matches the voice feature symbol output from the phonetic symbol detection unit with the text output from the voice recognition unit in the order of appearance;
A second matching unit that matches the display target text and the text output from the voice recognition unit in the order of appearance;
Text dividing means for dividing the text output from the voice recognition means into a plurality of line portion texts,
A display timing determining unit that associates the time information with each of the plurality of line portion texts based on the matching result of the first matching unit and the matching result of the second matching unit; Determining a display timing of each of the line portion texts.

Computer
Voice feature detection means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice-recognizing the voice signal and outputting a text,
First matching means for matching the speech feature symbol output from the phonetic symbol detection means and the text output from the speech recognition means in the order of appearance;
Second matching means for matching the display target text and the text output from the voice recognition means in the order of appearance,
Text division means for dividing the text output from the speech recognition means into a plurality of line portion texts, and
The time information is associated with each of the plurality of line portion texts based on the matching result of the first matching unit and the matching result of the second matching unit, and each of the plurality of line portion texts is associated with the time information. A computer program for generating synchronous information for text display used to be realized as a display timing determining means for determining a display timing.

Voice feature detection means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice-recognizing the voice signal and outputting a text,
A matching unit that matches the speech feature symbol output from the phonetic symbol detection unit and the text output from the speech recognition unit in the order of appearance;
Text division means for dividing the text output from the speech recognition means into a plurality of line portion texts,
Display timing determining means for associating the time information with each of the plurality of line-portion texts based on the matching result of the matching means and determining the display timing of each of the plurality of line-portion texts. Synchronization information generation device for text display.

Voice feature detecting means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice recognition of the voice signal and output text,
Matching means for matching the voice feature symbol output from the phonetic symbol detection means and the text output from the voice recognition means in the order of appearance,
Text dividing means for dividing the text output from the voice recognition means into a plurality of line portion texts,
Determining a display timing of each of the plurality of line portion texts by associating the time information with each of the plurality of line portion texts based on a matching result of the matching portion. Synchronization information generation method for text display, characterized in that:

Computer
Voice feature detection means for determining a voice feature from the voice signal and outputting a voice feature symbol together with time information corresponding to the utterance time;
Voice recognition means for voice-recognizing the voice signal and outputting a text,
Matching means for matching the speech feature symbol output from the phonetic symbol detection means and the text output from the speech recognition means in the order of appearance,
Text division means for dividing the text output from the speech recognition means into a plurality of line portion texts, and
The time information is associated with each of the plurality of line-portion texts based on the matching result of the matching unit, and is used as a display-timing determination unit that determines a display timing of each of the plurality of line-portion texts. Computer program for generating synchronization information for displaying text.