JP2008180794A

JP2008180794A - Data reproducing apparatus

Info

Publication number: JP2008180794A
Application number: JP2007012746A
Authority: JP
Inventors: Takahiro Tanaka; 孝浩田中
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-01-23
Filing date: 2007-01-23
Publication date: 2008-08-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology capable of displaying a lyrics telop in accordance with flow of a music piece, even when tempo of the music piece is changed like live music. <P>SOLUTION: In a storage section 12 of an image display device 10, lyrics data are stored by dividing it into a plurality of blocks, and collating data for indicating a string of phonemes of the lyrics corresponding to the lyrics data are stored for each block. A control section 11 detects motion of lips of a singer from image data, and detects a string (a pattern) of vowels included in singing voice of the singer from a detection result. The control section 11 collates the detected pattern with a collating pattern stored in the storage section 12, and specifies the block which is being sung, based on a matching degree. The control section 11 starts generation of a lyrics image data for expressing an image of the lyrics of the block in timing of the block being specified, and combines the generated lyrics image data with an image data for expressing the imaged image, and displays it on a display section 13. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、データを再生する技術に関する。 The present invention relates to a technique for reproducing data.

カラオケ装置は、歌詞テロップを画面に表示するとともに、そのテロップを伴奏に合わせて順番に色変わりさせていく機能を備えている。カラオケ装置は、このような機能により、正しい歌詞を正しいタイミングで発音するように歌唱者に案内することができる。また、歌詞テロップの表示は、カラオケ装置だけでなく、テレビ放送などにおける歌番組においても用いられている。 The karaoke apparatus has a function of displaying lyrics telop on the screen and changing the color of the telop in order according to the accompaniment. With such a function, the karaoke apparatus can guide the singer to pronounce the correct lyrics at the correct timing. Moreover, the display of the lyrics telop is used not only in a karaoke apparatus but also in a song program in a television broadcast or the like.

ところで、動画像と音声のずれの補正処理を行う技術として、特許文献１には、動画像の中の音の発生を示す画像を画像分析して、第１の音の発生タイミングを検出し、音声から第２の音の発生タイミングを検出し、第１の音の発生タイミングと第２の音の発生タイミングに基づき音声と動画像のずれを測定し、測定結果分、音声を遅延させる技術が提案されている。
特開２０００−１９６９１７号公報 By the way, as a technique for correcting a shift between a moving image and a sound, Patent Document 1 detects an occurrence timing of the first sound by performing image analysis on an image indicating the sound generation in the moving image, A technique for detecting the generation timing of the second sound from the sound, measuring the difference between the sound and the moving image based on the generation timing of the first sound and the generation timing of the second sound, and delaying the sound by the measurement result. Proposed.
JP 2000-196917 A

ところで、カラオケ装置によるカラオケ伴奏は常に一定のテンポで再生される一方、テレビ放送などの歌番組においてライブ演奏（生演奏）が行われる場合には、その時々によって演奏のテンポは異なることが多い。そのため、カラオケ装置で用いられている歌詞テロップデータをテレビ放送で用いると、歌詞テロップと実際の歌唱映像との間に時間的なずれが生じ、不自然になってしまう。一般的には、テレビ放送においては、楽曲の進行に合わせて歌詞テロップの表示をオペレータが手動で切り換えており、その作業は煩雑である。特許文献１に記載の技術では、動画像と音声のずれを補正することはできるものの、歌詞テロップの表示タイミングをライブ演奏に合わせて補正することはできない。
また、ライブ演奏ではその時々によってサビの演奏回数を増やすなど、曲構成にアレンジを加えて演奏する場合がある。
本発明は、このような事情に鑑みてなされたものであり、歌詞テロップの表示タイミングを映像にあわせることのできる技術を提供することを目的とする。 By the way, while the karaoke accompaniment by a karaoke apparatus is always reproduced | regenerated with fixed tempo, when a live performance (live performance) is performed in song programs, such as a television broadcast, the tempo of a performance often changes with the occasion. For this reason, when the lyrics telop data used in the karaoke apparatus is used in television broadcasting, a time lag occurs between the lyrics telop and the actual singing video, which is unnatural. In general, in television broadcasting, the operator manually switches the display of the lyrics telop as the music progresses, and the work is complicated. With the technique described in Patent Document 1, it is possible to correct the shift between the moving image and the sound, but it is not possible to correct the display timing of the lyrics telop according to the live performance.
Also, in live performances, there are cases in which arrangements are added to the music composition, such as increasing the number of chorus performances from time to time.
The present invention has been made in view of such circumstances, and an object of the present invention is to provide a technique capable of adjusting the display timing of the lyrics telop to the video.

本発明の好適な態様であるデータ再生装置は、楽曲の歌詞を表す歌詞データであって複数のブロックに区分される歌詞データを記憶する歌詞データ記憶手段と、歌唱者の映像を含む映像データを取得する映像データ取得手段と、前記映像データ取得手段により取得された映像データから前記歌唱者の唇の形状を検出する検出手段と、前記検出手段によって検出された唇の形状から、前記歌唱者が発した音声に含まれる音素の並びを検出する音素パターン検出手段と、前記音素パターン検出手段が検出した音素の並びを、前記ブロック毎の歌詞データと照合し、照合結果に基づいて、前記歌唱者が歌唱中のブロックを特定する歌唱中ブロック特定手段と、前記歌詞データ記憶手段から前記歌詞データを読み出し、読み出した歌詞データの示す歌詞の映像を表す歌詞映像データを生成する歌詞映像データ生成手段と、前記歌唱中ブロック特定手段により歌唱中のブロックが特定されたタイミングで、前記歌詞映像データ生成手段が生成した歌詞データのうち特定されたブロックの歌詞の映像を表す歌詞映像データを、表示手段に出力する出力手段とを具備することを特徴とする。 A data reproducing apparatus according to a preferred aspect of the present invention includes a lyric data storage means for storing lyric data representing lyrics of a music and divided into a plurality of blocks, and video data including a video of a singer. From the video data acquisition means to be acquired, the detection means for detecting the lip shape of the singer from the video data acquired by the video data acquisition means, and the lip shape detected by the detection means, the singer A phoneme pattern detecting means for detecting the arrangement of phonemes included in the uttered voice, and the arrangement of the phonemes detected by the phoneme pattern detecting means is checked against the lyrics data for each block, and the singer Singing block identifying means for identifying the block being sung, and the lyrics data is read from the lyrics data storage means, and the song indicated by the read lyrics data Lyric video data generating means for generating lyric video data representing the video of the lyric, and at the timing when the block being sung is specified by the singing block specifying means is identified among the lyric data generated by the lyric video data generating means Output means for outputting lyric video data representing the lyric video of the block to the display means.

上述の態様において、前記歌詞映像データ生成手段は、前記歌唱中ブロック特定手段により歌唱中のブロックが特定されたタイミングで、特定されたブロックの歌詞の映像を表す歌詞映像データの生成を開始してもよい。 In the above aspect, the lyric video data generating means starts generating lyric video data representing the lyric video of the specified block at the timing when the block being sung is specified by the singing block specifying means. Also good.

また、上述の態様において、前記歌詞データは、複数のブロックに区分されるとともに、区分された各ブロックの再生開始タイミングを示す同期情報を含み、前記出力手段は、前記各ブロックのそれぞれについて、同期情報によって示される各ブロックの再生開始タイミングと前記ブロック特定手段によってブロックが特定されたタイミングとのずれが予め定められた範囲内である場合に、該ブロックの歌詞の映像を表す歌詞映像データを出力してもよい。 In the above aspect, the lyrics data is divided into a plurality of blocks and includes synchronization information indicating the reproduction start timing of each of the divided blocks, and the output means is configured to synchronize each of the blocks. When the difference between the reproduction start timing of each block indicated by the information and the timing at which the block is specified by the block specifying means is within a predetermined range, lyrics video data representing the lyrics video of the block is output May be.

また、前記出力手段は、前記映像データ取得手段により取得された映像データと、前記歌詞映像データ生成手段によって生成された歌詞映像データとを出力してもよい。
この態様において、前記映像データ取得手段により取得された映像データを予め定められた時間だけ遅延させる遅延手段を備え、前記出力手段は、前記遅延手段により遅延された映像データと、前記歌詞映像データ生成手段によって生成された歌詞映像データとを出力してもよい。 The output means may output the video data acquired by the video data acquisition means and the lyric video data generated by the lyric video data generation means.
In this aspect, the image processing apparatus includes a delay unit that delays the video data acquired by the video data acquisition unit by a predetermined time, and the output unit generates the video data delayed by the delay unit and the lyrics video data generation The lyrics video data generated by the means may be output.

また、上述の態様において、音声を表す第１の音声データを記憶する第１の音声データ記憶手段と、音声を表す第２の音声データを取得する音声データ取得手段と、前記音声データ取得手段により取得された第２の音声データと前記第１の音声データ記憶手段に記憶された第１の音声データとを、所定時間長のフレーム単位で対応付けるタイムアライメント手段とを備え、前記出力手段は、前記タイムアライメント手段による対応付け結果に応じたタイミングで、前記歌詞映像データ生成手段が生成した歌詞データのうち特定されたブロックの歌詞の映像を表す歌詞映像データを、表示手段に出力してもよい。 Further, in the above aspect, the first sound data storage unit that stores the first sound data representing the sound, the sound data obtaining unit that obtains the second sound data representing the sound, and the sound data obtaining unit A time alignment unit that associates the acquired second audio data with the first audio data stored in the first audio data storage unit in units of frames of a predetermined time length, and the output unit includes: The lyric video data representing the lyric video of the identified block among the lyric data generated by the lyric video data generating unit may be output to the display unit at a timing according to the association result by the time alignment unit.

本発明によれば、歌詞テロップの表示タイミングを映像にあわせることができる。 According to the present invention, the display timing of the lyrics telop can be matched to the video.

次に、この発明を実施するための最良の形態を説明する。
図１は、この発明の一実施形態である映像表示装置１０のハードウェア構成を示すブロック図である。図において、制御部１１は、ＣＰＵ（Central Processing Unit）やＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）を備え、ＲＯＭ又は記憶部１２に記憶されているコンピュータプログラムを読み出して実行することにより、バスＢＵＳを介して映像表示装置１０の各部を制御する。記憶部１２は、制御部１１によって実行されるコンピュータプログラムやその実行時に使用されるデータを記憶するための記憶手段であり、例えばハードディスク装置である。表示部１３は、液晶パネルなどを備え、制御部１１の制御の下で、映像表示装置１０を操作するためのメニュー画面や、ライブ映像などの動画像を表示する。操作部１４は、利用者による操作に応じた操作信号を制御部１１に出力する。マイクロホン１５は、歌唱者の歌唱音声（以下「入力音声」という）を収音する収音機器である。マイクロホン１５は、入力音声の時間軸上における波形を表すアナログの電気信号を出力する。音声処理部１６は、マイクロホン１５から入力される電気信号をデジタル信号（以下「入力音声信号」という）に変換する。また、音声処理部１６は、制御部１１の制御の下、デジタルデータをアナログ信号に変換してスピーカ１７に出力する。スピーカ１７は、音声処理部１６でデジタルデータからアナログ信号に変換され出力される音声信号に応じた強度で放音する放音手段である。撮影部１８は、歌唱者を撮影し、撮影した映像を表す映像データを制御部１１に出力する。 Next, the best mode for carrying out the present invention will be described.
FIG. 1 is a block diagram showing a hardware configuration of a video display apparatus 10 according to an embodiment of the present invention. In the figure, the control unit 11 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory), and reads and executes a computer program stored in the ROM or the storage unit 12. The respective units of the video display device 10 are controlled through the bus BUS. The storage unit 12 is a storage unit for storing a computer program executed by the control unit 11 and data used at the time of execution, and is, for example, a hard disk device. The display unit 13 includes a liquid crystal panel and the like, and displays a menu screen for operating the video display device 10 and a moving image such as a live video under the control of the control unit 11. The operation unit 14 outputs an operation signal corresponding to the operation by the user to the control unit 11. The microphone 15 is a sound collecting device that picks up a singer's singing voice (hereinafter referred to as “input voice”). The microphone 15 outputs an analog electric signal representing a waveform of the input sound on the time axis. The audio processing unit 16 converts an electrical signal input from the microphone 15 into a digital signal (hereinafter referred to as “input audio signal”). In addition, the audio processing unit 16 converts the digital data into an analog signal and outputs the analog signal to the speaker 17 under the control of the control unit 11. The speaker 17 is a sound emitting unit that emits sound with an intensity corresponding to the sound signal that is converted from the digital data into an analog signal and output by the sound processing unit 16. The photographing unit 18 photographs the singer and outputs video data representing the photographed video to the control unit 11.

なお、この実施形態では、マイクロホン１５とスピーカ１７とが映像表示装置１０に含まれている場合について説明するが、音声処理部１６に入力端子及び出力端子を設け、オーディオケーブルを介してその入力端子に外部マイクロホンを接続する構成としても良く、同様に、オーディオケーブルを介してその出力端子に外部スピーカを接続するとしても良い。また、この実施形態では、マイクロホン１５から音声処理部１６へ入力される音声信号及び音声処理部１６からスピーカ１７へ出力される音声信号がアナログ音声信号である場合について説明するが、デジタル音声データを入出力するようにしても良い。このような場合には、音声処理部１６にてＡ／Ｄ変換やＤ／Ａ変換を行う必要はない。表示部１３、操作部１４、撮影部１８についても同様であり、映像表示装置１０に内蔵される形式であってもよく、外付けされる形式であってもよい。 In this embodiment, the case where the microphone 15 and the speaker 17 are included in the video display device 10 will be described. However, the audio processing unit 16 is provided with an input terminal and an output terminal, and the input terminal is connected via an audio cable. An external microphone may be connected, and similarly, an external speaker may be connected to the output terminal via an audio cable. In this embodiment, the case where the audio signal input from the microphone 15 to the audio processing unit 16 and the audio signal output from the audio processing unit 16 to the speaker 17 are analog audio signals will be described. You may make it input / output. In such a case, the audio processing unit 16 does not need to perform A / D conversion or D / A conversion. The same applies to the display unit 13, the operation unit 14, and the imaging unit 18, and may be a format built in the video display device 10 or an externally attached format.

記憶部１２は、図示のように、歌詞データ記憶領域１２１と、照合データ記憶領域１２２とを有している。歌詞データ記憶領域１２１には、歌詞テロップとして表示される楽曲の歌詞を表す歌詞データが記憶されている。この歌詞データは、予めカラオケ用に作成されたデータである。図２は、歌詞データの内容の一例を示す図である。歌詞データは、図示のように、楽曲の歌詞を示しているテキストデータ、歌詞の改行を示す改行データ、及び歌詞の一文字ごとにワイプ開始タイミングを示すワイプ開始タイミングデータを有している。そして、映像表示装置１０によって再生され、画面に歌詞テロップが表示されているときは、歌詞の１文字は対応するワイプ開始タイミングになると当該文字の左側から色を変化させ始めて、次の文字ワイプ開始タイミングに達すると、その文字全体について色の変化が完了するよう色替え制御される。この場合、ワイプ開始タイミングデータは、改行データにも設けられており、一行の最後に表示される文字については、当該文字のワイプ開始タイミングデータと改行データのワイプ開始タイミングデータの時間間隔が、当該文字の色替え時間となる。各文字の色が変化するスピードは、文字の横方向のドット数と色替え時間（当該文字のワイプ開始タイミングと次の文字のワイプ開始タイミング時間差）から決定される。
各文字は色替え時間内で色が変化する。そして、この図においては、最初の文字のワイプ開始タイミングはｗｔ１、次の文字はｗｔ２となり、以下順次ｗｔ３…となっている。 As illustrated, the storage unit 12 includes a lyrics data storage area 121 and a collation data storage area 122. The lyrics data storage area 121 stores lyrics data representing the lyrics of the music displayed as the lyrics telop. This lyric data is data created in advance for karaoke. FIG. 2 is a diagram illustrating an example of the contents of the lyrics data. As shown in the figure, the lyrics data includes text data indicating the lyrics of the music, line feed data indicating the line break of the lyrics, and wipe start timing data indicating the wipe start timing for each character of the lyrics. Then, when it is played back by the video display device 10 and the lyrics telop is displayed on the screen, one character of the lyrics starts to change color from the left side of the character when the corresponding wipe start timing starts, and the next character wipe starts When the timing is reached, the color change is controlled so that the color change is completed for the entire character. In this case, the wipe start timing data is also provided in the line feed data, and for the character displayed at the end of one line, the time interval between the wipe start timing data of the character and the wipe start timing data of the line feed data is It becomes the color change time of characters. The speed at which the color of each character changes is determined from the number of dots in the horizontal direction of the character and the color change time (the difference between the wipe start timing of the character and the wipe start timing of the next character).
Each character changes color within the color change time. In this figure, the wipe start timing of the first character is wt1, the next character is wt2, and so on.

また、この実施形態では、歌詞データは、複数のブロックＢ１，Ｂ２，…（Ａメロ、Ｂメロ、サビ…等）に区分されており、区分された各ブロックＢ１，Ｂ２，…の先頭部分を示すブロック開始データｂ１，ｂ２，…を有している。各ブロックの先頭部分のワイプ開始タイミングｗｔ１，ｗｔ１１，…は、各ブロックの再生開始タイミングを示す情報であり、以下の説明では、説明の便宜上、この各ブロックの再生開始タイミングを示す情報を「同期情報」と称する。 In this embodiment, the lyrics data is divided into a plurality of blocks B1, B2,... (A melody, B melody, rust, etc.), and the head portion of each divided block B1, B2,. Have block start data b1, b2,. The wipe start timings wt1, wt11,... At the head of each block are information indicating the playback start timing of each block. In the following description, for the sake of convenience, the information indicating the playback start timing of each block is referred to as “synchronization”. Referred to as “information”.

記憶部１２の照合データ記憶領域１２２には、歌詞データの歌詞を構成する音素の母音の並びを示す照合データが、歌詞データを構成するブロック毎に記憶されている。図３は、照合データの内容の一例を示す図である。図示のように、この照合データは、歌詞を構成する音素の母音を示すデータによって構成されている。このデータは、制御部１１が行う歌唱中ブロックの特定処理（詳細は後述する）を行う際に参照されるデータである。この実施形態では、この照合データは、各ブロックの先頭の６〜８文字の母音の並びを示す。
なお、照合データは、各ブロックの先頭の６〜８文字の母音の並びに限らず、各ブロックに含まれる音素の並びを示すデータであればどのようなものであってもよい。 The collation data storage area 122 of the storage unit 12 stores collation data indicating the arrangement of vowels of phonemes constituting the lyrics of the lyrics data for each block constituting the lyrics data. FIG. 3 is a diagram illustrating an example of the content of the collation data. As shown in the figure, this collation data is composed of data indicating the vowels of phonemes constituting the lyrics. This data is data that is referred to when performing the singing block specifying process (details will be described later) performed by the control unit 11. In this embodiment, the collation data indicates a sequence of vowels of 6 to 8 characters at the head of each block.
The collation data is not limited to the arrangement of the vowels of the first 6 to 8 characters in each block, and any data may be used as long as it indicates the arrangement of phonemes included in each block.

次に、映像表示装置１０の機能的構成について、図４を参照しつつ説明する。図４は、映像表示装置１０の機能的構成を示すブロック図である。図において、ビデオ解析部１１１，カラオケ歌詞生成部１１２，ビデオ再生部１１３及び遅延部１１４は、映像表示装置１０の制御部１１がＲＯＭ又は記憶部１２に記憶されたコンピュータプログラムを読み出して実行することにより実現される。なお、図中の矢印はデータの流れを概略的に示すものである。なお、この実施形態ではビデオ解析部１１１やカラオケ歌詞生成部１１２はソフトウェアとして実現されるが、これに限らず、ハードウェアによって実現される構成としてもよい。 Next, the functional configuration of the video display apparatus 10 will be described with reference to FIG. FIG. 4 is a block diagram showing a functional configuration of the video display device 10. In the figure, a video analysis unit 111, a karaoke lyrics generation unit 112, a video playback unit 113, and a delay unit 114 are read and executed by the control unit 11 of the video display device 10 by reading a computer program stored in the ROM or the storage unit 12. It is realized by. The arrows in the figure schematically show the flow of data. In this embodiment, the video analysis unit 111 and the karaoke lyrics generation unit 112 are realized as software, but the configuration is not limited to this, and may be realized by hardware.

図４において、ビデオ解析部１１１は、撮影部１８から映像データを取得し、取得した映像データから歌唱者の唇の形状を検出する機能を有する。ビデオ解析部１１１は、撮影部１８から出力される映像データを解析して顔検出処理を行い、検出結果に基づいて歌唱者の唇の動き（唇の形状の変化）を検出する。顔検出処理は、例えば肌色検出を行うとともに撮影された映像を予め定められたパターンの画像と照合し、その一致度に基づいて顔部分を検出してもよい。なお、顔検出処理の態様はこれに限らず、歌唱者の顔部分を検出できる態様であればどのようなものであってもよい。 In FIG. 4, the video analysis unit 111 has a function of acquiring video data from the photographing unit 18 and detecting the shape of the singer's lips from the acquired video data. The video analysis unit 111 analyzes the video data output from the photographing unit 18 and performs face detection processing, and detects the movement of the singer's lips (change in lip shape) based on the detection result. In the face detection process, for example, skin color detection may be performed, and a captured image may be collated with an image of a predetermined pattern, and a face portion may be detected based on the degree of coincidence. In addition, the aspect of a face detection process is not restricted to this, What kind of thing may be sufficient if it is an aspect which can detect a singer's face part.

ビデオ解析部１１１は、歌唱者の唇の動きの検出が開始されたときに、複数のブロックＢ１，Ｂ２，…のうちのいずれかのブロックの歌唱が開始されたと判断する。ビデオ解析部１１１は、ブロックの歌唱が開始されたと判断したときから歌唱者が発した音声に含まれる音素の並びを検出することによって、各ブロックの先頭部分の音素の並びを検出する。この音素の並びの検出処理について以下に説明する。 When the detection of the movement of the lips of the singer is started, the video analysis unit 111 determines that the singing of any of the blocks B1, B2,. The video analysis unit 111 detects the arrangement of phonemes at the beginning of each block by detecting the arrangement of phonemes included in the voice uttered by the singer when it is determined that the singing of the block has started. This phoneme arrangement detection process will be described below.

ビデオ解析部１１１は、検出した顔部分の画像から、歌唱者の唇の動き（唇の形状の変化）を検出し、検出結果に基づいて、歌唱者が発した音声に含まれる音素の並びを検出する。この実施形態では、ビデオ解析部１１１は、歌唱者が発した音声に含まれる母音の並びを検出する。音素の並びの検出処理は、例えば、母音（ａ，ｉ，ｕ，ｅ，ｏ）に対応する唇の形状を示すパターンデータを予め記憶させておき、検出された唇の形状を、記憶されたパターンデータと照合することによって検出してもよい。また、音素の並びの検出は、歌唱者の唇の動きの検出が開始されてから、予め定められた時間の間、ビデオ解析部１１１が唇の形状の変化の検出を行い、その期間の検出結果に基づいて音素の並びを検出してもよい。
なお、唇の形状を音素に変換する方法としては、主に、以下の２つの方法が考えられる。第一の方法は、ある人の唇の形状を一文字づつ撮影して唇の形状と音素との関係をデータベース化する方法である。この方法は、画像が使用者本人であれば、比較的正確に音素へ変換することができるが、汎用性に乏しいという問題がある。第二の方法は、唇の形状あるいは唇の動きから、これを音素に変換する方法である。この方法は、前記第一の方法に比べて、正確性に難があるものの汎用性の高い方法であるといえる。第二の方法は、一般的な唇の形状に音素を対応させるものである。例えば、大きく口が広がった形状は「あ」、横に細長くなった形状は「い」、小さくつぼまる形状は「う」といった具合に対応づけられることになる。画像データから、こうした唇の形状変化を抽出するためには、時間的に前後するフレームの画像データの差分値を求めることにより実現できる。すなわち、喋っている人の画像であれば、前後のフレームの差分値は、ほぼ唇の形状のみになるからである。 The video analysis unit 111 detects the movement of the lips of the singer (changes in the shape of the lips) from the detected image of the face portion, and based on the detection result, arranges the phoneme sequence included in the voice uttered by the singer. To detect. In this embodiment, the video analysis unit 111 detects the arrangement of vowels included in the voice uttered by the singer. In the phoneme arrangement detection process, for example, pattern data indicating the shape of the lips corresponding to the vowel (a, i, u, e, o) is stored in advance, and the detected lip shape is stored. You may detect by collating with pattern data. In addition, the phoneme alignment is detected by detecting a change in the shape of the lip by the video analysis unit 111 for a predetermined time after the detection of the lip movement of the singer is started. The phoneme sequence may be detected based on the result.
As a method for converting the shape of the lips into phonemes, the following two methods are mainly conceivable. The first method is a method in which a person's lip shape is photographed character by character and the relationship between the lip shape and phonemes is compiled into a database. This method can be converted into phonemes relatively accurately if the image is the user, but has a problem that it is not versatile. The second method is a method of converting this into a phoneme from the shape of the lips or the movement of the lips. This method can be said to be a highly versatile method although accuracy is difficult compared to the first method. The second method is to associate phonemes with general lip shapes. For example, a shape with a wide mouth is associated with “A”, a shape elongated in a horizontal direction with “I”, a shape with a small jar is associated with “U”, and so on. Extracting such a lip shape change from image data can be realized by obtaining a difference value between image data of frames that are temporally forward and backward. That is, in the case of an image of a person who is scolding, the difference value between the previous and next frames is almost only the shape of the lips.

ビデオ解析部１１１は、検出した音素の並びを示すデータをカラオケ歌詞生成部１１２に供給する。
カラオケ歌詞生成部１１２は、ビデオ解析部１１１で検出された母音（音素）の並び（以下「音素パターン」という）と照合データ記憶領域１２２に記憶されたブロック毎の照合データ（歌詞の母音の並びを示すデータ）とを比較し、一致の程度に基づいて、歌唱者が歌詞のどのブロックを歌っているのかを特定する。この特定処理は、具体的には、例えば、一致度が最も高いブロックを特定してもよく、また、例えば、一致度が所定値以上であるブロックを特定してもよい。
このように、ビデオ解析部１１１は、図３に示した照合データを用いて、各ブロックの先頭部分の６〜８文字をセットにして歌詞（各フレーズ（ブロック）の歌い出し部分）とのパターンマッチングを行う。このように、この実施形態では、ある程度の長さの音素列を用いて照合を行うから、歌唱映像から解析された音素パターン（音素の並び）にある程度の誤りがある場合であっても、照合の精度を高くすることができる。 The video analysis unit 111 supplies data indicating the detected phoneme arrangement to the karaoke lyrics generation unit 112.
The karaoke lyrics generation unit 112 includes a sequence of vowels (phonemes) detected by the video analysis unit 111 (hereinafter referred to as “phoneme pattern”) and collation data for each block stored in the collation data storage area 122 (arrangement of vowels of lyrics). And a block of the lyrics that the singer is singing based on the degree of matching. Specifically, this specifying process may specify, for example, a block having the highest degree of matching, for example, and may specify a block having a matching degree equal to or higher than a predetermined value.
As described above, the video analysis unit 111 uses the collation data shown in FIG. 3 to set the 6-8 characters at the beginning of each block as a set and a pattern with lyrics (the singing portion of each phrase (block)). Perform matching. Thus, in this embodiment, since the phoneme string of a certain length is used for collation, even if there is some error in the phoneme pattern (phoneme arrangement) analyzed from the singing video, the collation Accuracy can be increased.

また、カラオケ歌詞生成部１１２は、歌詞データ記憶領域１２１から歌詞データを読み出し、読み出した歌詞データの示す歌詞の映像（以下「歌詞映像」という）を表す歌詞映像データを生成する機能を有する。このとき、カラオケ歌詞生成部１１２は、歌唱中のブロックが特定されたタイミングで、特定されたブロックの歌詞の映像を表す歌詞映像データの生成を開始する。 Further, the karaoke lyrics generating unit 112 has a function of reading out lyrics data from the lyrics data storage area 121 and generating lyrics video data representing a video of lyrics (hereinafter referred to as “lyric video”) indicated by the read lyrics data. At this time, the karaoke lyrics generation unit 112 starts generating lyric video data representing the video of the lyrics of the specified block at the timing when the block being sung is specified.

カラオケ歌詞生成部１１２は、生成した歌詞映像データをビデオ再生部１１３に順次供給する。遅延部１１４は、撮影部１８から取得された映像データを、ビデオ解析部１１１とカラオケ歌詞生成部１１２との処理時間程度遅延させてビデオ再生部１１３に供給する。ビデオ再生部１１３は、遅延部１１４から供給される映像データの表す映像に対して、カラオケ歌詞生成部１１２から供給される歌詞映像データの表す歌詞映像をスーパーインポーズした合成映像データを生成して表示部１３へ出力する。 The karaoke lyrics generation unit 112 sequentially supplies the generated lyrics video data to the video playback unit 113. The delay unit 114 supplies the video data acquired from the imaging unit 18 to the video playback unit 113 with a delay of about the processing time of the video analysis unit 111 and the karaoke lyrics generation unit 112. The video playback unit 113 generates composite video data obtained by superimposing the lyrics video represented by the lyrics video data supplied from the karaoke lyrics generation unit 112 on the video represented by the video data supplied from the delay unit 114. Output to the display unit 13.

このようにして、表示部１３には合成映像データが出力され、また、スピーカ１７にはマイクロホン１５で収音された歌唱音声が出力される。これにより、この実施形態に係る映像表示装置１０は、撮影された映像に対して時間同期、すなわち楽曲の進行に合わせた歌詞映像が合成された映像と楽曲とを再生することができる。 In this way, the composite video data is output to the display unit 13, and the singing sound collected by the microphone 15 is output to the speaker 17. Thereby, the video display apparatus 10 according to this embodiment can reproduce a video and a music in which lyrics video is synthesized in time synchronization, that is, in accordance with the progress of the music, with respect to the shot video.

ここで、カラオケ歌詞生成部１１２が行う処理の具体例について、図５を参照しつつ以下に説明する。図５の（ａ）は、歌詞データの内容の一例を示す図であり、（ｂ）は、歌唱者によって実際に歌唱された歌唱音声の内容の一例を示す図であり、（ｃ）は、映像表示装置１０が表示する歌詞テロップの内容の一例を示す図である。図５（ａ）に示すように、歌詞データは、複数のブロックＢ１，Ｂ２，Ｂ３，Ｂ４，Ｂ５…に区分されて構成されている。一方、ライブ演奏を行った演奏者（歌唱者）は、歌詞データの通りに歌わずに、ブロックＢ１，Ｂ２，Ｂ３，Ｂ４，Ｂ４，Ｂ５…の順番に、ブロックＢ４を２回連続して演奏したとする。また、図示のように、（ｂ）に示すライブ演奏における歌唱と（ａ）に示す歌詞データとは、ブロックＢ１の開始時刻が、時間Δｔだけずれている。 Here, the specific example of the process which the karaoke lyrics production | generation part 112 performs is demonstrated below, referring FIG. (A) of FIG. 5 is a figure which shows an example of the content of lyrics data, (b) is a figure which shows an example of the content of the singing voice actually sung by the singer, (c), It is a figure which shows an example of the content of the lyrics telop which the video display apparatus 10 displays. As shown in FIG. 5A, the lyric data is divided into a plurality of blocks B1, B2, B3, B4, B5. On the other hand, the performer (singer) who performed the live performance did not sing according to the lyrics data and performed block B4 twice in the order of blocks B1, B2, B3, B4, B4, B5. Suppose that Further, as shown in the drawing, the start time of the block B1 is shifted by the time Δt between the singing in the live performance shown in (b) and the lyrics data shown in (a).

図５に示す例において、歌唱者によってブロックＢ１の歌唱が行われると、制御部１１は、撮影部１８から出力される映像データを解析して歌唱者の唇の形状を検出し、検出した唇の形状から歌唱者が発した音声の音素パターンを特定する。次いで、制御部１１は、特定した音素パターンを照合データ記憶領域１２２に記憶された照合データと照合し、ブロックＢ１を特定する。制御部１１は、特定したブロックＢ１の歌詞映像を表す歌詞映像データの生成を開始し、ブロックＢ１に対応する歌詞テロップのワイプ処理を開始する。すなわち、制御部１１は、ブロックＢ１に対応する歌詞テロップの表示を開始し、歌詞データに含まれる各文字のワイプ開始タイミングｗｔ１，ｗｔ２，ｗｔ３，…（図２参照）に基づいて、時刻（ｗｔ１＋Δｔ），（ｗｔ２＋Δｔ），（ｗｔ３＋Δｔ），…に歌詞テロップのワイプ処理を行う。
これにより、歌唱者の開始タイミングと歌詞テロップの表示開始タイミングとがずれている（図５に示す例では時間Δｔだけずれている）場合であっても、これらを同期させることができ、表示が不自然になるのを防ぐことができる。 In the example shown in FIG. 5, when the singer performs the singing of the block B1, the control unit 11 analyzes the video data output from the photographing unit 18 to detect the shape of the singer's lips, and detects the detected lips. The phoneme pattern of the voice uttered by the singer is specified from the shape of Next, the control unit 11 collates the identified phoneme pattern with the collation data stored in the collation data storage area 122, and identifies the block B1. The control unit 11 starts generating lyric video data representing the lyric video of the identified block B1, and starts wiping processing of the lyrics telop corresponding to the block B1. That is, the control unit 11 starts displaying the lyrics telop corresponding to the block B1, and based on the wipe start timings wt1, wt2, wt3,... (See FIG. 2) of each character included in the lyrics data, the time (wt1 + Δt ), (Wt2 + Δt), (wt3 + Δt),...
Accordingly, even when the start timing of the singer is shifted from the display start timing of the lyrics telop (in the example shown in FIG. 5, they are shifted by time Δt), they can be synchronized, and the display is It can prevent becoming unnatural.

また、図５に示す例では、歌唱者は、ブロックＢ４を歌唱した後に再度ブロックＢ４を歌唱する。このとき、制御部１１は、上述したブロックＢ１と同様に、映像データから歌唱者の唇の動きを検出し、検出した唇の形状から音素パターンを特定し、特定した音素パターンを照合パターンと比較して、ブロックＢ４を特定する。そして、制御部１１は、特定したブロックＢ４の歌詞映像データの生成及びワイプ処理を行う。
このように、この実施形態では、各ブロックの先頭部分の音素パターンを照合するから、図５の（ｂ）に示したように、歌唱者が、ブロックＢ４を連続して歌唱する等、歌詞データと異なる順番で歌唱を行った場合であっても、図５（ｃ）に示すように、実際に歌唱された歌詞に対応する歌詞テロップを表示させることができ、実際のライブ映像と異なる歌詞テロップが表示されるのを防ぐことができる。 In the example shown in FIG. 5, the singer sings block B4 again after singing block B4. At this time, similarly to the block B1 described above, the control unit 11 detects the movement of the lips of the singer from the video data, specifies the phoneme pattern from the detected lip shape, and compares the specified phoneme pattern with the matching pattern. Then, the block B4 is specified. And the control part 11 performs the production | generation and wipe process of the lyric image data of the specified block B4.
Thus, in this embodiment, since the phoneme pattern of the head part of each block is collated, as shown to (b) of FIG. 5, a singer sings block B4 continuously, etc. lyrics data Even if the songs are sung in a different order, the lyrics telop corresponding to the lyrics actually sung can be displayed as shown in FIG. Can be prevented from being displayed.

このように、この実施形態では、歌唱者の唇の形状を検出して歌詞のどの部分を歌唱中であるかを特定し、特定した部分の歌詞テロップを映像に同期させて表示するから、ライブ演奏映像に対して、精度の高い歌詞テロップを付加させることができる。また、歌唱者の唇の動作パターンを検出するのみでよいから、歌詞テロップの同期を容易に行うことができる。また、タイミングを映像の解析により検出し、合わせ込むから、カラオケ向けに作成されたコンテンツの歌詞表示部分をライブ／ビデオ映像に適用することができる。 As described above, in this embodiment, the shape of the lip of the singer is detected to identify which part of the lyrics is being sung, and the lyrics telop of the identified part is displayed in synchronization with the video. A highly accurate lyrics telop can be added to the performance video. Also, since it is only necessary to detect the movement pattern of the singer's lips, the lyrics telop can be easily synchronized. Further, since the timing is detected and matched by analyzing the video, the lyrics display portion of the content created for karaoke can be applied to the live / video video.

ところで、歌唱者が、ＣＤなどのリファレンスどおりに歌唱しなかった場合、例えば、２番と３番とを入れ替えて歌唱したり、また、例えば、サビと他のフレーズとの順番を入れ替えて歌唱したり、また、例えば、リフレインの回数を多く歌唱した場合など、実際の歌詞がリファレンスどおりでない場合が多々ある。このような場合には、従来の装置では、歌詞画面の内容と歌唱（歌そのもの）に違いがでる事が多々あった。
これに対し、この実施形態では、歌唱者が原曲ＣＤ通りに歌唱しなかった場合、例えば３番を２番より前に歌唱してしまったり、繰り返し回数を多く又は少なく歌唱した場合であっても、歌詞カード（歌詞データ）と実演奏で差がでてしまうのを防ぐことができる。 By the way, when a singer does not sing according to a reference such as a CD, for example, the singing is performed by exchanging the numbers 2 and 3, or the singing is performed by changing the order of the chorus and other phrases, for example. Or, for example, when singing a large number of refrains, the actual lyrics are often not according to the reference. In such a case, the conventional apparatus often has a difference between the contents of the lyrics screen and the song (song itself).
On the other hand, in this embodiment, when the singer does not sing as the original song CD, for example, when singing No. 3 before No. 2 or singing more or less times However, it is possible to prevent the difference between the lyrics card (lyric data) and the actual performance.

＜変形例＞
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。以下にその一例を示す。なお、以下の各態様を適宜に組み合わせてもよい。
（１）上述した実施形態では、制御部１１は、撮影部１８から出力される映像データを取得する構成としたが、これに限らず、映像データを、ハードディスク等の記憶手段から読み出す構成としてもよく、また、映像表示装置１０の通信部を設け、通信ネットワークを介して映像データを受信する構成としてもよい。要するに、制御部１１が映像データを取得する態様であればどのようなものであってもよい。 <Modification>
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. An example is shown below. In addition, you may combine each following aspect suitably.
(1) In the above-described embodiment, the control unit 11 is configured to acquire the video data output from the photographing unit 18, but the configuration is not limited thereto, and the video data may be read from a storage unit such as a hard disk. In addition, a communication unit of the video display device 10 may be provided to receive video data via a communication network. In short, any mode may be used as long as the control unit 11 acquires video data.

（２）上述した実施形態では、制御部１１は、歌詞テロップをスーパーインポーズした映像データを表示部１３に出力する構成とした。映像データの出力先は表示部１３に限らず、例えば、映像表示装置１０に通信部を設ける構成とし、通信ネットワークを介して接続さされた他の装置へ、該通信部を介して映像データを出力する構成としてもよい。また、例えば、映像データをハードディスク等の記憶手段に記憶させる構成としてもよい。要するに、制御部１１が映像データを出力する態様であればどのようなものであってもよい。 (2) In the embodiment described above, the control unit 11 is configured to output video data obtained by superimposing the lyrics telop to the display unit 13. The output destination of the video data is not limited to the display unit 13. For example, the video display device 10 is provided with a communication unit, and the video data is transmitted to the other device connected via the communication network via the communication unit. It is good also as a structure to output. For example, the video data may be stored in a storage unit such as a hard disk. In short, any mode may be used as long as the control unit 11 outputs video data.

（３）上述した実施形態では、遅延部１１４は、撮影部１８から取得された映像データを、ビデオ解析部１１１とカラオケ歌詞生成部１１２との処理時間程度遅延させたが、遅延部１１４が遅延させる時間はこれに限らず、例えば、歌詞のワイプ開始タイミングを実際に歌うタイミングよりも少し早めに設定するための時間を含むようにしてもよい。このようにすれば、歌唱者の歌唱開始タイミングの所定時間前に、歌詞テロップの表示及びワイプが開始されるから、視聴者は、歌詞のタイミングをより好適に把握することが可能となる。 (3) In the embodiment described above, the delay unit 114 delays the video data acquired from the photographing unit 18 by the processing time of the video analysis unit 111 and the karaoke lyrics generation unit 112, but the delay unit 114 delays the video data. The time to be performed is not limited to this, and for example, a time for setting the wipe start timing of the lyrics a little earlier than the actual sing timing may be included. In this way, since the display and wipe of the lyrics telop is started a predetermined time before the singing start timing of the singer, the viewer can more appropriately grasp the timing of the lyrics.

（４）上述した実施形態では、映像を解析した歌唱者の唇の動きを検出することによって、歌唱者が歌唱中の歌詞を特定した。これに加えて、歌唱者の歌唱音声を表す音声データを音声解析し、解析結果に応じて歌唱者が歌唱中の歌詞を特定してもよい。具体的には、例えば、記憶部１２に楽曲の見本となる音声や伴奏を含む演奏を録音した音声データであってその再生時刻を示すタイムコードが付されている音声データ（以下「第１の音声データ」という）を予め記憶する構成とする。そして、制御部１１は、第１の音声データとマイクロホン１５から入力される音声データ（以下「第２の音声データ」という）とを所定時間長のフレーム単位で対応付けるタイムアライメント処理を行い、対応付け結果に応じて第１の音声データと第２の音声データとの時間的なずれを検出し、検出したずれが解消されるように歌詞映像データの生成タイミングを補正する。このようにすれば、映像の解析結果に加えて、音声の解析結果を用いて歌詞テロップの表示タイミングを補正するから、歌詞テロップの表示タイミングの補正の精度をより高くすることができる。 (4) In the above-described embodiment, the singer identified the lyrics being sung by detecting the movement of the lips of the singer who analyzed the video. In addition to this, the voice data representing the singing voice of the singer may be voice-analyzed, and the lyrics that the singer is singing may be specified according to the analysis result. Specifically, for example, audio data in which a performance including a voice as an example of music or accompaniment is recorded in the storage unit 12 and a time code indicating the reproduction time is attached (hereinafter referred to as “first data”). (Sound data) ”is stored in advance. Then, the control unit 11 performs time alignment processing for associating the first audio data and the audio data input from the microphone 15 (hereinafter referred to as “second audio data”) in units of frames having a predetermined time length. A time shift between the first sound data and the second sound data is detected according to the result, and the generation timing of the lyrics video data is corrected so that the detected shift is eliminated. In this way, the display timing of the lyrics telop is corrected using the audio analysis result in addition to the video analysis result, so that the accuracy of the correction of the lyrics telop display timing can be further increased.

（５）上述した実施形態では、映像表示装置を本発明に係るデータ再生装置として適用したが、データ再生装置として適用される装置は映像表示装置に限らず、例えば、テレビ放送の放送局に設置された専用のコンピュータ装置や、パーソナルコンピュータ、移動体通信端末など、様々な装置が本発明に係るデータ再生装置として適用可能である。 (5) In the above-described embodiment, the video display device is applied as the data playback device according to the present invention. However, the device applied as the data playback device is not limited to the video display device, for example, installed in a broadcasting station for television broadcasting. Various devices such as a dedicated computer device, a personal computer, and a mobile communication terminal can be applied as the data reproducing device according to the present invention.

（６）上述した実施形態では、歌唱者の唇の動きと歌詞テロップとを同期させる場合を例に挙げて説明したが、歌唱に限らず、楽器の演奏動作を検出してもよい。この場合は、例えば、歌詞データに代えて、楽曲の楽譜を表すデータを記憶させておき、制御部１１は、映像データから演奏者の演奏動作（例えば、鍵盤楽器の演奏時の鍵盤における手の位置）を検出し、検出結果に基づいて音素の組み合わせのパターンを検出し、検出した音素パターンと照合パターンとを照合し、照合結果に基づいて演奏中の部分を特定してもよい。この場合は、楽譜データの表す楽譜の画像と演奏者の映像とを同期させることができる。このように、本発明にいう「音声」には、人間が発生した音声や楽器の演奏音といった種々の音響が含まれる。 (6) In the above-described embodiment, the case where the lip movement of the singer and the lyrics telop are synchronized has been described as an example. However, the present invention is not limited to singing, and a musical performance operation may be detected. In this case, for example, data representing the musical score of the music is stored instead of the lyrics data, and the control unit 11 performs the performance operation of the performer (for example, the hand on the keyboard when playing the keyboard instrument) from the video data. Position) may be detected, a phoneme combination pattern may be detected based on the detection result, the detected phoneme pattern may be compared with the verification pattern, and the part being played may be identified based on the verification result. In this case, the image of the score represented by the score data and the video of the performer can be synchronized. As described above, the “voice” in the present invention includes various sounds such as a voice generated by a person and a performance sound of a musical instrument.

（７）上述した実施形態では、データ再生装置は、取得された（撮影された映像を表す）映像データに歌詞映像を表す歌詞映像データをスーパーインポーズした映像データを出力する構成としたが、これに限らず、データ再生装置は、歌詞映像データのみを出力する構成としてもよい。具体的には、例えば、接続された外部の表示機器に、映像に合わせたタイミングで歌詞映像データを出力する構成とし、その外部の表示機器が、撮影された映像を表す映像データと歌詞映像データとを合成して出力する構成としてもよい。 (7) In the above-described embodiment, the data reproduction device is configured to output video data obtained by superimposing lyrics video data representing lyrics video on acquired video data (representing captured video). However, the present invention is not limited to this, and the data reproduction device may be configured to output only the lyric video data. Specifically, for example, it is configured to output lyric video data to a connected external display device at a timing in accordance with the video, and the external display device displays video data and lyric video data representing the captured video. May be combined and output.

（８）上述した実施形態では、制御部１１は、歌唱者の唇の動きを検出し、検出結果から歌唱者が発した音声における母音の並び（組み合わせ）を検出したが、母音に限らず、例えば、子音を含む音素の並び（組み合わせ）を検出してもよい。要するに、音素の組み合わせのパターンを検出すればよい。 (8) In the above-described embodiment, the control unit 11 detects the movement of the lips of the singer and detects the arrangement (combination) of vowels in the voice uttered by the singer from the detection result. For example, the arrangement (combination) of phonemes including consonants may be detected. In short, a phoneme combination pattern may be detected.

（９）上述した実施形態では、制御部１１は、歌唱中のブロックが特定されたタイミングで、特定されたブロックの歌詞の映像を表す歌詞映像データの生成を開始し、これにより、歌詞テロップの表示と映像データの示す映像の表示とを同期させた。歌詞テロップと映像とを同期させる態様はこれに限らず、例えば、歌詞映像データを予め生成して記憶しておく構成とし、歌唱中のブロックが特定されたタイミングで、特定されたブロックの歌詞の映像を表す映像データを読み出して表示手段に出力する構成としてもよい。要するに、歌唱中のブロックを特定し、歌唱中のブロックが特定されたタイミングで、特定されたブロックの歌詞の映像を表す歌詞映像データを表示手段に出力すればよい。 (9) In the embodiment described above, the control unit 11 starts generating lyric video data representing the video of the lyrics of the specified block at the timing when the block being sung is specified, thereby The display and the video display indicated by the video data were synchronized. The mode of synchronizing the lyrics telop and the video is not limited to this. For example, the lyric video data is generated and stored in advance, and the lyrics of the specified block at the timing when the block being sung is specified. The video data representing the video may be read out and output to the display means. In short, the singing block is specified, and the lyric video data representing the lyric video of the specified block is output to the display means at the timing when the singing block is specified.

（１０）上述した実施形態において、制御部１１は、各ブロックのそれぞれについて、同期情報（各ブロックの再生開始タイミングを示す情報）によって示される各ブロックの再生開始タイミングと歌唱中のブロックが特定されたタイミングとのずれが予め定められた範囲内である場合に、該ブロックの歌詞の映像を表す歌詞映像データを表示部１３に出力する構成としてもよい。 (10) In the above-described embodiment, the control unit 11 specifies, for each block, the reproduction start timing of each block indicated by the synchronization information (information indicating the reproduction start timing of each block) and the block being sung. When the deviation from the timing is within a predetermined range, the lyric video data representing the lyric video of the block may be output to the display unit 13.

（１１）上述した実施形態では、映像表示装置１０が、同実施形態に係る機能の総てを実現するようになっていた。これに対し、通信ネットワークで接続された複数の装置が上記機能を分担するようにし、それら複数の装置を備えるシステムが同実施形態の映像表示装置１０を実現してもよい。例えば、唇の動きを検出する機能や歌詞テロップ画像を生成する機能等を備える専用のコンピュータ装置と、表示部やスピーカを備える端末装置とが、ネットワークで接続されたシステムとして構成されていてもよい。 (11) In the embodiment described above, the video display device 10 has realized all the functions according to the embodiment. On the other hand, a plurality of devices connected via a communication network may share the above functions, and a system including the plurality of devices may realize the video display device 10 of the embodiment. For example, a dedicated computer device having a function of detecting lip movement and a function of generating a lyrics telop image and a terminal device having a display unit and a speaker may be configured as a system connected by a network. .

（１２）上述した映像表示装置１０の制御部１１によって実現されるプログラムは、磁気テープ、磁気ディスク、フレキシブルディスク、光記録媒体、光磁気記録媒体、ＲＡＭ、ＲＯＭなどの記録媒体に記憶した状態で提供し得る。また、インターネットのようなネットワーク経由で映像表示装置１０にダウンロードさせることも可能である。 (12) The program realized by the control unit 11 of the video display device 10 described above is stored in a recording medium such as a magnetic tape, a magnetic disk, a flexible disk, an optical recording medium, a magneto-optical recording medium, a RAM, or a ROM. Can be provided. It is also possible to download the video display device 10 via a network such as the Internet.

映像表示装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a video display apparatus. 歌詞データの内容の一例を示す図である。It is a figure which shows an example of the content of lyrics data. 照合データの内容の一例を示す図である。It is a figure which shows an example of the content of collation data. 映像表示装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a video display apparatus. 歌詞テロップの表示タイミングを説明するための図である。It is a figure for demonstrating the display timing of a lyrics telop.

Explanation of symbols

１０…映像表示装置、１１…制御部、１２…記憶部、１３…表示部、１４…操作部、１５…マイクロホン、１６…音声処理部、１７…スピーカ、１８…撮影部、１１１…ビデオ解析部、１１２…カラオケ歌詞生成部、１１３…ビデオ再生部、１２１…歌詞データ記憶領域、１２２…照合データ記憶領域。 DESCRIPTION OF SYMBOLS 10 ... Video display apparatus, 11 ... Control part, 12 ... Memory | storage part, 13 ... Display part, 14 ... Operation part, 15 ... Microphone, 16 ... Audio | voice processing part, 17 ... Speaker, 18 ... Shooting part, 111 ... Video analysis part 112 ... karaoke lyrics generation unit, 113 ... video playback unit, 121 ... lyric data storage area, 122 ... collation data storage area.

Claims

Lyrics data storage means for storing lyrics data representing the lyrics of a song and divided into a plurality of blocks;
Video data acquisition means for acquiring video data including the video of the singer;
Detection means for detecting the shape of the lips of the singer from the video data acquired by the video data acquisition means;
Phoneme pattern detection means for detecting the arrangement of phonemes included in the voice uttered by the singer from the shape of the lips detected by the detection means;
The in-singing block identifying means for collating the phoneme pattern detected by the phoneme pattern detecting means with the lyrics data for each block, and identifying the block that the singer is singing based on the matching result;
Lyrics video data generating means for reading out the lyrics data from the lyrics data storage means and generating lyrics video data representing a video of lyrics indicated by the read lyrics data;
At the timing when the block being sung is specified by the singing block specifying unit, the lyric video data representing the video of the lyrics of the specified block among the lyric data generated by the lyric video data generating unit is output to the display unit And a data reproducing apparatus.

The lyric video data generation means starts generating lyric video data representing a lyric video of the specified block at a timing when the block being sung is specified by the singing block specifying means. Item 4. The data reproducing device according to Item 1.

In the data reproducing device according to claim 1 or 2,
The lyrics data is divided into a plurality of blocks, and includes synchronization information indicating a reproduction start timing of each divided block,
The output means, for each of the blocks, when the deviation between the reproduction start timing of each block indicated by the synchronization information and the timing at which the block is specified by the block specifying means is within a predetermined range, A data reproduction apparatus characterized by outputting lyric video data representing a lyric video of the block.

The data reproduction apparatus according to any one of claims 1 to 3,
The data output apparatus characterized in that the output means outputs the video data acquired by the video data acquisition means and the lyrics video data generated by the lyrics video data generation means.

The data reproducing apparatus according to claim 4, wherein
Delay means for delaying the video data acquired by the video data acquisition means by a predetermined time;
The data output device characterized in that the output means outputs the video data delayed by the delay means and the lyrics video data generated by the lyrics video data generation means.

The data reproducing apparatus according to any one of claims 1 to 5,
First voice data storage means for storing first voice data representing voice;
Voice data acquisition means for acquiring second voice data representing voice;
Time alignment means for associating the second sound data acquired by the sound data acquisition means with the first sound data stored in the first sound data storage means in units of frames of a predetermined time length;
The output means displays, on the display means, lyric video data representing a lyric video of a specified block among the lyric data generated by the lyric video data generating means at a timing according to a result of association by the time alignment means. A data reproducing device characterized by outputting.