JP4877811B2

JP4877811B2 - Specific section extraction device, music recording / playback device, music distribution system

Info

Publication number: JP4877811B2
Application number: JP2007104946A
Authority: JP
Inventors: 悟松本; 友二山本; 達雄古賀
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2007-04-12
Filing date: 2007-04-12
Publication date: 2012-02-15
Anticipated expiration: 2027-04-12
Also published as: JP2008262043A

Description

本発明は、楽曲データからサビ区間等の特定区間を抽出する装置に関する。 The present invention relates to an apparatus for extracting a specific section such as a chorus section from music data.

昨今、ＨＤＤ等の大容量記録媒体を搭載し、様々な楽曲コンテンツを記録／再生することができる記録装置が多々開発されている。 In recent years, many recording devices have been developed that are equipped with a large-capacity recording medium such as an HDD and can record / reproduce various music contents.

楽曲コンテンツを検索する際には、楽曲コンテンツ（のヘッダ領域など）に含まれる楽曲情報をキーワードにして行うことができる。しかし、ＦＭラジオ等でオンエアされた楽曲コンテンツなど、楽曲情報が付与されない楽曲コンテンツを記録装置に記録することもある。このような場合、キーワードによる検索ができないので、コンテンツを順次再生しながら検索していく必要がある。そこで、検索時間を短縮するために、各楽曲コンテンツの一部分ずつを自動的に順次再生できるような記録装置が望まれる。 When searching for music content, the music information included in the music content (such as its header area) can be used as a keyword. However, music content to which music information is not given, such as music content that is aired on FM radio or the like, may be recorded in the recording device. In such a case, it is not possible to search by keyword, so it is necessary to search while reproducing the contents sequentially. Therefore, in order to shorten the search time, a recording apparatus capable of automatically and sequentially reproducing a part of each music content is desired.

そのためには、楽曲コンテンツから一部分を自動的に抽出する技術が必要であり、例えば特許文献１に開示されている楽曲自動分割方法を用いることができる。この楽曲自動分割方法は、楽曲コンテンツを短い区間に細分化して相互の類似度を算出し、隣接する区間で類似度が大きく変化した点を、楽曲の分割点として抽出するものである。これにより、楽曲コンテンツの一部分を自動的に抽出することができる。
特開２００６−１６３０６３号公報 For this purpose, a technique for automatically extracting a part from the music content is required. For example, the automatic music division method disclosed in Patent Document 1 can be used. This automatic music division method subdivides music content into short sections, calculates mutual similarities, and extracts points where the similarity changes greatly in adjacent sections as music division points. Thereby, a part of music content can be extracted automatically.
JP 2006-163063 A

しかし、特許文献１に記載のような方法を用いて楽曲コンテンツの一部分を自動的に抽出できたとしても、さほど特徴的でない部分（例えば、Ａメロ、Ｂメロなど）が抽出されてしまう場合もある。そのような部分を聞いても、ユーザは自分が探している楽曲であると認識できない可能性がある。そこで、単に各楽曲コンテンツの一部分ずつを順次再生できるだけではなく、各楽曲コンテンツのサビ部分を順次再生できるような記録装置が望まれる。 However, even if a part of the music content can be automatically extracted using the method described in Patent Document 1, a part that is not so characteristic (for example, A melody, B melody) may be extracted. is there. Even if such a part is heard, the user may not be able to recognize that it is the music he / she is looking for. Therefore, a recording apparatus is desired that not only can sequentially reproduce a part of each piece of music content, but also can sequentially reproduce the chorus part of each piece of music content.

本発明は、斯かる問題に鑑みてなされたものであり、楽曲コンテンツからサビ部分等の特定区間を効率的に抽出することができる、特定区間抽出装置を提供することを目的とする。 The present invention has been made in view of such a problem, and an object of the present invention is to provide a specific section extraction device that can efficiently extract a specific section such as a rust portion from music content.

上記目的を達成するために本願発明の特定区間抽出装置は、楽曲データの音声信号レベル又はその変化量が所定値以上である時点をカット点として検出するカット点検出部と、前記楽曲データの音声信号レベル又は音声信号レベルの変化量が前記楽曲データ内で最大である時点の前後区間を仮の特定区間候補とし、前記カット点の時刻情報に基づき、前記楽曲データ内に前記特定区間候補と類似する区間が存在するか否かを判定し、前記特定区間候補と類似する区間が存在するとき、前記特定区間候補を特定区間として抽出する抽出部を備えることを特徴とする。 In order to achieve the above object, the specific section extraction device of the present invention includes a cut point detection unit that detects a time point when the audio signal level of music data or the amount of change thereof is a predetermined value or more, and the audio of the music data. A section before and after the time point at which the signal level or the amount of change in the audio signal level is maximum in the music data is set as a temporary specific section candidate, and similar to the specific section candidate in the music data based on the time information of the cut point It is characterized by having an extraction part which judges whether the section to do exists or not, and when the section similar to the specific section candidate exists , extracts the specific section candidate as a specific section.

楽曲コンテンツ中の「サビ」部分は、楽曲コンテンツ中に複数回存在する可能性が高いと考えられる。このような性質を利用して、上記の特定区間抽出装置は、サビ部分等の特定区間を効率的に抽出することができる。 It is considered that the “rust” portion in the music content is likely to exist multiple times in the music content. Utilizing such a property, the above-described specific section extraction device can efficiently extract a specific section such as a rust portion.

ところで、サビ部分では、音声信号が大きいことが多い。したがって、楽曲の最初から類似区間の有無を順に検索していくよりも、まず音声信号の大きい部分から検索した方が、サビ部分が効率的に検索できるものと考えられる。 By the way, in the rust portion, the audio signal is often large. Therefore, it is considered that the rust portion can be searched efficiently by searching from the portion where the audio signal is large first, rather than sequentially searching for the presence or absence of similar sections from the beginning of the music.

そのため、前記抽出部は、楽曲データの音声信号レベル又は音声信号レベルの変化量が楽曲内で最大である時点の前後区間を仮の特定区間候補としている。 For this reason, before Symbol extractor, and the front and rear sections of the time variation of the voice signal level or the voice signal level of the music data is maximum in the song as the temporary specific section candidate.

また、コンテンツ記録時のノイズ混入により音声信号レベルが高くなる場合もある。この場合に、単に音声信号レベルの大小に基づいてサビ部分を抽出しようとすると、ノイズ部分が誤って抽出される恐れがある。そこで、特定区間候補に類似する他の区間の存在を確認してから、特定区間候補を特定区間として抽出するようにすれば、ノイズが混入した場合でも正しくサビ部分等の特定区間を抽出することができる。 Also, the audio signal level may increase due to noise mixing during content recording. In this case, if the rust portion is simply extracted based on the level of the audio signal level, the noise portion may be erroneously extracted. Therefore, if a specific section candidate is extracted as a specific section after confirming the existence of another section similar to the specific section candidate , a specific section such as a rust portion can be correctly extracted even when noise is mixed. Can do.

好ましくは、前記抽出部は、前記特定区間候補と類似する区間が存在しないとき、楽曲データの音声信号レベル又は音声信号レベルの変化量が楽曲内で２番目に大きい時点の前後区間を仮の特定区間候補とし、該特定区間候補と類似する区間が存在するとき、該特定区間候補を特定区間として抽出する。 Preferably, the extraction unit, when said specific section candidates with similar section not present, identify the front and rear sections of the time variation of the voice signal level or the voice signal level of the music data is the second largest in the music of temporary and section candidates, when the specific section candidates similar to section exists, extracts the specific segment candidate as the specific interval.

音声信号レベルが高い区間と類似する区間が存在しない場合にサビ抽出処理を終了してしまうのではなく、２番目に音声信号レベルが高い区間をサビ候補とする。これにより、サビ部分を検出できる確率を向上させることができる。また、仮に２番目に音声信号レベルが高い区間が本来のサビではなかったとしても、音声信号の大きな箇所は、ユーザの印象に残っている可能性が高い。このことから、特に楽曲コンテンツの一部分ずつを自動的に順次再生させるような用途に用いる場合は、このような特定区間抽出装置は非常に有用である。 Rust extraction processing is not terminated when there is no section similar to a section with a high audio signal level, but a section with the second highest audio signal level is selected as a rust candidate. Thereby, the probability that a rust part can be detected can be improved. Also, even if the section with the second highest audio signal level is not the original rust, there is a high possibility that the portion where the audio signal is large remains in the user's impression. Thus, such a specific section extracting device is very useful particularly when used for an application in which a part of music content is automatically and sequentially reproduced.

本発明の特定区間抽出装置によれば、楽曲コンテンツからサビ部分等の特定区間を効率的に抽出することができる。 According to the specific section extraction device of the present invention, a specific section such as a rust portion can be efficiently extracted from the music content.

以下、本発明をその実施の形態を示す図面に基づいて説明する。図１は、本発明の実施の形態に係る記録再生装置の構成図である。この記録再生装置は、ＦＭラジオ放送等の放送信号をＨＤＤ（ハードディスクドライブ）１０へ記録していく。これと並行して、放送信号に含まれている楽曲データ部分を検出する。更に楽曲データ中のサビ区間の抽出を行い、サビ区間データをＨＤＤ１０へ記録していく、といった機能を有している。 Hereinafter, the present invention will be described with reference to the drawings illustrating embodiments thereof. FIG. 1 is a configuration diagram of a recording / reproducing apparatus according to an embodiment of the present invention. This recording / reproducing apparatus records a broadcast signal such as FM radio broadcast on an HDD (Hard Disk Drive) 10. In parallel with this, the music data portion included in the broadcast signal is detected. Further, it has a function of extracting the chorus section in the music data and recording the chorus section data in the HDD 10.

チューナ部１は、受信したＦＭラジオ放送等の放送信号を、選局して受信して音声信号に復調する。Ａ／Ｄ変換器２は、チューナ部１により選択されたアナログの音声信号をデジタル信号に変換する。 The tuner unit 1 selects and receives the received broadcast signal such as FM radio broadcast and demodulates it into an audio signal. The A / D converter 2 converts the analog audio signal selected by the tuner unit 1 into a digital signal.

ＭＰ３（MPEG Audio Layer-3）コーデック３は、デジタル音声データを符号化し、データ圧縮した符号化データを生成して時刻情報とペアで出力するエンコーダ機能と、符号化データを復号するデコーダ機能とを有する。Ｄ／Ａ変換器４は、ＭＰ３コーデック３にて復号されたデジタル音声データをアナログ信号データに変換する。そして、このアナログ信号データは、図時省略したアンプを介してスピーカ５に入力される。 The MP3 (MPEG Audio Layer-3) codec 3 has an encoder function that encodes digital audio data, generates encoded data that is compressed, and outputs the encoded data as a pair, and a decoder function that decodes the encoded data. Have. The D / A converter 4 converts the digital audio data decoded by the MP3 codec 3 into analog signal data. The analog signal data is input to the speaker 5 through an amplifier omitted in the figure.

ＤＳＰ７は、音声信号に基づいて、音声信号レベルを検出するため音声信号の振幅値を二乗した音声パワーを算出する。また、ＤＳＰ７は、音声信号レベルの変化量を検出するため音声パワーの変化量を算出する。また、ＤＳＰ７は、音声パワーの変化量が所定値以上となるタイミングをカット点と定義して検出する。また、ＤＳＰ７は、カット点近傍においてのみの周波数領域の特徴量、例えばＭＦＣＣを算出する。そして、算出されたＭＦＣＣと、予め外部メモリ１１に記憶されているサンプル音声信号に基づいて算出されたＭＦＣＣとの尤度を算出する。ここで尤度とは、楽曲に対する尤度のことである。 The DSP 7 calculates the sound power obtained by squaring the amplitude value of the sound signal in order to detect the sound signal level based on the sound signal. Further, the DSP 7 calculates the change amount of the sound power in order to detect the change amount of the sound signal level. Further, the DSP 7 detects the timing at which the amount of change in the sound power becomes a predetermined value or more as a cut point. Further, the DSP 7 calculates a feature quantity in the frequency domain only in the vicinity of the cut point, for example, MFCC. Then, the likelihood of the calculated MFCC and the MFCC calculated based on the sample audio signal stored in advance in the external memory 11 is calculated. Here, the likelihood is the likelihood for the music.

ＣＰＵ８は、バス６を介して記録再生装置の動作を制御する。また、ＤＳＰ７で検出されたカット点の時刻情報を用いて、放送信号から楽曲区間を検出する。更に、検出した楽曲データからサビ区間の抽出を行う。 The CPU 8 controls the operation of the recording / reproducing apparatus via the bus 6. In addition, the music section is detected from the broadcast signal using the time information of the cut point detected by the DSP 7. Further, the chorus section is extracted from the detected music data.

ＨＤＤ１０は、例えばＡＴＡインタフェースを実現するＨＤＤインタフェース部９を介して符号化データと時刻情報とを記録する大容量記録装置である。メモリ１１は、実行プログラムを展開し、演算処理により生じるデータを一時記憶すると共に、ＡＤ変換直後の音声データを一定期間遅延する機能を持っている。なお、ＭＰ３コーデック３、ＤＳＰ７、ＣＰＵ８、ＨＤＤインタフェース部９及びメモリ１１は、バス６を介して各種のデータを互いに授受する。 The HDD 10 is a large-capacity recording device that records encoded data and time information via an HDD interface unit 9 that implements an ATA interface, for example. The memory 11 expands the execution program, temporarily stores data generated by the arithmetic processing, and has a function of delaying audio data immediately after AD conversion for a certain period. The MP3 codec 3, DSP 7, CPU 8, HDD interface unit 9, and memory 11 exchange various data with each other via the bus 6.

（放送信号に含まれている楽曲データ部分を検出する処理の説明）
この記録再生装置は、最終的には、放送信号から楽曲のサビ区間を抽出するものであるが、まず、放送信号から楽曲データ部分を検出する処理の手順を説明する。 (Description of processing to detect music data part included in broadcast signal)
This recording / reproducing apparatus finally extracts the chorus section of the music from the broadcast signal. First, the procedure of the process for detecting the music data part from the broadcast signal will be described.

図２は、本実施形態の記録再生装置の機能ブロック図である。この記録再生装置は、チューナ１で選局された音声信号を、Ａ／Ｄ変換器２に入力してデジタルに変換した後、時刻情報とともにＭＰ３コーデック３に入力しMP3データに入力しMP3データに圧縮符号化して時刻情報とペアで録音期間中、HDDインタフェース部９を介してＨＤＤ１０に継続的に記録する。 FIG. 2 is a functional block diagram of the recording / reproducing apparatus of the present embodiment. In this recording / reproducing apparatus, the audio signal selected by the tuner 1 is input to the A / D converter 2 and converted to digital, and then input to the MP3 codec 3 together with the time information, input to MP3 data, and converted to MP3 data. The data is compressed and encoded, and is recorded continuously on the HDD 10 via the HDD interface unit 9 during the recording period as a pair with the time information.

Ａ／Ｄ変換器２からのデジタル音声データをＤＳＰ７の処理に要する時間分だけ遅延する遅延メモリ１１ａに記憶するとともに、ＤＳＰ７内の音声パワー算出部７１にて音声信号レベルに相当する音声パワー即ち、音声信号の振幅の二乗値を算出する。 The digital audio data from the A / D converter 2 is stored in the delay memory 11a that is delayed by the time required for the processing of the DSP 7, and the audio power corresponding to the audio signal level in the audio power calculation unit 71 in the DSP 7, that is, The square value of the amplitude of the audio signal is calculated.

ＤＳＰ７内のカット点検出部７２では、音声信号から楽曲に相当する区間を特定するため、音声信号レベルの変化が大きいタイミング、即ち音声パワー値の変化量が所定値より大きいタイミングをカット点として検出して、検出出力を発する。同時にカット点における時刻情報とその変化量を一時記憶メモリ１１Ｃに記憶する。 The cut point detection unit 72 in the DSP 7 detects the timing corresponding to the change in the audio signal level, that is, the timing when the change in the audio power value is greater than the predetermined value as the cut point in order to identify the section corresponding to the music from the audio signal. Then, a detection output is issued. At the same time, the time information at the cut point and the amount of change are stored in the temporary storage memory 11C.

図３は、カット点検出部７２の動作を説明するための波形図である。図３（ａ）は、音声パワー値の変化を、図３（ｂ）は変化量（微分値）の変化を、それぞれ示している。図３に示すように、カット点検出部７２は音声パワー算出部７１にて算出された音声パワー値に基づいて、その微分値が所定の閾値より大きい極大点となる時刻Ｔｍ、Ｔｍ＋１をカット点として検出する。そして、その検出結果を、周波数特徴量算出部７３に入力する。 FIG. 3 is a waveform diagram for explaining the operation of the cut point detection unit 72. FIG. 3A shows the change of the audio power value, and FIG. 3B shows the change of the change amount (differential value). As shown in FIG. 3, the cut point detection unit 72 uses the audio power value calculated by the audio power calculation unit 71 as a cut point at times Tm and Tm + 1 at which the differential value becomes a maximum point greater than a predetermined threshold. Detect as. Then, the detection result is input to the frequency feature amount calculation unit 73.

周波数特徴量算出部７３では、遅延メモリ１１ａから所定時間遅延されて出力される音声データを、カット点検出部７２からの出力に同期して、カット点に僅か先行するタイミングからカット点より僅か遅れたタイミングにおける微小期間において一時的にＭＦＣＣ等の周波数の特徴量を算出して尤度算出部７４に入力する。 In the frequency feature quantity calculation unit 73, the audio data output after being delayed from the delay memory 11a for a predetermined time is synchronized with the output from the cut point detection unit 72 and slightly delayed from the cut point slightly before the cut point. The feature amount of the frequency such as MFCC is temporarily calculated in the minute period at the determined timing and input to the likelihood calculating unit 74.

本実施例では、楽曲と話し声とは周波数特徴量が異なることに着目して、典型的な楽曲の周波数特徴量と話し声の周波数特徴量を、比較に際して基準データとして外部メモリ１１ｂに予め記憶している。従って、ＤＳＰ内の尤度検出部７４は、周波数特徴量算出部７３から入力されるカット点前後の特徴量算出出力と、基準データとの尤度を算出し、その尤度算出出力を、ＣＰＵ８内のカット点判定部８１に入力する。 In this embodiment, paying attention to the fact that the frequency feature amount of music and spoken voice is different, the frequency feature amount of typical music and the frequency feature amount of spoken voice are stored in advance in the external memory 11b as reference data for comparison. Yes. Accordingly, the likelihood detection unit 74 in the DSP calculates the likelihood between the feature amount calculation output before and after the cut point input from the frequency feature amount calculation unit 73 and the reference data, and the likelihood calculation output is output to the CPU 8. To the cut point determination unit 81.

なお、尤度算出に際しては、基準データと比較により尤度を求める前述の方式に代えて、周波数特徴量をあらかじめ設定した評価関数に代入して尤度を求めることにより、楽曲である可能性を算出する方法もあり、必ずしも外部メモリ１１ｂに記憶された基準データと比較する必要はない。 In calculating the likelihood, instead of the above-mentioned method for obtaining the likelihood by comparison with the reference data, the likelihood is obtained by substituting the frequency feature quantity into a preset evaluation function to obtain the likelihood. There is also a calculation method, and it is not always necessary to compare with the reference data stored in the external memory 11b.

次に、カット点判定部８１は、尤度算出出力に基づいてカット点における音声信号が音楽か話し声かを判定する。カット点検出部７２より得た時刻情報と変化量が記憶されている一時記憶メモリ１１Ｃには、更に判定結果が関連付けて記憶される。 Next, the cut point determination unit 81 determines whether the audio signal at the cut point is music or speech based on the likelihood calculation output. In the temporary storage memory 11C in which the time information obtained from the cut point detection unit 72 and the amount of change are stored, the determination result is further stored in association with each other.

図４は、上記のような判定結果が関連付けて記憶されている一時記憶メモリ１１Ｃのテーブルを示している。 FIG. 4 shows a table of the temporary storage memory 11C in which the determination results as described above are stored in association with each other.

時間長判定部８３は、楽曲が所定時間たとえば１００秒以上継続するということ経験的に見出しており、話し声のサンプリング点間隔が１００秒未満の場合、その間のサンプリング点が楽曲と判定された場合でも、当該区間は楽曲とはみなさないことにしており、話し声即ち楽曲以外と判定されたサンプリング点間の間隔を測定し、１００秒以上の区間を楽曲区間として判定している。 The time length determination unit 83 has empirically found that the music continues for a predetermined time, for example, 100 seconds or more, and even when the sampling point interval of the voice is less than 100 seconds, the sampling point during that time is determined as music. The interval is not regarded as music, and the interval between sampling points determined to be other than spoken voice, that is, music, is measured, and an interval of 100 seconds or more is determined as a music interval.

時間長判定部８３から得られる判定出力を入力する楽曲区間検出部８２は、一時記憶メモリ１１Ｃのテーブルを書換え、楽曲毎のテーブル（最終テーブル）に変更する。 The music section detection unit 82 that receives the determination output obtained from the time length determination unit 83 rewrites the table in the temporary storage memory 11C and changes it to a table for each music (final table).

図５は、一時記憶メモリ１１Ｃで書き換えられた最終のテーブルを示す図である。この最終のテーブルでは、Ｔ６が一旦楽曲と判定されたが話し声と判定された前後のサンプリング点Ｔ５とＴ７の間隔が短いことを理由に、楽曲とはみなされなかったので、テーブルから削除されたということが示されている。 FIG. 5 is a diagram showing the final table rewritten in the temporary storage memory 11C. In this final table, T6 was once determined to be a song, but was deleted from the table because it was not considered a song because the interval between sampling points T5 and T7 before and after it was determined to be a speaking voice was short. It is shown that.

この最終のテーブルは、録音動作が終了した時点で、楽曲区間検出部８２を経由して、ＨＤＤインタフェース部９へ供給され、更にＨＤＤ１０に記憶される。 This final table is supplied to the HDD interface unit 9 via the music section detection unit 82 when the recording operation is completed, and is further stored in the HDD 10.

尚、最終テーブルは、楽曲の始点と終了点以外に、中間のカット点や、変化量を残したまま、ＨＤＤ１０に記録されているが、再生に際してサビ部分を再生する目的で利用される。 The final table is recorded in the HDD 10 while leaving the intermediate cut point and the amount of change in addition to the start point and end point of the music, but is used for the purpose of reproducing the chorus part during reproduction.

ＨＤＤ１０に記録されている符号化データは、編集再生操作に応じて最終テーブルに特定された楽曲区間にのみ対応する符号化データを順次読み出し、ＭＰ３コーデック３に入力される。ＭＰ３コーデック３は、符号化データを復号し、Ｄ／Ａ変換器４にて音声信号に変換された後、スピーカ５から出力される。これにより、会話等を含んだ音声信号から、楽曲だけを検出し、楽曲を抽出して再生することができる。 The encoded data recorded in the HDD 10 sequentially reads out the encoded data corresponding only to the music section specified in the final table according to the editing / playback operation, and is input to the MP3 codec 3. The MP3 codec 3 decodes the encoded data, is converted into an audio signal by the D / A converter 4, and then is output from the speaker 5. Thereby, it is possible to detect only the music from the audio signal including the conversation and extract the music and reproduce it.

極大点検出部８４は、楽曲区間中で音声パワーが大きい時刻を検出する。具体的には、音声パワー算出部７１で算出された音声パワー値と、楽曲区間検出部８２により抽出された楽曲区間の時刻情報を利用して、楽曲区間中で音声パワーが大きい時刻を特定する。そして、音声パワーが大きい複数の時点の時刻（例えば音声信号の大きい時点ベストテンの時刻）をＨＤＤ１０に記録する。 The maximum point detection unit 84 detects a time when the sound power is high in the music section. Specifically, using the audio power value calculated by the audio power calculating unit 71 and the time information of the music section extracted by the music section detecting unit 82, the time when the audio power is high in the music section is specified. . Then, the time at a plurality of time points when the sound power is high (for example, the time of the best time when the sound signal is large) is recorded in the HDD 10.

以上まで動作により、ＨＤＤ１０には、（１）ＭＰ３コーデック３により符号化された放送信号データ、（２）楽曲区間検出部８２により検出された楽曲区間の始点時刻、終点時刻、及びカット点時刻の情報、（３）楽曲区間検出部８２により検出された楽曲区間のうち、音声パワーが高い時刻の情報、が記録される。 By the above operation, the HDD 10 stores (1) broadcast signal data encoded by the MP3 codec 3, (2) the start time, end time, and cut point time of the music section detected by the music section detection unit 82. Information, (3) Information of the time when the audio power is high among the music sections detected by the music section detection unit 82 is recorded.

なお、上記例では、１つの楽曲データ中にカット点が３つ程度しか含まれていないが（図４、図５）、これはあくまでも説明のためのものであって、一例にすぎない。また、後述するサビ区間抽出のためには、カット点の数はこれよりも多いほうが好ましいので、実際には１つの楽曲中に数十個程度カット点が検出されるように、閾値を設定するのが好ましい。 In the above example, only about three cut points are included in one piece of music data (FIGS. 4 and 5). However, this is merely an example and is merely an example. In addition, it is preferable that the number of cut points is larger than this for extracting a chorus section, which will be described later. In practice, a threshold value is set so that about several tens of cut points are detected in one piece of music. Is preferred.

（楽曲データ部分からサビ区間を抽出する処理の説明）
次に、楽曲データ部分からサビ区間を抽出する処理についての説明を行う。 (Explanation of the process to extract the chorus section from the music data part)
Next, a process for extracting a chorus section from the music data portion will be described.

サビ区間検出部８５は、まず、ＨＤＤ１０から、楽曲区間内のカット点時刻の情報、楽曲区間中音声パワーが高い時刻の情報を読み出す。そして、これらの情報を使って、楽曲区間内に良く似たメロディの箇所が複数箇所あるかどうかを調べる（サビ部分は通常、楽曲内に複数回登場するという知見に基づく）。具体的には、カット点の間隔の組み合わせが類似する区間があるかどうかを調べることで、メロディの類似区間の有無を推定する。そして、メロディの類似区間が存在することが検出できれば、その区間をサビ区間として検出する。 First, the chorus section detection unit 85 reads information on the cut point time in the music section and information on the time when the audio power in the music section is high from the HDD 10. Then, using these pieces of information, it is examined whether or not there are a plurality of similar melody portions in the music section (based on the knowledge that the rust portion usually appears multiple times in the music). Specifically, the presence or absence of a similar melody section is estimated by examining whether there is a section having a similar combination of cut point intervals. If it is detected that a similar melody section exists, the section is detected as a chorus section.

図６を参照して、サビ区間検出部８５が行う処理の具体的な手順を説明する。まず、楽曲区間中、音声パワーが最も大きな時刻情報をＨＤＤ１０から取得する（ここでは、この時刻をＴ_ｍａｘとする）。次に、時刻Ｔ_ｍａｘ前後に存在する複数のカット点の時刻情報をＨＤＤ１０から取得する。例えば、ここではｃ_４、ｃ_５、ｃ_６、ｃ_７、ｃ_８の５つのカット点の情報が取得される。そして、ｃ_４〜ｃ_８間の楽曲データ（図６の「区間Ａ」）をサビ候補区間とまず仮定する。このような音声パワーが大きい区間をサビ候補と仮定するのは、楽曲コンテンツ中、音声パワーが大きい箇所はサビ部分である可能性が高い、という知見に基づくものである。 With reference to FIG. 6, the specific procedure of the process which the chorus area detection part 85 performs is demonstrated. First, time information with the highest audio power is acquired from the HDD 10 during the music section (here, this time is T _max ). Next, time information of a plurality of cut points existing around time T _{max is} acquired from the HDD 10. For example, here, information on five cut points c ₄ , c ₅ , c ₆ , c ₇ , and c ₈ is acquired. Then, the music data between c _{4 and} c ₈ (“section A” in FIG. 6) is first assumed to be a chorus candidate section. The reason why such a section with high audio power is assumed to be a chorus candidate is based on the knowledge that a portion with high audio power is likely to be a chorus part in the music content.

次に、区間Ａに類似する区間が、楽曲コンテンツ内に存在するかどうかを調べる。具体的には、まず、区間Ａ内のカット点群の時間間隔（interval）を算出する。ここでは、Ｉ＝｛ｉ_１, ｉ_２, ｉ_３, ｉ_４｝（但し、ｉ_ｎ＝Ｔ(ｃ_ｎ＋４)−Ｔ(ｃ_ｎ＋３)であり、Ｔ(ｃ_ｎ)はカット点ｃ_ｎの時刻である）が算出される。 Next, it is checked whether or not a section similar to the section A exists in the music content. Specifically, first, the time interval (interval) of the cut point group in the section A is calculated. Here, I = {i ₁ , i ₂ , i ₃ , i ₄ } (where i _n = T (c _{n + 4} ) −T (c _{n + 3} ), and T (c _n ) is the time of the cut point c _n . Is calculated).

そして、カット点の間隔の組み合わせが前記Ｉと類似するカット点群が楽曲コンテンツ中に存在するかどうかを調べる。例えば、図６の例においては、
Ｄ_ｎ＝(ｉ_１−ｉ_ｎ)＾２＋(ｉ_２−ｉ_ｎ＋１)＾２＋(ｉ_３−ｉ_ｎ＋２)＾２
＋(ｉ_４−ｉ_ｎ＋３)＾２（ｎ＝５,６,７,…,）
但し、ｉ_ｎ＝Ｔ(ｃ_ｎ＋４)−Ｔ(ｃ_ｎ＋３)
を順次計算していく。そして、所定値よりも小さいＤ_ｎが存在する場合、前記の区間Ａとカット点ｃ_ｎ＋３〜ｃ_ｎ＋７の区間が、区間Ａに類似する区間であると判断する。図６の例では、ｃ_１２〜ｃ_１６の区間（「区間Ｂ」）が、区間Ａに類似する区間であると判断される。このように、サビ候補区間と類似する区間（区間Ｂ）が発見できた場合、サビ候補区間（区間Ａ）を真のサビ区間であると検出する。 Then, it is checked whether or not a cut point group having a combination of cut point intervals similar to I exists in the music content. For example, in the example of FIG.
D _n = (i ₁ −i _n ) ^ 2 + (i ₂ −i _{n + 1} ) ^ 2 + (i ₃ −i _{n + 2} ) ^ 2
+ (I ₄ −i _{n + 3} ) ^ 2 (n = 5, 6, 7,...)
However, i _n = T (c _{n + 4} ) −T (c _{n + 3} )
Are calculated sequentially. If D _n smaller than the predetermined value exists, it is determined that the section A and the section of the cut points c _{n + 3 to} c _{n + 7} are similar to the section A. In the example of FIG. 6, the section c _{12 to} c ₁₆ (“section B”) is determined to be a section similar to the section A. Thus, when a section (section B) similar to the rust candidate section can be found, the rust candidate section (section A) is detected as a true rust section.

上記の手順によるサビ区間検出は以下の点で優れている。
（１）：サビ部分では、音声信号が大きいことが多い。したがって、楽曲の最初から類似区間の有無を順に検索していくよりも、上記のようにまず音声信号の大きい部分から検索するようにすれば、サビ部分が効率的に検出できる。
（２）：コンテンツ記録時のノイズ混入により音声信号レベルが高くなる場合もある。この場合、単に音声信号レベルの大小に基づいてサビ部分を抽出しようとすると、本来サビではない部分が抽出される恐れがある。そこで、上記のように、カット点の間隔の組み合わせが類似する他の区間の存在を確認してはじめて、そこをサビ部分と特定するようにすれば、ノイズが混入した場合でもサビ部分を正しく検出できる。 Rust detection by the above procedure is excellent in the following points.
(1): The sound signal is often large in the rust portion. Therefore, rather than sequentially searching for the presence or absence of similar sections from the beginning of the music, the rust portion can be efficiently detected by searching first from the portion where the audio signal is large as described above.
(2): The audio signal level may increase due to noise mixing during content recording. In this case, if a rust portion is simply extracted based on the level of the audio signal level, a portion that is not originally rust may be extracted. Therefore, as described above, it is only necessary to confirm the existence of other sections with similar combinations of cut point intervals, and if the section is identified as a rust portion, the rust portion is correctly detected even when noise is mixed. it can.

なお、上記例において、所定値よりも小さくなるようなＤ_ｎが存在しなかった場合、区間Ａと類似する区間は存在しないと判断する。即ち、区間Ａはサビ区間ではないと判断する。この場合、サビ区間検出部８５は、楽曲区間中、音声パワーが２番目に大きな時刻情報をＨＤＤ１０から取得する。その後、上記説明と同様の手順でサビ区間の検出を行う。 In the above example, when there is no D _n that is smaller than the predetermined value, it is determined that there is no section similar to the section A. That is, it is determined that the section A is not a chorus section. In this case, the chorus section detection unit 85 acquires time information having the second largest audio power from the HDD 10 during the music section. Thereafter, the chorus section is detected in the same procedure as described above.

サビ区間抽出部８６は、サビ区間検出部８５にて検出されたサビ区間の時刻情報等に基づいて、ＭＰ３コーデック３によりデコードされた楽曲データからサビ区間を抽出する。サビ区間抽出部８６により抽出されたサビ区間の音声信号は、スピーカ等により出力される。或いは、ＨＤＤ１０にサビ区間音声データとして記録するようにしても良い。 The chorus section extraction unit 86 extracts the chorus section from the music data decoded by the MP3 codec 3 based on the time information of the chorus section detected by the chorus section detection unit 85. The audio signal of the chorus section extracted by the chorus section extraction unit 86 is output by a speaker or the like. Alternatively, it may be recorded in the HDD 10 as chorus section audio data.

以上のような手順により、楽曲データからサビ区間の抽出を行うことができる。 The chorus section can be extracted from the music data by the procedure as described above.

（音楽記録再生装置、音楽配信システムへの適用）
上記のような手順で抽出された、サビ区間の音声データをＨＤＤに記録しておくと、様々な用途、応用例が考えられる。
（１）：ユーザは、サビ部分を順次再生していくことで効率的に所望の楽曲を検索することができる。
（２）：サムネイル形式で楽曲を一覧表示させ、楽曲のアイコンをクリックすると、楽曲のサビ部分が再生されるような音楽記録再生装置が実現できる。
（３）：アイコンのクリック等により選択された楽曲を購入することができる音楽配信システムが実現できる。通常、音楽配信システムを利用して楽曲を購入する場合、ユーザは所定の対価を支払う必要があるが、サビ部分データを所有している場合は、ユーザは通常よりも安価に楽曲を購入することができる、といったビジネスモデルが構築できる。 (Application to music recording and playback devices and music distribution systems)
If the voice data of the chorus section extracted by the procedure as described above is recorded in the HDD, various uses and application examples can be considered.
(1): The user can efficiently search for a desired music piece by sequentially reproducing the chorus parts.
(2): It is possible to realize a music recording / playback apparatus that displays a list of songs in a thumbnail format and clicks on a song icon to reproduce the chorus portion of the song.
(3): A music distribution system capable of purchasing a music selected by clicking an icon or the like can be realized. Normally, when purchasing music using a music distribution system, the user needs to pay a predetermined price. However, if he / she owns chorus partial data, the user should purchase the music at a lower price than usual. Can build a business model.

また、本実施の形態では、ＤＳＰ７とＣＰＵ８とで機能分散しているが、特にこれに限定されるものではなく、ＣＰＵ８のみで両機能を実現しても良いし、Ａ／Ｄ変換器２、ＭＰ３コーデック３、及びＤ／Ａ変換器４の機能も含めて、全てＣＰＵ８によりソフトウェア処理する構成であっても良い。 In the present embodiment, the functions are distributed between the DSP 7 and the CPU 8, but the present invention is not particularly limited to this, and both functions may be realized only by the CPU 8, and the A / D converter 2, A configuration in which all the software including the functions of the MP3 codec 3 and the D / A converter 4 is processed by the CPU 8 may be employed.

さらに、本発明は上記実施形態に限定されるものではなく、特許請求の範囲内の記載であれば多種の変形、置換等が可能であることは言うまでもない。 Further, the present invention is not limited to the above-described embodiment, and it goes without saying that various modifications, substitutions, and the like are possible as long as the description is within the scope of the claims.

例えば、上記実施形態では、カット点の間隔の組み合わせが類似する区間を特定区間として抽出したが、カット点の時刻の分布が類似する区間を特定区間として抽出しても良い。 For example, in the above embodiment, a section having a similar combination of cut point intervals is extracted as the specific section. However, a section having a similar cut point time distribution may be extracted as the specific section.

また、上記では音声信号が大きく変化する時点をカット点であると定義したが、音声信号の大きさが所定値よりも大きな時点をカット点としても良い。 In the above description, the time point when the audio signal changes greatly is defined as the cut point, but the time point when the size of the audio signal is larger than a predetermined value may be set as the cut point.

また、楽曲に抑揚が比較的少ないような場合は、検出されるカット点の数が非常に少ない場合が考えられる。その結果、上記のようなカット点ベースによる類似区間の有無判定が行えないことが考えられる。具体的には、（１）カット点の数が少なすぎて類似判定自体が行えない可能性、（２）サンプル数としてのカット点の数が少ないため、サビ区間が誤検出されてしまう可能性、の２つが考えられる。このような場合、カット点と認定するための所定の閾値を低く設定しても良い。 Further, when the music has relatively little intonation, the number of detected cut points may be very small. As a result, it is considered that the presence / absence determination of the similar section based on the cut point base as described above cannot be performed. Specifically, (1) the number of cut points is too small and the similarity determination itself cannot be performed, and (2) the number of cut points as the number of samples is small, so that the rust section may be erroneously detected. The two are considered. In such a case, a predetermined threshold value for recognition as a cut point may be set low.

即ち、（１）楽曲データ内のカット点の数が所定数よりも少ない場合はカット点と認定するための閾値を低くする、（２）サビ区間が検出できなかった場合は、２番目に大きな音声パワーの前後区間をサビ区間候補とするのではなく、カット点の再検出を行う（カット点の閾値を下げる）ような処理を行っても良い。 That is, (1) when the number of cut points in the music data is smaller than a predetermined number, the threshold value for identifying as a cut point is lowered, and (2) when the chorus section cannot be detected, the second largest Instead of setting the preceding and following sections of the audio power as the chorus section candidates, a process of redetecting cut points (lowering the cut point threshold) may be performed.

本発明の実施の形態に係る記録再生装置の構成図である。It is a block diagram of the recording / reproducing apparatus which concerns on embodiment of this invention. 本実施の形態に係る記録再生装置の機能ブロック図である。It is a functional block diagram of the recording / reproducing apparatus which concerns on this Embodiment. カット点検出部７２の動作を説明するための波形図である。6 is a waveform diagram for explaining the operation of the cut point detection unit 72. FIG. 一時記憶メモリ１１Ｃのテーブルである。It is a table of temporary storage memory 11C. 一時記憶メモリ１１Ｃで書き換えられた最終のテーブルを示す図である。It is a figure which shows the last table rewritten by 11 C of temporary storage memories. 検出されたカット点の間隔の組み合わせが類似する区間を特定する処理を説明するための図である。It is a figure for demonstrating the process which specifies the area where the combination of the detected cut point interval is similar.

Explanation of symbols

１チューナ部
２Ａ／Ｄ変換器
３ＭＰ３コーデック
４Ｄ／Ａ変換器
５スピーカ
６バス
７ＤＳＰ
８ＣＰＵ
９ＨＤＤインタフェース部
１０ＨＤＤ
１１メモリ
DESCRIPTION OF SYMBOLS 1 Tuner part 2 A / D converter 3 MP3 codec 4 D / A converter 5 Speaker 6 Bus 7 DSP
8 CPU
9 HDD interface unit 10 HDD
11 memory

Claims

A cut point detection unit for detecting a time point when the audio signal level of the music data or the amount of change thereof is equal to or greater than a predetermined value;
The period before and after the time point when the change amount of the audio signal level or the audio signal level of the music data is the maximum in the music data is a temporary specific section candidate,
Based on the time information of the cut point, it is determined whether or not there is a section similar to the specific section candidate in the music data, and when there is a section similar to the specific section candidate , the specific section candidate is A specific section extraction apparatus including an extraction unit that extracts a specific section.

The extraction unit is present when the combination of cut point intervals similar to the combination of cut point intervals existing in the specific section candidate exists in the music data, or exists in the specific section candidate. when the time of the distribution of the cut points similar to the distribution of the time of the cut point is present in the music data, and a section similar to the specific section candidate in the music data is present, the specific section candidate The specific section extraction device according to claim 1 , wherein the specific section is extracted as a specific section.

The extraction unit includes:
When there is no section similar to the specific section candidate , the section before and after the time when the audio signal level of the music data or the change amount of the audio signal level is the second largest in the music is set as a temporary specific section candidate,
When sections similar to the specific segment candidate is present, the specific section extracting device according to claim 1 or 2 for extracting the specific section candidate as a specific section.

The specific section extraction device according to any one of claims 1 to 3,
A music recording / reproducing apparatus comprising: a recording unit that records music data related to a specific section extracted by the extraction unit as specific section data.

The recording unit is
A display unit that displays a list of music information related to the specific section data to be recorded,
A selection unit for selecting a song from the song list displayed on the display unit;
The music recording / reproducing apparatus according to claim 4 , further comprising a reproducing unit that reproduces specific section data related to the music selected by the selection unit.