KR20050084039A

KR20050084039A - Summarizing digital audio data

Info

Publication number: KR20050084039A
Application number: KR1020057009715A
Authority: KR
Inventors: 창셍 쑤
Original assignee: 에이전시 포 사이언스, 테크놀로지 앤드 리서치
Priority date: 2005-05-27
Filing date: 2002-11-28
Publication date: 2005-08-26

Abstract

An embodiment is related to automatic summarization for digital audio raw data (12), more specifically, for identifying pure music and vocal music (40,60) from digital audio data by extracting distinctive features from music frames (73,74,75,76), designing a classifier and determining the classification parameters (20) using adaptive learning/training algorithm (36), and identifying music into pure music or vocal music according to the classifier. For pure music, temporal, spectral and cepstral features are calculated to characterise the musical content, and an adaptive clustering method is used to structure the musical content according to calculated features. The summary (22,24,26,48,52,70,72) is created according to clustered result and domain-based music knowledge (50,150). For vocal music, voice related features are extracted and used to structure the musical content, and similarly, the music summary is created in terms of structured content and heuristic rules related to music genres.

Description

Digital audio data summarizing method {SUMMARIZING DIGITAL AUDIO DATA}

본 발명은 오디오 데이터 인덱싱(indexing) 및 분류화(classification)와 같은 데이터 분석에 관한 것이다. 보다 상세하게는, 본 발명은 예컨대 콘텐츠-기반 뮤직 검색 및 웹-기반 온라인 뮤직 디스트리뷰션과 같은 여러 어플리케이션들에 대한 디지털 뮤직 원(raw) 데이터를 자동으로 써머라이징하는 방법에 관한 것이다.The present invention relates to data analysis, such as audio data indexing and classification. More specifically, the present invention relates to a method of automatically summarizing digital music raw data for various applications such as, for example, content-based music search and web-based online music distribution.

컴퓨터 네트워크 및 멀티미디어 기술의 급속한 발전으로 인하여, 디지털 멀티미디어 데이터 콜렉션의 크기가 급속히 증가되었다. 이러한 발전에 따라, 큰-스케일의 정보 오거니제이션(organisation) 및 프로세싱 중에서 원본 컨텐츠의 필수 요소를 가장 잘 캡쳐한, 광대한 멀티미디어 데이터 콜렉션의 간결하고 정보를 제공하는 써머리(summary)가 필요하게 되었다. 지금까지, 자동적으로 텍스트, 스피치 및 비디오 써머리를 생성하기 위한 많은 기술들이 제안되고 발전되었다. 그러나, 뮤직 써머라이제이션은 음악의 대표부로 사용할 수 있으며 청취자가 쉽게 인식할 수 있는, 소정 음악의 가장 공통되고 두드러진 테마를 결정하는 것이다. 원 디지털 음악 데이터는 고도로 비구조화된 모놀리식 사운드 파일 형태로만 사용가능한 특징없는 바이트의 콜렉션이기 때문에, 텍스트, 스피치 및 비디오 써머리제이션과 비교하여, 뮤직 써머라이제이션은 특별한 도전을 제공한다.Due to the rapid development of computer network and multimedia technology, the size of digital multimedia data collection has been rapidly increased. This development requires a concise and informative summary of the vast collection of multimedia data that best captures the essential elements of the original content among large-scale information organization and processing. To date, many techniques have been proposed and developed for automatically generating text, speech and video summary. However, music summerization is to determine the most common and salient theme of a given music that can be used as a representative of the music and easily recognized by the listener. Since raw digital music data is a collection of featureless bytes that are only available in the form of highly unstructured monolithic sound files, music summering presents a particular challenge.

2001년 5월 1일 등록된 인터네셔널 비지니스 머신 코포레이션의 미국 특허 제6,225,548호는 뮤직 써머라이제이션에 관한 것으로, 뮤지컬 인스트루먼트 디자인 인터페이스(MIDI) 컴포지션의 반복 특성을 사용하여 소정 음악 작품의 메인 멜로디 테마 세그먼트를 자동적으로 인식하는, MIDI 데이터 포맷을 위한 써머리제이션 시스템을 개시하고 있다. 검출 엔진은 멜로디 인식 및 뮤직 써머라이제이션 문제들을 다양한 스트링 프로세싱 문제들로서 모델링하는 알고리즘을 사용하고, 상기 문제들을 처리한다. 이 시스템은 MIDI 포맷의 음악 작품의 각 트랙에서 평범한 반복부가 없는 최장 세그먼트를 인식한다. 이러한 세그먼트는 음악 컴포지션의 기본 유닛이며, 음악 작품의 멜로디가 될 것이다. 그러나, MIDI 포맷 데이터는 샘플링된 원 오디오 데이터, 즉, 액츄얼 오디오 사운드가 아니다. 대신에, MIDI 포맷 데이터는 신디사이저 인스트럭션 또는 MIDI 노트를 포함하여, 오디오 데이터를 재생한다. 특히, 신디사이저는 MIDI 포맷 데이터의 인스트럭션으로부터 액츄얼 사운드를 생성한다. 액츄얼 오디오 사운드와 비교하여, MIDI 데이터는 인스트루먼트 및 사운드 이펙트 모두에 대하여, 공통적인 플레이백 익스피어리언스 및 무제한 사운드 팔레트를 제공하지 않을 것이다. 다른 한편, MIDI 데이터는 구조화된 포맷이며, 그 구조에 따른 써머리의 생성이 용이하다. 따라서, MIDI 써머리제이션은 실시간 플레이백 어플리케이션에는 실제로 응용할 수 없다. 따라서, 리얼 원 디지털 오디오 데이터로부터 음악 써머리를 생성할 필요가 있다.United States Patent No. 6,225,548 to International Business Machines Corporation, registered May 1, 2001, relates to music summerization, which uses the repeating nature of a musical instrument design interface (MIDI) composition to incorporate the main melody theme segment of a given piece of music. A summary system for automatically recognizing MIDI data formats is disclosed. The detection engine uses an algorithm that models melody recognition and music summerization problems as various string processing problems, and handles the problems. The system recognizes the longest segment with no normal repeats on each track of a musical piece in MIDI format. These segments are the basic unit of music composition and will be the melody of the piece of music. However, the MIDI format data is not sampled raw audio data, i.e., active audio sound. Instead, MIDI format data plays audio data, including synthesizer instructions or MIDI notes. In particular, the synthesizer produces an actual sound from instructions of MIDI format data. Compared with the actual audio sound, the MIDI data will not provide a common playback experience and unlimited sound palettes for both the instrument and sound effects. On the other hand, the MIDI data is in a structured format, and it is easy to generate a summary according to the structure. Therefore, MIDI summarization is not practical for real-time playback applications. Therefore, it is necessary to generate a music summary from real one digital audio data.

"키 프레이즈를 사용한 뮤직 써머라이제이션"이라는 명칭의 Beth Logan 및 Stephen Chu의 간행물(IEEE International Conference on Audio, Speech and Signal processing, Orlando, USA, 2000, vol.2, pp749-752)에는, 스피치 인식 어플리케이션에서 사용되는 것으로 알려진 "멜-셉스트럴(Mel-cepstral)" 특징을 사용하여 각각의 노래를 파마미터화하여 음악을 서머라이징하는 방법이 개시되어 있다. 스피치 인식의 이러한 특징을 다양한 클러스터링 기술과 함께 적용하여, 목소리를 사용한 음악 작품의 노래 구조를 알아낼 수 있다. 그리고 나서, 알아낸 것을 이러한 구조에 주어진 키 프레이즈를 써머라이징하는 데 사용한다. 이러한 써머리제이션 방법은 락 또는 포크 음악과 같이 목소리가 있는 특정 장르의 음악에는 적합하나, 이러한 방법은 클래식이나 재즈 음악과 같은 순수 음악이나 악기 연주 장르에는 적용하기 어렵다. "멜-셉스트럴" 특징은 음악 컨텐츠, 특히, 예를 들어, 악기 연주 음악과 같은 순수 음악의 특성을 고유하게 반영하지 않을 것이다. 따라서, 이러한 방법의 써머리제이션의 질은, 특히, 모든 형식의 음악 장르의 뮤직 써머라이제이션을 필요로 하는 어플리케이션에는 만족스럽지 못하다.Speech by Beth Logan and Stephen Chu (IEEE International Conference on Audio, Speech and Signal processing, Orlando, USA, 2000, vol. 2, pp749-752) entitled "Music Summering with Key Phrases" A method of summering music by parameterizing each song using the "Mel-cepstral" feature known to be used in an application is disclosed. This feature of speech recognition can be applied along with various clustering techniques to determine the song structure of a musical piece using voice. Then, you use it to summarize the key phrases given in this structure. Such a summation method is suitable for a specific genre of music such as rock or folk music, but this method is difficult to apply to pure music or musical instrument genre such as classical or jazz music. The "mel-septral" feature will not uniquely reflect the nature of the music content, in particular pure music, such as, for example, musical instrument playing music. Thus, the quality of the summation in this way is not satisfactory, especially for applications that require music summerization of all types of music genres.

따라서, 예를 들어, 실시간 플레이백 어플리케이션의 컨텐츠-기반 음악 검색 및 웹-기반 음악 배포에 사용하기 위한, 모든 형식의 음악 장르의 음악 인덱싱에 적용할 수 있는 디지털 음악 원 데이터의 자동적인 뮤직 써머라이제이션이 필요하다.Thus, automatic music summery of digital music source data that can be applied to music indexing of all types of music genres, for example for use in content-based music search and web-based music distribution of real-time playback applications. I need a ration.

본 발명의 실시예들의 상기 및 기타 특징, 목적 및 장점들은, 도면들과 연계되어 아래에 기록된 상세한 설명으로부터 당업자에게 보다 쉽게 이해되며 더욱 명백해질 것이다.These and other features, objects, and advantages of embodiments of the present invention will become more readily understood and more apparent to those skilled in the art from the detailed description set forth below in connection with the drawings.

도 1은 본 발명의 일 실시예에 따른 오디오 파일 써머리를 생성하는데 사용되는 시스템의 블록도;1 is a block diagram of a system used to generate an audio file summary according to one embodiment of the invention;

도 2는 본 발명의 일 실시예에 따른 오디오 파일 써머리를 생성하는 방법을 예시한 플로우차트;2 is a flowchart illustrating a method of generating an audio file summary according to an embodiment of the present invention;

도 3은 본 발명의 일 실시예에 따라 도 1 및 도 2의 분류기의 분류화 파라미터들을 생성하기 위한 트레이닝 프로세스의 플로우차트;3 is a flowchart of a training process for generating classification parameters of the classifier of FIGS. 1 and 2 in accordance with one embodiment of the present invention;

도 4는 본 발명의 일 실시예에 따른 도 2의 순수 뮤직 써머라이제이션의 보다 상세한 플로우차트;4 is a more detailed flowchart of the pure music summerization of FIG. 2 in accordance with an embodiment of the present invention;

도 5는 본 발명의 일 실시예에 따른 도 2의 보컬 뮤직 써머라이제이션의 블록도를 예시한 도면;FIG. 5 illustrates a block diagram of the vocal music summerization of FIG. 2 in accordance with an embodiment of the present invention. FIG.

도 6은 본 발명의 일 실시예에 따라 오디오 원 데이터의 오버래핑 프레임들로의 세그먼트화를 나타내는 그래프를 예시한 도면; 및6 illustrates a graph illustrating segmentation of audio raw data into overlapping frames, in accordance with an embodiment of the present invention; And

도 7은 본 발명의 일 실시예에 따라 도 6의 프레임들의 거리 매트릭스의 2차원 표현을 예시한 도면이다.FIG. 7 illustrates a two-dimensional representation of the distance matrix of the frames of FIG. 6 in accordance with an embodiment of the present invention. FIG.

본 발명의 실시예들은 본질적으로 고도로 구조화되는 뮤지컬 원 데이터와 같은 디지털 오디오 데이터의 자동 써머라이제이션을 제공한다. 일 실시예는 예컨대 클래식, 재즈, 팝, 록 또는 연주 음악 등의 순수 및/또는 보컬 뮤직과 같은 오디오 파일에 대한 써머리를 제공한다. 일 실시예의 또 다른 특징은 순수 뮤직(pure music) 및 보컬 뮤직을 식별하기 위한 분류기(classifier)를 디자인하도록 어댑티브 트레이닝 알고리즘(adaptive training algorithm)을 사용하는 것이다. 일 실시예의 또다른 특징은 어댑티브 클러스터링 알고리즘을 이용하여 음악 콘텐츠를 구조화하고, 도메인-기반 뮤직 지식을 적용함으로써 순수 및 보컬 뮤직에 대한 뮤직 써머리들을 생성하는 것이다. 일 실시예는 뮤직 프레임들로부터 뚜렷한 피처들을 추출하고, 어댑티브 러닝/트레이닝 알고리즘을 이용하여 분류화 파라미터들을 판정하며, 뮤직을 상기 분류기에 따라 순수 뮤직 또는 보컬 뮤직으로 식별함으로써, 순수 뮤직 및 보컬 뮤직을 디지털 오디오 데이터로부터 식별하기 위한 디지털 오디오 원 데이터에 대한 자동 써머라이제이션을 제공한다. 순수 뮤직에 있어서, 템포럴, 스펙트럴 및 셉트럴 피처들이 음악 콘텐츠를 특성화하도록 계산되며, 어댑티브 클러스터링 방법이 사용되어 계산된 피처들에 따라 음악 콘텐츠를 구조화하게 된다. 써머리는 클러스터링된 결과 및 도메인-기반 음악 지식에 따라 생성된다. 보컬 뮤직에 있어서, 상기 음악 콘텐츠를 구조화하기 위해 보이스 관련 피처들이 추출되어 사용되며, 이와 유사하게 상기 뮤직 써머리는 음악 장르에 관련된 경험적 룰(heuristic rule)들과 구조화된 콘텐츠에 관하여 생성된다.Embodiments of the present invention inherently provide for automatic summation of digital audio data, such as highly structured musical raw data. One embodiment provides a summary for audio files such as pure and / or vocal music such as classical, jazz, pop, rock or playing music. Another feature of one embodiment is the use of an adaptive training algorithm to design a classifier for identifying pure music and vocal music. Another feature of one embodiment is the use of an adaptive clustering algorithm to structure the music content and apply domain-based music knowledge to generate music summaries for pure and vocal music. One embodiment extracts distinct features from music frames, determines classification parameters using an adaptive learning / training algorithm, and identifies music as pure music or vocal music according to the classifier, thereby identifying pure music and vocal music. It provides automatic summation of digital audio raw data for identification from digital audio data. In pure music, temporal, spectral and spectral features are calculated to characterize the music content, and an adaptive clustering method is used to structure the music content according to the calculated features. The summary is generated according to the clustered results and domain-based music knowledge. In vocal music, voice related features are extracted and used to structure the music content, and similarly the music summary is created with respect to heuristic rules and structured content related to the genre of music.

본 발명의 일 실시형태에 따르면,According to one embodiment of the invention,

오디오 데이터의 1이상의 계산된 피처 특성을 갖는 상기 오디오 데이터의 표현을 식별하도록 상기 오디오 데이터를 분석하는 단계; 상기 표현에 기초하여, 2이상의 카테고리로부터 선택된 소정 카테고리 안으로 상기 오디오 데이터를 분류하는 단계; 및 상기 디지털 오디오 데이터의 써머라이제이션을 표현하는 어쿠스틱 신호를 생성하는 단계를 포함하여 이루어지며, 상기 써머라이제이션은 상기 선택된 소정 카테고리에 의존하는 디지털 오디오 데이터를 써머라이징하는 방법을 제공한다.Analyzing the audio data to identify a representation of the audio data having one or more calculated feature characteristics of audio data; Based on the representation, classifying the audio data into a predetermined category selected from two or more categories; And generating an acoustic signal representing the thermalization of the digital audio data, wherein the thermalization provides a method of summarizing digital audio data depending on the selected predetermined category.

다른 구현예들에서, 상기 분석하는 단계는, 세그먼트 프레임들 안으로 오디오 데이터를 세그먼트화하는 단계, 및 상기 프레임들을 오버랩하는 단계를 더 포함하여 이루어질 수도 있으며, 및/또는 상기 분류하는 단계는, 각각의 프레임으로부터 트레이닝 데이터를 수집함으로써 소정 카테고리 안으로 상기 프레임들을 분류하는 단계, 및 트레이닝 계산을 이용함으로써 분류화 파라미터들을 판정하는 단계를 더 포함하여 이루어질 수도 있다.In other implementations, the analyzing may further comprise segmenting audio data into segment frames, and overlapping the frames, and / or classifying the respective ones. Classifying the frames into a predetermined category by collecting training data from a frame, and determining classification parameters by using a training calculation.

본 발명의 또 다른 실시형태에 따르면,According to another embodiment of the present invention,

오디오 데이터를 수신하고 상기 오디오 데이터를 분석하여, 상기 오디오 데이터의 1이상의 계산된 피처 특성을 갖는 상기 오디오 데이터의 표현을 식별하는 피처 추출기; 상기 피처 추출기와 연통하여, 2이상의 카테고리들로부터 선택된 소정 카테고리 안으로 상기 피처 추출기로부터 수신된 상기 표현에 기초하여 상기 오디오 데이터를 분류하는 분류기; 및 상기 분류기와 연통하여, 상기 디지털 오디오 데이터의 써머라이제이션을 표현하는 어쿠스틱 신호를 생성하는 써머라이저를 포함하여 이루어지며, 상기 써머라이제이션은 상기 분류기에 의해 선택된 상기 카테고리에 의존하는 디지털 오디오 데이터를 써머라이징하는 장치가 제공된다.A feature extractor for receiving audio data and analyzing the audio data to identify a representation of the audio data having one or more calculated feature characteristics of the audio data; A classifier in communication with the feature extractor, classifying the audio data based on the representation received from the feature extractor into a predetermined category selected from two or more categories; And a summerizer in communication with the classifier for generating an acoustic signal representing the summerization of the digital audio data, wherein the summerization comprises digital audio data depending on the category selected by the classifier. A device for summarizing is provided.

다른 구현예들에서, 상기 장치는 상기 피처 추출기와 연통하여, 오디오 파일을 수신하고 세그먼트 프레임들 안으로 오디오 데이터를 세그먼트화하는 세그먼터를 더 포함하여 이루어질 수도 있으며, 상기 피처 추출기에 대해 상기 프레임들을 오버랩할 수도 있다. 상기 장치는, 상기 분류기와 연통하여 분류화 파라이터 제너레이터를 더 포함하여 이루어질 수도 있으며, 상기 분류기는 각각의 프레임으로부터 트레이닝 데이터를 수집함으로써 소정 카테고리 안으로 상기 프레임들의 각각을 분류하고, 상기 분류화 파라미터 제너레이터에서 트레이닝 계산을 이용함으로써 분류화 파라미터들을 판정한다.In other implementations, the apparatus may further comprise a segmenter in communication with the feature extractor to receive an audio file and segment the audio data into segment frames, overlapping the frames with respect to the feature extractor. You may. The apparatus may further comprise a classification parameter generator in communication with the classifier, wherein the classifier classifies each of the frames into a predetermined category by collecting training data from each frame, and the classification parameter generator The classification parameters are determined by using a training calculation in.

디지털 오디오 데이터를 써머라이징하기 위해, 컴퓨터 가용 매체내에서 구현되는 컴퓨터 판독가능 프로그램 코드 수단을 가지는 상기 매체를 포함하여 이루어지는 컴퓨터 프로그램물이 제공되며, 상기 컴퓨터 프로그램물은, 상기 오디오 데이터를 분석하여, 상기 오디오 데이터의 1이상의 계산된 피처 특성을 가지는 상기 오디오 데이터의 표현을 식별하는 컴퓨터 판독가능 프로그램 코드 수단; 상기 표현에 기초하여, 2이상의 카테고리로부터 선택된 소정 카테고리 안으로 상기 오디오 데이터를 분류하는 컴퓨터 판독가능 프로그램 코드; 및 상기 디지털 오디오 데이터의 써머라이제이션을 표현하는 어쿠스틱 신호를 생성하는 컴퓨터 판독가능 프로그램 코드를 포함하여 이루어지며, 상기 써머라이제이션은 상기 선택된 카테고리에 종속적이다.In order to summarize the digital audio data, a computer program product comprising the medium having computer readable program code means embodied in a computer usable medium is provided, the computer program product analyzing the audio data, Computer readable program code means for identifying a representation of the audio data having one or more calculated feature characteristics of the audio data; Computer readable program code for classifying the audio data into a predetermined category selected from two or more categories based on the expression; And computer readable program code for generating an acoustic signal representing the thermalization of the digital audio data, wherein the thermalization is dependent on the selected category.

도 1은 본 발명의 일 실시예에 따른 오디오 써머리를 생성하는데 사용되는 시스템(100)의 구성요소 및/또는 모듈을 예시한 블록도이다. 상기 시스템은 세그먼터(114)에서 뮤직 콘텐츠(12)와 같은 오디오 파일을 수신할 수 있다. 뮤직 시퀀스(12)는 프레임들로 세그먼트화되고, 피처들은 피처 추출기(feature extractor; 116)에서 각각의 프레임으로 추출된다. 분류화 파라미터 제너레이터(120)로부터 공급된 분류화 파라미터들에 기초한 분류기(118)는, 피처-추출된 프레임들을 카테고리들로, 예컨대 순수 뮤직 시퀀스(140) 또는 보컬 뮤직 시퀀스(160)로 분류한다. 순수 뮤직은 노래하는 보이스가 없는 뮤직 콘텐츠로서 정의되고, 보컬 뮤직은 노래하는 보이스가 있는 뮤직 콘텐츠로서 정의된다. 오디오 콘텐츠가 분류화(118)에 의해 분류된 카테고리에 대해 특별히 디자인된 오디오 콘텐츠 중 어느 것의 써머라이제이션을 수행하는 뮤직 써머라이저(122, 124) 중 하나에서 오디오 써머리가 생성되며, 오디오 지식 모듈 또는 룩 업 테이블(150)에 잔류하는 오디오 콘텐츠의 특정 카테고리들의 정보의 도움을 받아 계산될 수도 있다. 두 써머라이저가 도 1에 도시되어 있지만, 예컨대 모든 오디오 파일들이 순수 뮤직 또는 보컬 뮤직과 같은 단지 한 타입의 음악 콘텐츠만을 포함하는 경우, 단 하나의 써머라이저만이 일 타입의 오디오 파일에 요구될 수 있음을 이해할 수 있다. 도 1은 예를 들어 순수 뮤직 써머라이저(122) 및 보컬 뮤직 써머라이저(124)와 같은 2가지 일반적인 타입의 뮤직에 대해 구현될 수 있는 두 써머라이저들을 도시한다. 상기 시스템은 그 다음 예컨대 뮤직 써머리(26)와 같은 오디오 시퀀스 써머리를 제공한다.1 is a block diagram illustrating components and / or modules of a system 100 used to generate an audio summary according to one embodiment of the invention. The system may receive an audio file, such as music content 12, at segment 114. Music sequence 12 is segmented into frames, and features are extracted into each frame in feature extractor 116. Classifier 118 based on classification parameters supplied from classification parameter generator 120 classifies the feature-extracted frames into categories, such as pure music sequence 140 or vocal music sequence 160. Pure music is defined as music content without a singing voice, and vocal music is defined as music content with a singing voice. An audio summary is generated in one of the music summerizers 122, 124 where the audio content performs a summerization of any of the specifically designed audio content for the category categorized by the classification 118, and the audio knowledge module or It may be calculated with the help of information of specific categories of audio content remaining in the lookup table 150. Although both summerizers are shown in FIG. 1, if all audio files contain only one type of music content such as pure music or vocal music, only one summerizer may be required for one type of audio file. I can understand that. 1 shows two summerizers that can be implemented for two general types of music, such as, for example, pure music summerizer 122 and vocal music summerizer 124. The system then provides an audio sequence summary such as, for example, music summary 26.

도 1에 도시된 실시예 및 여기에 논의된 방법은 일반적으로 종래 기술에 잘 알려진 컴퓨터 아키텍처 내에서 및/또는 컴퓨터 아키텍처 상에서 구현될 수 있다. 기술된 본 발명의 실시예들의 기능성은 하드웨어 또는 소프트웨어 중 어느 하나에서 구현될 수도 있다. 소프트웨어 센스에서, 시스템의 구성요소들은 프로세스, 프로그램 또는 그 일부분일 수 있으며, 이는 보통 특정 기능 또는 관련된 기능들을 수행한다. 하드웨어 센스에서, 구성요소는 여타의 구성요소들과 함께 사용하기 위해 디자인된 기능성 하드웨어 유닛이다. 예를 들어, 구성요소는 이산 전기 구성요소들을 이용하여 구현될 수 있거나, 또는 ASIC(Application Specific Integrated Circuit)과 같은 전체 전자 회로의 일부분을 형성할 수도 있다. 기타 수많은 가능성들이 존재하며, 당업자라면 상기 시스템이 또한 하드웨어 및 소프트웨어 구성요소들의 조합으로 구현될 수도 있음을 이해할 수 있다.The embodiment shown in FIG. 1 and the methods discussed herein may be implemented within computer architectures and / or on computer architectures generally well known in the art. The functionality of the described embodiments of the invention may be implemented in either hardware or software. In software sense, the components of a system can be a process, a program, or a part thereof, which typically performs a particular function or related functions. In hardware sense, a component is a functional hardware unit designed for use with other components. For example, a component may be implemented using discrete electrical components or may form part of an overall electronic circuit, such as an application specific integrated circuit (ASIC). Many other possibilities exist and one skilled in the art will appreciate that the system may also be implemented in a combination of hardware and software components.

개인 컴퓨터 또는 서버들은 실시예들이 구현되는 컴퓨터 아키텍처들의 예시들이다. 이러한 컴퓨터 아키텍처들은 마이크로프로세서를 구비한 중앙처리유닛(CPU), 정보의 임시 및 영구 저장을 위한 RAM(random access memory)과 ROM(read only memory) 및 하드 드라이브, 디스켓 또는 CD ROM 등과 같은 매스 기억 장치와 같은 구성요소 및/또는 모듈들을 포함하여 이루어진다. 이러한 컴퓨터 아키텍처들은 또한 상기 구성요소들과 제어된 정보들을 상호연결 및 상기 구성요소들간의 통신을 위한 버스를 포함한다. 부가적으로는, 유저 입력을 위한 키보드, 마우스, 마이크로폰 등과 출력을 위한 디스플레이, 프린터, 스피커 등과 같은 유저 입력 및 출력 인터페이스가 일반적으로 제공된다. 일반적으로, 각각의 입력/출력 인터페이스들은 제어기에 의해 버스에 연결되고 제어기 소프트웨어로 구현된다. 물론, 소정 개수의 입력/출력 디바이스들이 이러한 시스템들에 구현될 수도 있음은 명백하다. 상기 컴퓨터 시스템은 통상적으로 CPU 상에 존재하는 운영 체제 소프트웨어에 의해 제어 및 관리된다. 흔히 사용되고 잘 알려진 여러 운영 체제들이 있다. 따라서, 본 발명의 실시예들은 이러한 컴퓨터 아키텍처들 내에서 및/또는 상에서 구현될 수도 있다.Personal computer or servers are examples of computer architectures in which embodiments are implemented. These computer architectures include a central processing unit (CPU) with a microprocessor, random access memory (RAM) and read only memory (ROM) for temporary and permanent storage of information, and mass storage devices such as hard drives, diskettes, or CD ROMs. It comprises a component and / or modules such as. Such computer architectures also include a bus for interconnecting the components and controlled information and for communicating between the components. In addition, user input and output interfaces are generally provided, such as a display, printer, speaker, etc., for outputting a keyboard, mouse, microphone and the like for user input. In general, each input / output interface is connected to the bus by a controller and implemented in controller software. Of course, it is apparent that any number of input / output devices may be implemented in such systems. The computer system is typically controlled and managed by operating system software residing on the CPU. There are several commonly used and well known operating systems. Thus, embodiments of the invention may be implemented within and / or on such computer architectures.

도 2는 본 발명의 일 실시예에 따른 뮤직 써머리와 같은 오디오 써머리를 자동으로 생성하는데 사용되는 시스템 및/또는 방법(10)의 구성요소들의 블록도를 예시한다. 이 실시예는 들어오는 오디오 데이터를 수신하는 것에서부터 시작된다. 오디오 파일(12)과 같은 들어오는 오디오 데이터는, 예컨대 뮤직 시퀀스 또는 콘텐츠를 포함할 수도 있다. 뮤직 콘텐츠는 세그먼트화 단계(14)에서 프레임들로 우선 세그먼트화된다. 그 후, 피처 추출 단계(16)에서, 예컨대 선형 예측 계수(linear prediction coefficients), 영 교차율(zero crossing rates) 및 멜-프리퀀시 셉스트럴 계수(mel-frequency cepstral coefficients)와 같은 피처들이 추출되고 함께 계산되어 각각의 프레임의 피처 벡터를 형성함으로써, 뮤직 콘텐츠의 특성을 나타내게 된다. 전제 뮤직 시퀀스의 각각의 프레임의 피처 벡터는, 분류기를 통해 뮤직을 순수 또는 보컬 뮤직과 같은 카테고리들로 전달된다. 소정 개수의 카테고리들이 사용될 수도 있음을 알 수 있다. 분류기(18)의 분류화 파라미터(20)들은 도 3에 도시된 트레이닝/분류화 프로세스에 의해 판정된다. 일단 순수 뮤직(40) 또는 보컬 뮤직(60) 등의 뮤직 카테고리들과 같은 오디오 카테고리들로 분류되면, 각각의 카테고리는 그 후에 써머라이즈되어 오디오 써머리(26)로 제공되어 끝난다. 예를 들어, 순수 뮤직 써머라이제이션 단계(22)는 도 4에 상세히 도시되어 있다. 마찬가지로, 보컬 뮤직 써머라이제이션 단계(24)는 도 5에 상세히 도시되어 있다.2 illustrates a block diagram of the components of the system and / or method 10 used to automatically generate an audio summary such as a music summary according to one embodiment of the invention. This embodiment begins with receiving incoming audio data. Incoming audio data, such as audio file 12, may comprise a music sequence or content, for example. The music content is first segmented into frames in segmentation step 14. Then, in the feature extraction step 16, features such as linear prediction coefficients, zero crossing rates and mel-frequency cepstral coefficients are extracted and together By calculating and forming the feature vector of each frame, the characteristics of the music content are represented. The feature vector of each frame of the entire music sequence is passed through the classifier to categories such as pure or vocal music. It will be appreciated that any number of categories may be used. The classification parameters 20 of the classifier 18 are determined by the training / classification process shown in FIG. 3. Once classified into audio categories, such as music categories such as pure music 40 or vocal music 60, each category is then summed up and provided to the audio summary 26 to finish. For example, pure music summerization step 22 is shown in detail in FIG. Similarly, vocal music summerization step 24 is shown in detail in FIG.

도 3은 본 발명의 일 실시예에 따라 (도 2에 도시된) 분류기(18)의 분류화 파라미터(20)들을 생성하기 위한 일 실시예의 트레이닝/분류화 파라미터 프로세스(38)의 다이어그램의 개념도를 예시한다. 음악 콘텐츠를 순수 뮤직 또는 보컬 뮤직과 같은 상이한 카테고리들로 식별하기 위하여, 분류기(18)가 제공된다. 분류기(18)용 분류화 파라미터(20)들은 트레이닝 프로세스(38)에 의해 판정된다. 트레이닝 프로세스는 음악 트레이닝 샘플 데이터를 분석하여, 음악 프레임들을 예컨대 보컬(60) 또는 비-보컬(40) 클래스들과 같은 분류화들로 분류하는 최적의 방식을 찾아낸다. 트레이닝 오디오(30)는 예컨대 트레이닝 데이터가 여러 소스들로부터 기원되어야 하고 여러 음악 장르를 포함해야 한다는 것과 같은 통계상으로 상당하도록 충분해야 한다. 트레이닝 샘플 오디오 데이터는 또한 도 2의 세그먼트화에서 논의된 바와 같이 고정-길이 및 오버래핑 프레임들로 세그먼트화될 수도 있다. 선형 예측 계수, 영 교차율 및 멜-프리퀀시 셉스트럴 계수 등과 같은 피처들은 각각의 프레임으로부터 추출된다. 각각의 프레임에 대해 선택된 피처들은 분류화를 최고로 특성화하고, 예컨대 보컬 클래스들에 대해 선택되는 피처들은 보컬 클래스들을 최고로 특성화하는 피처들이다. 계산된 피처들은 히든 마르코프 모델(hidden Markov model), 뉴럴 네트워크(neural network) 및 써포트 벡터 머신(support vector machine) 등과 같은 트레이닝 알고리즘(36)에 의해 클러스터링되어, 분류화 파라미터(20)들을 생성하게 된다. 이러한 여타의 트레이닝 알고리즘들이 사용될 수 있지만, 일부 트레이닝 알고리즘들은 여하한의 특정 어플리케이션에 더욱 적합할 수도 있다. 예를 들어, 지원 벡터 머신 트레이닝 알고리즘은 양호한 분류화 결과들을 수행할 수도 있지만, 트레이닝 타임이 여타의 트레이닝 알고리즘들에 비해 길다. 트레이닝 프로세스는 단 한번 수행되어야 하지만, 여러 번 수행될 수도 있다. 도출된 분류화 파라미터들은 예컨대 비-보컬 또는 순수 뮤직 및 보컬 뮤직과 같은 오디오 콘텐츠의 상이한 분류화들을 식별하는데 사용된다.3 shows a conceptual diagram of a diagram of an embodiment training / classification parameter process 38 for generating classification parameters 20 of a classifier 18 (shown in FIG. 2) according to one embodiment of the invention. To illustrate. In order to identify the music content into different categories, such as pure music or vocal music, a classifier 18 is provided. The classification parameters 20 for the classifier 18 are determined by the training process 38. The training process analyzes the music training sample data to find the best way to classify the music frames into classifications such as, for example, vocal 60 or non-vocal 40 classes. The training audio 30 should be sufficient to be statistically significant, such as, for example, that the training data must originate from several sources and include several musical genres. Training sample audio data may also be segmented into fixed-length and overlapping frames as discussed in the segmentation of FIG. 2. Features such as linear prediction coefficients, zero crossing rate, mel-frequency septal coefficients and the like are extracted from each frame. The features selected for each frame best characterize the classification, for example the features selected for the vocal classes are the features that best characterize the vocal classes. The computed features are clustered by training algorithms 36, such as hidden Markov models, neural networks, and support vector machines, to generate classification parameters 20. . While these other training algorithms may be used, some training algorithms may be more suitable for any particular application. For example, the support vector machine training algorithm may perform good classification results, but the training time is longer than other training algorithms. The training process needs to be performed only once, but may be performed several times. The derived classification parameters are used to identify different classifications of audio content such as non-vocal or pure music and vocal music.

도 4는 순수 뮤직 써머라이제이션의 일 실시예의 개념 블록도를 예시하고, 도 5는 보컬 뮤직 써머라이제이션의 일 실시예의 개념 블록도를 예시한다. 써머라이제이션의 목적은, 뮤직 시퀀스와 같은 주어진 오디오 데이터를 분석하고, 상기 뮤직의 두드러진 테마를 반영하기 위해 중요한 프레임들을 추출하는 것이다. 각각의 프레임의 계산된 피처들을 토대로, 어댑티브 클러스터링 방법이 사용되어, 뮤직 프레임들 및 뮤직 콘텐츠의 구조를 그룹화한다. 인접한 프레임들이 오버랩되기 때문에, 오버랩의 길이는 프레임 그룹화에 대해 판정된다. 초기 스테이지에서, 오버랩의 길이를 정확하게 판정하는 것은 어렵다. 오버랩의 길이는, 클러스터링 결과가 프레임 그룹화에 대해 이상적이지 않은 경우에 어댑티브하게 조정될 수 있다. 일반적인 클러스터링 알고리즘의 예시가 아래에 기술되어 있다:FIG. 4 illustrates a conceptual block diagram of one embodiment of pure music summerization, and FIG. 5 illustrates a conceptual block diagram of one embodiment of vocal music summerization. The purpose of summerization is to analyze given audio data, such as music sequences, and extract important frames to reflect the prominent theme of the music. Based on the calculated features of each frame, an adaptive clustering method is used to group the structure of music frames and music content. Since adjacent frames overlap, the length of the overlap is determined for frame grouping. In the initial stage, it is difficult to accurately determine the length of the overlap. The length of the overlap can be adaptively adjusted if the clustering result is not ideal for frame grouping. An example of a general clustering algorithm is described below:

(1) 도 6에 도시된 바와 같이, 세그먼터(144) 또는 세그멘트화 단계(42, 62)에서, 세그먼트 뮤직 신호는 N 고정-길이(73, 74, 75, 76)로 되고, 예컨대 도 6에 도시된 바와 같이 50%의 오버래핑 프레임(77, 78, 79)들을 제공하며, 각각의 프레임에 번호 i(i=1,2,...N)로 라벨링하고, 클러스터들의 초기 세트는 모든 프레임들이다. 단계(42, 62)에서의 세그먼트화 프로세스는 또한 도 2 및 도 3에 상술되고 도시된 바와 같이 세그먼트화 단계(14, 32)와 같이 다른 상황들에서 수행된 세그먼트화 프로세스의 동일한 프로시저를 따를 수도 있다;(1) As shown in FIG. 6, in the segmenter 144 or the segmentation steps 42 and 62, the segment music signal becomes N fixed-lengths 73, 74, 75, and 76, for example, FIG. 6. Provide 50% overlapping frames 77, 78, 79 as shown in, labeling each frame with the number i (i = 1, 2, ... N), the initial set of clusters being all frames admit. The segmentation process in steps 42 and 62 will also follow the same procedure of the segmentation process performed in other situations, such as segmentation steps 14 and 32 as detailed and shown in FIGS. 2 and 3. May be;

(2) 각각의 프레임에 있어서, 예컨대 선형 예측 계수, 영 교차율 및 멜-프리퀀시 셉스트럴 계수와 같은 오디오 파일의 특정 카테고리에 대해 특정된 피처 추출 단계(44, 64)에서 피처 추출값들을 계산하여 피처 벡터를 형성한다:(2) For each frame, feature extraction values are calculated in feature extraction steps 44, 64 specified for a particular category of audio file, such as, for example, linear prediction coefficients, zero crossing rate, and mel-frequency septal coefficients. Form the feature vector:

여기서, LPC_i는 선형 예측 계수를 나타내고, ZCR_i는 영 교차율을 나타내며, MFCC_i는 멜-프리퀀시 셉스트럴 계수를 나타낸다.Here, LPC _i denotes a linear prediction coefficient, ZCR _i denotes a zero crossing rate, and MFCC _i denotes a mel-frequency septal coefficient.

(3) 예컨대, 마하라노비스(Mahalanobis) 거리를 이용하여, 뮤직 프레임 i 및 j의 모든 쌍간의 거리들을 계산한다:(3) Calculate the distances between all pairs of music frames i and j, using, for example, Mahalanobis distances:

여기서, R은 피처 벡터의 코배리언스 매트릭스(covariance matrix)이다. R^-1이 대칭이기 때문에, R^-1은 세미 또는 포지티브 매트릭스(semi or positive matrix)이다. R^-1은 R^-1=P^TΛP로 대각화될 수 있는데, 여기서 Λ는 대각 매트릭스이고, P는 직교 매트릭스이다. 수학식 (2)는 다음과 같이 유클리디안(Euclidean) 거리에 대해 단순화될 수 있다:Where R is the covariance matrix of the feature vector. Since R ⁻¹ is symmetrical, R ⁻¹ is a semi or positive matrix. R ⁻¹ may be diagonalized to R ⁻¹ = P ^T ΛP, where Λ is a diagonal matrix and P is an orthogonal matrix. Equation (2) can be simplified for Euclidean distance as follows:

Λ 및 P는 R^-1로부터 직접 계산될 수도 있기 때문에, 벡터 거리의 계산의 복잡성은 O(n²)에서 O(n)로 줄어들 수 있다.Since Λ and P may be calculated directly from R ⁻¹ , the complexity of calculating the vector distance can be reduced from O (n ² ) to O (n).

(4) 도 7에 도시된 바와 같이, 계산된 거리는 2차원 표현(80)으로 임베딩된다. 메트릭스 S(80)는 모든 프레임 조합에 대해 계산된 유사성 메트릭(similarity metric)을 포함하므로, 프레임은 S의 i, j번째 원소가 D(i,j)가 되도록 i 및 j를 인덱싱한다.(4) As shown in FIG. 7, the calculated distance is embedded in the two-dimensional representation 80. Since matrix S 80 includes a similarity metric calculated for all frame combinations, the frame indexes i and j such that the i, j th element of S is D (i, j).

(5) 2차원 매트릭스 S의 각각의 로우(row)에 있어서, 소정의 두 프레임간의 거리가 사전-정의된 임계값보다 작은 경우, 예컨대 상기 실시예에서는 사전-정의된 임계값이 1.0과 같은 값인 경우, 상기 프레임들은 동일한 클러스터로 그룹화된다.(5) For each row of the two-dimensional matrix S, if the distance between two predetermined frames is less than a pre-defined threshold, for example in this embodiment the pre-defined threshold is a value such as 1.0. In this case, the frames are grouped into the same cluster.

(6) 최종 클러스터링 결과값이 이상적이지 않은 경우, 도 4에 화살표 45 및 도 5에 화살표 65로 도시된 바와 같이, 두 프레임들의 오버랩 길이를 조정하고 단계 (2) 내지 (5)를 반복한다. 예를 들어, 상기 실시예에서는, 이상적인 결과값은, 클러스터의 수가 클러스터링 후에 초기 클러스터의 개수보다 훨씬 작다는 것을 의미한다. 만일 상기 결과가 이상적이지 않다면, 오버랩은 오버래핑 길이를 예컨대 50% 에서 40%로 변경시켜 조정될 수도 있다.(6) If the final clustering result is not ideal, adjust the overlap length of the two frames as shown by arrow 45 in FIG. 4 and arrow 65 in FIG. 5 and repeat steps (2) to (5). For example, in the above embodiment, the ideal result means that the number of clusters is much smaller than the number of initial clusters after clustering. If the result is not ideal, the overlap may be adjusted by changing the overlapping length from 50% to 40%, for example.

특정 카테고리들에 대한 클러스터링을 참조하여, 도 4는 퓨처/비-보컬 뮤직에 대한 써머라이제이션 프로세스를 도시하고, 도 5는 보컬 뮤직에 대한 써머라이제이션 프로세스를 도시한다. 도 4에서, 순수 뮤직 콘텐츠(40)는 우선 예컨대 상술된 고정-길이 및 오버래핑 프레임들의 길이들로 세그먼트화되며(42), 그 후 피처 추출(44)은 상술된 각각의 프레임에서 행해진다. 추출된 피처들은 진폭 엔벨로프, 파워 스펙트럼, 멜-프리퀀시 셉스트럴 계수 등을 포함할 수 있으며, 이들은 순수 뮤직 콘텐츠를 임시, 스펙트럼 및 셉스트럴 도메인들에서 특성화할 수도 있다. 여타의 피처들은 순수 뮤직 콘텐츠를 특성화하도록 추출될 수 있으며, 이는 여기에 리스트화된 피처들로 제한되지는 않는다는 것을 이해할 수 있다. 계산된 피처들을 토대로, 어댑티브 클러스터링(46) 알고리즘이 적용되어, 프레임들을 그룹화하고 뮤직 콘텐츠의 구조를 얻게 된다. 세그먼트화 및 어댑티브 클러스터링 알고리즘은 상술된 것과 동일할 수도 있다. 예를 들어, 클러스터링 결과가 첫번째 패스 후에 결정 단계(47, 69)에서 이상적이지 않다면, 세그먼트화 단계(42, 62) 및 피처 추출 단계(44, 64)는 상이한 오버래핑 관계를 갖는 프레임들로 반복된다. 이 프로세스는 원하는 클러스터링 결과가 달성될 때까지 화살표 45, 65로 도시된 바와 같이 퀴어링(querying) 단계(47, 69)에서 반복된다. 클러스터링 후, 유사한 피처들을 갖는 프레임들이 뮤직 콘텐츠의 구조를 나타내는 동일한 클러스터들로 그룹화된다. 써머리 제너레이션(48)은 그 후에 이 구조 및 도메인-기반 뮤직 지식(50)의 관점에서 수행된다. 뮤직 지식에 따르면, 대부분의 뚜렷하거나 표현적인 음악 테마들이 전체 뮤직 작업에서 반복적으로 발생하여야만 한다.With reference to clustering for certain categories, FIG. 4 shows a summerization process for future / non-vocal music, and FIG. 5 shows a summerization process for vocal music. In FIG. 4, the pure music content 40 is first segmented 42, e.g., with the lengths of the fixed-length and overlapping frames described above, and then feature extraction 44 is performed in each frame described above. The extracted features may include amplitude envelopes, power spectra, mel-frequency septal coefficients, and the like, which may characterize pure music content in the temporal, spectral, and septal domains. It can be appreciated that other features can be extracted to characterize pure music content, which is not limited to the features listed here. Based on the calculated features, the adaptive clustering 46 algorithm is applied to group the frames and obtain the structure of the music content. The segmentation and adaptive clustering algorithm may be the same as described above. For example, if the clustering result is not ideal in decision steps 47 and 69 after the first pass, segmentation steps 42 and 62 and feature extraction steps 44 and 64 are repeated in frames with different overlapping relationships. . This process is repeated in the queuing steps 47 and 69 as shown by arrows 45 and 65 until the desired clustering result is achieved. After clustering, frames with similar features are grouped into identical clusters that represent the structure of the music content. The summary generation 48 is then performed in terms of this structure and domain-based music knowledge 50. According to music knowledge, most distinct or expressive music themes must occur repeatedly in the overall music work.

써머리(52)의 길이는 전체 뮤직의 대부분 뚜렷하거나 표현적인 숙련가를 나타내기에 충분히 길어야 한다. 보통, 3분 내지 4분 피스의 뮤직에 있어서, 30초는 써머리의 적절한 길이이다. 뮤직 작업의 써머리를 생성하는 예시는 다음과 같이 기술된다:The length of the summary 52 should be long enough to represent most distinct or expressive proficiency of the overall music. Typically, for 3-4 minute pieces of music, 30 seconds is the appropriate length of the summary. An example of generating a summary of a music task is described as follows:

(1) 프레임들의 최대량을 포함하는 클러스터를 식별한다. 이들 프레임들의 라벨들은 f₁, f₂, ... ...f_n, 여기서 f₁<f₂<......<f_n;(1) Identify the cluster containing the maximum amount of frames. The labels of these frames are f ₁ , f ₂ , ... ... f _n , where f ₁ <f ₂ <...... <f _n ;

(2) 이들 프레임들로부터, 다음의 룰에 따라 최소 라벨 f_i를 갖는 프레임을 선택한다:(2) From these frames, select the frame with the minimum label f _i according to the following rule:

m=1 내지 k에 있어서,for m = 1 to k,

프레임 (f_i+m) 및 프레임 (f_j+m)이 동일한 클러스터에 속한다면, i,j∈[1,n], i<j,k 는 써머리의 길이를 판정하기 위한 수이다;If frame (f _i + m) and frame (f _j + m) belong to the same cluster, i, j ∈ [1, n], i <j, k are numbers for determining the length of the summary;

(3) 프레임 (f_i+1), (f_i+2),......, (f_i+k)는 뮤직의 최종 써머리이다.(3) Frames (f _i +1), (f _i +2), ..., and (f _i + k) are the final summary of the music.

도 5는 일 실시예에 따른 보컬 뮤직 써머라이제이션의 개념 블록도를 예시한다. 보컬 뮤직 콘텐츠(60)는 우선 상술된 것과 동일한 방식으로 수행될 수 있는 고정-길이 및 오버래핑 프레임들로 세그먼트화된다(62). 피처 추출(64)은 각각의 프레임에서 행해진다. 추출된 피처들은 선형 예측 계수, 영 교차율 및 멜-프리퀀시 셉스트럴 계수 등을 포함하며, 이들은 보컬 뮤직 콘텐츠를 특성화할 수 있다. 물론, 비-보컬 뮤직에 대하여 상술된 바와 같이, 여타의 피처들이 추출되어 보컬 뮤직 콘텐츠를 특성화할 수 있고, 여기에 리스트화된 피처들로 제한되지 않는다는 것을 알 수 있다. 계산된 피처들을 토대로, 보컬 프레임(66)들이 배치되고, 여타의 비-보컬 프레임들이 디스카드된다. 어댑티브 클러스터링 알고리즘(68)이 적용되어 이들 보컬 프레임을 그룹화하고, 보컬 뮤직 콘텐츠의 구조를 얻게 된다. 세그먼트화 및 어댑티브 클러스터링 알고리즘은 상기와 동일할 수도 있는데, 예컨대 클러스터링 결과가 이상적이지 않다면, 세그먼트화 단계(62) 및 피처 추출 단계(64)가 상이한 오버랩 관계를 갖는 프레임들로 반복된다. 상기 프로세스는 원하는 클러스터링 결과가 달성될 때까지 결정 단계(69)로 도시된 바와 같이 반복되고, 도 5에서 분기된다(65). 최종적으로, 뮤직 써머리(70)는 클러스터링된 결과들과 보컬 뮤직에 관련된 뮤직 지식(50)을 토대로 생성된다.5 illustrates a conceptual block diagram of vocal music summerization according to one embodiment. Vocal music content 60 is first segmented 62 into fixed-length and overlapping frames that can be performed in the same manner as described above. Feature extraction 64 is performed in each frame. The extracted features include linear prediction coefficients, zero crossing rate and mel-frequency septal coefficients, etc., which can characterize vocal music content. Of course, as described above with respect to non-vocal music, it can be seen that other features may be extracted to characterize vocal music content, and are not limited to the features listed here. Based on the calculated features, vocal frames 66 are placed and other non-vocal frames are discarded. Adaptive clustering algorithm 68 is applied to group these vocal frames and obtain the structure of vocal music content. The segmentation and adaptive clustering algorithm may be the same as above, for example, if the clustering result is not ideal, segmentation step 62 and feature extraction step 64 are repeated with frames having different overlapping relationships. The process is repeated as shown in decision step 69 until the desired clustering result is achieved and branches 65 in FIG. 5. Finally, the music summary 70 is generated based on the clustered results and the music knowledge 50 related to vocal music.

보컬 뮤직용 써머라이제이션 프로세스(72)는 순수 뮤직의 것과 유사하지만, 여러 차이점들이 있는데, 예컨대 도 1의 뮤직 지식 모듈 또는 룩 업 테이블(150)과 같은 뮤직 지식(50)으로서 저장될 수도 있다. 첫번째 차이는 피처 추출이다. 순수 뮤직의 경우, 보이스-관련 피처들이 순수 뮤직 콘텐츠의 특성을 보다 양호하게 표현할 수도 있기 때문에, 진폭 엔벨로프 및 파워 스펙트럼과 같은 파워-관련 피처들이 사용된다. 진폭 엔벨로프은 시간 도메인으로 계산되는 한편, 스펙트럼 파워는 주파수 도메인으로 계산된다. 보컬 뮤직의 경우, 선형 예측 계수들, 영 교차율 및 멜-프리퀀스 셉스트럴 계수들과 같은 보이스-관련 피처들은, 그들이 보컬 뮤직 콘텐츠의 특성을 보다 양호하게 표현할 수도 있기 때문에 사용된다.The saturation process 72 for vocal music is similar to that of pure music, but there are a number of differences, such as may be stored as music knowledge 50, such as the music knowledge module or lookup table 150 of FIG. The first difference is feature extraction. For pure music, power-related features such as amplitude envelope and power spectrum are used because voice-related features may better represent the characteristics of pure music content. The amplitude envelope is calculated in the time domain, while the spectral power is calculated in the frequency domain. In the case of vocal music, voice-related features such as linear prediction coefficients, zero crossing rate, and mel-precision septal coefficients are used because they may better express the characteristics of the vocal music content.

순수 뮤직 및 보컬 뷰직 써머라이제이션 프로세스간의 또 다른 차이는 써머리 제너레이션이다. 순수 뮤직의 경우, 써머리는 여전히 순수 뮤직이다. 하지만, 보컬 뮤직의 경우, 써머리는 보컬 파트에서 시작되어야 하며 써머리에서 불러지는 뮤직 타이틀을 가지는 것이 바람직하다. 뮤직 장르와 관련된 어떤 다른 룰들이 존재하며, 이는 뮤직 지식(50)으로서 저장될 수도 있다. 예를 들어, 팝과 락 뮤직에서는, 메인 멜로디 파트가 큰 변화 없이 동일한 방식으로 반복된다. 팝과 락 뮤직은 통상적으로 유사한 방식 또는 패턴, 예를 들면 ABAB 포맷(여기서, A는 버스(verse)를 나타내고 B는 리프레인(refrain) 나타냄)을 따른다. 메인 테마(리프레인) 파트는 가장 빈번하게 행해지며, 버스, 브리지(bridge) 등이 뒤따른다. 하지만, 째즈 뮤직은 통상적으로 뮤지션들이 즉흥적으로 연주하므로, 파트들의 대부분이 변경되며 또한 메인 멜로디 파트를 결정하는데 있어 여러가지 문제들을 생성하게 된다. 통상적으로, 째즈 뮤직에는 리프레인이 없기 때문에, 째즈 뮤직의 메인 파트는 버스이다.Another difference between the pure music and vocal buccal summerization process is the summary generation. For pure music, the summary is still pure music. However, in the case of vocal music, the summary should start with the vocal part and it is desirable to have a music title that is called out in the summary. There are some other rules related to the music genre, which may be stored as music knowledge 50. For example, in pop and rock music, the main melody part is repeated in the same way without major changes. Pop and rock music typically follow a similar fashion or pattern, such as the ABAB format, where A stands for bus and B stands for refrain. The main theme (refrain) part is most often done, followed by buses, bridges, etc. However, jazz music is typically performed on the fly by musicians, so most of the parts are changed and also create various problems in determining the main melody part. Usually, since jazz music does not have a refresh, the main part of jazz music is a bus.

본질적으로, 본 발명의 일 실시예는, 특징적 상대 차이값(characteristic relative difference value)을 포함하는 뮤직 정보의 표현이 뮤직 정보를 표현하고, 인덱싱하며 및/또는 검색하는 비교적 간명하고 특징적 수단을 제공하는 현실화(realisation)에서 비롯된다. 또한, 이들 상대 차이값은 구조화되지 않은 모놀리스식 뮤지컬 원 디지털 데이터(unstructured monolithic musical raw digital data)에 비교적 복잡하지 않은 구조 표현을 제공한다는 것을 발견하였다.In essence, an embodiment of the present invention provides a relatively simple and characteristic means by which a representation of music information comprising a characteristic relative difference value represents, indexes, and / or retrieves music information. It comes from realisation. It has also been found that these relative difference values provide a relatively uncomplicated structural representation for unstructured monolithic musical raw digital data.

이상, 디지털 오디오 원 데이터의 써머라이제이션을 제공하는 방법, 시스템 및 컴퓨터 프로그램물이 개시되었다. 단지 몇몇 실시예들만이 설명되어 있다. 하지만, 당업자가 상기 명세서를 숙지한다면, 본 발명의 범위를 벗어나지 않고 다양한 변경 및/또는 수정이 행해질 수 있다는 것을 알 수 있을 것이다. In the above, the method, system, and computer program material which provide the summation of digital audio raw data are disclosed. Only some embodiments are described. However, it will be appreciated by those skilled in the art that various changes and / or modifications may be made without departing from the scope of the present invention.

Claims

In the method of summarizing digital audio data,

Analyzing the audio data to identify a representation of the audio data having at least one calculated feature characteristic of the audio data;

Based on the expression, classifying the audio data into a predetermined category selected from two or more categories; And

Generating an acoustic signal representing a thermalization of the digital audio data,

And wherein said thermosizing depends on said selected predetermined category.

The method of claim 1,

Said analyzing further comprises segmenting audio data into segment frames, and overlapping said frames.

The method of claim 2,

The classifying further comprises classifying the frames into a predetermined category by collecting training data from each frame, and determining classification parameters by using a training calculation.

The method according to any one of claims 1 to 3,

Wherein the calculated feature comprises perceptual and subjective features related to music content.

The method of claim 3,

The training calculation comprises a statistical learning algorithm, wherein the statistical learning algorithm is a Hidden Markov Model, a Neural Network, or a Support Vector Machine.

The method according to any one of claims 1 to 5,

And the type of acoustic signal is music.

The method according to any one of claims 1 to 6,

The acoustic signal type is vocal music or pure music.

The method according to any one of claims 1 to 7,

Wherein the calculated feature is an amplitude envelope, a power spectrum, or a mel-frequency septal coefficient (MFCC).

The method according to any one of claims 1 to 8,

Wherein the summerization is generated for heuristic rules and clustered results related to pure or vocal music.

The method according to any one of claims 1 to 9,

Wherein the calculated feature is associated with pure or vocal music content and is a linear prediction coefficient, zero crossing rate, or MFCC.

In the apparatus for summarizing digital audio data,

A feature extractor for receiving audio data and analyzing the audio data to identify a representation of the audio data having one or more calculated feature characteristics of the audio data;

A classifier in communication with the feature extractor, classifying the audio data based on the representation received from the feature extractor into a predetermined category selected from two or more categories; And

And a summerizer in communication with the classifier to generate an acoustic signal representing a thermalization of the digital audio data.

And wherein said summerization depends on said category selected by said classifier.

The method of claim 11,

And a segmenter in communication with the feature extractor, receiving an audio file, segmenting the audio data into segment frames, and overlapping the frames with respect to the feature extractor.

The method of claim 12,

And further comprising a classification parameter generator in communication with the classifier, wherein the classifier classifies each of the frames into a predetermined category by collecting training data from each frame, and uses training calculations in the classification parameter generator. Thereby determining the classification parameters.

The method according to any one of claims 11 to 13,

The calculated feature comprises perceptual and subjective features related to music content.

The method according to any one of claims 11 to 14,

The training calculation includes a statistical learning algorithm, wherein the statistical learning algorithm is a Hidden Markov Model, a Neural Network, or a Support Vector Machine.

The method according to any one of claims 11 to 15,

Wherein the acoustic signal is music.

The method according to any one of claims 11 to 16,

Wherein the acoustic signal is vocal music or pure music.

The method according to any one of claims 11 to 17,

Wherein the calculated feature is an amplitude envelope, a power spectrum, or an MFCC.

The method according to any one of claims 11 to 18,

Wherein the summerizer generates a summerization on empirical rules and clustered results related to pure or vocal music.

The method according to any one of claims 11 to 19,

Wherein the calculated feature is associated with pure or vocal music content and is a linear prediction coefficient, zero crossing rate or mel-frequency.

A computer program for summarizing said digital audio data comprising said medium having computer readable program code means embodied in a computer usable medium for deriving the summarizing of digital audio data.

Computer readable program code means for analyzing the audio data and identifying a representation of the audio data having one or more calculated feature characteristics of the audio data;

Computer readable program code for classifying the audio data into a predetermined category selected from two or more categories based on the expression; And

Computer readable program code for generating an acoustic signal representing the thermalization of the digital audio data,

And wherein said summerization is dependent on said selected category.

The method of claim 21,

And the analyzing further comprises segmenting the audio data into segment frames and overlapping the frames.

The method of claim 22,

The classifying comprises classifying the frames into a predetermined category by collecting training data from each frame and determining classification parameters by using a training calculation.

The method according to any one of claims 21 to 23, wherein

The method according to any one of claims 21 to 24,

The training calculation includes a statistical learning algorithm, wherein the statistical learning algorithm is a Hidden Markov Model, a Neural Network, or a Support Vector Machine. water.

The method according to any one of claims 21 to 25,

And the acoustic signal is music.

The method according to any one of claims 21 to 26,

And the acoustic signal type is vocal music or pure music.

The method according to any one of claims 21 to 27,

And said calculated feature is an amplitude envelope, a power spectrum, or an MFCC.

The method according to any one of claims 21 to 28, wherein

The summerization is generated for empirical rules and clustered results related to pure or vocal music.

The method according to any one of claims 21 to 29, wherein