KR102447554B1

KR102447554B1 - Method and apparatus for identifying audio based on audio fingerprint matching

Info

Publication number: KR102447554B1
Application number: KR1020200154553A
Authority: KR
Inventors: 이정환; 방경식; 유정수
Original assignee: 주식회사 샵캐스트
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2022-09-27
Also published as: KR20220067849A

Abstract

오디오 핑거프린트 기반 음원 인식 방법이 개시된다. 본 방법은, 쿼리 음원을 수신하는 단계, 쿼리 음원을 기 설정된 샘플링 레이트로 정규화하는 단계, 정규화된 쿼리 음원의 속도를 r배로 증가시키는 단계, 증가된 속도의 쿼리 음원에 대한 시간-주파수 히스토그램을, 1/r 배로 축소시키는 단계, 축소된 히스토그램을 바탕으로 쿼리 음원의 오디오 핑거프린트를 생성하는 단계 및 생성된 쿼리 음원의 오디오 핑거프린트와, 기 저장된 레퍼런스 음원들의 오디오 핑거프린트들을 비교하여 쿼리 음원에 대한 정보를 획득하는 단계를 포함한다.An audio fingerprint-based sound source recognition method is disclosed. The method includes the steps of receiving a query sound source, normalizing the query sound source to a preset sampling rate, increasing the speed of the normalized query sound source by r times, a time-frequency histogram for the query sound source of the increased speed, 1/r-fold reduction, generating an audio fingerprint of the query sound source based on the reduced histogram, and comparing the audio fingerprint of the generated query sound source with the audio fingerprints of the pre-stored reference sound sources obtaining information.

Description

Sound source recognition method and device based on audio fingerprint matching {METHOD AND APPARATUS FOR IDENTIFYING AUDIO BASED ON AUDIO FINGERPRINT MATCHING}

본 개시는 오디오 핑거프린트 매칭을 기반으로하는 음원 인식 방법 및 장치에 관한 것으로서, 더욱 상세하게는 오디오 핑거프린트 매칭시 연산량을 줄일 수 있는 음원 인식 방법 및 장치에 관한 것이다.The present disclosure relates to a method and apparatus for recognizing a sound source based on audio fingerprint matching, and more particularly, to a method and apparatus for recognizing a sound source capable of reducing the amount of computation during audio fingerprint matching.

TV 혹은 라디오 방송 콘텐츠, 영화 등에는 많은 음원이 사용되고 있고, 음원 자동 인식에 대한 필요성이 증가하고 있다. 예컨대, 음악 저작권자들은 자신의 음악이 라디오나 텔레비전에 어느 정도 방송되어 어느 정도의 저작권료를 청구할 수 있는가 하는 근거자료로 얻고자 한다. Many sound sources are used in TV or radio broadcast content, movies, and the like, and the need for automatic sound recognition is increasing. For example, music copyright holders want to obtain evidence of how much their music is broadcast on radio or television and how much copyright fees can be claimed.

음원 인식 기술로서, 오디오 핑거프린트(audio finger print) 비교 기술이 이용되고 있다. 오디오 핑거프린트란, 오디오 데이터의 특징을 설명할 수 있는 데이터를 의미하는 것으로, 신호 자체 고유의 특성을 반영하는 것으로서, 주파수, 진폭 등의 특징을 나타낸 정보일 수 있다. 이러한 특징 정보는 텍스트를 기반으로 한 메타 데이터와는 달리 컨텐츠 신호 자체의 특성을 반영하는 점에서 메타 데이터와는 차이가 있다.As a sound source recognition technology, an audio finger print comparison technology is used. The audio fingerprint refers to data that can describe the characteristics of audio data, reflects the inherent characteristics of the signal itself, and may be information indicating characteristics such as frequency and amplitude. This characteristic information is different from metadata in that it reflects the characteristics of the content signal itself, unlike text-based metadata.

이와 같이 오디오 핑거프린트를 비교하여 음원의 무단 도용 여부를 판별하거나 음원을 검색하는 등의 방법에 사용되고 있다.As such, it is used in methods such as comparing audio fingerprints to determine whether a sound source has been used without permission or to search for a sound source.

하지만, 음원이 어떤 것인지 판독하기 위해, 데이터 베이스에 저장된 수백만 개의 음원들에 대한 핑거프린트와, 비교대상 음원의 핑거프린트를 비교해야 하는데, 이는 시간이나 연산량으로 볼 때 상당한 로드(load)가 걸리는 작업이다. 따라서, 이러한 작업의 속도를 단축시키고자 하는 요구가 있었다.However, in order to read the sound source, it is necessary to compare the fingerprint of millions of sound sources stored in the database with the fingerprint of the sound source to be compared, which is a task that takes a considerable load in terms of time and computational amount. to be. Accordingly, there has been a demand to reduce the speed of such operations.

본 개시는 상술한 필요성에 따른 것으로, 본 개시의 목적은 오디오 핑거프린트 매칭시 연산량을 줄일 수 있는 음원 인식 방법 및 장치를 제공함에 있다.The present disclosure is in accordance with the above-described necessity, and an object of the present disclosure is to provide a method and apparatus for recognizing a sound source capable of reducing the amount of computation when matching an audio fingerprint.

본 개시는 상기와 같은 문제를 해결하기 위한 것으로서, 본 개시의 일 실시 예에 따른 오디오 핑거프린트 기반 음원 인식 방법은, 쿼리 음원을 수신하는 단계, 상기 쿼리 음원을 기 설정된 샘플링 레이트로 정규화하는 단계, 상기 정규화된 쿼리 음원의 속도를 r배로 증가시키는 단계, 상기 증가된 속도의 쿼리 음원에 대한 시간-주파수 히스토그램을, 1/r 배로 축소시키는 단계, 상기 축소된 히스토그램을 바탕으로 상기 쿼리 음원의 오디오 핑거프린트를 생성하는 단계 및 상기 생성된 상기 쿼리 음원의 오디오 핑거프린트와, 기 저장된 레퍼런스 음원들의 오디오 핑거프린트들을 비교하여 상기 쿼리 음원에 대한 정보를 획득하는 단계를 포함한다.The present disclosure is intended to solve the above problems, and an audio fingerprint-based sound source recognition method according to an embodiment of the present disclosure includes the steps of receiving a query sound source, normalizing the query sound source to a preset sampling rate; Increasing the speed of the normalized query sound source by r times, reducing the time-frequency histogram for the query sound source of the increased speed by 1/r times, Audio finger of the query sound source based on the reduced histogram generating a print and comparing the generated audio fingerprint of the query sound source with audio fingerprints of pre-stored reference sound sources to obtain information about the query sound source.

이 경우, 상기 증가시키는 단계는, 상기 쿼리 음원의 시간-주파수 히스토그램의 시간 축에 대하여 다운샘플링하는 것을 포함할 수 있다.In this case, the increasing may include downsampling with respect to the time axis of the time-frequency histogram of the query sound source.

한편, 상기 정규화된 쿼리 음원의 시간-주파수 히스토그램의 주파수 범위는 상기 기 설정된 샘플링 레이트의 절반이며, 상기 증가시키는 단계는, 상기 정규화된 쿼리 음원의 시간-주파수 히스토그램의 주파수 범위를 유지하며 속도를 증가시킬 수 있다.On the other hand, the frequency range of the time-frequency histogram of the normalized query sound source is half of the preset sampling rate, and the increasing includes maintaining the frequency range of the time-frequency histogram of the normalized query sound source and increasing the speed can do it

이 경우, 상기 기 설정된 샘플링 레이트는 32kHz이고, 상기 정규화된 쿼리 음원의 시간-주파수 히스토그램의 주파수 범위는 16kHz이며, 상기 축소시키는 단계에서 축소된 히스토그램의 주파수 범위는 8kHz일 수 있다.In this case, the preset sampling rate may be 32 kHz, the frequency range of the time-frequency histogram of the normalized query sound source may be 16 kHz, and the frequency range of the histogram reduced in the reduction step may be 8 kHz.

한편, 상기 기 저장된 레퍼런스 음원들의 오디오 핑거프린트들은, 상기 쿼리 음원의 상기 오디오 핑거프린트의 생성과 대응되는 방식으로 생성된 것일 수 있다.Meanwhile, the audio fingerprints of the pre-stored reference sound sources may be generated in a manner corresponding to the generation of the audio fingerprints of the query sound source.

한편, 본 개시의 일 실시 예에 따른 컴퓨터 실행가능 명령들을 포함하는 비-일시적 컴퓨터 판독가능 매체로서, 상기 컴퓨터 실행가능 명령들은, 프로세서에 의해 실행될 때, 상기 프로세서로 하여금 상술한 방법의 단계들을 수행하게 할 수 있다..On the other hand, as a non-transitory computer-readable medium including computer-executable instructions according to an embodiment of the present disclosure, the computer-executable instructions, when executed by a processor, cause the processor to perform the steps of the method described above. can do it..

본 개시가 해결하고자 하는 과제가 상술한 과제로 제한되는 것은 아니며, 언급되지 아니한 과제들은 본 명세서 및 첨부된 도면으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problem to be solved by the present disclosure is not limited to the above-mentioned problems, and the problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the present specification and the accompanying drawings. .

상술한 미세유동장치에 따르면, 오디오 핑거프린트들 간의 비교 연산 시 매칭 연산수를 줄일 수 있어 결과적으로 음원 인식 속도를 향상시킬 수 있다.According to the above-described microfluidic device, it is possible to reduce the number of matching operations when performing a comparison operation between audio fingerprints, and as a result, it is possible to improve the sound source recognition speed.

본 개시의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당해 기술분야에 있어서의 통상의 지식을 가진 자가 명확하게 이해할 수 있을 것이다.Effects of the present disclosure are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those of ordinary skill in the art from the following description.

도 1은 본 개시의 일 실시 예에 다른 음원 인식 시스템을 설명하기 위한 도면,
도 2는 오디오 핑거프린트 매칭의 속도를 증가시키기 위한 방법을 설명하기 위한 도면,
도 3은 음원 속도 증가를 위해 OLA 알고리즘과 다운 샘플링 알고리즘을 적용하였을 경우의 신호 변화를 설명하기 위한 도면,
도 4는 기존 방식과 본 개시에 따른 방식을 비교 설명하기 위한 도면,
도 5는 본 개시의 일 실시 예에 따른 음원 인식 방법을 설명하기 위한 흐름도이다.1 is a view for explaining a sound source recognition system according to an embodiment of the present disclosure;
2 is a diagram for explaining a method for increasing the speed of audio fingerprint matching;
3 is a diagram for explaining a signal change when an OLA algorithm and a downsampling algorithm are applied to increase the sound source speed;
4 is a view for explaining a comparison between the existing method and the method according to the present disclosure;
5 is a flowchart illustrating a method for recognizing a sound source according to an embodiment of the present disclosure.

이하에서는 도면을 참조하여 본 개시의 구체적인 실시 예를 상세하게 설명한다. 다만, 본 개시의 사상은 제시되는 실시 예에 제한되지 아니하고, 본 개시의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 실시 예이나 본 개시 사상의 범위 내에 포함되는 다른 실시 예를 용이하게 제안할 수 있을 것이나, 이 또한 본 개시의 사상 범위 내에 포함된다고 할 것이다. Hereinafter, specific embodiments of the present disclosure will be described in detail with reference to the drawings. However, the spirit of the present disclosure is not limited to the presented embodiment, and those skilled in the art who understand the spirit of the present disclosure may add, change, delete, etc. other components within the scope of the same spirit, through other degenerative embodiments or Other embodiments included within the scope of the present disclosure may be easily proposed, but this will also be included within the scope of the present disclosure.

본 개시에 사용된 용어들 중 일반적인 사전에 정의된 용어들은, 관련 기술의 문맥상 가지는 의미와 동일 또는 유사한 의미로 해석될 수 있으며, 본 개시에서 명백하게 정의되지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. 구체적인 용어 정의가 없으면 본 명세서의 전반적인 내용 및 당해 기술 분야의 통상적인 기술 상식을 토대로 해석될 수도 있다.Among the terms used in the present disclosure, terms defined in a general dictionary may be interpreted with the same or similar meaning as the meaning in the context of the related art, and unless explicitly defined in the present disclosure, ideal or excessively formal meanings is not interpreted as If there is no specific definition of the term, it may be interpreted based on the general content of the present specification and common technical knowledge in the art.

첨부된 각 도면에 기재된 동일한 참조 번호 또는 부호는 실질적으로 동일한 기능을 수행하는 부품 또는 구성요소를 나타낸다. 설명 및 이해의 편의를 위해서 서로 다른 실시 예들에서도 동일한 참조 번호 또는 부호를 사용하여 설명한다. 즉, 복수의 도면에서 동일한 참조 번호를 가지는 구성요소를 모두 도시되어 있다고 하더라도, 복수의 도면들이 하나의 실시 예를 의미하는 것은 아니다.The same reference numbers or reference numerals in each of the accompanying drawings indicate parts or components that perform substantially the same functions. For convenience of description and understanding, the same reference numbers or reference numerals are used in different embodiments. That is, even though all components having the same reference number are illustrated in a plurality of drawings, the plurality of drawings do not mean one embodiment.

또한, 본 개시에서는 구성요소들 간의 구별을 위하여 "제1", "제2" 등과 같이 서수를 포함하는 용어가 사용될 수 있다. 이러한 서수는 동일 또는 유사한 구성요소들을 서로 구별하기 위하여 사용하는 것이며 이러한 서수 사용으로 인하여 용어의 의미가 한정 해석되어서는 안 된다. 일 예로, 이러한 서수와 결합된 구성요소는 그 숫자에 의해 사용 순서나 배치 순서 등이 제한되어서는 안 된다. 필요에 따라서는, 각 서수들은 서로 교체되어 사용될 수도 있다.Also, in the present disclosure, terms including an ordinal number such as “first” and “second” may be used to distinguish between elements. This ordinal number is used to distinguish the same or similar elements from each other, and the meaning of the term should not be construed as limited due to the use of the ordinal number. As an example, the components combined with such an ordinal number should not be limited in the order of use or arrangement by the number. If necessary, each ordinal number may be used interchangeably.

본 개시에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구성되다" 등의 용어는 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. In this disclosure, the singular expression includes the plural expression unless the context clearly dictates otherwise. In this application, terms such as "comprises" or "consisting of" are intended to designate the presence of a feature, number, step, action, component, part, or combination thereof, but one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

도 1은 본 개시의 일 실시 예에 따른 오디오 핑거프린트 매칭 기반의 음원 인식 방법을 수행하기 위한 시스템의 구성도이다.1 is a block diagram of a system for performing a method for recognizing a sound source based on audio fingerprint matching according to an embodiment of the present disclosure.

도 1을 참조하면, 본 시스템(1000)은 컨텐츠 수신부(100), 오디오 핑거프린트 생성부(200), 레퍼런스 데이터베이스(300), 오디오 핑거프린트 비교부(400), 결과 출력부(500)를 포함할 수 있다. 이들 구성중 일부는 생략될 수 있거나 또는 당업자에게 자명한 다른 구성이 포함될 수 있다.Referring to FIG. 1 , the system 1000 includes a content receiving unit 100 , an audio fingerprint generating unit 200 , a reference database 300 , an audio fingerprint comparing unit 400 , and a result output unit 500 . can do. Some of these components may be omitted or other components apparent to those skilled in the art may be included.

먼저, 컨텐츠 수신부(100)는 레퍼런스 음원들을 수신할 수 있다. 그리고 오디오 핑거프린트 생성부(200)는 레퍼런스 음원들로부터 오디오 핑거프린트들을 생성하여 레퍼런스 데이터 베이스(300)에 미리 저장해 둘 수 있다. 이 경우, 레퍼런스 데이터 베이스(300)에는 오디오 핑거프린트들 각각에 대한 정보로서 예컨대, 노래 제목, 가수 이름 등이 연관되어 저장되어 있을 수 있다.First, the content receiving unit 100 may receive reference sound sources. In addition, the audio fingerprint generation unit 200 may generate audio fingerprints from reference sound sources and store them in advance in the reference database 300 . In this case, the reference database 300 may be stored in association with, for example, a song title, a singer's name, etc. as information on each of the audio fingerprints.

또한 컨텐츠 수신부(100)는 쿼리 음원(인식 대상 음원)을 수신하고 이를 오디오 핑거프린트 생성부(200)로 전송할 수 있고, 핑거프린트 생성부(200)는 이로부터 오디오 핑거프린트를 생성하여 오디오 핑거프린트 비교부(400)로 전송할 수 있다. In addition, the content receiving unit 100 may receive a query sound source (recognized sound source) and transmit it to the audio fingerprint generation unit 200, and the fingerprint generation unit 200 generates an audio fingerprint therefrom to generate an audio fingerprint. may be transmitted to the comparator 400 .

이후, 오디오 핑거프린트 비교부(400)에선 쿼리 음원의 핑거프린트와, 레퍼런스 데이터 베이스(300)의 핑거프린트들을 비교하여, 쿼리 음원의 핑거프린트와 동일성 있는 핑거프린트를 레퍼런스 데이터 베이스(300)로부터 검색할 수 있다. 검색이 되면, 검색된 핑거프린트에 대응하는 음원에 대한 정보를 결과 출력부(500)로 전송할 수 있다. Thereafter, the audio fingerprint comparison unit 400 compares the fingerprint of the query sound source with the fingerprints of the reference database 300 , and searches for a fingerprint identical to the fingerprint of the query sound source from the reference database 300 . can do. When the search is performed, information on the sound source corresponding to the searched fingerprint may be transmitted to the result output unit 500 .

예컨대, 방송에서 나오는 음악이 어떤 음악인지 판독하기 위해, 오디오 핑거프린트 비교부(400)는 레퍼런스 데이터베이스(300)에 저장된 수백만 개의 핑거프린트로부터, 쿼리 음원의 핑거프린트와 동일성 있는 것을 검색해야 하는데, 이는 시간이나 연산량으로 볼 때 상당한 로드(load)가 걸리는 작업이다. 본 개시에선 이러한 검색의 속도를 증가시키기 위한 방법을 제안한다.For example, in order to read the music from the broadcast, the audio fingerprint comparison unit 400 needs to search for the same as the fingerprint of the query sound source from millions of fingerprints stored in the reference database 300, which It is a task that takes a significant load in terms of time or computational amount. The present disclosure proposes a method for increasing the speed of such a search.

검색 속도를 증가시키기 위한 방법으로, 음원의 길이를 줄여서, 즉 재생 속도를 높여서 핑거프린트를 추출하게 되면, 검색 연산량이 감소할 수 있게 된다. 구체적으로, 음원의 속도를 r(r>1)배 높이면, 매칭 연산량은 1/r²으로 감소하게 된다. 이에 대한 설명은 도 2를 참고하여 설명하도록 한다.As a method for increasing the search speed, when the length of the sound source is shortened, that is, the reproduction speed is increased to extract the fingerprint, the amount of search operation can be reduced. Specifically, if the speed of the sound source is increased by r (r>1) times, the amount of matching operation is reduced to 1/r ² . This will be described with reference to FIG. 2 .

도 2를 참고하면, 두 음원에 대한 유사도 이미지를 구하여 매칭 여부를 판단할 수 있는데, 유사도 이미지의 가로축은 레퍼런스 음원으로부터 생성된 핑거 프린트(r₁, r₂,...r_n)이고 세로축은 쿼리 음원으로부터 생성된 핑거 프린트(q₁, q₂,...q_n)이라고 할 때, 유사성이 높은 곳이 어두운 형태로 나타나게 되고, 이 어두운 곳들을 잇는 직선 구간이 검출되면, 즉, 선형적으로 관련되는 대응관계를 가지면, 서로 매칭되는 것으로 판단할 수 있다. 이를 위해선 n X m 번 유사도 연산을 해야한다. 이 경우, 레퍼런스 음원과 쿼리 음원의 속도를 2배로 증가시키게 되면, n/2 X m/2 번 유사도 연산을 하면 되므로, 매칭 연산 수가 1/4 로 감소되게 되어, 연산 속도가 훨씬 빨라질 수 있다.Referring to FIG. 2 , it is possible to determine whether matching by obtaining similarity images for two sound sources. The horizontal axis of the similarity image is a fingerprint (r ₁ , r ₂ ,...r _n ) generated from the reference sound source, and the vertical axis is When a fingerprint (q ₁ , q ₂ , ... q _n ) generated from a query sound source is called, a place with high similarity appears in a dark form, and when a straight section connecting these dark places is detected, that is, linear If there is a corresponding relationship related to , it can be determined that they match each other. For this, n X m similarity calculations are required. In this case, if the speed of the reference sound source and the query sound source is doubled, the similarity operation is performed n/2 X m/2 times, so that the number of matching operations is reduced to 1/4, and the operation speed can be much faster.

한편, 음원의 속도를 증가시키는 알고리즘은 크게 샘플(sample)을 재조합하는 방법과 프레임(Frame)별 처리 방법으로 나눌 수 있다.On the other hand, the algorithm for increasing the speed of the sound source can be largely divided into a method of recombining samples and a processing method for each frame.

대표적인 샘플 재조합 방법은 다운 샘플링(Down-sampling) 방법이고 대표적인 프레임 별 처리방법은 OLA(Overlap and Add) 와 SOLA(Synchronous Overlap and Add) 알고리즘이다. SOLA의 기본 처리 과정은 OLA와 같지만, OLA의 처리 위치를 찾는 계산식을 모든 위치에서 비교하여 찾는 점이 상이하다. A representative sample recombination method is a down-sampling method, and a representative frame-by-frame processing method is OLA (Overlap and Add) and SOLA (Synchronous Overlap and Add) algorithms. The basic processing process of SOLA is the same as that of OLA, but it is different in that the calculation formula for finding the processing position of OLA is compared in all positions to find it.

도 3은 속도 증가를 위해 OLA 알고리즘과 다운 샘플링 알고리즘을 적용하였을 경우의 신호 변화를 설명하기 위한 도면이다.3 is a diagram for explaining a signal change when an OLA algorithm and a down-sampling algorithm are applied to increase the speed.

도 3을 참고하면, 두 개의 동일한 음원인 경우, 원래 속도에선 유사도가 높으나, OLA 를 적용할 경우, 스팩트럼을 유지하면서 속도가 변화되는데, 유사도 비교의 관점에서 보면 OLA는 신호의 변화가 필연적으로 발생하게 되는 문제가 있다. 따라서, OLA 나 SOLA 알고리즘을 오디오 핑거 프린트 추출에 적용하는 것은 어렵다.Referring to FIG. 3 , in the case of two identical sound sources, the similarity is high in the original speed, but when OLA is applied, the speed is changed while maintaining the spectrum. There is a problem to do. Therefore, it is difficult to apply OLA or SOLA algorithms to audio fingerprint extraction.

한편, (시간 축)다운 샘플링을 하게 되면 스팩트럼이 확장되므로 특징 추출 주파수 범위가 줄어들게 된다. 따라서, 음원의 속도 증가를 위해 다운 샘플링 방식을 이용하되, 특징 추출 주파수 범위가 줄어드는 문제를 해결하기 위한 방법에 대해 이하 도 4를 참고하여 설명하도록 한다.On the other hand, when (time axis) down-sampling is performed, the spectrum is expanded, so the frequency range of feature extraction is reduced. Accordingly, a method for solving the problem of using the downsampling method to increase the speed of the sound source and reducing the feature extraction frequency range will be described below with reference to FIG. 4 .

도 4는 기존 방식과 본 개시에 따른 방식을 비교 설명하기 위한 도면으로서, 기존 방식에서는 먼저, 음원의 샘플레이트를 소정 주파수 대역으로 정규화한다. 정규화는 소정의 샘플링 주파수에 의해 샘플링 변환에 의해 이루어질 수 있으며, 이는 입력되는 음원들은 예컨대 8kHz, 11kHz, 16kHz, 22kHz, 44kHz 등의 여러가지 형태의 주파수를 사용할 수 있으므로 이들을 특정 주파수 대역으로 정규화할 필요가 있기 때문이다. 정규화는 보통 16kHz 또는 11.5kHz로 정규화할 수 있는데, 그 이상의 샘플링 주파수에는 고주파에 해당하는 오디오 신호 성분이 포함되어 있어 이를 처리할 신호의 양이 많아지게 되며, 이로 인해 오디오 핑거프린트 데이터 생성 속도가 저하될 수 있기 때문이다. 4 is a diagram for explaining a comparison between an existing method and a method according to the present disclosure. In the conventional method, a sample rate of a sound source is first normalized to a predetermined frequency band. Normalization can be done by sampling conversion by a predetermined sampling frequency, and since the input sound sources can use various types of frequencies, such as 8 kHz, 11 kHz, 16 kHz, 22 kHz, and 44 kHz, it is not necessary to normalize them to a specific frequency band. because there is Normalization can be normalized to 16 kHz or 11.5 kHz, but higher sampling frequencies contain audio signal components corresponding to high frequencies, which increases the amount of signals to be processed, which slows down the audio fingerprint data generation speed. because it can be

이후, 정규화된 음원에 대응하는 히스토그램을 바탕으로 핑거프린트가 생성될 수 있다.Thereafter, a fingerprint may be generated based on the histogram corresponding to the normalized sound source.

기존과 비교하여 본 개시에 따르면, 정규화 시, 기존의 방식보다 r배 크게 정규화한다. 도 4는 기존의 16kHz보다 2배 큰, 32kHz로 정규화한 것을 예시로 들었다. 이후, wave를 1/r 로 다운 샘플링하여 음원의 속도를 r배 증가시킨다. 그리고 나서 히스토그램을 1/r로 축소하면, 기존의 핑거프린트 추출 주파수 범위를 살리면서 핑거프린트 데이터량을 1/r로 축소하는 것이 가능하다.According to the present disclosure as compared with the conventional method, during normalization, normalization is performed r times larger than that of the conventional method. 4 is an example that is normalized to 32 kHz, which is twice as large as the conventional 16 kHz. Thereafter, the wave is downsampled to 1/r to increase the speed of the sound source by r times. Then, if the histogram is reduced to 1/r, it is possible to reduce the fingerprint data amount to 1/r while preserving the existing fingerprint extraction frequency range.

도 5는본 개시의 일 실시 예에 따른 핑거프린트 생성 방법을 설명하기 위한 흐름도이다.5 is a flowchart illustrating a method of generating a fingerprint according to an embodiment of the present disclosure.

도 5를 참고하면 먼저, 쿼리 음원을 수신한다(S510). 쿼리 음원은 인식하고자 하는 대상 음원이라면 어떠한 것이라도 가능하다. 이러한 쿼리 음원은 다양한 외부 장치로부터 수신할 수 있다. S510 단계는 컨텐츠 수신부(100)에 의해 수행될 수 있다. 컨텐츠 수신부(110)는 외부 장치로부터 음원을 수신할 수 있는 입력 장치일 수 있다. 예컨대, 다양한 규격의 통신 포트, 무선 통신 장치 등으로 구현될 수 있다.Referring to FIG. 5 , first, a query sound source is received (S510). The query sound source can be any sound source to be recognized. Such a query sound source may be received from various external devices. Step S510 may be performed by the content receiving unit 100 . The content receiving unit 110 may be an input device capable of receiving a sound source from an external device. For example, it may be implemented as a communication port of various standards, a wireless communication device, and the like.

이후, 쿼리 음원을 기 설정된 샘플링 레이트로 정규화한다(S520). 여기서 샘플링 레이트는 종래 기술에서 핑거프린트 추출을 위한 정규화에서 적용된 샘플링 레이트보다 크다. 예컨대, 샘플링 레이트는 32kHz일 수 있다.Thereafter, the query sound source is normalized to a preset sampling rate (S520). Here, the sampling rate is larger than the sampling rate applied in normalization for fingerprint extraction in the prior art. For example, the sampling rate may be 32 kHz.

이후, 정규화된 쿼리 음원의 속도를 r배로 증가시킨다(S530). 속도를 증가시키는 방법으로서, 정규화된 쿼리 음원의 시간-주파수 히스토그램에서 시간 축에 대한 다운 샘플링 알고리즘이 적용될 수 있다. 그리고 속도를 증가시키더라도 히스토그램의 주파수 범위는 유지된다. 예컨대, 정규화의 샘플링 레이트가 32kHz였으면, 해당 히스토그램의 주파수 범위는 그 절반 값인 16kHz이고, 속도가 증가되더라도 주파수 범위는 16kHz로 유지된다.Thereafter, the speed of the normalized query sound source is increased by r times (S530). As a method of increasing the speed, a downsampling algorithm with respect to the time axis in the time-frequency histogram of the normalized query sound source may be applied. And even if you increase the speed, the frequency range of the histogram is maintained. For example, if the sampling rate of the normalization is 32 kHz, the frequency range of the corresponding histogram is 16 kHz, which is half the value, and the frequency range is maintained at 16 kHz even if the speed is increased.

이후, 증가된 속도의 쿼리 음원에 대한 시간-주파수 히스토그램을, 1/r 배로 축소시킨다(S540). 상술한 예시에서 히스토그램의 주파수 범위가 16kHz였으면, 본 단계에서 주파수 범위가 8kHz로 축소되게 된다. 이러한 축소된 히스토그램에선 핑거프린트 생성을 위한 특징 추출 범위가, 본래 음원에 비해 줄어들지 않으면서도, 속도는 증가하였으므로 데이터 량은 1/r로 감소되는 효과를 얻게 된다. Thereafter, the time-frequency histogram for the query sound source of the increased speed is reduced to 1/r times (S540). In the above example, if the frequency range of the histogram is 16 kHz, the frequency range is reduced to 8 kHz in this step. In such a reduced histogram, the feature extraction range for generating a fingerprint is not reduced compared to the original sound source, but the speed is increased, so the data amount is reduced to 1/r.

이후, 축소된 히스토그램을 바탕으로 쿼리 음원의 오디오 핑거프린트를 생성한다(S550). S520 내지 S550 단계는 오디오 핑거프린트 생성부(200)에 의해 수행될 수 있다.Thereafter, an audio fingerprint of the query sound source is generated based on the reduced histogram (S550). Steps S520 to S550 may be performed by the audio fingerprint generator 200 .

그리고 생성된 쿼리 음원의 오디오 핑거프린트와, 기 저장된 레퍼런스 음원들의 오디오 핑거프린트들을 비교하여 쿼리 음원에 대한 정보를 획득한다(S560). 여기서 레퍼런스 음원들의 오디오 핑거프린트들은 레퍼런스 데이터베이스(300)에 저장되어 있는 것일 수 있다. 그리고 레퍼런스 음원들의 오디오 핑거프린트들도 역시 마찬가지로, 쿼리 음원에 대한 핑거프린트 생성 방법과 동일하게 생성된 것일 수 있다. 즉, 정규화, 속도 증가, 히스토그램 축소의 과정을 거쳐 생성된 것일 수 있다. 따라서 레퍼런스 음원들의 오디오 핑거프린트들도 데이터량이 동일하게 1/r로 감소한 것이므로, 결과적으로 핑거프린트 매칭의 연산량이 1/r²으로 감소되는 효과를 얻을 수 있다.Then, information on the query sound source is obtained by comparing the generated audio fingerprint of the query sound source with the audio fingerprints of the pre-stored reference sound sources (S560). Here, the audio fingerprints of the reference sound sources may be stored in the reference database 300 . Also, the audio fingerprints of the reference sound sources may be generated in the same manner as the fingerprint generation method for the query sound source. That is, it may be generated through the processes of normalization, speed increase, and histogram reduction. Therefore, since the audio fingerprints of the reference sound sources also have the same data amount reduced to 1/r, as a result, the operation amount of fingerprint matching is reduced to 1/r ² .

S560 단계에서 획득된 정보는 결과 출력부(500)에서 출력될 수 있다. 결과 출력부(500)에서 출력되는 정보는 목적에 맞게 다양한 형태로 출력될 수 있다. 예컨대 저작권 협회에서 저작권료 산정을 위한 목적이라면 해당 목적에 맞는 정보의 형태로 출력될 것이고, 일반 사용자가 음원의 제목을 알고 싶은 경우라면 노래 제목, 가수 이름 등에 대한 정보를 출력할 수 있을 것이고, 또는, 쿼리 음원에 대응하는 원본 음원의 재생을 목절으로 한다면 해당 음원이 출력될 수도 있다. 결과 출력부(500)는 디스플레이, 스피커 등의 출력 장치로 구성되건, 혹은 타 외부 출력장치에 정보를 전송하기 위한 통신장치로 구현될 수도 있다.The information obtained in step S560 may be output from the result output unit 500 . Information output from the result output unit 500 may be output in various forms according to the purpose. For example, if the purpose of the Copyright Association is to calculate the copyright fee, it will be output in the form of information suitable for that purpose, and if the general user wants to know the title of the sound source, information about the song title, singer name, etc. will be output, or, If the reproduction of the original sound source corresponding to the query sound source is performed, the corresponding sound source may be output. The result output unit 500 may be configured as an output device such as a display or a speaker, or may be implemented as a communication device for transmitting information to another external output device.

상술한 실시 예들에 따르면 핑거프린트 매칭 기반으로 하는 음원 인식 속도가 종래에 비해 크게 향상될 수 있다.According to the above-described embodiments, a sound source recognition speed based on fingerprint matching may be significantly improved compared to the related art.

본 문서에서 사용된 "~부"와 같은 용어는 적어도 하나의 기능이나 동작을 수행하는 구성요소를 지칭하기 위한 용어이며, 이러한 구성요소는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. 또한, "~부"들은 각각이 개별적인 특정한 하드웨어로 구현될 필요가 있는 경우를 제외하고는, 적어도 하나의 모듈이나 칩으로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.As used in this document, a term such as "~ unit" is a term to refer to a component that performs at least one function or operation, and such a component may be implemented as hardware or software, or may be implemented as a combination of hardware and software. have. In addition, "~ parts" may be integrated into at least one module or chip and implemented by at least one processor, except when each needs to be implemented as individual specific hardware.

이상에서 설명된 다양한 실시 예들은 소프트웨어(software), 하드웨어(hardware) 또는 이들의 조합으로 구현될 수 있다. 하드웨어적인 구현에 의하면, 본 개시에서 설명되는 실시 예들은 ASICs(Application Specific Integrated Circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 프로세서(processors), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기타 기능 수행을 위한 전기적인 유닛(unit) 중 적어도 하나를 이용하여 구현될 수 있다.The various embodiments described above may be implemented by software, hardware, or a combination thereof. According to the hardware implementation, the embodiments described in the present disclosure are ASICs (Application Specific Integrated Circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays) ), processors, controllers, micro-controllers, microprocessors, and other electrical units for performing other functions may be implemented using at least one.

본 개시의 다양한 실시 예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media)에 저장될 수 있는 명령어를 포함하는 소프트웨어로 구현될 수 있다. 기기(machine)는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치이다.Various embodiments of the present disclosure may be implemented as software including instructions that may be stored in a machine-readable storage medium (eg, a computer). A machine is a device capable of calling a stored command from a storage medium and operating according to the called command.

이러한 명령어가 프로세서에 의해 실행될 경우, 프로세서가 직접, 또는 상기 프로세서의 제어 하에 다른 구성요소들을 이용하여 명령어에 해당하는 기능을 수행할 수 있다. 명령어는 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. When such an instruction is executed by a processor, the processor may perform a function corresponding to the instruction by using other components directly or under the control of the processor. Instructions may include code generated or executed by a compiler or interpreter.

기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' means that the storage medium does not include a signal and is tangible, and does not distinguish that data is semi-permanently or temporarily stored in the storage medium.

일 실시 예에 따르면, 본 문서에 개시된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로, 또는 어플리케이션 스토어(예: 플레이 스토어??, 앱스토어??)를 통해 온라인으로 배포될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to an embodiment, the method according to various embodiments disclosed in this document may be provided by being included in a computer program product. Computer program products may be traded between sellers and buyers as commodities. Computer program products are distributed in the form of device-readable storage media (eg compact disc read only memory (CD-ROM)) or online through application stores (eg Play Store??, App Store??). can be In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily generated in a storage medium such as a memory of a server of a manufacturer, a server of an application store, or a relay server.

이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구 범위에 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In the above, preferred embodiments of the present disclosure have been illustrated and described, but the present disclosure is not limited to the specific embodiments described above, and it is common in the technical field to which the disclosure pertains without departing from the gist of the disclosure as claimed in the claims. Various modifications may be made by those having the knowledge of

100: 컨텐츠 수신부
200: 오디오 핑거프린트 생성부
300: 레퍼런스 데이터베이스
400: 오디오 핑거프린트 비교부
500: 결과 출력부100: content receiving unit
200: audio fingerprint generation unit
300: reference database
400: audio fingerprint comparison unit
500: result output unit

Claims

In a sound source recognition method by an audio fingerprint-based sound source recognition system,
receiving a query sound source;
normalizing the query sound source to a sampling rate set to r times a specific sampling rate;
increasing the speed of the normalized query sound source by r times;
reducing the time-frequency histogram for the increased speed query sound source by 1/r times;
generating an audio fingerprint of the query sound source based on the reduced histogram; and
Comparing the generated audio fingerprint of the query sound source with the audio fingerprints of pre-stored reference sound sources to obtain information on the query sound source;
The increasing step is
Downsampling with respect to the time axis of the time-frequency histogram of the query sound source,
The frequency range of the time-frequency histogram of the normalized query sound source is half of the sampling rate set in the normalizing step,
The increasing step is
Method of increasing the speed while maintaining the frequency range of the time-frequency histogram of the normalized query sound source.

delete

The method of claim 1,
The sampling rate set in the normalizing step is 32 kHz, the frequency range of the time-frequency histogram of the normalized query sound source is 16 kHz, and the frequency range of the histogram reduced in the reducing step is 8 kHz.

The method of claim 1,
The audio fingerprints of the pre-stored reference sound sources are generated in a manner corresponding to the generation of the audio fingerprints of the query sound source.

A non-transitory computer-readable medium containing computer-executable instructions, comprising:
The computer-executable instructions, when executed by a processor, cause the processor to perform the steps of the method as recited in claim 1 , 4 or 5 .