KR20130061504A

KR20130061504A - Method for searching content using audio features

Info

Publication number: KR20130061504A
Application number: KR1020110127848A
Authority: KR
Inventors: 정혁; 오원근; 제성관; 나상일; 이근동
Original assignee: 한국전자통신연구원
Priority date: 2011-12-01
Filing date: 2011-12-01
Publication date: 2013-06-11

Abstract

PURPOSE: A contents search method using audio characteristics is provided to search contents by using indexed audio characteristics, thereby quickly finding specific contents containing a content sample in a mass content database. CONSTITUTION: A pre-processing part converts an audio signal of an inputted content sample into a mono signal(S102,S103). A filter part extracts only the audio signal of a specific frequency band from the converted audio signal(S104). A location extraction part outputs the location of an acoustic power maximum value from the extracted audio signal(S105). A characteristic extraction part extracts a characteristic value from at least two specific sections based on the maximum value location(S106). A search part searches a database for contents including a characteristic value coinciding with the extracted characteristic value and outputs the contents(S107). [Reference numerals] (AA) Start; (BB) No; (CC) Yes; (DD) End; (S101) Input contents sample; (S102) Is this contents samples audio contents?; (S102`) Extract audio signal; (S103) Convert audio signal to mono signal; (S104) Extract specific frequency band; (S105) Extract the maximum value location from specific frequency band; (S106) Extract characteristic value in the specific area with extracted location standard; (S107) Search and output media by using the characteristic value

Description

Method for searching content using audio features}

The present invention relates to a content retrieval method, and more particularly, to extract an audio feature using signal size information of a specific frequency band of an audio signal included in the content, and to search for the content using the extracted audio feature. It relates to a content search method using.

When a user has only part of the content from countless audio / video content on the Internet, there is a need for a technique for searching for content that includes a part of the content. In general, a video includes an audio signal synchronized with a video signal. Since the characteristics of the audio signal are easier to calculate and have a smaller capacity than the features of the video signal, the audio signal is used as a means for searching for the video. Are utilized.

In order to search for content using audio features, it must have robust characteristics for audio signal deformation such as resampling, lossy compression such as MP3, and equalization, and it should be easy to search in real time through a simple process.

Conventionally, audio features are extracted by extracting audio features using spectral flatness of each subband of an audio signal, and audio is searched using a distortion discriminant analysis (DDA). However, the conventional search method using the audio feature is not robust to distortions applied to the audio signal, and has a problem in that it takes a long time to search for an audio file.

The present invention was devised to solve the above problems, and an object of the present invention is to index an audio feature of a specific position of an audio signal in a content and perform a content search by using an indexed audio feature in a content search. It is to provide a content retrieval method using the feature.

It is also an object of the present invention to provide a content retrieval method using an audio feature exhibiting a characteristic that is robust to distortion by placing feature values insensitive to distortion in the most significant bit.

To this end, according to the present invention, a content retrieval method using an audio feature according to the present invention comprises the steps of receiving a content sample, pre-processing the audio signal of the content sample, the sound for a specific frequency band of the audio signal Obtaining a power, extracting a reference position in the specific frequency band based on the sound power, extracting an audio feature in at least two sections based on the reference position, and using the audio feature Retrieving content including the content sample from a database.

The content retrieval method using the audio feature according to the present invention indexes the audio feature of a specific position of the audio signal in the content and performs the content retrieval using the indexed audio feature in the content retrieval. Searched specific content quickly.

In addition, the content retrieval method using the audio feature according to the present invention does not use the feature of the entire area of the content when retrieving the content from a large database, but uses the minimum audio feature at a specific position of the audio signal in the content. This can improve search efficiency.

1 is a diagram showing the configuration of a content retrieval apparatus using audio features according to the present invention;
FIG. 2 is a diagram illustrating a configuration of the filter unit illustrated in FIG. 1. FIG.
3 is a conceptual diagram illustrating a maximum value extraction section and a movement interval of a feature extractor;
4 is a flowchart illustrating a content retrieval method using audio features according to the present invention;

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The configuration of the present invention and the operation and effect thereof will be clearly understood through the following detailed description.

Prior to the detailed description of the present invention, the same components will be denoted by the same reference numerals even if they are displayed on different drawings, and the detailed description will be omitted when it is determined that the well-known configuration may obscure the gist of the present invention. do.

1 is a block diagram of a content retrieval apparatus using an audio feature according to an embodiment of the present invention, FIG. 2 is a block diagram of a filter unit shown in FIG. 1, and FIG. The conceptual diagram shown.

Referring to FIG. 1, the media search apparatus according to the present invention includes a preprocessor 110, a filter unit 120, a location extractor 130, a feature extractor 140, a searcher 150, a database (hereinafter, DB) 160 and the like.

The preprocessing unit 110 receives a content sample and performs preprocessing on the audio signal included in the content. When the content sample is input, the preprocessor 110 checks whether the content sample is audio content. The content sample may be an audio sample or a video sample.

If the content sample is audio content, the preprocessor 1100 converts the audio signal of the content sample into a mono signal. Here, the preprocessor 110 converts the audio signal into a mono signal. Take the average of all channel signals.

If the input content sample is not audio content, the preprocessor 110 extracts an audio signal from the content sample and converts the extracted audio signal into a mono signal. For example, when the content sample is a video sample, the preprocessor 110 extracts an audio signal from the video sample and converts the extracted audio signal into a mono signal.

The filter unit 120 extracts only an audio signal of a specific frequency band from the audio signal preprocessed by the preprocessor 110 and obtains an acoustic power for the audio signal of the specific frequency band. The filter unit 120 includes a band pass filter 121 and a low pass filter 122. The band pass filter 121 extracts an audio signal in the 300 to 2000 Hz frequency band from the audio signal. The filter unit 120 includes a one-pole low pass filter 122 having a time coefficient of 10 ms.

The position extractor 130 extracts a temporal position at which a sound power (signal magnitude) of a specific frequency band extracted by the filter unit 120 is a maximum value. The temporal position becomes a reference position for detecting audio features. As shown in FIG. 3, the position extracting unit 130 sets the reference position extracting section to an arbitrary block, and detects positions (a and b) having the maximum acoustic power in each block while moving the block. And, the movement interval of the block is set smaller than the block which is a section for extracting the reference position. For example, when the reference position extraction section is 10 seconds, the movement interval of the block is set to 5 seconds which is about half of the reference position extraction section. That is, the position extraction unit 130 detects the maximum position of the acoustic power in each block while overlapping the reference position extraction section by 50%. In the present invention, the extraction of the maximum position of the acoustic power has been described as an example, but the present invention is not limited thereto and may be implemented to extract the minimum position of the acoustic power.

The feature extractor 140 obtains a power spectrum of the audio signal in at least two specific sections based on the position extracted by the position extractor 130. The feature extractor 140 divides the power spectrum obtained in each specific section into at least two subbands and adds the spectrums in each subband to obtain subband power. The subband is set to be proportional to the critical bandwidth in consideration of human hearing characteristics. In the present invention, a case in which the number of subbands and the power spectrum are calculated for each of 16 and 2 is described as an example. However, the present invention is not limited thereto. The number of subbands and the power spectrum can be obtained in various ways depending on the system implementation method. Can be set.

The feature extractor 140 takes 4192 data from the maximum position of the sound power extracted by the position extractor 130 as a first section for obtaining a power spectrum. The feature extractor 140 takes 4192 data from the 4193th position of the maximum value position of the sound power in the second section. The subband is a region divided into 16 parts based on the critical band in the 300Hz to 2000Hz section, which contains most of the important acoustic information.

At this time, the subband power in the first section is in the order of low frequency to high frequency.

(i = 1, 2, ..., 16), and the subband power in the second section

, The characteristic value at the kth (k = 1, 2, ..., 16) bits represented by 16 bits.

Is defined as in Equations 1 and 2 below. Equation 1 is a feature value

Eighth bit is defined from the 1st bit (most significant bit) of Equation (2).

Defines the 9th to 16th bits (least significant bit) of the.

As mentioned earlier, the feature value

Consists of 16 bits. The feature value

Has the same content, but only the lower bit value is transformed when the distortion is only partially caused by band pass filtering.

This is advantageous for indexing and processing.

In other words, feature values

Since the most significant bits of are compared with sound power differences between neighboring frames, the values of the most significant bits are maintained unchanged in the case of an audio signal having the same contents unless the distortion is very severe. Thus, feature value

The most significant bit of is unlikely to be transformed, and even if some of the lower bits are different, it is most likely an audio signal of quite similar content. The feature value

When indexing, the search efficiency can be increased by sequentially comparing and processing the most significant bit to the least significant bit.

The feature value

At least one may be extracted based on the position of the maximum value of each block, and the feature values are located at the important bit positions in order that distortion is not easily generated by deformation.

. &Lt; / RTI >

The searcher 150 searches for content in the DB 160 to be described later by using the extracted feature value. The searcher 150 compares the extracted feature value with the feature value stored in the DB 160 and outputs an identifier (ID) of the content connected to the matching feature value and its location as a search result. In other words, the searcher 150 searches for content including a content sample input to the content search apparatus.

The DB 160 lists the feature values associated with the content ID and the location in order. That is, the DB 160 stores the ID and the position of the content such as a video and audio in a table form by pairing the feature values at a specific position of the audio signal in the content.

4 is a flowchart illustrating a content retrieval method using audio features according to the present invention.

Referring to FIG. 4, the preprocessor 110 of the content retrieval apparatus receives a content sample (S101). The preprocessor 110 receives an audio sample or a video sample.

The preprocessor 110 checks whether the received content sample is audio content (S102).

If the content sample is audio content, the preprocessor 110 converts the audio signal of the content sample into a mono signal and transmits it to the filter unit 120 (S103). That is, when the audio sample is input, the preprocessor 110 converts the input audio sample into a mono signal.

On the other hand, if the content sample is not audio content, the preprocessor 110 extracts an audio signal from the content sample (S102 '). For example, when the preprocessor 110 receives a video sample, the preprocessor 110 extracts an audio signal from the video sample. The preprocessing unit 110 converts the extracted audio signal into a mono signal (S103).

The filter unit 120 extracts only an audio signal of a specific frequency band from the audio signal converted into a mono signal (S104). The filter unit 120 performs band pass filtering on the audio signal and then performs low pass filtering to extract only the audio signal of a specific frequency band. The filter unit 120 obtains a sound power (signal magnitude) for the extracted audio signal of a specific frequency band.

Subsequently, the position extraction unit 130 detects a position where the acoustic power is maximum in the audio signal of the specific frequency band extracted by the filter unit 120 and outputs the detected acoustic power maximum position (S105). .

When the position where the sound power is maximum is extracted, the feature extractor 140 extracts a feature value (audio feature) in at least two specific sections based on the maximum value position output from the position extractor 130 (S106). ). The feature extractor 140 obtains a power spectrum of an audio signal in at least two specific sections based on the position extracted by the position extractor 130. The feature extractor 140 obtains subband power for each subband by dividing the power spectrum obtained in each section into a plurality of subbands. The feature extractor 140 uses the subband power to generate a feature value at a specific position.

Extract The feature value

Consists of 16 bits,

The most significant bit of is insensitive to distortion.

When the feature value is extracted, the searcher 150 searches the DB 160 for the content having the feature value that matches the extracted feature value, and outputs the ID and location of the content as a search result (S107). ). That is, the searcher 150 searches for the content including the content sample from the DB 160.

The embodiments disclosed in the specification of the present invention are not intended to limit the present invention. The scope of the present invention should be construed according to the following claims, and all the techniques within the scope of equivalents should be construed as being included in the scope of the present invention.

110: preprocessing unit 120: filter unit
130: location extraction unit 140: feature extraction unit
150: search unit 160: DB

Claims

Receiving a sample of content,
Preprocessing an audio signal of the content sample;
Obtaining sound power for a specific frequency band of the audio signal;
Extracting a reference position in the specific frequency band based on the sound power;
Extracting an audio feature in at least two sections based on the reference position;
And searching for a content including the content sample in a database using the audio feature.

The method of claim 1, wherein the audio signal preprocessing step comprises:
Confirming whether the content sample is audio content;
Extracting an audio signal from the content sample if the content sample is not audio content;
And converting the audio signal extracted from the content sample into a mono signal.

The method of claim 2,
And converting the audio signal of the content sample into a mono signal if the content sample is audio content.

The method of claim 1, wherein the obtaining of the specific frequency band sound power comprises:
And an audio signal of the specific frequency band is extracted by sequentially performing band pass filtering and low pass filtering on the audio signal.

The method of claim 1, wherein the extracting the reference position comprises:
Setting a block that is a location extraction section and extracting a reference position from each block while moving the block in the specific frequency band.

The method of claim 5, wherein the block,
Content search method using an audio feature, characterized in that for moving at a smaller interval than the location extraction section.

The method of claim 5, wherein the reference position extraction,
And detecting the position of the maximum or minimum sound power in each of the blocks.

The method of claim 1, wherein the extracting audio features comprises:
Obtaining a power spectrum of an audio signal in at least two specific sections based on the reference position;
Dividing the power spectrum into a plurality of subbands and obtaining subband power of each subband;
And extracting feature values of an audio signal using subband power in the specific section.

The method of claim 8, wherein the feature value,
Content retrieval method using an audio feature, characterized in that consisting of 16 bits.

The method of claim 8, wherein the feature value,
Content retrieval method using an audio feature, characterized in that a plurality of extractable based on the reference position of each block.