KR20130061504A - Method for searching content using audio features - Google Patents

Method for searching content using audio features Download PDF

Info

Publication number
KR20130061504A
KR20130061504A KR1020110127848A KR20110127848A KR20130061504A KR 20130061504 A KR20130061504 A KR 20130061504A KR 1020110127848 A KR1020110127848 A KR 1020110127848A KR 20110127848 A KR20110127848 A KR 20110127848A KR 20130061504 A KR20130061504 A KR 20130061504A
Authority
KR
South Korea
Prior art keywords
content
audio
audio signal
sample
feature
Prior art date
Application number
KR1020110127848A
Other languages
Korean (ko)
Inventor
정혁
오원근
제성관
나상일
이근동
Original Assignee
한국전자통신연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자통신연구원 filed Critical 한국전자통신연구원
Priority to KR1020110127848A priority Critical patent/KR20130061504A/en
Publication of KR20130061504A publication Critical patent/KR20130061504A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Analysis (AREA)
  • Multimedia (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Mathematical Optimization (AREA)
  • Signal Processing (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PURPOSE: A contents search method using audio characteristics is provided to search contents by using indexed audio characteristics, thereby quickly finding specific contents containing a content sample in a mass content database. CONSTITUTION: A pre-processing part converts an audio signal of an inputted content sample into a mono signal(S102,S103). A filter part extracts only the audio signal of a specific frequency band from the converted audio signal(S104). A location extraction part outputs the location of an acoustic power maximum value from the extracted audio signal(S105). A characteristic extraction part extracts a characteristic value from at least two specific sections based on the maximum value location(S106). A search part searches a database for contents including a characteristic value coinciding with the extracted characteristic value and outputs the contents(S107). [Reference numerals] (AA) Start; (BB) No; (CC) Yes; (DD) End; (S101) Input contents sample; (S102) Is this contents samples audio contents?; (S102`) Extract audio signal; (S103) Convert audio signal to mono signal; (S104) Extract specific frequency band; (S105) Extract the maximum value location from specific frequency band; (S106) Extract characteristic value in the specific area with extracted location standard; (S107) Search and output media by using the characteristic value

Description

Method for searching content using audio features}

The present invention relates to a content retrieval method, and more particularly, to extract an audio feature using signal size information of a specific frequency band of an audio signal included in the content, and to search for the content using the extracted audio feature. It relates to a content search method using.

When a user has only part of the content from countless audio / video content on the Internet, there is a need for a technique for searching for content that includes a part of the content. In general, a video includes an audio signal synchronized with a video signal. Since the characteristics of the audio signal are easier to calculate and have a smaller capacity than the features of the video signal, the audio signal is used as a means for searching for the video. Are utilized.

In order to search for content using audio features, it must have robust characteristics for audio signal deformation such as resampling, lossy compression such as MP3, and equalization, and it should be easy to search in real time through a simple process.

Conventionally, audio features are extracted by extracting audio features using spectral flatness of each subband of an audio signal, and audio is searched using a distortion discriminant analysis (DDA). However, the conventional search method using the audio feature is not robust to distortions applied to the audio signal, and has a problem in that it takes a long time to search for an audio file.

The present invention was devised to solve the above problems, and an object of the present invention is to index an audio feature of a specific position of an audio signal in a content and perform a content search by using an indexed audio feature in a content search. It is to provide a content retrieval method using the feature.

It is also an object of the present invention to provide a content retrieval method using an audio feature exhibiting a characteristic that is robust to distortion by placing feature values insensitive to distortion in the most significant bit.

To this end, according to the present invention, a content retrieval method using an audio feature according to the present invention comprises the steps of receiving a content sample, pre-processing the audio signal of the content sample, the sound for a specific frequency band of the audio signal Obtaining a power, extracting a reference position in the specific frequency band based on the sound power, extracting an audio feature in at least two sections based on the reference position, and using the audio feature Retrieving content including the content sample from a database.

The content retrieval method using the audio feature according to the present invention indexes the audio feature of a specific position of the audio signal in the content and performs the content retrieval using the indexed audio feature in the content retrieval. Searched specific content quickly.

In addition, the content retrieval method using the audio feature according to the present invention does not use the feature of the entire area of the content when retrieving the content from a large database, but uses the minimum audio feature at a specific position of the audio signal in the content. This can improve search efficiency.

1 is a diagram showing the configuration of a content retrieval apparatus using audio features according to the present invention;
FIG. 2 is a diagram illustrating a configuration of the filter unit illustrated in FIG. 1. FIG.
3 is a conceptual diagram illustrating a maximum value extraction section and a movement interval of a feature extractor;
4 is a flowchart illustrating a content retrieval method using audio features according to the present invention;

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The configuration of the present invention and the operation and effect thereof will be clearly understood through the following detailed description.

Prior to the detailed description of the present invention, the same components will be denoted by the same reference numerals even if they are displayed on different drawings, and the detailed description will be omitted when it is determined that the well-known configuration may obscure the gist of the present invention. do.

1 is a block diagram of a content retrieval apparatus using an audio feature according to an embodiment of the present invention, FIG. 2 is a block diagram of a filter unit shown in FIG. 1, and FIG. The conceptual diagram shown.

Referring to FIG. 1, the media search apparatus according to the present invention includes a preprocessor 110, a filter unit 120, a location extractor 130, a feature extractor 140, a searcher 150, a database (hereinafter, DB) 160 and the like.

The preprocessing unit 110 receives a content sample and performs preprocessing on the audio signal included in the content. When the content sample is input, the preprocessor 110 checks whether the content sample is audio content. The content sample may be an audio sample or a video sample.

If the content sample is audio content, the preprocessor 1100 converts the audio signal of the content sample into a mono signal. Here, the preprocessor 110 converts the audio signal into a mono signal. Take the average of all channel signals.

If the input content sample is not audio content, the preprocessor 110 extracts an audio signal from the content sample and converts the extracted audio signal into a mono signal. For example, when the content sample is a video sample, the preprocessor 110 extracts an audio signal from the video sample and converts the extracted audio signal into a mono signal.

The filter unit 120 extracts only an audio signal of a specific frequency band from the audio signal preprocessed by the preprocessor 110 and obtains an acoustic power for the audio signal of the specific frequency band. The filter unit 120 includes a band pass filter 121 and a low pass filter 122. The band pass filter 121 extracts an audio signal in the 300 to 2000 Hz frequency band from the audio signal. The filter unit 120 includes a one-pole low pass filter 122 having a time coefficient of 10 ms.

The position extractor 130 extracts a temporal position at which a sound power (signal magnitude) of a specific frequency band extracted by the filter unit 120 is a maximum value. The temporal position becomes a reference position for detecting audio features. As shown in FIG. 3, the position extracting unit 130 sets the reference position extracting section to an arbitrary block, and detects positions (a and b) having the maximum acoustic power in each block while moving the block. And, the movement interval of the block is set smaller than the block which is a section for extracting the reference position. For example, when the reference position extraction section is 10 seconds, the movement interval of the block is set to 5 seconds which is about half of the reference position extraction section. That is, the position extraction unit 130 detects the maximum position of the acoustic power in each block while overlapping the reference position extraction section by 50%. In the present invention, the extraction of the maximum position of the acoustic power has been described as an example, but the present invention is not limited thereto and may be implemented to extract the minimum position of the acoustic power.

The feature extractor 140 obtains a power spectrum of the audio signal in at least two specific sections based on the position extracted by the position extractor 130. The feature extractor 140 divides the power spectrum obtained in each specific section into at least two subbands and adds the spectrums in each subband to obtain subband power. The subband is set to be proportional to the critical bandwidth in consideration of human hearing characteristics. In the present invention, a case in which the number of subbands and the power spectrum are calculated for each of 16 and 2 is described as an example. However, the present invention is not limited thereto. The number of subbands and the power spectrum can be obtained in various ways depending on the system implementation method. Can be set.

The feature extractor 140 takes 4192 data from the maximum position of the sound power extracted by the position extractor 130 as a first section for obtaining a power spectrum. The feature extractor 140 takes 4192 data from the 4193th position of the maximum value position of the sound power in the second section. The subband is a region divided into 16 parts based on the critical band in the 300Hz to 2000Hz section, which contains most of the important acoustic information.

At this time, the subband power in the first section is in the order of low frequency to high frequency.

Figure pat00001
Figure pat00002
(i = 1, 2, ..., 16), and the subband power in the second section
Figure pat00003
Figure pat00004
, The characteristic value at the kth (k = 1, 2, ..., 16) bits represented by 16 bits.
Figure pat00005
Figure pat00006
Figure pat00007
Is defined as in Equations 1 and 2 below. Equation 1 is a feature value
Figure pat00008
Figure pat00009
Eighth bit is defined from the 1st bit (most significant bit) of Equation (2).
Figure pat00010
Figure pat00011
Defines the 9th to 16th bits (least significant bit) of the.

Figure pat00012

Figure pat00013

Figure pat00014

Figure pat00015

Figure pat00016

Figure pat00017

Figure pat00018

Figure pat00019

Figure pat00020

As mentioned earlier, the feature value

Figure pat00021
Figure pat00022
Consists of 16 bits. The feature value
Figure pat00023
Figure pat00024
Has the same content, but only the lower bit value is transformed when the distortion is only partially caused by band pass filtering.
Figure pat00025
Figure pat00026
This is advantageous for indexing and processing.

In other words, feature values

Figure pat00027
Figure pat00028
Since the most significant bits of are compared with sound power differences between neighboring frames, the values of the most significant bits are maintained unchanged in the case of an audio signal having the same contents unless the distortion is very severe. Thus, feature value
Figure pat00029
Figure pat00030
The most significant bit of is unlikely to be transformed, and even if some of the lower bits are different, it is most likely an audio signal of quite similar content. The feature value
Figure pat00031
Figure pat00032
When indexing, the search efficiency can be increased by sequentially comparing and processing the most significant bit to the least significant bit.

The feature value

Figure pat00033
Figure pat00034
Figure pat00035
At least one may be extracted based on the position of the maximum value of each block, and the feature values are located at the important bit positions in order that distortion is not easily generated by deformation.
Figure pat00036
Figure pat00037
. ≪ / RTI >

The searcher 150 searches for content in the DB 160 to be described later by using the extracted feature value. The searcher 150 compares the extracted feature value with the feature value stored in the DB 160 and outputs an identifier (ID) of the content connected to the matching feature value and its location as a search result. In other words, the searcher 150 searches for content including a content sample input to the content search apparatus.

The DB 160 lists the feature values associated with the content ID and the location in order. That is, the DB 160 stores the ID and the position of the content such as a video and audio in a table form by pairing the feature values at a specific position of the audio signal in the content.

4 is a flowchart illustrating a content retrieval method using audio features according to the present invention.

Referring to FIG. 4, the preprocessor 110 of the content retrieval apparatus receives a content sample (S101). The preprocessor 110 receives an audio sample or a video sample.

The preprocessor 110 checks whether the received content sample is audio content (S102).

If the content sample is audio content, the preprocessor 110 converts the audio signal of the content sample into a mono signal and transmits it to the filter unit 120 (S103). That is, when the audio sample is input, the preprocessor 110 converts the input audio sample into a mono signal.

On the other hand, if the content sample is not audio content, the preprocessor 110 extracts an audio signal from the content sample (S102 '). For example, when the preprocessor 110 receives a video sample, the preprocessor 110 extracts an audio signal from the video sample. The preprocessing unit 110 converts the extracted audio signal into a mono signal (S103).

The filter unit 120 extracts only an audio signal of a specific frequency band from the audio signal converted into a mono signal (S104). The filter unit 120 performs band pass filtering on the audio signal and then performs low pass filtering to extract only the audio signal of a specific frequency band. The filter unit 120 obtains a sound power (signal magnitude) for the extracted audio signal of a specific frequency band.

Subsequently, the position extraction unit 130 detects a position where the acoustic power is maximum in the audio signal of the specific frequency band extracted by the filter unit 120 and outputs the detected acoustic power maximum position (S105). .

When the position where the sound power is maximum is extracted, the feature extractor 140 extracts a feature value (audio feature) in at least two specific sections based on the maximum value position output from the position extractor 130 (S106). ). The feature extractor 140 obtains a power spectrum of an audio signal in at least two specific sections based on the position extracted by the position extractor 130. The feature extractor 140 obtains subband power for each subband by dividing the power spectrum obtained in each section into a plurality of subbands. The feature extractor 140 uses the subband power to generate a feature value at a specific position.

Figure pat00038
Figure pat00039
Extract The feature value
Figure pat00040
Figure pat00041
Consists of 16 bits,
Figure pat00042
Figure pat00043
The most significant bit of is insensitive to distortion.

When the feature value is extracted, the searcher 150 searches the DB 160 for the content having the feature value that matches the extracted feature value, and outputs the ID and location of the content as a search result (S107). ). That is, the searcher 150 searches for the content including the content sample from the DB 160.

The embodiments disclosed in the specification of the present invention are not intended to limit the present invention. The scope of the present invention should be construed according to the following claims, and all the techniques within the scope of equivalents should be construed as being included in the scope of the present invention.

110: preprocessing unit 120: filter unit
130: location extraction unit 140: feature extraction unit
150: search unit 160: DB

Claims (10)

Receiving a sample of content,
Preprocessing an audio signal of the content sample;
Obtaining sound power for a specific frequency band of the audio signal;
Extracting a reference position in the specific frequency band based on the sound power;
Extracting an audio feature in at least two sections based on the reference position;
And searching for a content including the content sample in a database using the audio feature.
The method of claim 1, wherein the audio signal preprocessing step comprises:
Confirming whether the content sample is audio content;
Extracting an audio signal from the content sample if the content sample is not audio content;
And converting the audio signal extracted from the content sample into a mono signal.
The method of claim 2,
And converting the audio signal of the content sample into a mono signal if the content sample is audio content.
The method of claim 1, wherein the obtaining of the specific frequency band sound power comprises:
And an audio signal of the specific frequency band is extracted by sequentially performing band pass filtering and low pass filtering on the audio signal.
The method of claim 1, wherein the extracting the reference position comprises:
Setting a block that is a location extraction section and extracting a reference position from each block while moving the block in the specific frequency band.
The method of claim 5, wherein the block,
Content search method using an audio feature, characterized in that for moving at a smaller interval than the location extraction section.
The method of claim 5, wherein the reference position extraction,
And detecting the position of the maximum or minimum sound power in each of the blocks.
The method of claim 1, wherein the extracting audio features comprises:
Obtaining a power spectrum of an audio signal in at least two specific sections based on the reference position;
Dividing the power spectrum into a plurality of subbands and obtaining subband power of each subband;
And extracting feature values of an audio signal using subband power in the specific section.
The method of claim 8, wherein the feature value,
Content retrieval method using an audio feature, characterized in that consisting of 16 bits.
The method of claim 8, wherein the feature value,
Content retrieval method using an audio feature, characterized in that a plurality of extractable based on the reference position of each block.
KR1020110127848A 2011-12-01 2011-12-01 Method for searching content using audio features KR20130061504A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020110127848A KR20130061504A (en) 2011-12-01 2011-12-01 Method for searching content using audio features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020110127848A KR20130061504A (en) 2011-12-01 2011-12-01 Method for searching content using audio features

Publications (1)

Publication Number Publication Date
KR20130061504A true KR20130061504A (en) 2013-06-11

Family

ID=48859608

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020110127848A KR20130061504A (en) 2011-12-01 2011-12-01 Method for searching content using audio features

Country Status (1)

Country Link
KR (1) KR20130061504A (en)

Similar Documents

Publication Publication Date Title
KR20120064582A (en) Method of searching multi-media contents and apparatus for the same
TWI480855B (en) Extraction and matching of characteristic fingerprints from audio signals
EP2458584A2 (en) Audio visual signature, method of deriving a signature, and method of comparing audio-visual data
Anguera et al. Mask: Robust local features for audio fingerprinting
CN106802960B (en) Fragmented audio retrieval method based on audio fingerprints
US10089994B1 (en) Acoustic fingerprint extraction and matching
CN105989836B (en) Voice acquisition method and device and terminal equipment
CN106708990B (en) Music piece extraction method and equipment
JP2006505821A (en) Multimedia content with fingerprint information
CN111640411B (en) Audio synthesis method, device and computer readable storage medium
CN109644283B (en) Audio fingerprinting based on audio energy characteristics
US20190213214A1 (en) Audio matching
US8543228B2 (en) Coded domain audio analysis
US8108164B2 (en) Determination of a common fundamental frequency of harmonic signals
CN103294696A (en) Audio and video content retrieval method and system
KR20130061504A (en) Method for searching content using audio features
KR101661666B1 (en) Hybrid audio fingerprinting apparatus and method
US9215350B2 (en) Sound processing method, sound processing system, video processing method, video processing system, sound processing device, and method and program for controlling same
US9183840B2 (en) Apparatus and method for measuring quality of audio
Wang et al. Audio fingerprint based on spectral flux for audio retrieval
KR101303256B1 (en) Apparatus and Method for real-time detecting and decoding of morse signal
Mapelli et al. Audio hashing technique for automatic song identification
CN108268572B (en) Song synchronization method and system
CN110910899B (en) Real-time audio signal consistency comparison detection method
Chickanbanjar Comparative analysis between audio fingerprinting algorithms

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination