EP2357645A1

EP2357645A1 - Music detecting apparatus and music detecting method

Info

Publication number: EP2357645A1
Application number: EP10172348A
Authority: EP
Inventors: Tatsuya Uehara
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-12-28
Filing date: 2010-08-10
Publication date: 2011-08-17

Abstract

According to one embodiment, a music detecting apparatus comprises: an input module (201); a decomposing module (202) ; an inverse quantization module (203; 502); an estimating module (204; 503); and determining module (205). The input module (201) is configured to receive an input of an audio signal subjected to middle-side (MS) stereo encoding. The decomposing module (202) is configured to decompose the audio signal input in the input module (201) on a component-by-component basis. The inverse quantization module (203; 502) is configured to perform, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels included in the audio signal for generating inverse quantization data of the difference signal. The estimating module (204; 503) is configured to estimate sound volume of the difference signal based on the inverse quantization data of the difference signal generated by the inverse quantization module (203; 502). The determining module (205) is configured to determine, based on whether the sound volume of the difference signal estimated by the estimating module (204) is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.

Description

Embodiments described herein relate generally to a music detecting apparatus and a music detecting method.

BACKGROUND

With the development of computer technology in recent years, there have been advances in reproducing apparatuses that receive video signals or audio signals and then reproduce the same. Besides, usually, computers are equipped with a function of reproducing video signals and audio signals as part of the standard specification.
In such computers or reproducing apparatuses, music programs are sometimes obtained as part of various video signal contents. With regard to obtaining music programs, for example, there exists a demand for separating the music portions from the other portions included in a music program.
As a technology to respond to such a demand, for example, Japanese Patent Application Publication (KOKAI) No. 2008-298976 discloses a technology for detecting music on the basis of input audio signals. By implementing that technology, it becomes possible to identify the music portion included in the contents.
In digital broadcasting, transmission is performed in an encoded format such as the moving picture experts group (MPEG) audio format. Hence, in the case of implementing the technology disclosed in Japanese Patent Application Publication (KOKAI) No. 2008-298976 , there arises a need to perform decoding. However, since decoding is not required while recording the digital broadcast, a decode circuit is often not available for use at the time of recording. For that reason, in order to identify music portions at the time of recording, decoding may be performed using software. However, that option is associated with a greater processing load.
It is therefore an object of the present invention to provide a music detecting apparatus and a music detecting method that enable achieving reduction in the processing load associated with identification of music portions.

SUMMARY

To overcome the problems and achieve the object mentioned above, in general, according to an embodiment, a music detecting apparatus comprises: an input module; a decomposing module; an inverse quantization module; an estimating module; and a determining module. The input module is configured to receive an input of an audio signal subjected to middle-side (MS) stereo encoding. The decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis. The inverse quantization module configured to perform, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels included in the audio signal for generating inverse quantization data of the difference signal. The estimating module is configured to estimate sound volume of the difference signal based on the inverse quantization data of the difference signal generated by the inverse quantization module. The determining module is configured to determine, based on whether the sound volume of the difference signal estimated by the estimating module is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
According to another embodiment, a music detecting apparatus comprises: an input module, a decomposing module; an inverse quantization module; and a determining module. The input module is configured to receive an input of an audio signal subjected to intensity stereo encoding. The decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis. The inverse quantization module is configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component. The determining module is configured to determine, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal by the inverse quantization module lies within a predetermined range, whether each component of the audio signal represents a music portion.
According to still another embodiment, a music detecting apparatus comprises: an input module; a decomposing module; an encoding identifying module; an inverse quantization module; an estimating module; an MS stereo determining module; an intensity stereo determining module; and a stereo encoding nonuse determining module. The input module is configured to receive an input of an audio signal. The decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis. The encoding identifying module is configured to identify, with respect to each component of the audio signal in a decomposed form, an implemented method from among a middle-side (MS) stereo encoding method, an intensity stereo encoding method, and stereo encoding nonuse. The inverse quantization module is configured to perform, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels included in the audio signal for generating inverse quantization data of the difference signal in a case when the encoding identifying module identifies implementation of the MS stereo encoding method, configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component in a case when the encoding identifying module identifies implementation of the intensity stereo encoding method, and configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on each of a plurality of channels included in the audio signal for generating inverse quantization data of each of the plurality of channels in a case when the encoding identifying module identifies the stereo encoding nonuse. The estimating module is configured to estimate, in the case when the encoding identifying module identifies implementation of the MS stereo encoding method, sound volume of the difference signal based on the inverse quantization data of the difference signal generated by the inverse quantization module, and configured to estimate, in the case when the encoding identifying module identifies the stereo encoding nonuse, sound volume on a channel-by-channel basis based on the inverse quantization data of each of the plurality of channels generated by the inverse quantization module. The MS stereo determining module is configured to determine, based on whether the sound volume of the difference signal estimated by the estimating module is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion. The intensity stereo determining module is configured to determine, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal by the inverse quantization module in the case when the encoding identifying module identifies implementation of the intensity stereo encoding method lies within a predetermined range, whether each component of the audio signal represents a music portion. The stereo encoding nonuse determining module is configured to determine, based on whether a difference in sound volume of the plurality of channels estimated by the estimating module in the case when the encoding identifying module identifies the stereo encoding nonuse is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
According to still another embodiment, a music detecting apparatus comprises: an input module; a decomposing module; and a determining module. The input module is configured to receive an input of an audio signal. The decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis. The determining module configured to determine, based on whether either one of a middle-side (MS) stereo encoding method and an intensity stereo encoding method is implemented for stereo encoding, whether each component of the audio signal represents a music portion.
According to still another embodiment, a music detecting method is executed in a music detecting apparatus. The music detecting method comprises: receiving, by an input module, an input of an audio signal subjected to middle-side (MS) stereo encoding; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; performing, by an inverse quantization module, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels comprised in the audio signal for generating inverse quantization data of the difference signal; estimating, by an estimating module, sound volume of the difference signal based on the inverse quantization data of the difference signal generated at the performing; and determining, by a determining module, based on whether the sound volume of the difference signal estimated at the estimating is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
According to still another embodiment, a music detecting method executed in a music detecting apparatus, the music detecting method comprising: receiving, by an input module, an input of an audio signal subjected to intensity stereo encoding; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; performing, by an inverse quantization module, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component; and determining, by a determining module, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal at the performing lies within a predetermined range, whether each component of the audio signal represents a music portion.
According to still another embodiment, a music detecting method executed in a music detecting apparatus, the music detecting method comprising: receiving, by an input module, an input of an audio signal; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; identifying, by an encoding identifying module, with respect to each component of the audio signal in a decomposed form, an implemented method from among a middle-side (MS) stereo encoding method, an intensity stereo encoding method, and stereo encoding nonuse; first-performing, by an inverse quantization module, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels comprised in the audio signal for generating inverse quantization data of the difference signal in a case when implementation of the MS stereo encoding method is identified at the identifying, second-performing, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component in a case when implementation of the intensity stereo encoding method is identified at the identifying, and third-performing, with respect to each component of the audio signal in a decomposed form, inverse quantization of each of a plurality of channels comprised in the audio signal for generating inverse quantization data of each of the plurality of channels in a case when the stereo encoding nonuse is identified at the identifying; first-estimating, by an estimating module, in the case when implementation of the MS stereo encoding method is identified at the identifying, sound volume of the difference signal based on the inverse quantization data of the difference signal generated at the first-performing, and second-estimating, in the case when the stereo encoding nonuse is identified at the identifying, sound volume on a channel-by-channel basis based on the inverse quantization data of each of the plurality of channels generated at the third-performing; first-determining, by an MS stereo determining module, based on whether the sound volume of the difference signal estimated at the first-estimating is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion; second-determining, by an intensity stereo determining module, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal at the second-performing in the case when implementation of the intensity stereo encoding method is identified at the identifying lies within a predetermined range, whether each component of the audio signal represents a music portion; and third-determining, by a stereo encoding nonuse determining module, based on whether a difference in sound volume of the channels estimated at the second-estimating in the case when the stereo encoding nonuse is identified at the identifying is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
According to still another embodiment, a music detecting method executed in a music detecting apparatus, the music detecting method comprising: receiving, by an input module, an input of an audio signal; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; and determining, by a determining module, based on whether either one of a middle-side (MS) stereo encoding method and an intensity stereo encoding method is implemented for stereo encoding, whether each component of the audio signal represents a music portion.
According to an embodiment, it becomes possible to reduce the processing load while determining whether an audio signal represents music.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

A general architecture that implements the various features of the invention will now be described with reference to the drawings. The drawings and the associated descriptions are provided to illustrate embodiments of the invention and not to limit the scope of the invention.
FIG. 1 is an exemplary schematic diagram of a configuration of a digital television display device according to a first embodiment;

FIG. 2 is an exemplary block diagram of a configuration of an audio-signal-music detecting module in the first embodiment;
FIG. 3 illustrates an exemplary configuration of a conventional audio decoder;
FIG. 4 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed for an audio signal by the audio-signal-music detecting module in the first embodiment;
FIG. 5 is an exemplary block diagram of a configuration of an audio-signal-music detecting module according to a first modification of the first embodiment;
FIG. 6 is an exemplary block diagram of a configuration of an audio-signal-music detecting module according to a second modification of the first embodiment;
FIG. 7 is an exemplary block diagram of a configuration of an audio-signal-music detecting module according to a third modification of the first embodiment;
FIG. 8 is an exemplary block diagram of a configuration of an audio-signal-music detecting module according to a second embodiment;
FIG. 9 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed for an audio signal by the audio-signal-music detecting module in the second embodiment;
FIG. 10 is an exemplary block diagram of a configuration of an audio-signal-music detecting module according to a third embodiment; and
FIG. 11 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed with respect to each section by the audio-signal-music detecting module in the third embodiment.

DETAILED DESCRIPTION

Various embodiments of a music detecting apparatus and a music detecting method will be described hereinafter with reference to the accompanying drawings.
Described below with reference to the drawings are embodiments. FIG. 1 is an exemplary schematic diagram of a digital television display device to which is applied a first embodiment. However, application of the embodiments described below is not limited to the digital television display device. That is, the embodiments can also be applied to, for example, a personal computer, a video camera, or a handheld terminal that can reproduce video programs or video contents. Alternatively, the embodiments can also be market as programs that can be fed to handheld devices, personal computers, or video-game terminals.
A digital television display device 1 illustrated in FIG. 1 comprises a tuner module (TV tuner) 10 that is configured to receive, for example, a satellite digital television broadcast provided via broadcasting satellites or communications satellites, or receive a terrestrial digital broadcast and an analog television broadcast provided using ground waves (space waves), or receive video contents provided via cable networks. The output of the tuner module 10 is fed to a video-type analog-to-digital converter (hereinafter, "video ADC") 14 and an audio-type (sound/music) analog-to-digital converter (hereinafter, "audio ADC") 16. Besides, an input signal from an external input terminal (aux) 12 is also fed to the video ADC 14 and to the audio ADC 16.
The video stream digitized by the video ADC 14 and the audio signal digitized by the audio ADC 16 are fed to an MPEG encoder 20. Besides, a digital stream (in, for example, the MPEG2-TS format, where TS stands for transport stream) from an external digital input terminal 18 is fed to the MPEG encoder 20 via an interface (I/F) 19 of, for example, IEEE 1394 standard (or high-definition multimedia interface (HDMI) standard).
When a television broadcast signal supplied to the tuner module 10 is a digital signal in, for example, the MPEG2-TS format ; the television broadcast signal (i.e. , the digital stream from the tuner module 10) is fed without change to the MPEG encoder 20. Then, except for the case of outputting the input MPEG2-TS signal without change (i.e. , except for the case of pass-through), the MPEG encoder 20 encodes the input stream into the MPEG2-PS format (where PS stands for program system) or into the MPEG4-AVC format (where AVC stands for advanced video coding). In the present embodiment, while recording a digitally encoded stream, it is assumed that feature detection and indexing is performed based on contents information and the processing is performed with respect to the encoded stream or the pass-through stream from among the abovementioned streams.
The stream data processed by the MPEG encoder 20 is temporarily buffered in a high-speed memory such as a synchronous dynamic random access memory (SDRAM) 22.
The stream data that has been buffered in the SDRAM 22 and that has been subjected to certain processing is then transferred, depending on the contents thereof, to a hard disk drive (HDD) 104, a disk drive unit 24, or a memory slot 26 at a predetermined timing.
The HDD 104 comprises a readable-writable recording medium.
The disk drive unit 24 is a disk-shaped recording medium that can record data (streams) in an optical disk 102 and can reproduce data (streams) already stored in the optical disk 102.
The memory slot 26 is used to insert a card memory 106 having a capacity of, for example, about two gigabytes (2 GB) .
Upon being reproduced from the optical disk 102, the HDD 104, or the card memory 106 via the disk drive unit 24, the HDD 104, or the memory slot 26, respectively; the stream data is transferred to an MPEG decoder 30 via the SDRAM 22.
Depending on the stream transferred thereto, the MPEG decoder 30 can decode the MPEG2-TS format, the MPEG2-PS format, or the MPEG4-AVC format.
The video data (in the MPEG2-TS format or the MPEG2-PS format) decoded by the MPEG decoder 30 is converted into an analog video signal either of standard picture quality or of high-definition picture quality by a video-type digital-to-analog converter (hereinafter, "video DAC") 32. The analog video signal is then supplied to a video output terminal 36. By connecting the video output terminal 36 to a display device (monitor device/display module) 52, the video picture can be displayed on the display device 52.
Meanwhile, the audio data decoded by the MPEG decoder 30 is converted into an analog audio signal by an audio-type (sound/music) digital-to-analog converter (hereinafter, "audio DAC") 34. The analog audio signal is then supplied to an audio (sound) output terminal 38. By connecting the audio output terminal 38 to a speaker (that is either embedded in the display device 52 or disposed independently), the sound/music can be reproduced.
If the data supplied to the MPEG decoder is in the MPEG2-TS format, then that data is supplied without modification to a digital output terminal 39 via an interface (I/F) 37 of, for example, IEEE 1394 standard (or HDMI standard).
The digital television display device 1 illustrated in FIG. 1 is controlled by a main control block 40, which functions as a stream parser and comprises a microprocessor unit (MPU) (not illustrated) or a central processing unit (CPU) (not illustrated) . To the main control block 40 are attached an electronically erasable and programmable read only memory (EEPROM) 42 for storing firmware or various control parameters, a work random access memory (RAM) 44, and a timer 46. The main control block 40 controls streams in between the SDRAM 22 and the MPEG decoder 30, and is used in video recording or video reproduction.
The main control block 40 comprises an audio-signal-music detecting module 100 that identifies music portions included in an audio signal.
FIG. 2 is an exemplary block diagram of a configuration of the audio-signal-music detecting module 100 according to the first embodiment. As illustrated in FIG. 2, the audio-signal-music detecting module 100 comprises a stream input module 201, a frame decomposing module 202, an inverse quantization module 203, a power estimating module 204, a middle-side (MS) stereo music determining module 205, an LR music determining module 206, an intensity stereo (IS) music determining module 207, and an indexing module 208.
In the present embodiment, it is assumed that an audio signal from which music portions are to be detected is a bit stream that is encoded frame by frame in an MPEG audio format. Hence, the encoded audio signal needs to be decoded frame by frame. The encoded bit stream contains audio information as well as information regarding bit allocation or scale factor that is necessary for decoding.
In order to encode the audio signal in an MPEG audio format, the following two encoding methods are implemented either combinedly or independently. As a specific example, in the case of the MPEG audio layer-3 format or the MPEG-2 AAC format (where AAC stands for advanced audio coding), the two encoding methods can be used according to the situation.
The first encoding method is called an intensity stereo (IS) encoding mode, which is a stereo encoding mode making use of the correlation between the left channel and the right channel of a stereo. In the intensity stereo encoding mode, encoding is performed using the sum signal of the left channel and the right channel and using the ratio of the signal of the left channel and the signal of the right channel.
The other encoding method is called a middle-side (MS) stereo encoding mode, which is a stereo encoding mode making use of the phase difference between the left channel and the right channel. In the middle-side stereo encoding mode, encoding is performed using the sum signal of the left channel and the right channel and using the difference signal between the left channel and the right channel.
In the audio-signal-music detecting module 100 according to the present embodiment, the music portions are identified on the basis of the difference between the signal of the left channel and the signal of the right channel.
More specifically, in an input audio signal, the portions other than the music portions are collected from a main microphone such as a center microphone. Therefore, not much difference occurs between the magnitude of the signal of the left channel and the magnitude of the signal of the right channel.
In contrast, regarding the music portions in an input audio signal; with an aim to reproduce the difference in the directions or the difference in the sense of distance of a sound entering the left ear and the right ear of a human being, it has become mainstream to collect the music portions separately from a left microphone and a right microphone resembling the left ear and the right ear, respectively, of a human being and to output the separately collected signals via separate (left and right) channels. Hence, there occurs a large difference between the magnitude of the signal of the left channel and the magnitude of the signal of the right channel. In the audio-signal-music detecting module 100 according to the present embodiment, attention is focused on that particular point for determining the existence of music portions.
With respect to an audio signal encoded by one of the abovementioned encoding modes, the audio-signal-music detecting module 100 according to the present embodiment detects music portions on a frame-by-frame basis. Meanwhile, if an attempt is made to identify music portions at the time of recording using a conventional apparatus, then decoding of the audio signals should be followed by the operation of determining the difference between the left channel and the right channel. That results in an increase in the processing load.
FIG. 3 illustrates an exemplary configuration of a conventional audio decoder 300. In the example illustrated in FIG. 3, the audio decoder 300 comprises a stream input module 301, a frame decomposing module 302, an inverse quantization module 303, a stereo restoring module 304, and a sub-band synthesis module 305.
The stream input module 301 performs input processing of an audio signal. The frame decomposing module 302 then decomposes the input audio signal on a frame-by-frame basis and extracts a decoding signal for each frame. Based on the extracted decoding signals, the inverse quantization module 303 obtains a spectrum by performing inverse quantization of the audio signal that has been decomposed on a frame-by-frame basis. If encoding is underway using a stereo correlation; then, at this stage, channel-wise spectrums are not sought. Thus, the stereo restoring module 304 performs restoration of the channel-wise spectrums.
Subsequently, the sub-band synthesis module 305 converts the audio signal into a time domain audio signal using a frequency domain spectrum.
The existence of music portions can be determined for the audio signal output by the sub-band synthesis module 305. However, in many encoding modes, not only the sub-band synthesis is associated with a greater processing load but the processing also takes a considerable amount of time. Therefore, elimination of the sub-band synthesis operation can lead to an increase in the processing speed. Hence, the audio-signal-music detecting module 100 according to the first embodiment is configured as illustrated in FIG. 2.
That is, in the audio-signal-music detecting module 100 according to the first embodiment; when stereo encoding is underway, notice is taken of the fact that the necessary information (difference between the left channel and the right channel) is obtained at a stage prior to performing stereo signal restoration. Thus, the processing is terminated at that stage as an attempt to increase the processing speed.
The stream input module 201 receives input of an audio signal. In the present embodiment, the input audio signal is assumed to be bit stream that is encoded in the MPEG-1 audio layer-3 (MP3) format. The bit stream is encoded using the middle-side stereo encoding mode, the intensity stereo encoding mode, or a stereo encoding nonuse mode.
The frame decomposing module 202 comprises a stereo-encoding-mode identifying module 210 and decomposes the audio signal that has been input in the stream input module 201 on a frame-by-frame basis. Besides, for each frame, the frame decomposing module 202 extracts decoding parameters such as the quantization spectrum and the scale factor by implementing Huffman decoding.
Moreover, in the frame header of each frame is stored the stereo encoding mode, which the frame decomposing module 202 extracts. The extracted encoding mode represents the encoding mode with which the corresponding frame is encoded.
In the audio-signal-music detecting module 100 according to the first embodiment, an audio signal is decomposed on a frame-by-frame basis. However, the decomposition is not limited to the frame-by-frame decomposition and can be performed on the basis of any other constituent element of the audio signal.
With respect to each frame of the decomposed audio signal, the stereo-encoding-mode identifying module 210 refers to the extracted stereo encoding mode and identifies whether the encoding is performed using the middle-side stereo encoding mode, the intensity stereo encoding mode, or the stereo encoding nonuse mode.
Then, according to the identified stereo encoding mode, the frame decomposing module 202 decides on the quantization spectrum to be sent to the inverse quantization module 203.
Upon receiving that information from the frame decomposing module 202, the inverse quantization module 203 performs inverse quantization on the quantization spectrum based on the scale factor and generates an original linear scale spectrum (in other words, generates inverse quantization data). The generated spectrum is assumed to have a mixture of various signals.
More particularly, when the stereo encoding mode is the middle-side stereo encoding mode; the inverse quantization module 203 performs, on a frame-by-fame basis of the audio signal, inverse quantization on the quantization spectrum corresponding to the sum signal of the left channel and the right channel and the difference signal between the left channel and the right channel. When the stereo encoding mode is the intensity stereo encoding mode; the inverse quantization module 203 does not perform inverse quantization of the sum signal of the left channel and the right channel but performs, on a frame-by-fame basis of the audio signal, inverse quantization only on the signal representing the ratio of the left channel and the right channel. Meanwhile, since the specific method implemented for inverse quantization is commonly known, the detailed description thereof is not given.
Except for the case when the audio encoding mode is the intensity stereo encoding mode, the power estimating module 204 calculates an estimate value of the power (sound volume) from the linear scale spectrum. At the time of estimating the power, the power estimating module 204 can perform an accurate calculation or can perform a simplified calculation using the sum of MDCT coefficients. Moreover, instead of performing calculation for all frequency bands, the power can be calculated for only specific frequency bands.
For example, when the audio encoding mode is the middle-side stereo encoding mode, the power estimating module 204 estimates the sound volume of the difference signal from the inverse quantization data thereof as well as estimates the sound volume of the sum signal from the inverse quantization data thereof.
Explained below is an example of the power estimation method. Equation (1) represents an equation with which the power estimating module 204 calculates a power estimate value P.
$P = \sum_{i = 0}^{N} |Si|$
In Equation (1), Si (e.g., in MPEG audio, i=0, ..., 31) represents the spectrum obtained by inverse quantization, that is, represents the MDCT coefficients.
During the calculation of the power estimate value P, in order to perform calculation for all frequency bands, N can be set to the maximum value (e.g., in MPEG audio, N=31) ; and in order increase the calculation speed, N can be set to a value smaller than the maximum value (e.g., N=20) by ignoring the high frequency components. Then, the power estimating module 204 can calculate the power estimate value of each signal, that is, calculate the sound volume estimate value of each signal using Equation (1).
Explained below is a configuration for determining the existence of music. Herein, the configuration for determining the existence of music is different for each stereo encoding mode.
For the middle-side stereo encoding mode, the MS stereo music determining module 205 determines whether the ratio value between the sound volume estimate value of the difference signal and the sound volume estimate value of the sum signal calculated by the power estimating module 204 is greater than a predetermined threshold value. If the ratio value is greater than the predetermined threshold value, then the MS stereo music determining module 205 determines that the corresponding frame in the audio signal represents a music portion.
For the intensity stereo encoding mode; the IS music determining module 207 determines, based on the inverse quantization data that represents the signal ratio of each channel and that is generated by the inverse quantization module 203, whether the signal ratio of each channel lies within a predetermined range. If the signal ratio of each channel is not within the predetermined range, then the IS music determining module 207 determines that the corresponding frame in the audio signal represents a music portion.
As the stereo encoding nonuse mode other than the middle-side stereo encoding mode or the intensity stereo encoding mode; for a frame in which the left channel and the right channel are encoded independently (in other words, when stereo encoding is not implemented), the LR music determining module 206 determines whether the frame represents a music portion. As the determining method, the LR music determining module 206 determines whether the absolute value obtained by deleting the sum signal (L+R) of each channel from the difference signal (L-R) of each channel is equal to or greater than a predetermined threshold value and accordingly determines the existence of a music portion. Meanwhile, instead of the abovementioned determining method, the LR music determining module 206 can also implement the conventionally-proposed methods for determining the existence of music.
Based on the determination result of each music determining module and based on the corresponding time information, the indexing module 208 stores therein the start time and the end time of music as indexing information that can be put to use at the time of music reproduction. In the indexing module 208, the indexing information is recorded in a predetermined memory area such as a read-in area, a header information recording area, or a table of contents (TOC) that is unique to the recording medium being used. The indexing information in the indexing module 208 also enables the digital television display device 1 to generate the data set of music extracted from the recorded contents.
At the time of outputting the audio signal, depending on whether the audio signal represents music portions, the main control block 40 illustrated in FIG. 1 performs processing suitable for the music portions. As the processing suitable for the music portions, for example, the audio signal determined to represent the music portions is subjected to surround reproduction in the digital television display device 1 under the control of the main control block 40.
Explained below is the music determining processing performed for an audio signal by the audio-signal-music detecting module 100 of the digital television display device 1 according to the present embodiment. FIG. 4 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed by the audio-signal-music detecting module 100 according to the present embodiment.
Firstly, the stream input module 201 receives an input of an audio signal (S401). Then, the frame decomposing module 202 decomposes the input audio signal on a frame-by-frame basis and, for each frame, extracts decoding parameters such as the quantization spectrum and the scale factor by implementing Huffman decoding as well as extracts the stereo encoding mode (S402) . For example, in the case of the MPEG-1 audio layer-3 format, the frame decomposing module 202 extracts the stereo encoding mode from a mode extension of each frame header.
Subsequently, based on the extracted stereo encoding mode, the stereo-encoding-mode identifying module 210 determines whether the data is stereo data (S403). If the data is not stereo data, that is, if the data is monaural data (No at S403), then the stereo-encoding-mode identifying module 210 assumes that it is not possible to determine the existence of music (S404) and ends the operation.
On the other hand, if the data is stereo data (Yes at S403), then the stereo-encoding-mode identifying module 210 determines whether the stereo encoding mode is the middle-side stereo encoding mode (S405).
If the stereo-encoding-mode identifying module 210 determines that the stereo encoding mode is the middle-side stereo encoding mode (Yes at S405), then the inverse quantization module 203 performs inverse quantization on the quantization spectrum corresponding to the sum signal and the difference signal (S406).
Subsequently, from each linear scale spectrum corresponding to the sum signal and the difference signal obtained by inverse quantization, the power estimating module 204 calculates the power estimate value, that is, calculates the sound volume estimate value of each of the sum signal and the difference signal (5407).
Then, the MS stereo music determining module 205 determines whether the ratio value between the power estimate value of the difference signal and the power estimate value of the sum signal is greater than a predetermined threshold value α (S408). If the ratio value is greater than the threshold value α (Yes at S408), then the MS stereo music determining module 205 determines that the corresponding frame in the audio signal represents a music portion (S409).
On the other hand, if the ratio value between the power estimate value of the difference signal and the power estimate value of the sum signal is smaller than the predetermined threshold value α (No at S408), then the MS stereo music determining module 205 determines that the corresponding frame in the audio signal does not represent a music portion (S416).
Meanwhile, if the stereo-encoding-mode identifying module 210 determines that the stereo encoding mode is not the middle-side stereo encoding mode (No at S405), then it determines whether the stereo encoding mode is the intensity stereo encoding mode (S410). If the stereo encoding mode is the intensity stereo encoding mode (Yes at S410), then the inverse quantization module 203 performs inverse quantization only on the signal representing the ratio of the left channel and the right channel (S411). Subsequently, based on the inverse quantization data representing the signal ratio of the left channel and the right channel, the IS music determining module 207 determines whether the signal ratio of each channel lies within a predetermined range, that is, whether the signal ratio of each channel is greater than a predetermined threshold value γ1 but smaller than a predetermined threshold value γ2 (S412).
If the signal ratio of each channel is not within the predetermined range (No at S412), then the IS music determining module 207 determines that the corresponding frame in the audio signal represents a music portion (S4Q9).
On the other hand, if the signal ratio of each channel is within the predetermined range (Yes at S412), then the IS music determining module 207 determines that the corresponding frame in the audio signal does not represent a music portion (S416).
Meanwhile, if the stereo-encoding-mode identifying module 210 determines that the stereo encoding mode is not the intensity stereo encoding mode, that is, determines that stereo encoding is not implemented (No at S410), then the inverse quantization module 203 independently performs inverse quantization on the signal of the left channel and inverse quantization of the signal of the right channel (S413) and, from the inverse quantization data calculated for each signal, calculates the power estimate value of each of the left channel and the right channel (S414).
Subsequently, the LR music determining module 206 determines whether Abs (L-R)/(L+R) is greater than a predetermined threshold value β(S415). Herein, L and R represent the power estimate value of the left channel and the right channel, respectively, calculated at S414. If Abs(L-R)/(L+R) is smaller than the threshold value β (No at S415), then the LR music determining module 206 determines that the corresponding frame in the audio signal does not represent a music portion (S416).
On the other hand, if Abs (L-R)/(L+R) is greater than the threshold value β(Yes at S415), then the LR music determining module 206 determines that the corresponding frame in the audio signal represents a music portion (S409).
By performing the abovementioned operation, it becomes possible to determine the existence of music portions on a frame-by-frame basis in an audio signal. Meanwhile, the threshold values α, β, γ1, and γ2 are set suitably according to the criteria for determining the existence of music. Moreover, the abovementioned operations are performed with respect to each frame in an audio signal. Furthermore, the sequence of operations can be changed as appropriate.
In the audio-signal-music detecting module 100 according to the present embodiment; when the stereo encoding mode is the middle-side stereo encoding mode, inverse quantization is performed with respect to the quantization spectrum corresponding to the sum signal and the difference signal and, if the ratio value between the difference signal and the sum signal is greater than a predetermined threshold value, then music is determined to be existing. However, the present embodiment is not limited to that case and, for examples, it is also possible to determine the existence of music depending on whether only the difference signal exceeds a predetermined threshold value.
Moreover, in the audio-signal-music detecting module 100 according to the present embodiment; although the description is given with reference to a two-channel stereo with left and right channels, the description is also applicable to a multi-channel stereo. In that case, for example, regarding the front left channel and the front right channel as the target channels for processing, the operations can be performed in the same manner as those in the first embodiment. Moreover, regarding the front left channel and the front right channel as the target channels for processing, the other channels are excluded from the stage of inverse quantization onward. Meanwhile, instead of the front left channel and the front right channel, other two channels can also be considered as the target channels for processing. Moreover, more than two channels can also be considered as the target channels for processing.
The audio-signal-music detecting module 100 in the digital television display device 1 according to the present embodiment focuses on the parameters regarding the stereo included in an encoded acoustic signal and, without performing decoding to the end, performs partial processing until the parameters are extracted. Hence, the audio-signal-music detecting module 100 can perform high-speed processing without help from any dedicated hardware.
Moreover, by making use of the stereo correlation, the amount of calculation can be reduced in the digital television display device 1 according to the present embodiment.
The first embodiment is not limited to the description given above and can be modified to various modification examples as explained below.
In the first embodiment described above, a plurality of encoding modes can be implemented with respect to an input audio signal. In contrast, in a first modification of the first embodiment, the music determining module is configured to determine the existence of music only when the middle-side stereo encoding mode is implemented.
FIG. 5 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 500 according to the first modification of the first embodiment. As illustrated in FIG. 5, the audio-signal-music detecting module 500 comprises the stream input module 201, a frame decomposing module 501, an inverse quantization module 502, a power estimating module 503, the MS stereo music determining module 205, and the indexing module 208. In the following description, the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
The stream input module 201 according to the present modification example receives input of an audio signal encoded using only the middle-side stereo encoding mode.
The frame decomposing module 501 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis.
Then, for each frame in the audio signal, the inverse quantization module 502 performs inverse quantization on the quantization spectrum corresponding to the sum signal of the left channel and the right channel and the difference signal between the left channel and the right channel.
The power estimating module 503 calculates, from the linear scale spectrum, the power (sound volume) estimate value of the sum signal of the left channel and the right channel and the power (sound volume) estimate value of the difference signal between the left channel and the right channel. Then, the MS stereo music determining module 205 determines whether the audio signal represents music portions.
In the present modification example, the existence of music portions can be adequately determined only when an audio signal is encoded using the middle-side stereo encoding mode. Moreover, at the time of determining the existence of music portions, operations such as sub-band synthesis are not performed thereby achieving reduction in the processing load.
In a second modification of the first embodiment, the music determining module is configured to determine the existence of music only when the stereo encoding nonuse mode is implemented without implanting the middle-side stereo encoding mode and the intensity stereo encoding mode.
FIG. 6 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 600 according to the second modification of the first embodiment. As illustrated in FIG. 6, the audio-signal-music detecting module 600 comprises the stream input module 201, a frame decomposing module 601, an inverse quantization module 602, a power estimating module 603, the LR music determining module 206, and the indexing module 208. In the following description, the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
The stream input module 201 according to the present modification example receives input of an audio signal in which the left channel and the right channel are encoded independently.
The frame decomposing module 601 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis.
Then, for each frame in the audio signal, the inverse quantization module 602 independently performs inverse quantization on the signal of the left channel and inverse quantization of the signal of the right channel.
The power estimating module 603 calculates, from the linear scale spectrum, the power (sound volume) estimate value of the left channel and of the right channel. Then, the LR music determining module 206 determines whether the audio signal represents music portions.
In the present modification, the existence of music portions can be adequately determined for an audio signal in which the left channel and the right channel are encoded independently. Moreover, at the time of determining the existence of music portions, operations such as sub-band synthesis are not performed thereby achieving reduction in the processing load.
In a third modification of the first embodiment, the music determining module is configured to determine the existence of music only when the intensity stereo encoding mode is implemented.
FIG. 7 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 700 according to the third modification of the first embodiment. As illustrated in FIG. 7, the audio-signal-music detecting module 700 comprises the stream input module 201, a frame decomposing module 701, an inverse quantization module 702, the IS music determining module 207, and the indexing module 208. In the following description, the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
The stream input module 201 according to the present modification receives an input of an audio signal encoded using only the intensity stereo encoding mode.
The frame decomposing module 701 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis.
Then, for each frame in the audio signal, the inverse quantization module 702 performs inverse quantization only on the signal representing the ratio of the left channel and the right channel. Then, based on whether the signal ratio lies within a predetermined range, the IS music determining module 207 determines whether audio signal represents music portions.
In the present modification, the existence of music portions can be adequately determined only when an audio signal is encoded using the intensity stereo encoding mode. Moreover, at the time of determining the existence of music portions, operations such as sub-band synthesis are not performed thereby achieving reduction in the processing load.
As described in the three modification examples of the first embodiment, the existence of music portions can be adequately determined when an audio signal is encoded using one of the encoding modes. Moreover, it is also possible to combine the three modifications and determine the existence of music portions when an audio signal is encoded using two of the encoding modes.
In the first embodiment, inverse quantization is performed in order to determine the existence of music portions. However, the music determining method is not limited to the first embodiment or the modification examples thereof. In a second embodiment of the present invention, an example is described when the existence of music portions is determined based on the stereo encoding mode.
FIG. 8 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 800 according to the second embodiment. As illustrated in FIG. 8, the audio-signal-music detecting module 800 comprises the stream input module 201, a frame decomposing module 801, a music determining module 802, and the indexing module 208. In the following description, the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
The stream input module 201 receives input of an audio signal in an identical manner to the first embodiment.
The frame decomposing module 801 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis and extracts only the stereo encoding mode stored in each frame header. The information other than the stereo encoding mode is not required for determining the existence of music. Hence, the frame decomposing module 801 need not perform payload analysis other than performing header analysis. That enables achieving reduction in the processing load.
Then, the frame decomposing module 801 outputs the extracted stereo encoding mode to the music determining module 802.
Meanwhile, at the output source of audio signals according to the present embodiment; it is assumed that an audio signal is encoded using an encoding mode, such as in the MPEG-1 audio layer-3 format, performing joint stereo encoding and including both the middle-side stereo encoding mode and the intensity stereo encoding mode.
In that encoding mode; at the time of encoding, mode selection is performed with the purpose of reducing the amount of encoding. The middle-side stereo has the property that higher the correlation between the left channel and the right channel, higher is the encoding efficiency. Hence, when a small difference exists between the left channel and the right channel, the middle-side stereo encoding mode is selected. On the contrary, when a large difference exists between the left channel and the right channel, the intensity stereo encoding mode is likely to be selected.
When an audio signal represents music, there exists a large difference between the left channel and the right channel. That is, when an audio signal represents music, it is more likely that encoding at the output source of that audio signal is performed by selecting the intensity stereo encoding mode. Therefore, in the music determining module 802 of the audio-signal-music detecting module 800 according to the second embodiment, an audio signal is determined to represent music when encoding is performed using the intensity stereo encoding mode.
In this way, based on whether the stereo encoding mode input by the frame decomposing module 801 is the intensity stereo encoding mode, the music determining module 802 determines the existence of a music portion on a frame-by-frame basis. When the stereo encoding mode is not the intensity stereo encoding mode, the music determining module 802 determines that the corresponding frame does not represent a music portion.
Explained below is the music determining processing performed for an audio signal by the audio-signal-music detecting module 800 according to the present embodiment. FIG. 9 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed by the audio-signal-music detecting module 800 according to the present embodiment.
Firstly, the stream input module 201 performs input processing of an audio signal (S901). Then, the frame decomposing module 801 decomposes the input audio signal on a frame-by-frame basis and, for each frame, extracts the stereo encoding mode (S902). The extracted stereo encoding mode is then output to the music determining module 802.
Subsequently, based on the extracted stereo encoding mode, the music determining module 802 determines whether the data is stereo data (S903). If the data is not stereo data, that is, if the data is monaural data (No at S903), then the music determining module 802 assumes that it is not possible to determine the existence of music (S904) and ends the operation.
On the other hand, if the data is stereo data (Yes at S903), then the music determining module 802 determines whether the stereo encoding mode is the middle-side stereo encoding mode (S905).
If the stereo encoding mode is the middle-side stereo encoding mode (Yes at S905), then the music determining module 802 determines that the corresponding frame does not represent a music portion (S906).
On the other hand, if the stereo encoding mode is not the middle-side stereo encoding mode (No at S905), then the music determining module 802 determines whether the stereo encoding mode is the intensity stereo encoding mode (S907). If the stereo encoding mode is the intensity stereo encoding mode (Yes at S907), then the music determining module 802 determines that the corresponding frame represents a music portion (S908).
If the stereo encoding mode is not the intensity stereo encoding mode (No at S907), then the music determining module 802 assumes that it is not possible to determine the existence of music (S904) and ends the operation.
In the abovementioned operation, if the stereo encoding mode is the intensity stereo encoding mode, then a music portion is determined to be existing; if the stereo encoding mode is the middle-side stereo encoding mode, then a non-music portion is determined to be existing; and if the stereo encoding mode is neither the intensity stereo encoding mode nor the middle-side stereo encoding mode, then it is assumed impossible to determine the existence of music. Although dependent on the capacity of the encoder, it is generally unlikely that the existence of music cannot be determined at the time of stereo encoding. Thus, the abovementioned music determining method poses no particular problem.
Meanwhile, in the music determining method implemented by the audio-signal-music detecting module 800 according to the second embodiment; although the amount of calculation is extremely small, dependency on the encoder could lead to reduction in the accuracy.
Hence, the detecting method can be switched depending on the CPU resources available at the time. That is, when the resources are extremely limited, then the method according to the second embodiment can be implemented; while when it is affordable to use a certain amount of resources, then the method according to the first embodiment can be implemented. In this way, the method can be switched depending on the resources. Moreover, when there are fewer restrictions on the resources; then it is possible to perform decoding and, as in the past, perform the music detecting operation with a high degree of accuracy using musical scale information.
In the abovementioned embodiments, the existence of music is determined on a frame-by-frame basis. However, if the existence of music is determined on a frame-by-frame basis; then, depending on the timing such as that of silence, it often becomes difficult to determine the existence of music.
Moreover, at the time of detecting music portions in a music program, a music section tends to continue for a certain length. Therefore, with reference to an audio-signal-music detecting module 1000 according to a third embodiment, an example is described when the existence of a music portion is determined on a section-by-section basis, where each section includes a plurality of frames.
FIG. 10 is an exemplary block diagram of a configuration of the audio-signal-music detecting module 1000 according to the third embodiment. In comparison with the audio-signal-music detecting module 100 according to the first embodiment, the audio-signal-music detecting module 1000 has a different configuration in which the indexing module 208 is replaced by an indexing module 1001 having different functions. Apart from that, in the following description, the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
Besides performing the same operation as that performed by the indexing module 208, the indexing module 1001 also comprises a section determining module 1010.
The section determining module 1010 determines the existence of music with respect to each given section composed of a plurality of frames. At that time, in order to accurately determine the existence of a music section, the section determining module 1010 obtains the density of music determination frames in a given section. That is, if the number of frames determined to represent music in a given section is greater than a predetermined number, then the section determining module 1010 according to the present embodiment determines that the corresponding section represents a music section.
Explained below is the music determining processing performed with respect to each section by the audio-signal-music detecting module 1000 according to the present embodiment. FIG. 11 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed by the audio-signal-music detecting module 1000.
Firstly, the digital television display device 1 records contents to be subjected to determination (S1101).
Then, for each frame of frames 1 to N in an audio signal included in the contents recorded by the audio-signal-music detecting module 1000, it is determined whether music exists (S1102). Herein, with reference to FIG. 11, the frames 1 to N are extracted when an audio signal included in the contents to be processed is decomposed frame by frame. As the music determining method performed for each frame of the frames 1 to N, the method according to the first embodiment is implemented. Hence, that explanation is not repeated.
Subsequently, the section determining module 1010 initializes a variable S to "1" (S1103). Herein, the variable S is set to the first frame in a section. The last frame in that section is S+K, where K is equal to one number less than the number of frames in the section. The value of K can be any value. In the present embodiment, K is assumed to be equal to 15.
The section determining module 1010 determines whether S+K is equal to or smaller than N (S1104). If S+K is equal to or smaller than N (Yes at S1104); then the section determining module 1010 sets, in Y, the number of frames determined to represent music in the section between S and S+K (S1105).
Then, the section determining module 1010 calculates the value obtained by dividing Y by K (i.e., Y/K), that is, calculates the percentage of frames detected to represent music. More particularly, the section determining module 1010 obtains the percentage of frames detected to represent music (i.e., Y/K) and determines whether that percentage is greater than a threshold value α_f (S1106).
If the percentage of frames detected to represent music is greater than the threshold value α_f (Yes at S1106), then the section determining module 1010 determines that the section between S and S+K represents music (S1107). On the other hand, if the percentage of frames detected to represent music is equal to or smaller than the threshold value α_f (No at S1106), then the section determining module 1010 determines that the section between S and S+K does not represent music (S1109).
Upon completion of S1107 or S1109, the section determining module 1010 increments the variable S by 1 (S1108) and returns to S1104 for restarting the operations.
By performing the abovementioned operations, the existence of music can be determined by shifting the window one frame at a time.
In the present embodiment, with respect to the contents recorded in advance, the existence of music is first determined for all frames and then determined on a section-by-section basis. However, the method is not limited to that manner. That is, alternatively, the music determining modules 205, 206, and 207 can be used to determine the existence of music on a frame-by-frame basis in accordance with the input of an audio signal. Still alternatively, the music determining processing can be implemented as an asynchronous processing in such a way that the existence of music sections is determined when the number of undetermined sections exceeds K. Meanwhile, in the present embodiment, although the method according to the first embodiment is implemented as the music determining method for each frame, it is also possible to implement the method according to the modification examples of the first embodiment or the method according to the second embodiment.
Moreover, in the present embodiment, the operations can be performed without having to decode music signals. Hence, even in the case of using large-scale integration (LSI) for television in which a hardware decoder cannot be used while recording or even in the case when the number of decoders falls short due to the recording of multiple programs, it becomes possible to detect the music sections. For example, by detecting the music sections, it becomes possible to view and listen to only the singing sections in a recorded music program.
Meanwhile, a music detecting program that is executed in the audio-signal-music detecting module according to the embodiments can be stored in a computer-readable recording medium such as a flexible disk (FD), a compact disk read only memory (CD-ROM), a compact disk recordable (CD-R), or a digital versatile disk (DVD).
Alternatively, the music detecting program that is executed in the audio-signal-music detecting module according to the embodiments can be stored in a computer connected over a network such as the Internet and can be downloaded via the network for distribution. Moreover, the music detecting program that is executed in the audio-signal-music detecting module according to the embodiments can be distributed over a network such as the Internet.
Still alternatively, the music detecting program according to the embodiments can be stored in advance in a read only memory (ROM) for distribution.
Meanwhile, the music detecting program that is executed in the audio-signal-music detecting module according to the embodiments contains modules for each constituent element (the stream input module, the inverse quantization module, the power estimating module, the music determining modules, and the indexing module). In practice, hardware such as a CPU (processor) retrieves the music detecting program from the memory medium and runs the same such that the music detecting program is loaded in a corresponding main memory. As a result, the modules for the stream input module, the inverse quantization module, the power estimating module, the music determining modules, and the indexing module are generated in the main memory.
Moreover, the various modules of the systems described herein can be implemented as software applications, hardware and/or software modules, or components on one or more computers, such as servers. While the various modules are illustrated separately, they may share some or all of the same underlying logic or code.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims

A music detecting apparatus comprising:
an input module (201) configured to receive an input of an audio signal subjected to middle-side (MS) stereo encoding;

a decomposing module (202) configured to decompose the audio signal input in the input module (201) on a component-by-component basis;

an inverse quantization module (203; 502) configured to perform, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels included in the audio signal for generating inverse quantization data of the difference signal;

an estimating module (204; 503) configured to estimate sound volume of the difference signal based on the inverse quantization data of the difference signal generated by the inverse quantization module (203; 502); and

a determining module (205) configured to determine, based on whether the sound volume of the difference signal estimated by the estimating module (204) is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
The music detecting apparatus of Claim 1, wherein
the inverse quantization module (203, 502) is further configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on a sum signal of the channels for generating inverse quantization data of the sum signal,
the estimating module (204) is further configured to estimate sound volume of the sum signal based on the inverse quantization data of the sum signal generated by the inverse quantization module (203, 502), and
the determining module (205) is configured to determine, based on whether a value of a ratio between the difference signal and the sum signal is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
A music detecting apparatus comprising:
an input module (201) configured to receive an input of an audio signal subjected to intensity stereo encoding;

a decomposing module (202; 701) configured to decompose the audio signal input in the input module (201) on a component-by-component basis;

an inverse quantization module (203; 702) configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the components; and

a determining module (207) configured to determine, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal by the inverse quantization module (203; 702) lies within a predetermined range, whether each component of the audio signal represents a music portion.
A music detecting apparatus comprising:
an input module (201) configured to receive an input of an audio signal;

a decomposing module (202) configured to decompose the audio signal input in the input module (201) on a component-by-component basis;

an encoding identifying module (210) configured to identify, with respect to each component of the audio signal in a decomposed form, an implemented method from among a middle-side (MS) stereo encoding method, an intensity stereo encoding method, and stereo encoding nonuse;

an inverse quantization module (203) configured to perform, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels included in the audio signal for generating inverse quantization data of the difference signal in a case when the
encoding identifying module (210) identifies implementation of the MS stereo encoding method, configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component in a case when the encoding identifying module (210) identifies implementation of the intensity stereo encoding method, and configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on each of a plurality of channels included in the audio signal for generating inverse quantization data of each of the plurality of channels in a case when the encoding identifying module (210) identifies the stereo encoding nonuse;

an estimating module (204) configured to estimate, in the case when the encoding identifying module (210) identifies implementation of the MS stereo encoding method, sound volume of the difference signal based on the inverse quantization data of the difference signal generated by the inverse quantization module (203), and configured to estimate, in the case when the encoding identifying module (210) identifies the stereo encoding nonuse, sound volume on a channel-by-channel basis based on the inverse quantization data of each of the plurality of channels generated by the inverse quantization module (203);

an MS stereo determining module (205) configured to determine, based on whether the sound volume of the difference signal estimated by the estimating module (204) is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion;

an intensity stereo determining module (207) configured to determine, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal by the inverse quantization module (203) in the case when the encoding identifying module (210) identifies implementation of the intensity stereo encoding method lies within a predetermined range, whether each component of the audio signal represents a music portion; and

a stereo encoding nonuse determining module (206) configured to determine, based on whether a difference in sound volume of the plurality of channels estimated by the estimating module (204) in the case when the encoding identifying module (210) identifies the stereo encoding nonuse is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
A music detecting apparatus comprising:
an input module (201) configured to receive an input of an audio signal;

a decomposing module (202) configured to decompose the audio signal input in the input module (201) on a component-by-component basis; and

a determining module (802) configured to determine, based on whether either one of a middle-side (MS) stereo encoding method and an intensity stereo encoding method is implemented for stereo encoding, whether each component of the audio signal represents a music portion.
The music detecting apparatus of any one of Claims 1 to 5, further comprising a section determining module (1010) configured to determine that a section represents a music portion when, in the section composed of a plurality of components, percentage of the components determined to represent a music portion is equal to or greater than a predetermined threshold value.
The music detecting apparatus of any one of Claims 1 to 6, further comprising a processing module (40) configured to perform, with respect to the audio signal determined to comprise a component or a section representing a music portion, processing suitable for a music portion.
A music detecting method executed in a music detecting apparatus, the music detecting method comprising:
receiving, by an input module (201), an input of an audio signal subjected to middle-side (MS) stereo encoding;

decomposing, by a decomposing module (202), the audio signal input at the receiving on a component-by-component basis;

performing, by an inverse quantization module (203; 502), with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels comprised in the audio signal for generating inverse quantization data of the difference signal;

estimating, by an estimating module (204; 503), sound volume of the difference signal based on the inverse quantization data of the difference signal generated at the performing; and

determining, by a determining module (205), based on whether the sound volume of the difference signal estimated at the estimating is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
A music detecting method executed in a music detecting apparatus, the music detecting method comprising:
receiving, by an input module (201), an input of an audio signal subjected to intensity stereo encoding;

decomposing, by a decomposing module (202, 701), the audio signal input at the receiving on a component-by-component basis;

performing, by an inverse quantization module (203, 702), with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component; and

determining, by a determining module (207), based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal at the performing lies within a predetermined range, whether each component of the audio signal represents a music portion.
A music detecting method executed in a music detecting apparatus, the music detecting method comprising:
receiving, by an input module (201), an input of an audio signal;

decomposing, by a decomposing module (202), the audio signal input at the receiving on a component-by-component basis;

identifying, by an encoding identifying module (210), with respect to each component of the audio signal in a decomposed form, an implemented method from among a middle-side (MS) stereo encoding method, an intensity stereo encoding method, and stereo encoding nonuse;
(i) first-performing, by an inverse quantization module (203), with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels comprised in the audio signal for generating inverse quantization data of the difference signal in a case when implementation of the MS stereo encoding method is identified at the identifying, (ii) second-performing, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component in a case when implementation of the intensity stereo encoding method is identified at the identifying, and (iii) third-performing, with respect to each component of the audio signal in a decomposed form, inverse quantization of each of a plurality of channels comprised in the audio signal for generating inverse quantization data of each of the plurality of channels in a case when the stereo encoding nonuse is identified at the identifying;

first-estimating, by an estimating module (204), in the case when implementation of the MS stereo encoding method is identified at the identifying, sound volume of the difference signal based on the inverse quantization data of the difference signal generated at the first-performing, and second-estimating, in the case when the stereo encoding nonuse is identified at the identifying, sound volume on a channel-by-channel basis based on the inverse quantization data of each of the plurality of channels generated at the third-performing;
first-determining, by an MS stereo determining module (205), based on whether the sound volume of the difference signal estimated at the first-estimating is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion;

second-determining, by an intensity stereo determining module (207), based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal at the second-performing in the case when implementation of the intensity stereo encoding method is identified at the identifying lies within a predetermined range, whether each component of the audio signal represents a music portion; and

third-determining, by a stereo encoding nonuse determining module (206), based on whether a difference in sound volume of the channels estimated at the second-estimating in the case when the stereo encoding nonuse is identified at the identifying is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
A music detecting method executed in a music detecting apparatus, the music detecting method comprising:
receiving, by an input module (201), an input of an audio signal;

decomposing, by a decomposing module (202), the audio signal input at the receiving on a component-by-component basis; and

determining, by a determining module (802), based on whether either one of a middle-side (MS) stereo encoding method and an intensity stereo encoding method is implemented for stereo encoding, whether each component of the audio signal represents a music portion.