EP2357645A1 - Appareil et procédé de détection de musique - Google Patents

Appareil et procédé de détection de musique Download PDF

Info

Publication number
EP2357645A1
EP2357645A1 EP10172348A EP10172348A EP2357645A1 EP 2357645 A1 EP2357645 A1 EP 2357645A1 EP 10172348 A EP10172348 A EP 10172348A EP 10172348 A EP10172348 A EP 10172348A EP 2357645 A1 EP2357645 A1 EP 2357645A1
Authority
EP
European Patent Office
Prior art keywords
module
audio signal
component
inverse quantization
music
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP10172348A
Other languages
German (de)
English (en)
Inventor
Tatsuya Uehara
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Publication of EP2357645A1 publication Critical patent/EP2357645A1/fr
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/046Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for differentiation between music and non-music signals, based on the identification of musical parameters, e.g. based on tempo detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • Embodiments described herein relate generally to a music detecting apparatus and a music detecting method.
  • music programs are sometimes obtained as part of various video signal contents.
  • obtaining music programs for example, there exists a demand for separating the music portions from the other portions included in a music program.
  • Japanese Patent Application Publication (KOKAI) No. 2008-298976 discloses a technology for detecting music on the basis of input audio signals. By implementing that technology, it becomes possible to identify the music portion included in the contents.
  • a music detecting apparatus comprises: an input module; a decomposing module; an inverse quantization module; an estimating module; and a determining module.
  • the input module is configured to receive an input of an audio signal subjected to middle-side (MS) stereo encoding.
  • the decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis.
  • the inverse quantization module configured to perform, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels included in the audio signal for generating inverse quantization data of the difference signal.
  • the estimating module is configured to estimate sound volume of the difference signal based on the inverse quantization data of the difference signal generated by the inverse quantization module.
  • the determining module is configured to determine, based on whether the sound volume of the difference signal estimated by the estimating module is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
  • a music detecting apparatus comprises: an input module, a decomposing module; an inverse quantization module; and a determining module.
  • the input module is configured to receive an input of an audio signal subjected to intensity stereo encoding.
  • the decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis.
  • the inverse quantization module is configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component.
  • the determining module is configured to determine, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal by the inverse quantization module lies within a predetermined range, whether each component of the audio signal represents a music portion.
  • a music detecting apparatus comprises: an input module; a decomposing module; an encoding identifying module; an inverse quantization module; an estimating module; an MS stereo determining module; an intensity stereo determining module; and a stereo encoding nonuse determining module.
  • the input module is configured to receive an input of an audio signal.
  • the decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis.
  • the encoding identifying module is configured to identify, with respect to each component of the audio signal in a decomposed form, an implemented method from among a middle-side (MS) stereo encoding method, an intensity stereo encoding method, and stereo encoding nonuse.
  • MS middle-side
  • the inverse quantization module is configured to perform, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels included in the audio signal for generating inverse quantization data of the difference signal in a case when the encoding identifying module identifies implementation of the MS stereo encoding method, configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component in a case when the encoding identifying module identifies implementation of the intensity stereo encoding method, and configured to perform, with respect to each component of the audio signal in a decomposed form, inverse quantization on each of a plurality of channels included in the audio signal for generating inverse quantization data of each of the plurality of channels in a case when the encoding identifying module identifies the stereo encoding nonuse.
  • the estimating module is configured to estimate, in the case when the encoding identifying module identifies implementation of the MS stereo encoding method, sound volume of the difference signal based on the inverse quantization data of the difference signal generated by the inverse quantization module, and configured to estimate, in the case when the encoding identifying module identifies the stereo encoding nonuse, sound volume on a channel-by-channel basis based on the inverse quantization data of each of the plurality of channels generated by the inverse quantization module.
  • the MS stereo determining module is configured to determine, based on whether the sound volume of the difference signal estimated by the estimating module is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
  • the intensity stereo determining module is configured to determine, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal by the inverse quantization module in the case when the encoding identifying module identifies implementation of the intensity stereo encoding method lies within a predetermined range, whether each component of the audio signal represents a music portion.
  • the stereo encoding nonuse determining module is configured to determine, based on whether a difference in sound volume of the plurality of channels estimated by the estimating module in the case when the encoding identifying module identifies the stereo encoding nonuse is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
  • a music detecting apparatus comprises: an input module; a decomposing module; and a determining module.
  • the input module is configured to receive an input of an audio signal.
  • the decomposing module is configured to decompose the audio signal input in the input module on a component-by-component basis.
  • the determining module configured to determine, based on whether either one of a middle-side (MS) stereo encoding method and an intensity stereo encoding method is implemented for stereo encoding, whether each component of the audio signal represents a music portion.
  • MS middle-side
  • a music detecting method is executed in a music detecting apparatus.
  • the music detecting method comprises: receiving, by an input module, an input of an audio signal subjected to middle-side (MS) stereo encoding; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; performing, by an inverse quantization module, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels comprised in the audio signal for generating inverse quantization data of the difference signal; estimating, by an estimating module, sound volume of the difference signal based on the inverse quantization data of the difference signal generated at the performing; and determining, by a determining module, based on whether the sound volume of the difference signal estimated at the estimating is greater than a predetermined threshold value, whether each component of the audio signal represents a music portion.
  • MS middle-side
  • a music detecting method executed in a music detecting apparatus comprising: receiving, by an input module, an input of an audio signal subjected to intensity stereo encoding; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; performing, by an inverse quantization module, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component; and determining, by a determining module, based on whether a signal ratio of each channel comprised in the inverse quantization data generated on a component-by-component basis for the audio signal at the performing lies within a predetermined range, whether each component of the audio signal represents a music portion.
  • a music detecting method executed in a music detecting apparatus comprising: receiving, by an input module, an input of an audio signal; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; identifying, by an encoding identifying module, with respect to each component of the audio signal in a decomposed form, an implemented method from among a middle-side (MS) stereo encoding method, an intensity stereo encoding method, and stereo encoding nonuse; first-performing, by an inverse quantization module, with respect to each component of the audio signal, inverse quantization on a difference signal between a plurality of channels comprised in the audio signal for generating inverse quantization data of the difference signal in a case when implementation of the MS stereo encoding method is identified at the identifying, second-performing, with respect to each component of the audio signal in a decomposed form, inverse quantization on the component for generating inverse quantization data of the component in a case
  • a music detecting method executed in a music detecting apparatus comprising: receiving, by an input module, an input of an audio signal; decomposing, by a decomposing module, the audio signal input at the receiving on a component-by-component basis; and determining, by a determining module, based on whether either one of a middle-side (MS) stereo encoding method and an intensity stereo encoding method is implemented for stereo encoding, whether each component of the audio signal represents a music portion.
  • MS middle-side
  • FIG. 1 is an exemplary schematic diagram of a configuration of a digital television display device according to a first embodiment
  • FIG. 1 is an exemplary schematic diagram of a digital television display device to which is applied a first embodiment.
  • application of the embodiments described below is not limited to the digital television display device. That is, the embodiments can also be applied to, for example, a personal computer, a video camera, or a handheld terminal that can reproduce video programs or video contents. Alternatively, the embodiments can also be market as programs that can be fed to handheld devices, personal computers, or video-game terminals.
  • a digital television display device 1 illustrated in FIG. 1 comprises a tuner module (TV tuner) 10 that is configured to receive, for example, a satellite digital television broadcast provided via broadcasting satellites or communications satellites, or receive a terrestrial digital broadcast and an analog television broadcast provided using ground waves (space waves), or receive video contents provided via cable networks.
  • the output of the tuner module 10 is fed to a video-type analog-to-digital converter (hereinafter, "video ADC") 14 and an audio-type (sound/music) analog-to-digital converter (hereinafter, "audio ADC”) 16.
  • video ADC video-type analog-to-digital converter
  • audio ADC audio-type analog-to-digital converter
  • an input signal from an external input terminal (aux) 12 is also fed to the video ADC 14 and to the audio ADC 16.
  • the video stream digitized by the video ADC 14 and the audio signal digitized by the audio ADC 16 are fed to an MPEG encoder 20.
  • a digital stream in, for example, the MPEG2-TS format, where TS stands for transport stream
  • I/F interface
  • IEEE 1394 or high-definition multimedia interface (HDMI) standard
  • a television broadcast signal supplied to the tuner module 10 is a digital signal in, for example, the MPEG2-TS format ; the television broadcast signal (i.e. , the digital stream from the tuner module 10) is fed without change to the MPEG encoder 20. Then, except for the case of outputting the input MPEG2-TS signal without change (i.e. , except for the case of pass-through), the MPEG encoder 20 encodes the input stream into the MPEG2-PS format (where PS stands for program system) or into the MPEG4-AVC format (where AVC stands for advanced video coding).
  • PS program system
  • MPEG4-AVC where AVC stands for advanced video coding
  • the stream data processed by the MPEG encoder 20 is temporarily buffered in a high-speed memory such as a synchronous dynamic random access memory (SDRAM) 22.
  • SDRAM synchronous dynamic random access memory
  • the stream data that has been buffered in the SDRAM 22 and that has been subjected to certain processing is then transferred, depending on the contents thereof, to a hard disk drive (HDD) 104, a disk drive unit 24, or a memory slot 26 at a predetermined timing.
  • HDD hard disk drive
  • the HDD 104 comprises a readable-writable recording medium.
  • the disk drive unit 24 is a disk-shaped recording medium that can record data (streams) in an optical disk 102 and can reproduce data (streams) already stored in the optical disk 102.
  • the memory slot 26 is used to insert a card memory 106 having a capacity of, for example, about two gigabytes (2 GB) .
  • the stream data is transferred to an MPEG decoder 30 via the SDRAM 22.
  • the MPEG decoder 30 can decode the MPEG2-TS format, the MPEG2-PS format, or the MPEG4-AVC format.
  • the video data (in the MPEG2-TS format or the MPEG2-PS format) decoded by the MPEG decoder 30 is converted into an analog video signal either of standard picture quality or of high-definition picture quality by a video-type digital-to-analog converter (hereinafter, "video DAC") 32.
  • the analog video signal is then supplied to a video output terminal 36.
  • a display device monitoror device/display module
  • the audio data decoded by the MPEG decoder 30 is converted into an analog audio signal by an audio-type (sound/music) digital-to-analog converter (hereinafter, "audio DAC") 34.
  • the analog audio signal is then supplied to an audio (sound) output terminal 38.
  • the audio output terminal 38 By connecting the audio output terminal 38 to a speaker (that is either embedded in the display device 52 or disposed independently), the sound/music can be reproduced.
  • the data supplied to the MPEG decoder is in the MPEG2-TS format, then that data is supplied without modification to a digital output terminal 39 via an interface (I/F) 37 of, for example, IEEE 1394 standard (or HDMI standard).
  • I/F interface
  • the digital television display device 1 illustrated in FIG. 1 is controlled by a main control block 40, which functions as a stream parser and comprises a microprocessor unit (MPU) (not illustrated) or a central processing unit (CPU) (not illustrated) .
  • MPU microprocessor unit
  • CPU central processing unit
  • EEPROM electronically erasable and programmable read only memory
  • RAM work random access memory
  • timer 46 timer 46
  • the main control block 40 comprises an audio-signal-music detecting module 100 that identifies music portions included in an audio signal.
  • FIG. 2 is an exemplary block diagram of a configuration of the audio-signal-music detecting module 100 according to the first embodiment.
  • the audio-signal-music detecting module 100 comprises a stream input module 201, a frame decomposing module 202, an inverse quantization module 203, a power estimating module 204, a middle-side (MS) stereo music determining module 205, an LR music determining module 206, an intensity stereo (IS) music determining module 207, and an indexing module 208.
  • MS middle-side
  • IS intensity stereo
  • an audio signal from which music portions are to be detected is a bit stream that is encoded frame by frame in an MPEG audio format.
  • the encoded audio signal needs to be decoded frame by frame.
  • the encoded bit stream contains audio information as well as information regarding bit allocation or scale factor that is necessary for decoding.
  • the following two encoding methods are implemented either combinedly or independently.
  • the two encoding methods can be used according to the situation.
  • the first encoding method is called an intensity stereo (IS) encoding mode, which is a stereo encoding mode making use of the correlation between the left channel and the right channel of a stereo.
  • IS intensity stereo
  • encoding is performed using the sum signal of the left channel and the right channel and using the ratio of the signal of the left channel and the signal of the right channel.
  • the other encoding method is called a middle-side (MS) stereo encoding mode, which is a stereo encoding mode making use of the phase difference between the left channel and the right channel.
  • a middle-side stereo encoding mode encoding is performed using the sum signal of the left channel and the right channel and using the difference signal between the left channel and the right channel.
  • the music portions are identified on the basis of the difference between the signal of the left channel and the signal of the right channel.
  • the portions other than the music portions are collected from a main microphone such as a center microphone. Therefore, not much difference occurs between the magnitude of the signal of the left channel and the magnitude of the signal of the right channel.
  • the audio-signal-music detecting module 100 detects music portions on a frame-by-frame basis. Meanwhile, if an attempt is made to identify music portions at the time of recording using a conventional apparatus, then decoding of the audio signals should be followed by the operation of determining the difference between the left channel and the right channel. That results in an increase in the processing load.
  • FIG. 3 illustrates an exemplary configuration of a conventional audio decoder 300.
  • the audio decoder 300 comprises a stream input module 301, a frame decomposing module 302, an inverse quantization module 303, a stereo restoring module 304, and a sub-band synthesis module 305.
  • the stream input module 301 performs input processing of an audio signal.
  • the frame decomposing module 302 then decomposes the input audio signal on a frame-by-frame basis and extracts a decoding signal for each frame.
  • the inverse quantization module 303 obtains a spectrum by performing inverse quantization of the audio signal that has been decomposed on a frame-by-frame basis. If encoding is underway using a stereo correlation; then, at this stage, channel-wise spectrums are not sought. Thus, the stereo restoring module 304 performs restoration of the channel-wise spectrums.
  • the sub-band synthesis module 305 converts the audio signal into a time domain audio signal using a frequency domain spectrum.
  • the audio-signal-music detecting module 100 is configured as illustrated in FIG. 2 .
  • the audio-signal-music detecting module 100 when stereo encoding is underway, notice is taken of the fact that the necessary information (difference between the left channel and the right channel) is obtained at a stage prior to performing stereo signal restoration. Thus, the processing is terminated at that stage as an attempt to increase the processing speed.
  • the stream input module 201 receives input of an audio signal.
  • the input audio signal is assumed to be bit stream that is encoded in the MPEG-1 audio layer-3 (MP3) format.
  • MP3 MPEG-1 audio layer-3
  • the bit stream is encoded using the middle-side stereo encoding mode, the intensity stereo encoding mode, or a stereo encoding nonuse mode.
  • the frame decomposing module 202 comprises a stereo-encoding-mode identifying module 210 and decomposes the audio signal that has been input in the stream input module 201 on a frame-by-frame basis. Besides, for each frame, the frame decomposing module 202 extracts decoding parameters such as the quantization spectrum and the scale factor by implementing Huffman decoding.
  • the frame decomposing module 202 extracts.
  • the extracted encoding mode represents the encoding mode with which the corresponding frame is encoded.
  • an audio signal is decomposed on a frame-by-frame basis.
  • the decomposition is not limited to the frame-by-frame decomposition and can be performed on the basis of any other constituent element of the audio signal.
  • the stereo-encoding-mode identifying module 210 refers to the extracted stereo encoding mode and identifies whether the encoding is performed using the middle-side stereo encoding mode, the intensity stereo encoding mode, or the stereo encoding nonuse mode.
  • the frame decomposing module 202 decides on the quantization spectrum to be sent to the inverse quantization module 203.
  • the inverse quantization module 203 Upon receiving that information from the frame decomposing module 202, the inverse quantization module 203 performs inverse quantization on the quantization spectrum based on the scale factor and generates an original linear scale spectrum (in other words, generates inverse quantization data).
  • the generated spectrum is assumed to have a mixture of various signals.
  • the inverse quantization module 203 performs, on a frame-by-fame basis of the audio signal, inverse quantization on the quantization spectrum corresponding to the sum signal of the left channel and the right channel and the difference signal between the left channel and the right channel.
  • the stereo encoding mode is the intensity stereo encoding mode
  • the inverse quantization module 203 does not perform inverse quantization of the sum signal of the left channel and the right channel but performs, on a frame-by-fame basis of the audio signal, inverse quantization only on the signal representing the ratio of the left channel and the right channel.
  • the specific method implemented for inverse quantization is commonly known, the detailed description thereof is not given.
  • the power estimating module 204 calculates an estimate value of the power (sound volume) from the linear scale spectrum. At the time of estimating the power, the power estimating module 204 can perform an accurate calculation or can perform a simplified calculation using the sum of MDCT coefficients. Moreover, instead of performing calculation for all frequency bands, the power can be calculated for only specific frequency bands.
  • the power estimating module 204 estimates the sound volume of the difference signal from the inverse quantization data thereof as well as estimates the sound volume of the sum signal from the inverse quantization data thereof.
  • Equation (1) represents an equation with which the power estimating module 204 calculates a power estimate value P.
  • the configuration for determining the existence of music is different for each stereo encoding mode.
  • the MS stereo music determining module 205 determines whether the ratio value between the sound volume estimate value of the difference signal and the sound volume estimate value of the sum signal calculated by the power estimating module 204 is greater than a predetermined threshold value. If the ratio value is greater than the predetermined threshold value, then the MS stereo music determining module 205 determines that the corresponding frame in the audio signal represents a music portion.
  • the IS music determining module 207 determines, based on the inverse quantization data that represents the signal ratio of each channel and that is generated by the inverse quantization module 203, whether the signal ratio of each channel lies within a predetermined range. If the signal ratio of each channel is not within the predetermined range, then the IS music determining module 207 determines that the corresponding frame in the audio signal represents a music portion.
  • the LR music determining module 206 determines whether the frame represents a music portion. As the determining method, the LR music determining module 206 determines whether the absolute value obtained by deleting the sum signal (L+R) of each channel from the difference signal (L-R) of each channel is equal to or greater than a predetermined threshold value and accordingly determines the existence of a music portion. Meanwhile, instead of the abovementioned determining method, the LR music determining module 206 can also implement the conventionally-proposed methods for determining the existence of music.
  • the indexing module 208 stores therein the start time and the end time of music as indexing information that can be put to use at the time of music reproduction.
  • the indexing information is recorded in a predetermined memory area such as a read-in area, a header information recording area, or a table of contents (TOC) that is unique to the recording medium being used.
  • the indexing information in the indexing module 208 also enables the digital television display device 1 to generate the data set of music extracted from the recorded contents.
  • the main control block 40 illustrated in FIG. 1 performs processing suitable for the music portions.
  • the processing suitable for the music portions for example, the audio signal determined to represent the music portions is subjected to surround reproduction in the digital television display device 1 under the control of the main control block 40.
  • FIG. 4 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed by the audio-signal-music detecting module 100 according to the present embodiment.
  • the stream input module 201 receives an input of an audio signal (S401). Then, the frame decomposing module 202 decomposes the input audio signal on a frame-by-frame basis and, for each frame, extracts decoding parameters such as the quantization spectrum and the scale factor by implementing Huffman decoding as well as extracts the stereo encoding mode (S402) . For example, in the case of the MPEG-1 audio layer-3 format, the frame decomposing module 202 extracts the stereo encoding mode from a mode extension of each frame header.
  • the stereo-encoding-mode identifying module 210 determines whether the data is stereo data (S403). If the data is not stereo data, that is, if the data is monaural data (No at S403), then the stereo-encoding-mode identifying module 210 assumes that it is not possible to determine the existence of music (S404) and ends the operation.
  • the stereo-encoding-mode identifying module 210 determines whether the stereo encoding mode is the middle-side stereo encoding mode (S405).
  • the stereo-encoding-mode identifying module 210 determines that the stereo encoding mode is the middle-side stereo encoding mode (Yes at S405), then the inverse quantization module 203 performs inverse quantization on the quantization spectrum corresponding to the sum signal and the difference signal (S406).
  • the power estimating module 204 calculates the power estimate value, that is, calculates the sound volume estimate value of each of the sum signal and the difference signal (5407).
  • the MS stereo music determining module 205 determines whether the ratio value between the power estimate value of the difference signal and the power estimate value of the sum signal is greater than a predetermined threshold value ⁇ (S408). If the ratio value is greater than the threshold value ⁇ (Yes at S408), then the MS stereo music determining module 205 determines that the corresponding frame in the audio signal represents a music portion (S409).
  • the MS stereo music determining module 205 determines that the corresponding frame in the audio signal does not represent a music portion (S416).
  • the stereo-encoding-mode identifying module 210 determines that the stereo encoding mode is not the middle-side stereo encoding mode (No at S405), then it determines whether the stereo encoding mode is the intensity stereo encoding mode (S410). If the stereo encoding mode is the intensity stereo encoding mode (Yes at S410), then the inverse quantization module 203 performs inverse quantization only on the signal representing the ratio of the left channel and the right channel (S411).
  • the IS music determining module 207 determines whether the signal ratio of each channel lies within a predetermined range, that is, whether the signal ratio of each channel is greater than a predetermined threshold value ⁇ 1 but smaller than a predetermined threshold value ⁇ 2 (S412).
  • the IS music determining module 207 determines that the corresponding frame in the audio signal represents a music portion (S4Q9).
  • the IS music determining module 207 determines that the corresponding frame in the audio signal does not represent a music portion (S416).
  • the stereo-encoding-mode identifying module 210 determines that the stereo encoding mode is not the intensity stereo encoding mode, that is, determines that stereo encoding is not implemented (No at S410)
  • the inverse quantization module 203 independently performs inverse quantization on the signal of the left channel and inverse quantization of the signal of the right channel (S413) and, from the inverse quantization data calculated for each signal, calculates the power estimate value of each of the left channel and the right channel (S414).
  • the LR music determining module 206 determines whether Abs (L-R)/(L+R) is greater than a predetermined threshold value ⁇ (S415).
  • L and R represent the power estimate value of the left channel and the right channel, respectively, calculated at S414. If Abs(L-R)/(L+R) is smaller than the threshold value ⁇ (No at S415), then the LR music determining module 206 determines that the corresponding frame in the audio signal does not represent a music portion (S416).
  • the LR music determining module 206 determines that the corresponding frame in the audio signal represents a music portion (S409).
  • the threshold values ⁇ , ⁇ , ⁇ 1, and ⁇ 2 are set suitably according to the criteria for determining the existence of music.
  • the abovementioned operations are performed with respect to each frame in an audio signal. Furthermore, the sequence of operations can be changed as appropriate.
  • the audio-signal-music detecting module 100 when the stereo encoding mode is the middle-side stereo encoding mode, inverse quantization is performed with respect to the quantization spectrum corresponding to the sum signal and the difference signal and, if the ratio value between the difference signal and the sum signal is greater than a predetermined threshold value, then music is determined to be existing.
  • the present embodiment is not limited to that case and, for examples, it is also possible to determine the existence of music depending on whether only the difference signal exceeds a predetermined threshold value.
  • the description is given with reference to a two-channel stereo with left and right channels, the description is also applicable to a multi-channel stereo.
  • the operations can be performed in the same manner as those in the first embodiment.
  • the front left channel and the front right channel as the target channels for processing, the other channels are excluded from the stage of inverse quantization onward.
  • the front left channel and the front right channel other two channels can also be considered as the target channels for processing.
  • more than two channels can also be considered as the target channels for processing.
  • the audio-signal-music detecting module 100 in the digital television display device 1 focuses on the parameters regarding the stereo included in an encoded acoustic signal and, without performing decoding to the end, performs partial processing until the parameters are extracted. Hence, the audio-signal-music detecting module 100 can perform high-speed processing without help from any dedicated hardware.
  • the amount of calculation can be reduced in the digital television display device 1 according to the present embodiment.
  • the first embodiment is not limited to the description given above and can be modified to various modification examples as explained below.
  • the music determining module is configured to determine the existence of music only when the middle-side stereo encoding mode is implemented.
  • FIG. 5 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 500 according to the first modification of the first embodiment.
  • the audio-signal-music detecting module 500 comprises the stream input module 201, a frame decomposing module 501, an inverse quantization module 502, a power estimating module 503, the MS stereo music determining module 205, and the indexing module 208.
  • the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
  • the stream input module 201 receives input of an audio signal encoded using only the middle-side stereo encoding mode.
  • the frame decomposing module 501 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis.
  • the inverse quantization module 502 performs inverse quantization on the quantization spectrum corresponding to the sum signal of the left channel and the right channel and the difference signal between the left channel and the right channel.
  • the power estimating module 503 calculates, from the linear scale spectrum, the power (sound volume) estimate value of the sum signal of the left channel and the right channel and the power (sound volume) estimate value of the difference signal between the left channel and the right channel. Then, the MS stereo music determining module 205 determines whether the audio signal represents music portions.
  • the existence of music portions can be adequately determined only when an audio signal is encoded using the middle-side stereo encoding mode. Moreover, at the time of determining the existence of music portions, operations such as sub-band synthesis are not performed thereby achieving reduction in the processing load.
  • the music determining module is configured to determine the existence of music only when the stereo encoding nonuse mode is implemented without implanting the middle-side stereo encoding mode and the intensity stereo encoding mode.
  • FIG. 6 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 600 according to the second modification of the first embodiment.
  • the audio-signal-music detecting module 600 comprises the stream input module 201, a frame decomposing module 601, an inverse quantization module 602, a power estimating module 603, the LR music determining module 206, and the indexing module 208.
  • the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
  • the stream input module 201 receives input of an audio signal in which the left channel and the right channel are encoded independently.
  • the frame decomposing module 601 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis.
  • the inverse quantization module 602 independently performs inverse quantization on the signal of the left channel and inverse quantization of the signal of the right channel.
  • the power estimating module 603 calculates, from the linear scale spectrum, the power (sound volume) estimate value of the left channel and of the right channel. Then, the LR music determining module 206 determines whether the audio signal represents music portions.
  • the existence of music portions can be adequately determined for an audio signal in which the left channel and the right channel are encoded independently. Moreover, at the time of determining the existence of music portions, operations such as sub-band synthesis are not performed thereby achieving reduction in the processing load.
  • the music determining module is configured to determine the existence of music only when the intensity stereo encoding mode is implemented.
  • FIG. 7 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 700 according to the third modification of the first embodiment.
  • the audio-signal-music detecting module 700 comprises the stream input module 201, a frame decomposing module 701, an inverse quantization module 702, the IS music determining module 207, and the indexing module 208.
  • the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
  • the stream input module 201 receives an input of an audio signal encoded using only the intensity stereo encoding mode.
  • the frame decomposing module 701 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis.
  • the inverse quantization module 702 performs inverse quantization only on the signal representing the ratio of the left channel and the right channel. Then, based on whether the signal ratio lies within a predetermined range, the IS music determining module 207 determines whether audio signal represents music portions.
  • the existence of music portions can be adequately determined only when an audio signal is encoded using the intensity stereo encoding mode. Moreover, at the time of determining the existence of music portions, operations such as sub-band synthesis are not performed thereby achieving reduction in the processing load.
  • the existence of music portions can be adequately determined when an audio signal is encoded using one of the encoding modes. Moreover, it is also possible to combine the three modifications and determine the existence of music portions when an audio signal is encoded using two of the encoding modes.
  • inverse quantization is performed in order to determine the existence of music portions.
  • the music determining method is not limited to the first embodiment or the modification examples thereof.
  • an example is described when the existence of music portions is determined based on the stereo encoding mode.
  • FIG. 8 is an exemplary block diagram of a configuration of an audio-signal-music detecting module 800 according to the second embodiment.
  • the audio-signal-music detecting module 800 comprises the stream input module 201, a frame decomposing module 801, a music determining module 802, and the indexing module 208.
  • the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
  • the stream input module 201 receives input of an audio signal in an identical manner to the first embodiment.
  • the frame decomposing module 801 decomposes the audio signal input in the stream input module 201 on a frame-by-frame basis and extracts only the stereo encoding mode stored in each frame header.
  • the information other than the stereo encoding mode is not required for determining the existence of music.
  • the frame decomposing module 801 need not perform payload analysis other than performing header analysis. That enables achieving reduction in the processing load.
  • the frame decomposing module 801 outputs the extracted stereo encoding mode to the music determining module 802.
  • an audio signal is encoded using an encoding mode, such as in the MPEG-1 audio layer-3 format, performing joint stereo encoding and including both the middle-side stereo encoding mode and the intensity stereo encoding mode.
  • an encoding mode such as in the MPEG-1 audio layer-3 format, performing joint stereo encoding and including both the middle-side stereo encoding mode and the intensity stereo encoding mode.
  • encoding mode at the time of encoding, mode selection is performed with the purpose of reducing the amount of encoding.
  • the middle-side stereo has the property that higher the correlation between the left channel and the right channel, higher is the encoding efficiency. Hence, when a small difference exists between the left channel and the right channel, the middle-side stereo encoding mode is selected. On the contrary, when a large difference exists between the left channel and the right channel, the intensity stereo encoding mode is likely to be selected.
  • an audio signal is determined to represent music when encoding is performed using the intensity stereo encoding mode.
  • the music determining module 802 determines the existence of a music portion on a frame-by-frame basis.
  • the music determining module 802 determines that the corresponding frame does not represent a music portion.
  • FIG. 9 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed by the audio-signal-music detecting module 800 according to the present embodiment.
  • the stream input module 201 performs input processing of an audio signal (S901). Then, the frame decomposing module 801 decomposes the input audio signal on a frame-by-frame basis and, for each frame, extracts the stereo encoding mode (S902). The extracted stereo encoding mode is then output to the music determining module 802.
  • the music determining module 802 determines whether the data is stereo data (S903). If the data is not stereo data, that is, if the data is monaural data (No at S903), then the music determining module 802 assumes that it is not possible to determine the existence of music (S904) and ends the operation.
  • the music determining module 802 determines whether the stereo encoding mode is the middle-side stereo encoding mode (S905).
  • the music determining module 802 determines that the corresponding frame does not represent a music portion (S906).
  • the music determining module 802 determines whether the stereo encoding mode is the intensity stereo encoding mode (S907). If the stereo encoding mode is the intensity stereo encoding mode (Yes at S907), then the music determining module 802 determines that the corresponding frame represents a music portion (S908).
  • the music determining module 802 assumes that it is not possible to determine the existence of music (S904) and ends the operation.
  • the stereo encoding mode is the intensity stereo encoding mode
  • a music portion is determined to be existing
  • the stereo encoding mode is the middle-side stereo encoding mode
  • a non-music portion is determined to be existing
  • the stereo encoding mode is neither the intensity stereo encoding mode nor the middle-side stereo encoding mode
  • the detecting method can be switched depending on the CPU resources available at the time. That is, when the resources are extremely limited, then the method according to the second embodiment can be implemented; while when it is affordable to use a certain amount of resources, then the method according to the first embodiment can be implemented. In this way, the method can be switched depending on the resources. Moreover, when there are fewer restrictions on the resources; then it is possible to perform decoding and, as in the past, perform the music detecting operation with a high degree of accuracy using musical scale information.
  • the existence of music is determined on a frame-by-frame basis.
  • the existence of music is determined on a frame-by-frame basis; then, depending on the timing such as that of silence, it often becomes difficult to determine the existence of music.
  • an audio-signal-music detecting module 1000 according to a third embodiment, an example is described when the existence of a music portion is determined on a section-by-section basis, where each section includes a plurality of frames.
  • FIG. 10 is an exemplary block diagram of a configuration of the audio-signal-music detecting module 1000 according to the third embodiment.
  • the audio-signal-music detecting module 1000 has a different configuration in which the indexing module 208 is replaced by an indexing module 1001 having different functions.
  • the constituent elements identical to those in the first embodiment are referred to by the same reference numerals and the explanation thereof is not repeated.
  • the indexing module 1001 also comprises a section determining module 1010.
  • the section determining module 1010 determines the existence of music with respect to each given section composed of a plurality of frames. At that time, in order to accurately determine the existence of a music section, the section determining module 1010 obtains the density of music determination frames in a given section. That is, if the number of frames determined to represent music in a given section is greater than a predetermined number, then the section determining module 1010 according to the present embodiment determines that the corresponding section represents a music section.
  • FIG. 11 is an exemplary flowchart for explaining a sequence of operations in the music determining processing performed by the audio-signal-music detecting module 1000.
  • the digital television display device 1 records contents to be subjected to determination (S1101).
  • the frames 1 to N are extracted when an audio signal included in the contents to be processed is decomposed frame by frame.
  • the music determining method performed for each frame of the frames 1 to N the method according to the first embodiment is implemented. Hence, that explanation is not repeated.
  • the section determining module 1010 initializes a variable S to "1" (S1103).
  • the variable S is set to the first frame in a section.
  • the last frame in that section is S+K, where K is equal to one number less than the number of frames in the section.
  • the value of K can be any value. In the present embodiment, K is assumed to be equal to 15.
  • the section determining module 1010 determines whether S+K is equal to or smaller than N (S1104). If S+K is equal to or smaller than N (Yes at S1104); then the section determining module 1010 sets, in Y, the number of frames determined to represent music in the section between S and S+K (S1105).
  • the section determining module 1010 calculates the value obtained by dividing Y by K (i.e., Y/K), that is, calculates the percentage of frames detected to represent music. More particularly, the section determining module 1010 obtains the percentage of frames detected to represent music (i.e., Y/K) and determines whether that percentage is greater than a threshold value ⁇ f (S1106).
  • the section determining module 1010 determines that the section between S and S+K represents music (S1107). On the other hand, if the percentage of frames detected to represent music is equal to or smaller than the threshold value ⁇ f (No at S1106), then the section determining module 1010 determines that the section between S and S+K does not represent music (S1109).
  • the section determining module 1010 increments the variable S by 1 (S1108) and returns to S1104 for restarting the operations.
  • the existence of music can be determined by shifting the window one frame at a time.
  • the existence of music is first determined for all frames and then determined on a section-by-section basis.
  • the method is not limited to that manner. That is, alternatively, the music determining modules 205, 206, and 207 can be used to determine the existence of music on a frame-by-frame basis in accordance with the input of an audio signal. Still alternatively, the music determining processing can be implemented as an asynchronous processing in such a way that the existence of music sections is determined when the number of undetermined sections exceeds K.
  • the method according to the first embodiment is implemented as the music determining method for each frame, it is also possible to implement the method according to the modification examples of the first embodiment or the method according to the second embodiment.
  • the operations can be performed without having to decode music signals.
  • LSI large-scale integration
  • a music detecting program that is executed in the audio-signal-music detecting module can be stored in a computer-readable recording medium such as a flexible disk (FD), a compact disk read only memory (CD-ROM), a compact disk recordable (CD-R), or a digital versatile disk (DVD).
  • a computer-readable recording medium such as a flexible disk (FD), a compact disk read only memory (CD-ROM), a compact disk recordable (CD-R), or a digital versatile disk (DVD).
  • the music detecting program that is executed in the audio-signal-music detecting module according to the embodiments can be stored in a computer connected over a network such as the Internet and can be downloaded via the network for distribution.
  • the music detecting program that is executed in the audio-signal-music detecting module according to the embodiments can be distributed over a network such as the Internet.
  • the music detecting program according to the embodiments can be stored in advance in a read only memory (ROM) for distribution.
  • ROM read only memory
  • the music detecting program that is executed in the audio-signal-music detecting module contains modules for each constituent element (the stream input module, the inverse quantization module, the power estimating module, the music determining modules, and the indexing module).
  • hardware such as a CPU (processor) retrieves the music detecting program from the memory medium and runs the same such that the music detecting program is loaded in a corresponding main memory.
  • the modules for the stream input module, the inverse quantization module, the power estimating module, the music determining modules, and the indexing module are generated in the main memory.
  • modules of the systems described herein can be implemented as software applications, hardware and/or software modules, or components on one or more computers, such as servers. While the various modules are illustrated separately, they may share some or all of the same underlying logic or code.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
EP10172348A 2009-12-28 2010-08-10 Appareil et procédé de détection de musique Withdrawn EP2357645A1 (fr)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2009298263 2009-12-28

Publications (1)

Publication Number Publication Date
EP2357645A1 true EP2357645A1 (fr) 2011-08-17

Family

ID=43639948

Family Applications (1)

Application Number Title Priority Date Filing Date
EP10172348A Withdrawn EP2357645A1 (fr) 2009-12-28 2010-08-10 Appareil et procédé de détection de musique

Country Status (1)

Country Link
EP (1) EP2357645A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0517233A1 (fr) * 1991-06-06 1992-12-09 Matsushita Electric Industrial Co., Ltd. Appareil de discrimination musique voix
US6148086A (en) * 1997-05-16 2000-11-14 Aureal Semiconductor, Inc. Method and apparatus for replacing a voice with an original lead singer's voice on a karaoke machine
US20080255860A1 (en) * 2007-04-11 2008-10-16 Kabushiki Kaisha Toshiba Audio decoding apparatus and decoding method
JP2008298976A (ja) 2007-05-30 2008-12-11 Toshiba Corp 音楽検出装置及び音楽検出方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0517233A1 (fr) * 1991-06-06 1992-12-09 Matsushita Electric Industrial Co., Ltd. Appareil de discrimination musique voix
US6148086A (en) * 1997-05-16 2000-11-14 Aureal Semiconductor, Inc. Method and apparatus for replacing a voice with an original lead singer's voice on a karaoke machine
US20080255860A1 (en) * 2007-04-11 2008-10-16 Kabushiki Kaisha Toshiba Audio decoding apparatus and decoding method
JP2008298976A (ja) 2007-05-30 2008-12-11 Toshiba Corp 音楽検出装置及び音楽検出方法

Similar Documents

Publication Publication Date Title
AU2006228821B2 (en) Device and method for producing a data flow and for producing a multi-channel representation
KR100333999B1 (ko) 오디오 신호 처리 장치 및 오디오 신호 고속 재생 방법
EP1667110B1 (fr) Reconstruction d'erreurs d'informations de flux audio
US9058803B2 (en) Multichannel audio stream compression
JP5179881B2 (ja) オーディオソースのパラメトリックジョイント符号化
US8359113B2 (en) Method and an apparatus for processing an audio signal
TWI644308B (zh) Decoding device and method, and program
JP5006315B2 (ja) オーディオ信号のエンコーディング及びデコーディング方法及び装置
JP5455647B2 (ja) オーディオデコーダ
US20060031075A1 (en) Method and apparatus to recover a high frequency component of audio data
US20110002393A1 (en) Audio encoding device, audio encoding method, and video transmission device
KR20080063155A (ko) 부가정보 비트스트림 변환을 포함하는 다양한 채널로구성된 다객체 오디오 신호의 부호화 및 복호화 장치 및방법
JP2011509428A (ja) オーディオ信号処理方法及び装置
US20100228552A1 (en) Audio decoding apparatus and audio decoding method
JP6728154B2 (ja) オーディオ信号のエンコードおよびデコード
US9153241B2 (en) Signal processing apparatus
EP2610867B1 (fr) Dispositif de reproduction audio et méthode de reproduction audio
JP4809234B2 (ja) オーディオ符号化装置、復号化装置、方法、及びプログラム
KR100891666B1 (ko) 믹스 신호의 처리 방법 및 장치
RU2383941C2 (ru) Способ и устройство для кодирования и декодирования аудиосигналов
JP4743228B2 (ja) デジタル音声信号解析方法、その装置、及び映像音声記録装置
US20150104158A1 (en) Digital signal reproduction device
JP2006146247A (ja) オーディオ復号装置
EP2357645A1 (fr) Appareil et procédé de détection de musique
JP2005519489A (ja) 複数の番組の記録と再生

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20100810

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME RS

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20120218