MX2008004572A - Neural network classifier for seperating audio sources from a monophonic audio signal - Google Patents

Neural network classifier for seperating audio sources from a monophonic audio signal

Info

Publication number
MX2008004572A
MX2008004572A MX/A/2008/004572A MX2008004572A MX2008004572A MX 2008004572 A MX2008004572 A MX 2008004572A MX 2008004572 A MX2008004572 A MX 2008004572A MX 2008004572 A MX2008004572 A MX 2008004572A
Authority
MX
Mexico
Prior art keywords
audio
sources
frame
classifier
monophonic
Prior art date
Application number
MX/A/2008/004572A
Other languages
Spanish (es)
Inventor
V Shmunk Dmitry
Original Assignee
Dts Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dts Inc filed Critical Dts Inc
Publication of MX2008004572A publication Critical patent/MX2008004572A/en

Links

Abstract

A neural network classifier provides the ability to separate and categorize multiple arbitrary and previously unknown audio sources down-mixed to a single monophonic audio signal. This is accomplished by breaking the monophonic audio signal into baseline frames (possibly overlapping), windowing the frames, extracting a number of descriptive features in each frame, and employing a pre-trained nonlinear neural network as a classifier. Each neural network output manifests the presence of a pre-determined type of audio source in each baseline frame of the monophonic audio signal. The neural network classifier is well suited to address widely changing parameters of the signal and sources, time and frequency domain overlapping of the sources, and reverberation and occlusions in real-life signals. The classifier outputs can be used as a front-end to create multiple audio channels for a source separation algorithm (e.g., ICA) or as parameters in a post-processing algorithm (e.g. categorize music, track sources, generate audio indexes for the purposes of navigation, re-mixing, security and surveillance, telephone and wireless communications, and teleconferencing).

Description

NEURAL NETWORK CLASSIFIER FOR SEPARATING AUDIO SOURCES OF A MONOPHONIC AUDIO SIGNAL DESCRIPTION OF THE INVENTION This invention relates to the separation of multiple unknown audio sources mixed in descending order into a single single monophonic audio signal. There are techniques to extract sources of stereo or multichannel audio signals. Independent component analysis (ICA) is the most widely known and researched method. However, ICA can only extract a number of sources equal to or less than the number of channels in the input signal. Therefore, it can not be used in monophonic signal separation. The extraction of audio sources from a monophonic signal can be useful to extract speech signal characteristics, synthesize a multichannel signal representation, categorize music, track sources, generate an additional channel for ICA, generate audio indexes for navigation purposes (surface reading), remixing (consumer and pro), security and surveillance, telephone and wireless communication, and teleconferencing. Extracting conversation signal features (such as automated dictator detection, automated conversation recognition, conversation / music detectors) is well developed. The extraction of information from instruments The arbitrary musicality of the monophonic signal is investigated very analytically due to the difficulties posed by the problem, which are included in widely changing parameters of the signal and sources, overlapping time and frequency domain of the sources, and repercussions and occlusions in signs of real life. Known techniques include equalization and direct extraction of parameters. An equalizer can be applied to the signal to extract the sources that occupy the known frequency range. For example, most of the energy of the conversation signal is present in the range of 200Hz-4kHz. The bass sounds of the guitar are usually limited to frequencies below 1kHz. By filtering the entire out-of-band signal, the selected source can be extracted, or its energy can be amplified relative to other sources. However, equalization is not effective to extract overlapping sources. A method of direct parameter extraction is described in the ^ Audio Content Analysis for Segmentation and Classification of Online Audiovisual Data 'by Tong Zhang and Jay Kuo (IEEE Transactions on Conversation and Audio Processing, Volume 9 No. 4, May 2001). Simple audio features such as the energy function, the average cross-zero ratio, the fundamental frequency, and the spectral peak tracks are they extract. The signal is then divided into categories (silence, with music components, without music components) and subcategories. An inclusion of a fragment in a certain category is decided by direct comparison of a feature in a set of boundaries. A prior knowledge of the sources is required. A method of categorizing musical genres is described in Musical Genre Classification of Audio Signals by George Tzanetakis and Perry Cook (IEEE Transactions on Conversation and Audio Processing, Volume 10 No. 5, July 2002). Characteristics such as instrumentation, rhythmic structure and harmonic content are extracted from the signal and entered into a pre-trained statistical pattern recognition classifier. '' Acoustic Segmentation for Audio Seekers' by Don Kimbler and Lynn ilconx employ Hidden Markov Models for segmentation and audio classification. The present invention provides the ability to separate and categorize multiple arbitrary and previously unknown audio sources mixed down into a single monophonic audio signal. This is achieved by breaking the monophonic audio signal in baseline frames (possibly overlapping), placing the frames in a window, extracting a number of descriptive characteristics in each frame, and using a network non-linear neural pre-trained as a classifier. Each neural network result manifests the presence of a predetermined type of audio source in each baseline frame of the monophonic audio signal. The neural network typically has both results and types of audio sources that the system is trained to discriminate. The neural network classifier is well suited to address parameters that widely change the signal and sources, overlap of the frequency and time domain of the sources, and the impact and occlusions on the real-life signals. Classifier results can be used as an input terminal to create multiple audio channels for a source separation algorithm (for example, ICA) or as parameters in a post-processing algorithm (for example, categorize music, trace sources, generate audio indexes for purposes of navigation, remixing, security and surveillance, telephone and wireless communication, and teleconferencing In a first modality, the monophonic audio signal is filtered by subband, the number of subbands and the variation or The sub-band uniformity is dependent on the application, each sub-band is then put into frames and features are extracted, the same or different feature combinations can be extracted from different sub-bands, some sub-bands may not have extracted characteristics. Each subband feature can form a separate input in the classifier or similar characteristics can be "merged" through the subbands. The classifier may include a single output node for each predetermined audio source to improve the strength to classify each particular audio source. Alternatively, the classifier may include an output node for each subband for each predetermined audio source to improve the separation of multiple overlapped frequency sources. In a second embodiment, one or more of the features, e.g., tonal components or TNR, is extracted in multiple time frequency resolutions and then scaled in the baseline frame size. Preferably this is done in parallel but can be done sequentially. The characteristics in each resolution can be entered into the classifier or can be merged to form a single entry. This multiresolution procedure directs the non-fixation of natural signals. Most signals can only be considered as a quasi-stationary and short time intervals. Some signals change faster, some slower, for example, for conversation, with rapidly varying signal parameters, shorter time frames will result in better separation of signal energy. For string instruments that are more stationary, longer frames provide a higher frequency resolution with no decrease in signal energy separation. In a third embodiment, the monophonic audio signal is filtered by subband and one or more of the characteristics in one or more subbands is extracted at multiple resolutions of time frequency and then scaled in the line frame size base. The combination of the subband filter and the multiresolution can also improve the capacity of the classifier. In a fourth mode, the values at the Neural Network output nodes are filtered with low pass to reduce the noise, hence the variation from frame to frame of the classification. Without low pass filtering, the system operates in a few parts of the signal (baseline frames) without knowledge of past or future inputs. Low pass filtering decreases the number of false results, assuming that a signal typically lasts longer than a baseline frame. These and other features and advantages of the invention will be apparent to those skilled in the art from the following detailed description of the preferred embodiments, taken in conjunction with the accompanying drawings, in which: BRIEF DESCRIPTION OF THE DRAWINGS FIGURE 1 is a block diagram for the separation of multiple unknown audio sources and descendingly blended into a single monophonic audio signal using a Neural Network classifier in accordance with the present invention; FIGURE 2 is a diagram illustrating the subband filtering of the input signal; FIGURE 3 is a diagram illustrating the placement in frames and windows of the input signal; FIGURE 4 is a flow diagram for extracting multi-resolution tonal components and TNR characteristics; FIGURE 5 is a flow chart for estimating the noise level; FIGURE 6 is a flow diagram for extracting a Pico Cepstrum characteristics; FIGURE 7 is a block diagram of a typical Neural Network classifier; FIGURES 8a-8c are schematics of the audio sources that form a monophonic signal and the means produced by the Neural Network classifier; FIGURE 9 is a block diagram of a system for using the measures produced to remix the monophonic signal into a plurality of audio channels; Y FIGURE 10 is a block diagram of a system for using the measurements produced to augment a standard post-processing task performed on the monophonic signal. The present invention provides the ability to separate and categorize multiple arbitrary and previously unknown audio sources mixed down into a single monophonic audio signal. As shown in Figure 1, a plurality of audio sources 10, eg, voice, string and percussion, have been mixed down (step 12) into a single channel 14 of monophonic audio. The monophonic signal may be a conventional monophonic mixture or it may be a channel of a stereo or multichannel signal. In the most general case, there is no prior information regarding the particular types of audio sources in the specific mix, the signals themselves, how many different signals are included, or the mixing coefficients. The types of audio sources that can be included in a specific mix are known. For example, the application may be to classify the predominant sources or sources in a musical mix. The classifier will know that possible sources include male voice, female voice, string, percussion, etc. The classifier will not know which of these sources or how many are included in the specific mix, nothing about the sources specific or how they were mixed. The process for separating and categorizing the multiple arbitrary and previously unknown audio sources begins by placing the monophonic audio signal in frames of a baseline (possibly overlapping) frames (step 16), placing the frames in windows (step 18). ), extract a number of descriptive characteristics in each frame (step 20) and employ a pre-trained neural network pre-trained as a classifier (step 22). Each result of the neural network manifests the presence of a predetermined type of audio source in each baseline frame of the monophonic audio signal. The neural network typically has as many results as types of audio sources that the system is trained to discriminate. The performance of the Neural Network classifier, particularly for separating and classifying "overlapping sources" can be improved in a number of ways including sub-band filtering of the monophonic signal, extracting multiresolution characteristics and low-pass filtering of the values of classification. In a first improved mode, the monophonic audio signal can be filtered per subband (step 24). This typically, although not necessarily done before the placement in frames. The number of subbands and the variation or uniformity of the subbands is independent of the application. Each subband is then placed in frames and features are extracted. The same or different combinations of characteristics can be extracted from the different subbands. Some subbands may not have features extracted. Each sub-band characteristics can form a separate input in the classifier or similar characteristics can be "merged" through the sub-bands (step 26). The classifier can include a single output node for each predetermined audio source, in which case extracting characteristics from multiple sub-bands improves the strength to classify each particular audio source. Alternatively, the classifier may include an output node for each subband for each predetermined audio source, in which case extracting characteristics from multiple subbands improves the separation of multiple overlapped frequency sources. In a second improved mode, one or more of the features is extracted at multiple resolutions of time frequency and then scaled in the baseline frame size. As shown, the monophonic signal is initially segmented into baseline frames, placed in windows and the features are extracted. If one or more of the features that are extracted in multiple resolutions (step 28), the frame size is decreased (increased) (step 30) and the process is repeated. The size of frames is suitably decreases (increases) as a multiple of the adjusted baseline frame size for overlapping and windowing. As a result, there will be multiple instances of each characteristic over the equivalent of a one-baseline frame. These characteristics must then be scaled in the baseline frame size, either independently or together (step 32). The characteristics extracted in smaller frame sizes are averaged and the characteristics extracted in larger frame sizes are interpolated in the size of. the baseline plot. In most cases, the algorithm can extract multiple resolution features by decreasing and increasing from the baseline frame. In addition, it may be desirable to merge the features extracted in each resolution to form an input in the classifier (step 26). If the multiresolution features are not merged, the baseline scaling (step 32) can be done within the loop and the characteristics entered into the classifier at each step. More preferably, the multiresolution extraction is performed in parallel. In a third improved mode, the values at the Neural Network output nodes are post-processed using for example, a moving average low pass filter (step 34) to reduce noise, hence the variation from frame to frame of the classification.
Sub-band Filtering As shown in Figure 2, a sub-band filter 40 divides the frequency spectrum of the monophonic audio signal into N sub-bands 42 of uniform or varied width. For illustration purposes, possible frequency spectra H (f) are shown for voice 44, string 46 and percussion 48. When extracting characteristics in sub-bands where the overlapped source is low, the classifier can do a better job to classify the source predominant in the plot. In addition, when extracting characteristics in different sub-bands, the classifier may be able to classify the predominant source in each of the sub-bands. In those sub-bands where the signal separation is good, the confidence of the classification can be very strong, for example, almost 1. While in those sub-bands where the signals overlap, the classifier can be less reliable than a source predominates, for example, two or more sources may have similar produced values. The equivalent function can also be provided using a frequency transform instead of the subband filter. Placing in Frames and Windows As shown in Figures 3a-3c, the monophonic signal 50 (or each sub-band of the signal) is decomposed into a frame sequence 52 of baseline. The signal is It decomposes adequately in overlapping frames and preferably with an overlap of 50% or more. Each frame is placed in windows to reduce the effects of discontinuities in the limits of the frame and improve the frequency separation. Well-known analysis windows 54 include Elevated Cosine, Hamming, Hanning and Chebyschev, etc. Signal 56 placed in window for each baseline frame is then passed for feature extraction. Feature extraction Feature extraction is the process of calculating a compact numerical representation that can be used to characterize an audio baseline frame. The idea is to identify a number of characteristics, which alone or in combination with other characteristics, in a single or multiple resolutions, and in a single or multiple spectral bands, effectively differentiate between different audio sources. Examples of features that are useful in separating sources from a monophonic audio signal include: total number of tonal components in a frame; Tone to Noise Ratio (TNR); and peak amplitude of Cepstrum. In addition to these features, one or any combination of the 17 low-level audio descriptors described in the MPEG-7 specification may be suitable features in different applications.
Now we will describe the tonal components, characteristics of TNR and Cepstrum peak in detail. In addition, the tonal components and TNR characteristics are extracted in multiple time frequency resolutions and scaled in the baseline frame. The stages to calculate the "low level descriptors" are available the support documentation for MPEG-7 audio. (See, for example, the International Standard ISO / IEC 15938"Multimedia Content Description Interface", or http: / / www. Chiariglione. Org / mpeg / standards / mpeg-7 /mpeg-7.htm). Tonal Components A Tonal Component is essentially a tone that is relatively strong when compared to the average signal. The characteristic that is extracted is the number of tonal components in a given resolution of time frequency. The procedure for estimating the number of tonal components at a simple resolution level of time frequency in each frame is illustrated in Figure 4 and includes the following steps: 1. Frame the monophonic input signal (step 16). 2. Place in the window the data that fall in the frame (stage 18). 3. Apply the frequency transformation to the signal placed in window (step 60), such as FFT, MDCT, etc. The length of the transformation must be equal to the number of audio samples in the frame, that is, the frame size. Lengthening the length of the transformation will lower the resolution of time, without improvements in frequency resolution. Having a transform length smaller than a length of a frame will lower the frequency resolution. 4. Calculate the magnitude of the spectral lines (step 62). For an FFT, the magnitude A = Sqrt (Re * Re + Im * Im) where Re and Im are the Real and Imaginary components of a spectral line produced by the transform. 5. Estimate the noise level for all frequencies (stage 64). (See Figure 5) 6. Count the number of components sufficiently above the noise level, for example, more than a predefined fixed threshold above the noise level (step 66). These components are considered "tonal components" and the counting occurs in the NN classifier (step 68). Real-life audio signals may contain both stationary fragments with tonal components in them (such as string instruments) and non-stationary fragments that also have tonal components in them (such as voice conversation fragments). To efficiently capture the tonal components in In all situations, the signal has to be analyzed in several levels of frequency frequency resolution. Virtually useful results can be extracted in frames that vary from approximately 5m seconds to 200m seconds. Note, that these frames are preferably interleaved, and many frames of a given length can fall under a simple baseline frame. To estimate the number of tonal components at multiple time frequency resolutions, the above procedure is modified as follows: 1. Decrease the Screen Size, for example, by a factor of 2 (ignoring the overlap) (step 70). 2. Repeat steps 16, 18, 60, 62, 64 and 66 for the new frame size. A frequency transform with length equal to the length of the frame must be performed to obtain an optimal exchange of time frequency. 3. Scale the count of the tonal components in the baseline frame size and produce in the NN classifier (step 72). As shown, a cumulative number of tonal components in each frequency frequency resolution is passed individually to the classifier. In a simpler implementation, the number of tonal components in all resolutions can be extracted and added together to form a single value. 4. Repeat until the desired frame size smaller has been analyzed (stage 74). To illustrate the extraction of tonal components of multiresolution, consider the following example. The baseline frame size is 4096 samples. The tonal components are extracted to 1024, 2048 and 4096 transformation lengths (without overlapping for simplicity). The typical results can be: In transformation of 4096 points: 5 components In transformation of 2048 points (total of two transformations in a base line frame): 15 components, 7 components. In transformation of 1024 points (total of 4 transformations in a baseline plot): 3, 10, 17, 4 The numbers that will be passed to the NN entries will be 5, 22 (= 15 + 7) 34 (= 3 + 10 + 17 + 4) in each step. Or alternatively, the values could be added 61 = 5 + 22 + 34 and entered as a single value. The algorithm to calculate the time frequency multiresolutions when increasing is analogous. Tone to Noise Ratio (T R) The tone to noise ratio is a measure of the total energy ratio in the tonal components for the noise level can also be a very relevant feature for the discrimination of various types of sources. For example, various types of string instruments They have different levels of TNR. The process of the tone-to-noise ratio is similar to the estimation of the number of tonal components described in the above. Instead of counting the number of tonal components (step 66), the method calculates the cumulative energy ratio in the tonal components for the noise level (step 76) and produces the ratio in the NN classifier (step 78). Measuring TNR at various resolutions of time frequency is also an advantage to provide stronger performance with real-life signals. The frame size is decreased (step 70) and the procedure is repeated for a number of small frame sizes. The results of the smaller frames are scaled by averaging over a period of time equal to the baseline frame (step 78). As with the tonal components, the averaged ratio can occur in the classifier at each step or can be added up to a single value. Also, the different resolutions for tonal components and TNR are calculated properly in parallel. To illustrate the multiresolution TNR extraction consider the following example. The baseline frame size is 4096 samples. The TNRs are extracted in transformation lengths of 1024, 2048 and 4096 (without overlap for simplicity). Typical results can In transformation of 4096 points: ratio of 40 db In transformation of 2048 points (total of 2 transformations in a line of base line): relations of 28db, 20db In transformation of 1024 points (total of 4 transformations in a baseline frame): ratio of 20db, 20db, 16db and 12db The relationships that will be passed to the NN inputs will be 40db, 24db and 17db in each step . Or alternatively, the values could be added (average = 27db) and entered as a simple value. The algorithm to calculate the time frequency multiresolutions when increasing is analogous. Noise Level Estimation The noise level used to estimate the tonal components and TNR is a measure of the environment or an undesired position of the signal. For example, if one were trying to classify or separate the musical instruments in a live acoustic musical presentation, the noise level would represent the average acoustic level of the room when the musicians are not playing. A number of algorithms can be used to estimate the noise level in a frame. In one implementation, a low pass FIR filter can be applied over the amplitudes of the spectral lines. The result of such filtering will be slightly greater than the actual noise level since it includes the energy of both noisy and tonal components. This is despite the fact that it can be compensated by lowering the threshold value. As shown in Figure 5, a more accurate algorithm refines the simple FIR filtering procedure to get closer to the actual noise level. A simple estimation of the noise level is found by the application of an FIR filter: L k L 2 Where: Ni - estimated noise level for the iava spectral line; ? - magnitudes of spectral lines after the frequency transformation; Ck - FIR filter coefficients; and L - filter length. As shown in Figure 5, the most accurate estimate clarifies the initial low-pass FIR estimate (step 80) given above by marking components that lie sufficiently above the noise level, for example, 3dB above the result of FIR on each frequency (step 82). Once marked, a counter is established, for example, J = 0 (step 84) and the marked components (magnitudes 86) are replaced in the last FIR results (step 88). This stage eliminates effectively the energy of the tonal component from the calculation of noise level. The low pass FIR is reapplied (step 90), the components that lie sufficiently above the noise level are marked (step 92), the counter is incremented (step 94) and the newly marked components are replaced with the last ones FIR results (stage 88). This process is repeated for a desired number of iterations, for example, 3 (step 96). The higher number of iterations will result in a slightly better accuracy. It is important to note that the very estimation of the Noise Level can be used as a feature to describe and separate the audio sources. Cepstrum peak In cepstrum analysis it is normally used in applications related to conversation processing. Several characteristics of the cepstrum can be used as parameters for processing. The cepstrum is also descriptive for other types of highly harmonic signals. A cepstrum is the result of taking the inverse Fourier transform of the decibel spectrum as if it were a signal. The method of extracting a Cepstrum Peak is as follows: 1. Separates the noise signal in a sequence of frames (step 16). 2. Place the signal in the window in each frame (stage 18). 4. Calculate the Cepstrum: a. Calculate a frequency transform of the signal placed in a window, for example, FFT (step 100); b. Calculate the logarithmic amplitude of the spectral line magnitudes (step 102); and c. Calculate the inverse transform on logarithmic amplitudes (step 104). 5. The Cepstrum peak is the value and position of the maximum value in the cepstrum (step 106). Neural Network Classifier Many known types of neural networks are suitable to operate as classifiers. The current state of the art in neural network architectures and training algorithms revert to a direct feed network (a stratified network in which each layer only receives inputs from previous layers) a very good candidate. Existing training algorithms provide stable results and a good generalization. As shown in Figure 7, a direct-feed network 110 includes an input layer 112, one or more hidden layers 114, and an output layer 116. The neurons in the input layer receive a full set of extracted features 118 and respective weights. A The off-line supervised training algorithm tunes the weights with which the characteristics are passed to each of the neurons. The hidden layers include neurons with non-linear activation functions. Multiple layers of neurons with non-linear transfer functions allow the network to learn the nonlinear and linear relationships between the input and output signals. The number of neurons in the output layer is equal to the number of types of sources that the classifier can recognize. Each of the results of the network signals indicates the presence of a certain type of sources 120, and the value [0,1] indicates the confidence that the input signal includes a given audio source. If sub-band filtering is used, the number of output neurons can be equal to the number of sources multiplied by the number of subbands. In this case, the result of a neuron indicates the presence of a particular source in the particular subband. The neurons produced can go "as is", leveled to only retain the values of neurons above a certain level, or leveled just to retain the most predominant source. The network must pre-train on a set of sufficiently representative signals. For example, for the system capable of recognizing four different readings that contain: male voice, female voice, percussion instruments and string instruments, all these types of sources should present to train in the set of sufficient varieties. It is not necessary to exhaustively present all possible types of sources due to the generation capacity of the neural network. Each record must be passed through the feature extraction part of the algorithm. The extracted characteristics are then mixed arbitrarily into two data sets: training and validation. One of the well-known supervised training algorithms is then used to train the network (for example, such as the Levenberg-arquardt algorithm). The strength of the classifier is strongly depending on the set of extracted characteristics. If the characteristics together differentiate the different sources, the classifier will perform well. The implementation of the multiresolution and sub-band filtering to increase the standard audio features presents a much stronger characteristics established to differentiate and properly classify the audio sources in the monophonic signal. In exemplary mode, a 5-3-3 direct-feed network architecture (5 neurons in the input layer, 3 neurons in the hidden layer, and 3 neurons in the output layer) with functions of the tansig activator (hyperbolic tangent) ) in all layers made well for the classification of three types of sources; voice, percussion and string. In the direct feed architecture used, each neuron of the given layer is connected to each neuron of the preceding layer (except for the input layer). Each neuron in the input layer received the full set of extracted characteristics. The characteristics presented in the network included tonal components of multiresolution, multiresolution TNR, and Pico Cepstrum, which were pre-normalized to fit in the margin of [-1: 1]. The first result of the network indicated the presence of the voice source in the signal. The second result indicated the presence of stringed instruments. And finally the third result was trained to indicate the presence of percussion instruments. In each layer, an activating function of 'tansig' was used. A formula calculated in an effective way to calculate the result of a kava neuron in the java layer is given by: Where: A - result of the k'ava neuron in the j ava layer. • ava j.k weighting of that neuron (established during training). For the input layer the formula is: Where: ava F, - i characteristic j ava j.k weighting of that neuron (established during training) To test a simple classifier, a large audio file is concatenated from three different types of audio signals. The blue lines represent the actual presence of the voice 130 (German conversation), instrument 132 of percussion (upper cymbal), and a string instrument 134 (acoustic guitar). The file is approximately 800 frames in length in which the first 370 frames are voice, the next 100 frames are percussion and the last 350 frames are the string. Sudden interruptions in the blue lines correspond to periods of silence in the input signal. The green lines represent predictions of voice 140, percussion 142 and 144 given by the classifier. The output values have been filtered to reduce the noise. The distance of how far the result of the network goes from 0 or 1 is a measure of how true the classifier is so that the input signal includes that particular audio source. Although the audio file represents a monophobic signal in which none of the audio sources are actually present at the same time, it is adequate and simpler to demonstrate the capacity of the classifier. As shown in Figure 8c, the classifier identified the string instrument with greater confidence and without errors. As shown in Figures 8a and 8b, the performance on the speech and percussion signals was satisfactory, although there was some overlap. The use of multiresolution tonal components would distinguish more effectively between the percussion instruments and the voice fragments (in fact, the voiceless fragments of the conversation). The results of the classifier can be used as an input terminal to create multiple audio channels for a source separation algorithm (for example, ICA) or as parameters in a post-processing algorithm (for example, categorize music, trace sources, generate audio indexes for purposes of navigation, remixing, security and surveillance, telephone and wireless communication, and teleconferencing). As shown in Figure 9, the classifier is used as an input terminal in a Source Separation (BSS) algorithm 150 such as ICA, which requires as many input channels as sources it is trying to separate. Assume the BSS algorithm wants to separate the sources of voice, percussion and strings from a monophonic signal, which it can not do. The NN classifier can be configured with output neuron 152 for voice, percussion and chord. The values of neurons are used as weights to mix 154 each frame of the monophonic audio signal into audio channels 156 into three separate audio channels, one for voice 158, percussion 160 and string 162. The weights may be the current values of the neurons or threshold values to identify the first dominant signal per frame. This procedure can also be refined using subband filtering and thus produce many more input channels for BSS. The BSS uses powerful algorithms to further refine the initial source spacing provided by the N classifier. As shown in Figure 10, the output layer neurons 170 NN can be used in a post-processor 172 that operates in the monophonic audio signal in the audio channel 174. Tracking - algorithm can be applied to individual channels that were obstructed with other algorithms (for example, BSS) that worked on the basis of raster by raster. With the help of the result of the algorithm, a link of the neighboring frames can be made possible or be more stable or simpler. Audio Identification Engine and Audio Search - patterns extracted from signal types and possibly their durations can be used with index in the database (or as a key for the hash table).
Coding - decoding - information on the type of the signal allows encoding and decoding to fine tune a psychoacoustic model, bit allocation or other coding parameters. Input terminal for a source separation - algorithms such as ICA require at least as many input channels as sources. Our algorithm can be used to create multiple audio channels from the single channel or to increase the number of individual input channels available. Remixed - the individual separate channels can be remixed again in monophonic representation (or a representation with reduced numbers of channels) with a post-processing algorithm (like the equalizer) in the middle part. Security and surveillance - the results of the algorithm can be used as parameters in a post-processing algorithm to improve the intelligibility of the recorded audio. Telephone and wireless communication, and teleconferencing - algorithm can be used to separate speaker / individual sources and a post-processing algorithm can assign individual virtual positions in stereo or multichannel environment. A small number of channels (or possibly just a single channel) will have to be transmitted. While several illustrative embodiments of the invention have been shown and described, numerous variations and alternative modalities will be presented to those skilled in the art. Such variations and alternative embodiments are contemplated, and may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (27)

  1. CLAIMS 1. A method for separating audio sources from a monophonic audio signal, characterized in that it comprises: (a) providing a monophonic audio signal comprising a downmix of a plurality of unknown audio sources; (b) separating the audio signal into a sequence of baseline frames; (c) place each frame in windows; (d) extracting a plurality of audio characteristics from each baseline frame that tends to distinguish audio sources; and (e) applying the audio characteristics to a neural network classifier (NN) trained in a representative set of audio sources with audio characteristics, the neural network classifier produces at least one minus one measure of an audio source included in each baseline frame of the monophonic audio signal. The method according to claim 1, characterized in that the plurality of unknown audio sources are selected from a set of musical sources comprising at least voice, strings and percussion. 3. The method according to claim 1, further characterized in that it comprises: repeating steps (b) to (d) for a different frame size to extract features of multiple resolutions; and scale the audio characteristics extracted in the different resolutions for the baseline plot. 4. The method according to claim 3, further characterized in that it comprises applying the scaled characteristics in each resolution for the classifier N. 5. The method of compliance with the claim 3, further characterized in that it comprises merging the scaled features in each resolution into a simple characteristic that is applied to the NN classifier. The method according to claim 1, further characterized in that it comprises filtering the frames in a plurality of frequency sub-bands and extracting the audio characteristics of the sub-bands. The method according to claim 1, further characterized in that it comprises filtering with low pass the results of the classifier. The method according to claim 1, characterized in that one or more audio characteristics are selected from a set comprising tonal components, in pitch-to-noise (TNR) and peak Cepstrum. 9. The method of compliance with the claim 8, characterized in that the tonal components are extracted by: (f) applying a frequency transformation to the signal placed in windows for each frame; (g) calculate the magnitude of the spectral lines in the frequency transformation; (h) estimate a noise level; (i) identifying as tone components the spectral components that exceeded the noise level by a threshold amount; (j) producing the tonal component number as the characteristic of the tonal component. 10. The method of compliance with the claim 9, characterized in that the length of the frequency transformation is equal to the number of audio samples in the frame for a certain resolution of time frequency. 11. The method according to the claim 10, further characterized in that it comprises: repeating steps (f) to (i) for different frame lengths and transformations; and producing a cumulative number of tonal components at each time frequency resolution; 12. The method according to claim 8, characterized the characteristics of TNR is extracted to: (k) apply a frequency transformation to the signal placed in windows for each frame; (1) calculate the magnitude of spectral lines in frequency transformation; (m) estimate a noise level; (n) determining a ratio of the energy of the tonal components identified with the noise level; and (o) produce the relationship as the characteristic of TNR. The method according to claim 12, characterized in that the length of the frequency transformation is equal to the number of audio samples in the frame for a certain resolution of time frequency. The method according to claim 13, further characterized in that it comprises: repeating steps (k) to (n) for different frame lengths and transformations; and averaging the relationships from different resolutions over a period of time equal to the baseline frame. 15. The method of compliance with the claim 12, characterized in that the noise level is estimated to: (p) apply a low pass filter on magnitudes of spectral lines, (q) mark components sufficiently above the filter outlet, (r) replace the components marked with the result of the low pass filter, (s) repeat steps (a) to (c) a number of times, and (t) produce the resulting components as the noise level estimate. The method according to claim 1, characterized in that the Neural Network classifier includes a plurality of produced neurons that each indicate the presence of a certain audio source in the monophonic audio signal. 17. The method according to claim 16, characterized in that the value of each produced neuron indicates a confidence that the baseline frame includes a certain audio source. Method according to claim 1, further characterized in that it comprises using the measure to remix the monophonic audio signal into a plurality of audio channels for the respective audio sources in the representative set. 19. The method according to claim 18, characterized in that the monophonic audio signal is remixed by switching it to the audio channel identified as the most prominent. 20. The method of compliance with the claim 18, characterized in that the Neural Network classifier produces a measurement for each of the audio sources in the representative set indicating a confidence that the frame includes the corresponding audio source, the monophonic audio signal is attenuated by each of the measurements and goes to the respective audio channels. The method according to claim 18, further characterized in that it comprises processing the plurality of audio channels using a source separation algorithm that requires at least both audio and input channels and audio sources to separate the plurality of audio channels. audio channels in equal or lesser plurality of audio sources. 22. The method according to claim 21, characterized in that the source separation algorithm is based on a blind source separation (BSS). The method according to claim 1, further characterized in that it comprises passing the monophonic audio signal and the sequence of the measurements to a postprocessor that uses measures to increase the postprocessing of the monophonic audio signal. 24. A method for separating audio sources from a monophonic audio signal, characterized in that it comprises: (a) providing a monophonic audio signal comprising a downmix of a plurality of sources of unknown audio; (b) separating the audio signal into a sequence of baseline frames; (c) place each frame in windows; (d) extracting a plurality of audio characteristics from each baseline frame that tends to distinguish audio sources; (e) repeating steps (b) to (d) for a different frame size to extract features of multiple resolutions; (f) scaling the audio characteristics extracted in the different resolutions for the baseline frame; (g) applying the audio characteristics to a neural network classifier (NN) trained in a representative set of audio sources with the audio characteristics, the neural network classifier has a plurality of neurons produced which each indicates the presence of a certain audio source in the monaural audio signal for each baseline frame. 25. An audio source classifier, characterized in that: a frame former for separating a monophonic audio signal comprising a downmix of a plurality of unknown audio sources in a sequence of baseline frames placed in a window; a feature extractor to extract a plurality of audio characteristics from each baseline frame that tends to distinguish audio sources; and a neural network classifier (NN) trained in a representative set of audio sources with the audio characteristics, the neural network classifier receives the extracted audio characteristics and produces at least one measure of an audio source included in each baseline plot of monophonic audio signal. 26. The audio source classifier according to claim 25, characterized in that the feature extractor extracts one or more of the audio features at multiple resolutions of time frequency. 27. The audio source classifier according to claim 25, characterized in that the NN classifier has a plurality of produced neurons each signaling the presence of a certain audio source in the monophonic audio signal for each line frame base.
MX/A/2008/004572A 2005-10-06 2008-04-04 Neural network classifier for seperating audio sources from a monophonic audio signal MX2008004572A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11244554 2005-10-06

Publications (1)

Publication Number Publication Date
MX2008004572A true MX2008004572A (en) 2008-10-03

Family

ID=

Similar Documents

Publication Publication Date Title
RU2418321C2 (en) Neural network based classfier for separating audio sources from monophonic audio signal
Sharma et al. Trends in audio signal feature extraction methods
KR101101384B1 (en) Parameterized temporal feature analysis
US9313593B2 (en) Ranking representative segments in media data
AU2002240461B2 (en) Comparing audio using characterizations based on auditory events
JP4067969B2 (en) Method and apparatus for characterizing a signal and method and apparatus for generating an index signal
Ramalingam et al. Gaussian mixture modeling of short-time Fourier transform features for audio fingerprinting
JP2004530153A6 (en) Method and apparatus for characterizing a signal and method and apparatus for generating an index signal
CN106997765B (en) Quantitative characterization method for human voice timbre
EP3504708B1 (en) A device and method for classifying an acoustic environment
WO2019053544A1 (en) Identification of audio components in an audio mix
Doets et al. Distortion estimation in compressed music using only audio fingerprints
Pilia et al. Time scaling detection and estimation in audio recordings
Rizzi et al. Genre classification of compressed audio data
Haubrick et al. Robust audio sensing with multi-sound classification
Uhle et al. Speech enhancement of movie sound
de León et al. A complex wavelet based fundamental frequency estimator in singlechannel polyphonic signals
MX2008004572A (en) Neural network classifier for seperating audio sources from a monophonic audio signal
Htun Analytical approach to MFCC based space-saving audio fingerprinting system
Uzun et al. A preliminary examination technique for audio evidence to distinguish speech from non-speech using objective speech quality measures
Lewis et al. Blind signal separation of similar pitches and instruments in a noisy polyphonic domain
Kaur et al. Audio Post-Processing Identification Using MFCC and LPC Feature
Fenton Audio Dynamics: Towards a Perceptual Model of'punch'.
Alías Pujol et al. A Review of physical and perceptual feature extraction techniques for speech, music and environmental sounds
CN115620731A (en) Voice feature extraction and detection method