WO2011076284A1 - An apparatus - Google Patents

An apparatus Download PDF

Info

Publication number
WO2011076284A1
WO2011076284A1 PCT/EP2009/067894 EP2009067894W WO2011076284A1 WO 2011076284 A1 WO2011076284 A1 WO 2011076284A1 EP 2009067894 W EP2009067894 W EP 2009067894W WO 2011076284 A1 WO2011076284 A1 WO 2011076284A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency component
audio signal
ieast
fundamental frequency
speech
Prior art date
Application number
PCT/EP2009/067894
Other languages
French (fr)
Inventor
Koray Ozcan
Jukka Vesa Rauhala
Original Assignee
Nokia Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Corporation filed Critical Nokia Corporation
Priority to PCT/EP2009/067894 priority Critical patent/WO2011076284A1/en
Publication of WO2011076284A1 publication Critical patent/WO2011076284A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Definitions

  • the present invention relates to apparatus for processing of audio signals.
  • the invention further relates to, but is not limited to, apparatus for processing audio and speech signals in audio playback devices.
  • Audio processing and in particular audio processing in mobile devices have been a growing area in recent years.
  • the use of apparatus to process audio signals, for example music, to separate instruments from vocals or different sources from each other in order to focus on one component of an audio signal is a broadly researched area.
  • Music audio signals typically contain both "speech" (voice) and “non-speech” (non-voice) frequency components where the non-speech frequency components may originate from instruments and other background noise components.
  • the result of the extractions may be used to process the audio signals to attempt to enhance the voice components. For example where a music recording is incorrectly mixed or recorded and the non-speech frequency components dominate the speech components rendering the recording difficult to understand.
  • the audio processor may attempt to remove the voice component, for example in "home” karaoke devices.
  • One known approach to enhance the speech components is to configure the apparatus to increase the overall amplitude or volume level.
  • Such an operation would increase both the speech and non-speech frequency components equally and although make the speech components audio clearer, aiso make the non-speech frequency components louder and possibly reach dangerous volume levels.
  • spectral amplifiers are typically difficult for untrained operators to balance. Spectral amplifiers require a trained "ear” to operate as they remove or enhance a broad spectrum of audio signals. Some commerciai software packages attempt to identify the speech components of audio signals by using the centre channel assumption and mixing the multichannels to a centre channel. In other words a multi channel recording is assumed to have the majority of speech components as common components present in both the left and right channels.
  • a method comprising: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental speech frequency component.
  • Selecting each of the at least one consistent peak frequency component may comprise determining for at least a first range of consecutive segments a peak frequency component within a range of frequency values.
  • the first range of consecutive segments may comprise at least: a first range defined by a minimum number of segments; and a first range defined by a minimum and maximum number of segments.
  • Processing the at least one audio signal dependent on the at least one fundamental frequency component may comprise at least one of: filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signal; and enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
  • the at least one audio signai may comprise a centre channel mix of a multiple channel signal.
  • the at least one audio signal may comprise the output of a band pass filter configured to output speech frequencies
  • an apparatus comprising at least one processor and at least one memory including computer program code the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signai dependent on the at least one fundamental frequency component. Determining at least one fundamental frequency component may cause the apparatus at least to perform: determining at least one frequency component; and selecting as the fundamental frequency component the frequency component with the greatest magnitude.
  • Determining at ieast one frequency component may cause the apparatus at least to perform selecting at Ieast one consistent peak frequency component. Selecting each of the at Ieast one consistent peak frequency component may cause the apparatus at least to perform determining for at Ieast a first range of consecutive segments a peak frequency component within a range of frequency values.
  • the first range of consecutive segments may comprise at Ieast: a first range defined by a minimum number of segments; and a first range defined by a minimum and maximum number of segments.
  • Processing the at least one audio signal dependent on the at Ieast one fundamental frequency component may cause the apparatus at Ieast to perform at Ieast one of: filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signai; and enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
  • the at least one audio signai may comprise a centre channel mix of a multiple channel signal.
  • the at Ieast one audio signal may comprise the output of a band pass filter configured to output frequencies.
  • the at least one fundamental frequency component may be an at least one fundamental speech frequency component.
  • the at least one frequency component may be an at least one speech frequency component.
  • an apparatus comprising: an audio signal processor configured to divide at least one audio signal into a plurality of segments; a component identifier configured to determine at least one fundamental frequency component over a predetermined number of segments; and a filter configured to process the at least one audio signal dependent on the at least one fundamental frequency component.
  • the component identifier may further be configured to determine at least one frequency component; and select as the fundamental frequency component the frequency component with the greatest magnitude.
  • the component identifier may further be configured to select at least one consistent peak frequency component as the frequency com onent.
  • the component identifier may further be configured to select the at least one consistent peak frequency component by determining for at least a first range of consecutive segments a peak frequency component within a range of frequency values.
  • the first range of consecutive segments may comprise at least: a first range defined by a minimum number of segments; and a first range defined by a minimum and maximum number of segments.
  • the filter may be configured to perform at least one of: filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signal; and enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
  • the at least one audio signal may comprise a centre channel mix of a multiple channel signal.
  • the at least one audio signal may comprise the output of a band pass fiiter configured to output frequencies.
  • an apparatus comprising: means for dividing at least one audio signal into a plurality of segments; means for determining at least one fundamental frequency component over a predetermined number of segments; and means for processing the at least one audio signal dependent on the at !east one fundamental frequency component.
  • a computer-readable medium encoded with instructions that, when executed by a computer perform: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental frequency component.
  • An electronic device may comprise apparatus as described above.
  • a chipset may comprise apparatus as described above.
  • Figure 1 shows schematically an apparatus employing embodiments of the application
  • Figure 2 shows schematically the apparatus shown in Figure 1 according to some embodiments of the application
  • FIG. 3 shows an overview of the operation of the analyser as employed in some embodiments
  • FIG 4 shows schematically the analyser as shown in Figure 2 in further detail
  • Figure 5 shows a flow diagram illustrating the operation of the analyser according to some embodiments of the application.
  • Figure 6 shows schematically in speech gain filter as shown in Figure 2 according to some embodiments of the application
  • Figure 7 shows schematically a further view of the apparatus according to some further embodiments of the application.
  • Figure 8 shows schematically a view of the apparatus according to some other embodiments of the application.
  • Figure 9 shows a time domain trace of an example audio signal before and after processing according to some embodiments of the application.
  • Figure 1 shows a schematic block diagram of an exemplary electronic device 10 or apparatus, which may incorporate a voice/speech component extractor.
  • the apparatus 10 may for example be a mobile terminal or user equipment for a wireless communication system.
  • the electronic device may be a Television (TV) receiver, portable digital versatile disc (DVD) player, or audio player such as an mp3 player.
  • TV Television
  • DVD portable digital versatile disc
  • audio player such as an mp3 player.
  • the apparatus 10 comprises a processor 21 which may be linked via a digital-to- analogue converter 32 to a playback speaker configured to provide a suitable audio playback.
  • the playback speaker in some embodiments may be any suitable loudspeaker. In some other embodiments the playback speaker may be a headphone or ear worn speaker (EWS) set. In some embodiments the apparatus 10 may comprise a headphone connector for receiving a headphone or headset 33.
  • the processor 21 is in some embodiments further linked to a transceiver (TX/RX) 13, to a user interface (Ul) 15 and to a memory 22.
  • the processor 21 may be configured to execute various program codes.
  • the implemented program codes comprise a voice/speech component extractor for extracting a voice/speech component from an audio signal and thus process the audio signal based on the extraction output.
  • the implemented program codes 23 may be stored for example in the memory 22 for retrieval by the processor 21 whenever needed.
  • the memory 22 could further provide a section 24 for storing data, for example data that has been processed in accordance with the embodiments.
  • the voice/speech component extracting code may in embodiments be implemented in hardware or firmware.
  • the user interface 15 In some embodiments enables a user to input commands to the electronic device 10, for example via a keypad, and/or to obtain information from the electronic device 10, for example via a display.
  • the transceiver 13 enables a communication with other electronic devices, for example via a wireless communication network.
  • the apparatus 10 may in some embodiments further comprise at least two microphones 11 for inputting audio or speech that is to be processed according to embodiments of the application or transmitted to some other electronic device or stored in the data section 24 of the memory 22.
  • a corresponding application to capture stereo audio signals using the at least two microphones may be activated to this end by the user via the user interface 15.
  • the apparatus 10 in such embodiments may further comprise an analogue-to-digital converter 14 configured to convert the input analogue audio signal into a digital audio signal and provide the digitai audio signal to the processor 21.
  • the apparatus 10 may in some embodiments receive a bit stream with a correspondingly encoded audio data from another electronic device via the transceiver 13.
  • the processor 21 may execute the voice/speech component processing program code stored in the memory 22.
  • the processor 21 in these embodiments may process the received audio signal data, and output the extracted and processed data.
  • the headphone connector 33 may be configured to communicate to a headphone set or earplugs wirelessly, for example by a Bluetooth profile, or using a conventional wired connection.
  • the received stereo audio data may in some embodiments also be stored, instead of being processed immediately, in the data section 24 of the memory 22, for instance for enabling a later processing and presentation or forwarding to still another electronic device.
  • each channel is processed independently of other channels.
  • the audio signal speech extraction and processing processor 151 is shown as a series of slices, each slice configured to extract voice parameters and process for a different audio signal input channel. Therefore with respect to Figure 2, each audio channel is input to a separate slice.
  • a first channel input Xi is input to a first slice 1511 which outputs a processed signal Yi
  • a second audio channel X 2 is input to a second slice 1512 which outputs a processed signal Y2
  • an N th audio channel input X is input to an N FH slice 151 N and outputs a processed signal YN.
  • each slice 1511 may be at least partially implemented in the processor 21.
  • the slice operations may thus be implemented at least partially as code stored within the memory 22 and implemented on the processor 21.
  • each slice may be implemented as in dedicated hardware separate from the processor 21 .
  • each slice is implemented in parallel it would be understood that in some embodiments at least some slices may be implemented sequentially on the same apparatus.
  • the slice 151i receives the audio signal first channel Xi and inputs the audio signal first channel Xi into an analyser 101 and a speech gain filter 103.
  • the analyser 101 is configured in some embodiments to determine whether or not the audio signal input Xi comprises speech (or voice) signal components and the location of such components in order to output a control signal to the speech gain filter 103 to control the filtering of the audio signal and output a voice/speech processed audio signal.
  • the analyser on receiving the audio signal is configured to detect all fundamental frequencies within the defined frequency range. In other words, within an expected frequency range of speech (or voice) signals, any fundamental activity is detected.
  • the analyser 101 After detecting any activity in the fundamental frequencies within the defined frequency range, the analyser 101 attempts to determine any speech components within the fundamental frequency activity. The operation of determining speech components within the fundamental frequencies is shown in Figure 3 by step 1 13.
  • the analyser 101 is configured in some embodiments to select the most likely speech component fundamental frequency.
  • the step of selecting the most likely speech component fundamental frequency from possible multiple fundamental frequencies is shown in Figure 3 by step 115.
  • the analyser 101 may then output to the speech gain filter 103 filter an indication of whether or not the audio signal comprises any voice/speech components and where the speech/voice component is located in frequency relative to the audio signal.
  • Figure 4 a schematic view of the analyser in detail is shown according to some embodiments.
  • Figure 5 shows the operations implemented by the analyser 101 according to these embodiments.
  • the audio signal in some embodiments is first received within the analyser 101 by a segment/window generator 201.
  • the segment/window generator receives the time domain audio samples and generates a segment (or frame) of these audio sample values.
  • the segment/window generator 201 may apply a 40 millisecond window length to the received audio signal samples.
  • the segment/window generator generates a segment (or frame or window) every 20 milliseconds. In other words a new segment is started after 20 milliseconds with the consequence that each segment overlaps by 20 milliseconds with the previous segment and by 20 milliseconds with the next segment. It would be understood that any suitable window size and repetition may be used, for example in further embodiments the window size may be from 20 to 50 milliseconds long and have a similar size range of overlaps,
  • segment/window generator 201 may further apply a weighting function to the samples.
  • step 401 The generation of the segment is shown in Figure 5 by step 401.
  • the segment samples in some embodiments are output to the time to frequency domain converter 203.
  • the input audio signal is not a digital audio signal
  • the segment/window generator 201 may further comprise an analogue to digital converter of any suitable type in order to generate digital signals suitable for further processing.
  • the time to frequency domain converter 203 receives the segmented audio samples and generates frequency domain component values for are segment period.
  • the time to frequency domain converter 203 may in some embodiments be a Fast Fourier Transformer, a Discrete Cosine Transformer, a Wavelet Transformer or any suitable time to frequency domain converter.
  • the time to frequency domain converter is configured to output only real frequency domain components whereas in other embodiments, both real and imaginary components may be output in order to preserve both amplitude and phase information.
  • any suitable practical time to frequency domain converter may be used in embodiments of the application.
  • suitable time to frequency domain transforms include the Discrete Fourier Transform, short term FFT, Goertzel's algorithm.
  • a zero-padding operation may further be applied to improve the accuracy of the application.
  • the frequency domain components may then be output to the sub-band detector 207 and a peak energy determiner 209.
  • the sub-band peak detector 207 may be configured in some embodiments to compare the received frequency domain components (or ⁇ sub-bands) and output an indication or signals indicating the activity in the frequency domain components or sub-bands. For example the sub-band peak detector 207 may indicate peak value sub-bands of the frequency domain signal. The sub-band peak detector 207 may be configured in some embodiments to determine a "peak" by determining the derivative of the sub-band coefficient value and choosing a sub- band as a peak sub-band where the derivative sign switches before and after the sub-band.
  • the sub-bands may be processed to calculate the difference between neighbouring sub-bands and a sub-band selected as being a potential peak value if the sign of the difference between the current sub-band and the previous sub band is different from the sign of the difference between the next sub-band and the current sub-band.
  • only a predetermined number of the sub-bands are selected to be passed to the peak comparator these are the sub-bands with the n lh highest energy sub-bands for the current segment that have been determined to be peak sub-bands.
  • the peak indicator values may be passed, in some embodiments, to the segment peak comparator and to the peak energy determiner 209.
  • the peak energy determiner 209 receives the peak sub- band indicator values representing frequencies where there is activity from the sub- band peak detector 207 and also the frequency domain components of the segment. The peak energy determiner 209 may then calculate the ratio of energy at the detected fundamental frequency and its harmonics compared to the total energy of the segment.
  • the determination of the peak sub-band energy is shown in Figure 5 by step 409.
  • the segment peak comparator/speech determiner 21 1 receives in some embodiments both the indications of peak sub-band activity and also the peak energy values from the sub-band peak detector 207 and ratio peak energy determiner 209 respectively and compares these values against previous segment peak values.
  • the segment peak comparator 211 compares the current indicated peak vaiues against the previous M segment peak values to determine whether or not any of the current peak values is substantially similar to a previous peak sub- band value and/or how many previous peak sub-bands it is similar for.
  • a current segment peak value is similar to a range of previous segments peak sub-band values then this current segment peak value is selected as a potential fundamental frequency candidate and output to the fundamental frequency selector.
  • each sub-band has an index value i then where at least one of the current segment peak value sub-band index values ⁇ i t ⁇ is within a range N ran ge of index values of previous segment peak values ⁇ it- ⁇ then the sub band index - either current or previous peak value is noted.
  • the search range is split equally so that for each of the current sub- band peak values a range of previous peak values of [irN ra nge 2, it+N ratl g e /2] is searched. This is because speech may have transients where the fundamental frequency slides over the same syllable.
  • step 413 Furthermore the operation of selecting substantially similar peak sub-bands over a range of segments is shown in Figure 5 by step 413.
  • the segment peak comparator comprises a linked list memory where each element in the linked list comprises a sub-band index value and an integer value representing the number of consecutive segments the sub-band has been "active" for.
  • the segment peak comparator 211 compares the current "active" or peak sub-band index value against the list. If the current peak sub-band index is outside the search range [it-Nrange 2, it + N ra nge/2] then the list is incremented by an additional element with the current peak sub-band index and a consecutive segment value of 1. If the current peak sub-band index is inside the search range [it- range 2, it+Nrange 2] then the list entries within the search range are amended so that their sub-band index value is the current index value and the consecutive segment value incremented by 1. In such embodiments where two or more search range generates more than one 'match' within the search range the entry with the largest consecutive index is chosen.
  • the list is then pruned to remove all entries which have not been updated or added in the current segment.
  • the list contains a series of entries of frequency indicators and the number of consecutive segments within which the frequency has been a 'peak' value.
  • the segment peak comparator 211/speech determiner may in some embodiments output a binary decision of whether or not there are any speech components within the segment where the "number" of active consecutive sub-band is within the predefined range of values.
  • the range of values is defined as being equal to or greater than a predefined value.
  • the segment peak comparator 211/speech determiner may determine that speech has been detected where there is at least one peak sub-band over two consecutive segments and the segment peak comparator/speech determiner 21 1 outputs a binary decision S spe ech value of 1 .
  • the segment peak comparator 21 1/speech determiner may search the iist for consecutive values greater than the predefined value and output the flag S sp _ech as being positive and pass at least the index value of the entry to the fundamental frequency selector 2 3 if any entries are found.
  • the range is defined by a predefined lower and upper level in other words that the number of consecutive "active" segments for a sub- band is between a minimum number and a maximum number.
  • the segment . peak comparator 211/speech determiner may search the iist for consecutive values greater than a lower predefined value but less than an upper value and output the flag Sspeec as being positive pass at least the entry inde value to the fundamental frequency selector 213 if any entries are found.
  • the fundamental frequency selector 213 may in some embodiments select the fundamental frequency by selecting the fundamental frequency candidate with the greatest energy component.
  • the selection of the highest energy fundamental frequency candidate is shown in Figure 5 by step 415.
  • the fundamental frequency selector then may output the fundamental frequency candidate selected.
  • the outputting of the indicator of the binary decision whether or not there is speech within the audio signal segment S speeCh and the fundamental frequency of the speech component for the segment is shown in Figure 5 by step 4 7.
  • the speech gain filter 103 may be configured to perform a filtering operation on the digital audio signal input Xi based on the control signals from the Analyser 101 and specifically fundamental frequency and speech detection binary indicator.
  • the speech gain filter is implemented as a comb filter.
  • a comb filter adds a delayed version of a signal to itself causing constructive and destructive interference which may be used to significantly enhance the fundamental frequency components.
  • the input signal shown in Figure 6 for the speech gain filter 103 is input to a first combiner 501 first input.
  • the first combiner 501 is configured to output the combination to a filter gain amplifier 503 which multiplies the signal by a gain g comp .
  • the output of the first combiner 501 is input to a switching ampiifier 505.
  • the gain of the switching amplifier in some embodiments is controlled by the speech detection binary indicator s spe ec h so that when the speech given filter 103 receives an indicator that there is a speech component the switching amplifier 505 passes the signal to the feedback loop and when no speech components are detected the feedback loop is switched off.
  • the output of the switching ampiifier 505 is in some embodiments is input to a feedback gain amplifier 507 which multiplies the signal by a feedback gain S g .
  • the feedback gain S g is a parameter which controls how much the speech gain filter boosts or reduces the speech component.
  • the value of S g may be greater than 1. In some other embodiments where the speech component of the audio signal is to remove the value of S g many be less than 1.
  • the output of the speech gain amplifier 507 is in some embodiments passed to a low pass filter comprising a second combiner 509 which receives the output of the speech gain amplifier and outputs a combined signal both to a controllable delay element 515 of the speech gain filter and also a short deiay element 511 for the low pass filter.
  • the short delay element 51 1 of the low pass filter further outputs a delayed signal to a low pass filter amplifier 513.
  • the low pass filter gain amplifier has in some embodiments a gain of a.
  • the output of the low pass filter gain amplifier is in some embodiments passed to the second input of the second combiner second input 509.
  • the low pass filter in some embodiments may be used in order to focus the impact of the comb filter to the frequencies where speech is present.
  • a loss pass filter is inserted into the comb filter's feedback loop, cause the magnitude difference between consecutive peaks and notches to decrease towards higher frequencies, whereas a regular comb filter maintain a magnitude difference which is constant throughout the frequency range.
  • the introduction of the low pass filter creates a filter where a magnitude response of the regular comb filter is multiplied by the magnitude response of the low pass filter.
  • the controllable delay element 515 in some embodiments is controlled by the fundamental frequency value to delay the feedback loop sufficiently for the required interference value.
  • a look-up table not shown receives the fundamental frequency value and determines the delay value .
  • the output of the delay element 515 is passed to a controllable fractional delay filter.
  • the controllable fractional delay filter 517 increases the specificity of the filter by providing the ability to delay the signal by a non-integer number of samples.
  • the controllable fractional delay filter 517 thus in some embodiments in combination with the controllable delay element 515 be also controlled by the same look-up table.
  • the output of the fractional delay filter 517 in some embodiments is then passed to the second input of the first combiner 501 in order to produce the constructive and destructive interference effects.
  • the speech gain filter 103 is shown with respect to a time domain representation of the filter, it would be understood that in some embodiments the speech gain filter 103 may be implemented in the frequency domain.
  • the frequency domain components generated by the analyser in overall operation of extracting the fundamental frequency value may themselves be modified based on the output of the analyser.
  • the speech gain filter has been shown in the above example as a comb filter any suitable filter or group of filters may be implemented.
  • the speech gain filter comprises at least one band pass or notch filter where each filter may be tuned to output the fundamental frequency (or in filter bank embodiments comprising more than one filter a fundamental and harmonics of the fundamental frequency).
  • the speech gain filter 103 output the processed audio signal which in some embodiment can be saved in the memory overwriting the input audio signal.
  • the apparatus may determine based on an input from the user to save the processed audio signal in the memory with the original audio signal, in other words not overwriting the pre-processed audio signal.
  • pre or post processing may assist the main fundamental frequency processing embodiments.
  • the audio signal analyses and processed may be pre-processed by being band limited before the processing, or post processed by band limiting the output of the filter.
  • a digital equaliser may be implemented to convolve the equalizer with the processed signal so that speech could be understood more clearly by enhancing this 'sensitive' frequency region.
  • FIG. 7 a further schematic view of a speech extractor controiled audio signal processor according to some further embodiments of the application is shown.
  • the multiple input channels Xi to X N are passed to a speech gain filter similar to that described previously and a!so to a mixer 601.
  • the mixer 601 may then produce a single channel mixed signal which is passed to an analyser.
  • the analyser in these embodiments is similar to the analyser configuration described previously which outputs a speech indicator decision and a fundamental frequency value but outputs the value to all of the filters.
  • the processing requirement is simplified over the previous embodiments as there is only one analyser operation.
  • FIG 8 a further schematic view of a speech extractor controlled audio signal processor schematic is shown.
  • the multiple input channels Xi to X N are pre-processed prior to the operation of the analyser 101.
  • the multiple input channels Xi to XN are also pre-processed prior to the application of a single speech gain filter 103.
  • the multiple input channels Xi to X N are input to a centre channel extractor 701 which is configured to implement a centre channel extraction or mixing of the input channels audio signals to generate an audio signal which most closely represents the central channel information.
  • the centre channel extraction may be any suitable centre channel extraction operation and be implemented by any suitable apparatus.
  • the centre channel extractor may output signals using both magnitude and phase information at lower frequency values but only magnitude information for higher frequencies to extract the centre channel audio components
  • frequency dependent magnitude and phase difference information between pairs of channels may be compared against user specific intra-level difference (ILD) and intra-time difference (iTD) cues. The result of the comparison may be used in some embodiments to determine whether or not a signal component is located at the centre channel and therefore suitable for extraction.
  • ILD intra-level difference
  • iTD intra-time difference
  • Such embodiments may be further customised according to the user of the apparatus.
  • Such embodiments may use centre channel extraction dependent on the operator's own head related transfer function. In such embodiments, sources in the medial plane of a binaurally recorded signal may be extracted.
  • the centre channel signal Xc may then in these embodiments be passed to the analyser which may be schematically similar to the analyser 101 described with respect to Figure 3, 4 and 5 and is configured to output an indicator whether or not there are any detected speech components within the centre channel extracted audio signal and the fundamental frequencies Fo associated with this detected speech component.
  • These values may be passed to the speech gain filter 103 which also receives the centre channel audio signal Xc and in some embodiments performs a similar gain filtering as described with respect to Figure 6 to output an enhanced speech audio signal Yc.
  • the pre-processing may comprise an equalizer configured to boost frequencies within a predefined speech frequency range. With respect to Figure 9, an example of the operation of such speech gain filtering on an audio signal is shown.
  • Figure 9 shows two separate time sample values where an audio signal input 800 and filtered output 802 comprises music comprising instrumentals and also vocal sources.
  • the output trace is shown to be significantly amplified with respect to the input trace 800.
  • a voice component such as those regions shown by the periods 801 , 803, 805, 807 and 809
  • the output trace is shown to be significantly amplified with respect to the input trace 800.
  • the above description has been with respect to the enhancement of speech/voice components within the audio signal, it would be understood that by changing the gain within the speech gain filter 103, the speech components may be enhanced, by using a gain greater than 1 , or diminished, by using a gain less than 1.
  • embodiments of the application perform a method comprising; dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental speech frequency component.
  • user equipment may comprise an audio processor such as those described in embodiments of the invention above.
  • audio processor such as those described in embodiments of the invention above.
  • electronic device and user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be impiemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • At least some embodiments may be an apparatus comprising at least one processor and at least one memory including computer program code the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform; dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental frequency component.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as In the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • At least some embodiments may be a computer-readable medium encoded with instructions that, when executed by a computer perform: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental frequency component.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
  • a standardized electronic format e.g., Opus, GDSII, or the like
  • circuitry refers to all of the following:
  • circuits and software and/or firmware
  • combinations of circuits and software such as: (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processors)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.
  • This definition of 'circuitry' applies to all uses of this term in this application, including any claims.
  • the term 'circuitry' would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware.
  • the term 'circuitry' would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or similar integrated circuit in server, a cellular network device, or other network device.

Abstract

An apparatus comprising at least one processor and at least one memory including computer program code. The at least one memory and the computer program code is configured to, with the at least one processor, cause the apparatus at least to perform dividing at least one audio signal into a plurality of segments, determining at least one fundamental frequency component over a predetermined number of segments, and processing the at least one audio signal dependent on the at least one fundamental frequency component.

Description

An Apparatus
The present invention relates to apparatus for processing of audio signals. The invention further relates to, but is not limited to, apparatus for processing audio and speech signals in audio playback devices.
Audio processing and in particular audio processing in mobile devices have been a growing area in recent years. The use of apparatus to process audio signals, for example music, to separate instruments from vocals or different sources from each other in order to focus on one component of an audio signal is a broadly researched area. Music audio signals typically contain both "speech" (voice) and "non-speech" (non-voice) frequency components where the non-speech frequency components may originate from instruments and other background noise components. The result of the extractions may be used to process the audio signals to attempt to enhance the voice components. For example where a music recording is incorrectly mixed or recorded and the non-speech frequency components dominate the speech components rendering the recording difficult to understand. Similarly the audio processor may attempt to remove the voice component, for example in "home" karaoke devices.
One known approach to enhance the speech components is to configure the apparatus to increase the overall amplitude or volume level. However such an operation would increase both the speech and non-speech frequency components equally and although make the speech components audio clearer, aiso make the non-speech frequency components louder and possibly reach dangerous volume levels.
Further approaches may use a spectral amplifier (analyser) to attempt to manually select frequency bands to enhance. However, spectral amplifiers are typically difficult for untrained operators to balance. Spectral amplifiers require a trained "ear" to operate as they remove or enhance a broad spectrum of audio signals. Some commerciai software packages attempt to identify the speech components of audio signals by using the centre channel assumption and mixing the multichannels to a centre channel. In other words a multi channel recording is assumed to have the majority of speech components as common components present in both the left and right channels. Although these approaches can lead to improved voice component separation (as typically the speech or vocal components are typically mixed as signals common to left and right channels), it is also common to have bass and lead instruments or significant components of the bass and lead instruments also at the centre channel which will also be enhanced in level along with any of the other centre channel components. Furthermore these approaches fail where there is any panning of the audio scene to move the voice or speech components off the audio centre stage. This invention thus proceeds from the consideration that prior art solutions for speech frequency component selection and filtering to enhance or diminish the speech component currently does not produce good quality audio signals. Listening to such process signals produces a poor listening experience. Embodiments of the present invention aim to address the above problem.
There is provided according to a first aspect of the invention a method comprising: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental speech frequency component.
Determining at least one fundamental frequency component may comprise: determining at least one frequency component; and selecting as the fundamental frequency component the frequency component with the greatest magnitude. Determining at least one frequency component may comprise selecting at least one consistent peak frequency component.
Selecting each of the at least one consistent peak frequency component may comprise determining for at least a first range of consecutive segments a peak frequency component within a range of frequency values.
The first range of consecutive segments may comprise at least: a first range defined by a minimum number of segments; and a first range defined by a minimum and maximum number of segments.
Processing the at least one audio signal dependent on the at least one fundamental frequency component may comprise at least one of: filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signal; and enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
The at least one audio signai may comprise a centre channel mix of a multiple channel signal.
The at least one audio signal may comprise the output of a band pass filter configured to output speech frequencies, According to a second aspect of the application there is provided an apparatus comprising at least one processor and at least one memory including computer program code the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signai dependent on the at least one fundamental frequency component. Determining at least one fundamental frequency component may cause the apparatus at least to perform: determining at least one frequency component; and selecting as the fundamental frequency component the frequency component with the greatest magnitude.
Determining at ieast one frequency component may cause the apparatus at least to perform selecting at Ieast one consistent peak frequency component. Selecting each of the at Ieast one consistent peak frequency component may cause the apparatus at least to perform determining for at Ieast a first range of consecutive segments a peak frequency component within a range of frequency values. The first range of consecutive segments may comprise at Ieast: a first range defined by a minimum number of segments; and a first range defined by a minimum and maximum number of segments.
Processing the at least one audio signal dependent on the at Ieast one fundamental frequency component may cause the apparatus at Ieast to perform at Ieast one of: filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signai; and enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
The at least one audio signai may comprise a centre channel mix of a multiple channel signal.
The at Ieast one audio signal may comprise the output of a band pass filter configured to output frequencies. The at least one fundamental frequency component may be an at least one fundamental speech frequency component.
The at least one frequency component may be an at least one speech frequency component.
According to a third aspect of the invention there is provided an apparatus comprising: an audio signal processor configured to divide at least one audio signal into a plurality of segments; a component identifier configured to determine at least one fundamental frequency component over a predetermined number of segments; and a filter configured to process the at least one audio signal dependent on the at least one fundamental frequency component.
The component identifier may further be configured to determine at least one frequency component; and select as the fundamental frequency component the frequency component with the greatest magnitude.
The component identifier may further be configured to select at least one consistent peak frequency component as the frequency com onent.
The component identifier may further be configured to select the at least one consistent peak frequency component by determining for at least a first range of consecutive segments a peak frequency component within a range of frequency values.
The first range of consecutive segments may comprise at least: a first range defined by a minimum number of segments; and a first range defined by a minimum and maximum number of segments. The filter may be configured to perform at least one of: filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signal; and enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
The at least one audio signal may comprise a centre channel mix of a multiple channel signal.
The at least one audio signal may comprise the output of a band pass fiiter configured to output frequencies. According to a fourth aspect of the invention there is provided an apparatus comprising: means for dividing at least one audio signal into a plurality of segments; means for determining at least one fundamental frequency component over a predetermined number of segments; and means for processing the at least one audio signal dependent on the at !east one fundamental frequency component.
According to a fifth aspect of the invention there is provided a computer-readable medium encoded with instructions that, when executed by a computer perform: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental frequency component.
An electronic device may comprise apparatus as described above. A chipset may comprise apparatus as described above. Brief Description of Drawings
For better understanding of the present invention, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 shows schematically an apparatus employing embodiments of the application; Figure 2 shows schematically the apparatus shown in Figure 1 according to some embodiments of the application;
Figure 3 shows an overview of the operation of the analyser as employed in some embodiments;
Figure 4 shows schematically the analyser as shown in Figure 2 in further detail;
Figure 5 shows a flow diagram illustrating the operation of the analyser according to some embodiments of the application;
Figure 6 shows schematically in speech gain filter as shown in Figure 2 according to some embodiments of the application;
Figure 7 shows schematically a further view of the apparatus according to some further embodiments of the application;
Figure 8 shows schematically a view of the apparatus according to some other embodiments of the application; and
Figure 9 shows a time domain trace of an example audio signal before and after processing according to some embodiments of the application.
The following describes apparatus and methods for the provision of enhancing voice/speech component extraction for audio signal processing. In this regard reference is first made to Figure 1 which shows a schematic block diagram of an exemplary electronic device 10 or apparatus, which may incorporate a voice/speech component extractor.
The apparatus 10 may for example be a mobile terminal or user equipment for a wireless communication system. In other embodiments the electronic device may be a Television (TV) receiver, portable digital versatile disc (DVD) player, or audio player such as an mp3 player.
The apparatus 10 comprises a processor 21 which may be linked via a digital-to- analogue converter 32 to a playback speaker configured to provide a suitable audio playback. The playback speaker in some embodiments may be any suitable loudspeaker. In some other embodiments the playback speaker may be a headphone or ear worn speaker (EWS) set. In some embodiments the apparatus 10 may comprise a headphone connector for receiving a headphone or headset 33. The processor 21 is in some embodiments further linked to a transceiver (TX/RX) 13, to a user interface (Ul) 15 and to a memory 22.
The processor 21 may be configured to execute various program codes. The implemented program codes comprise a voice/speech component extractor for extracting a voice/speech component from an audio signal and thus process the audio signal based on the extraction output. The implemented program codes 23 may be stored for example in the memory 22 for retrieval by the processor 21 whenever needed. The memory 22 could further provide a section 24 for storing data, for example data that has been processed in accordance with the embodiments. The voice/speech component extracting code may in embodiments be implemented in hardware or firmware.
The user interface 15 In some embodiments enables a user to input commands to the electronic device 10, for example via a keypad, and/or to obtain information from the electronic device 10, for example via a display. The transceiver 13 enables a communication with other electronic devices, for example via a wireless communication network.
It is to be understood again that the structure of the electronic device 10 could be supplemented and varied in many ways.
The apparatus 10 may in some embodiments further comprise at least two microphones 11 for inputting audio or speech that is to be processed according to embodiments of the application or transmitted to some other electronic device or stored in the data section 24 of the memory 22. A corresponding application to capture stereo audio signals using the at least two microphones may be activated to this end by the user via the user interface 15. The apparatus 10 in such embodiments may further comprise an analogue-to-digital converter 14 configured to convert the input analogue audio signal into a digital audio signal and provide the digitai audio signal to the processor 21. The apparatus 10 may in some embodiments receive a bit stream with a correspondingly encoded audio data from another electronic device via the transceiver 13. In these embodiments, the processor 21 may execute the voice/speech component processing program code stored in the memory 22. The processor 21 in these embodiments may process the received audio signal data, and output the extracted and processed data.
In some embodiments the headphone connector 33 may be configured to communicate to a headphone set or earplugs wirelessly, for example by a Bluetooth profile, or using a conventional wired connection.
The received stereo audio data may in some embodiments also be stored, instead of being processed immediately, in the data section 24 of the memory 22, for instance for enabling a later processing and presentation or forwarding to still another electronic device.
It would be appreciated that the schematic structures described in Figures 2, 4, 6 to 8 and the method steps shown in Figures 3 and 5 represent only a part of the operation of a complete audio processing chain comprising some embodiments as exemplary the shown implemented in the apparatus shown in Figure .
The key concept with regards to embodiments of the application is the extraction of voice (or speech) components within an audio signal relative to the non-voice (or non-speech) components in order to enable the apparatus to modify the playback characteristics of the audio signal. The embodiments described below are applicable to mobile phone applications and other devices and apparatus such as PCs, laptops, servers and databases. As shown in Figure 2, in some embodiments, each channel is processed independently of other channels. In Figure 2 only the audio signal speech extraction and processing processor 151 is shown as a series of slices, each slice configured to extract voice parameters and process for a different audio signal input channel. Therefore with respect to Figure 2, each audio channel is input to a separate slice. Thus a first channel input Xi is input to a first slice 1511 which outputs a processed signal Yi, a second audio channel X2 is input to a second slice 1512 which outputs a processed signal Y2 and an Nth audio channel input X is input to an NFH slice 151N and outputs a processed signal YN. With respect to Figure 2, only the components with respect to the first slice 511 are shown and are described hereafter. However it would be understood to the person skilled in the art that each slice would comprise similar components.
Furthermore the speech extraction and processing processor may be implemented as part of the processor 21 as shown in Figure 1. In some other embodiments, each slice 1511 may be at least partially implemented in the processor 21. In some embodiments the slice operations may thus be implemented at least partially as code stored within the memory 22 and implemented on the processor 21. !n still further embodiments of the application, each slice may be implemented as in dedicated hardware separate from the processor 21 .
Although as shown in Figure 2 each slice is implemented in parallel it would be understood that in some embodiments at least some slices may be implemented sequentially on the same apparatus.
With respect to Figure 2 (and specifically with regards to a first channel input Xi which outputs an enhanced audio signal Yi using the first slice 1511 ), the apparatus and operation of a slice of the speech extraction and processing processor will be described in further detail hereafter. The slice 151i receives the audio signal first channel Xi and inputs the audio signal first channel Xi into an analyser 101 and a speech gain filter 103. The analyser 101 is configured in some embodiments to determine whether or not the audio signal input Xi comprises speech (or voice) signal components and the location of such components in order to output a control signal to the speech gain filter 103 to control the filtering of the audio signal and output a voice/speech processed audio signal.
With respect to Figure 3, an overview of the operation of the analyser 101 is shown. The analyser on receiving the audio signal is configured to detect all fundamental frequencies within the defined frequency range. In other words, within an expected frequency range of speech (or voice) signals, any fundamental activity is detected.
The operation of detecttng the fundamental frequencies is shown in Figure 3 by step 1 11.
After detecting any activity in the fundamental frequencies within the defined frequency range, the analyser 101 attempts to determine any speech components within the fundamental frequency activity. The operation of determining speech components within the fundamental frequencies is shown in Figure 3 by step 1 13.
Furthermore where there are multiple speech component fundamental frequencies identified, then the analyser 101 is configured in some embodiments to select the most likely speech component fundamental frequency. The step of selecting the most likely speech component fundamental frequency from possible multiple fundamental frequencies is shown in Figure 3 by step 115.
In some embodiments the analyser 101 may then output to the speech gain filter 103 filter an indication of whether or not the audio signal comprises any voice/speech components and where the speech/voice component is located in frequency relative to the audio signal. With respect to Figure 4 a schematic view of the analyser in detail is shown according to some embodiments. Figure 5 shows the operations implemented by the analyser 101 according to these embodiments, The audio signal in some embodiments is first received within the analyser 101 by a segment/window generator 201. The segment/window generator receives the time domain audio samples and generates a segment (or frame) of these audio sample values. In some embodiments, for example, the segment/window generator 201 may apply a 40 millisecond window length to the received audio signal samples. Furthermore in some embodiments the segment/window generator generates a segment (or frame or window) every 20 milliseconds. In other words a new segment is started after 20 milliseconds with the consequence that each segment overlaps by 20 milliseconds with the previous segment and by 20 milliseconds with the next segment. It would be understood that any suitable window size and repetition may be used, for example in further embodiments the window size may be from 20 to 50 milliseconds long and have a similar size range of overlaps,
In some embodiments the segment/window generator 201 may further apply a weighting function to the samples.
The generation of the segment is shown in Figure 5 by step 401.
The segment samples in some embodiments are output to the time to frequency domain converter 203.
In some embodiments the input audio signal is not a digital audio signal, in such embodiments the segment/window generator 201 may further comprise an analogue to digital converter of any suitable type in order to generate digital signals suitable for further processing. The time to frequency domain converter 203 in some embodiments receives the segmented audio samples and generates frequency domain component values for are segment period. The time to frequency domain converter 203 may in some embodiments be a Fast Fourier Transformer, a Discrete Cosine Transformer, a Wavelet Transformer or any suitable time to frequency domain converter. In some embodiments the time to frequency domain converter is configured to output only real frequency domain components whereas in other embodiments, both real and imaginary components may be output in order to preserve both amplitude and phase information. It would be appreciated that any suitable practical time to frequency domain converter may be used in embodiments of the application. For example other suitable time to frequency domain transforms include the Discrete Fourier Transform, short term FFT, Goertzel's algorithm. In some embodiments a zero-padding operation may further be applied to improve the accuracy of the application.
The frequency domain components may then be output to the sub-band detector 207 and a peak energy determiner 209.
The generation of frequency domain components is shown in Figure 5 by step 403.
The sub-band peak detector 207 may be configured in some embodiments to compare the received frequency domain components (or ίΛ sub-bands) and output an indication or signals indicating the activity in the frequency domain components or sub-bands. For example the sub-band peak detector 207 may indicate peak value sub-bands of the frequency domain signal. The sub-band peak detector 207 may be configured in some embodiments to determine a "peak" by determining the derivative of the sub-band coefficient value and choosing a sub- band as a peak sub-band where the derivative sign switches before and after the sub-band. For example the sub-bands may be processed to calculate the difference between neighbouring sub-bands and a sub-band selected as being a potential peak value if the sign of the difference between the current sub-band and the previous sub band is different from the sign of the difference between the next sub-band and the current sub-band. In some embodiments only a predetermined number of the sub-bands are selected to be passed to the peak comparator these are the sub-bands with the nlh highest energy sub-bands for the current segment that have been determined to be peak sub-bands.
The peak indicator values may be passed, in some embodiments, to the segment peak comparator and to the peak energy determiner 209.
The determination of the peak sub-bands is shown in Figure 5 by step 407.
The peak energy determiner 209 in some embodiments receives the peak sub- band indicator values representing frequencies where there is activity from the sub- band peak detector 207 and also the frequency domain components of the segment. The peak energy determiner 209 may then calculate the ratio of energy at the detected fundamental frequency and its harmonics compared to the total energy of the segment.
The determination of the peak sub-band energy is shown in Figure 5 by step 409. The segment peak comparator/speech determiner 21 1 receives in some embodiments both the indications of peak sub-band activity and also the peak energy values from the sub-band peak detector 207 and ratio peak energy determiner 209 respectively and compares these values against previous segment peak values.
The segment peak comparator 211, for example, compares the current indicated peak vaiues against the previous M segment peak values to determine whether or not any of the current peak values is substantially similar to a previous peak sub- band value and/or how many previous peak sub-bands it is similar for.
The operation of comparing peak sub-band values against a range of previous peak sub-bands is shown in Figure 5 by step 41 . Where, a current segment peak value is similar to a range of previous segments peak sub-band values then this current segment peak value is selected as a potential fundamental frequency candidate and output to the fundamental frequency selector. In other words if each sub-band has an index value i then where at least one of the current segment peak value sub-band index values {it} is within a range Nrange of index values of previous segment peak values {it-} then the sub band index - either current or previous peak value is noted. In some embodiments the search range is split equally so that for each of the current sub- band peak values a range of previous peak values of [irNrange 2, it+Nratlge/2] is searched. This is because speech may have transients where the fundamental frequency slides over the same syllable.
Furthermore the operation of selecting substantially similar peak sub-bands over a range of segments is shown in Figure 5 by step 413.
For example in some embodiments the segment peak comparator comprises a linked list memory where each element in the linked list comprises a sub-band index value and an integer value representing the number of consecutive segments the sub-band has been "active" for.
In such embodiments for each segment the segment peak comparator 211 compares the current "active" or peak sub-band index value against the list. If the current peak sub-band index is outside the search range [it-Nrange 2, it+Nrange/2] then the list is incremented by an additional element with the current peak sub-band index and a consecutive segment value of 1. If the current peak sub-band index is inside the search range [it- range 2, it+Nrange 2] then the list entries within the search range are amended so that their sub-band index value is the current index value and the consecutive segment value incremented by 1. In such embodiments where two or more search range generates more than one 'match' within the search range the entry with the largest consecutive index is chosen. Finally the list is then pruned to remove all entries which have not been updated or added in the current segment. in some embodiments other suitable search and memory operations may be used. Thus the list contains a series of entries of frequency indicators and the number of consecutive segments within which the frequency has been a 'peak' value.
The segment peak comparator 211/speech determiner may in some embodiments output a binary decision of whether or not there are any speech components within the segment where the "number" of active consecutive sub-band is within the predefined range of values.
For example, in some embodiments, the range of values is defined as being equal to or greater than a predefined value. For example in some embodiments the segment peak comparator 211/speech determiner may determine that speech has been detected where there is at least one peak sub-band over two consecutive segments and the segment peak comparator/speech determiner 21 1 outputs a binary decision Sspeech value of 1 . For example the segment peak comparator 21 1/speech determiner may search the iist for consecutive values greater than the predefined value and output the flag Ssp_ech as being positive and pass at least the index value of the entry to the fundamental frequency selector 2 3 if any entries are found.
In some other embodiments the range is defined by a predefined lower and upper level in other words that the number of consecutive "active" segments for a sub- band is between a minimum number and a maximum number. For example the segment . peak comparator 211/speech determiner may search the iist for consecutive values greater than a lower predefined value but less than an upper value and output the flag Sspeec as being positive pass at least the entry inde value to the fundamental frequency selector 213 if any entries are found. In some embodiments the fundamental frequency selector 213, having received the fundamental frequency candidates from the segment peak comparator 211, selects one of the fundamental frequency candidate values. The fundamental frequency selector 213 may in some embodiments select the fundamental frequency by selecting the fundamental frequency candidate with the greatest energy component.
The selection of the highest energy fundamental frequency candidate is shown in Figure 5 by step 415.
The fundamental frequency selector then may output the fundamental frequency candidate selected. The outputting of the indicator of the binary decision whether or not there is speech within the audio signal segment SspeeCh and the fundamental frequency of the speech component for the segment is shown in Figure 5 by step 4 7.
Although we have described the operation of the analyser as being carried out with regards to the frequency domain, it would be understood a time domain implementation where sub-band time domain filtering and integration may be carried out in order to produce similar output results.
With respect to Figure 6 a schematic view of the speech gain filter 103 is shown in further detail. The speech gain filter 103 may be configured to perform a filtering operation on the digital audio signal input Xi based on the control signals from the Analyser 101 and specifically fundamental frequency and speech detection binary indicator. In Figure 6 the speech gain filter is implemented as a comb filter. A comb filter adds a delayed version of a signal to itself causing constructive and destructive interference which may be used to significantly enhance the fundamental frequency components. The input signal shown in Figure 6 for the speech gain filter 103 is input to a first combiner 501 first input. The first combiner 501 is configured to output the combination to a filter gain amplifier 503 which multiplies the signal by a gain gcomp.
Furthermore the output of the first combiner 501 is input to a switching ampiifier 505. The gain of the switching amplifier in some embodiments is controlled by the speech detection binary indicator sspeech so that when the speech given filter 103 receives an indicator that there is a speech component the switching amplifier 505 passes the signal to the feedback loop and when no speech components are detected the feedback loop is switched off.
The output of the switching ampiifier 505 is in some embodiments is input to a feedback gain amplifier 507 which multiplies the signal by a feedback gain Sg, The feedback gain Sg is a parameter which controls how much the speech gain filter boosts or reduces the speech component.
Thus in some embodiments where the speech component is to be amplified the value of Sg may be greater than 1. In some other embodiments where the speech component of the audio signal is to remove the value of Sg many be less than 1.
The output of the speech gain amplifier 507 is in some embodiments passed to a low pass filter comprising a second combiner 509 which receives the output of the speech gain amplifier and outputs a combined signal both to a controllable delay element 515 of the speech gain filter and also a short deiay element 511 for the low pass filter. The short delay element 51 1 of the low pass filter further outputs a delayed signal to a low pass filter amplifier 513. The low pass filter gain amplifier has in some embodiments a gain of a. The output of the low pass filter gain amplifier is in some embodiments passed to the second input of the second combiner second input 509. The low pass filter in some embodiments may be used in order to focus the impact of the comb filter to the frequencies where speech is present. The embodiments where a loss pass filter is inserted into the comb filter's feedback loop, cause the magnitude difference between consecutive peaks and notches to decrease towards higher frequencies, whereas a regular comb filter maintain a magnitude difference which is constant throughout the frequency range. In other words in the spectral domain the introduction of the low pass filter creates a filter where a magnitude response of the regular comb filter is multiplied by the magnitude response of the low pass filter.
The controllable delay element 515 in some embodiments is controlled by the fundamental frequency value to delay the feedback loop sufficiently for the required interference value. In some embodiments a look-up table not shown receives the fundamental frequency value and determines the delay value .
The output of the delay element 515 is passed to a controllable fractional delay filter. The controllable fractional delay filter 517 increases the specificity of the filter by providing the ability to delay the signal by a non-integer number of samples. The controllable fractional delay filter 517 thus in some embodiments in combination with the controllable delay element 515 be also controlled by the same look-up table. The output of the fractional delay filter 517 in some embodiments is then passed to the second input of the first combiner 501 in order to produce the constructive and destructive interference effects.
Although the speech gain filter 103 is shown with respect to a time domain representation of the filter, it would be understood that in some embodiments the speech gain filter 103 may be implemented in the frequency domain. For example the frequency domain components generated by the analyser in overall operation of extracting the fundamental frequency value may themselves be modified based on the output of the analyser. Furthermore although the speech gain filter has been shown in the above example as a comb filter any suitable filter or group of filters may be implemented. For example in some embodiments the speech gain filter comprises at least one band pass or notch filter where each filter may be tuned to output the fundamental frequency (or in filter bank embodiments comprising more than one filter a fundamental and harmonics of the fundamental frequency). The speech gain filter 103 output the processed audio signal which in some embodiment can be saved in the memory overwriting the input audio signal. In further embodiments the apparatus may determine based on an input from the user to save the processed audio signal in the memory with the original audio signal, in other words not overwriting the pre-processed audio signal.
Also in some embodiments pre or post processing may assist the main fundamental frequency processing embodiments. For example as a human hearing mechanism is more sensitive to frequency region between 2 to 4 kHz (primarily -3.5 kHz) the audio signal analyses and processed may be pre-processed by being band limited before the processing, or post processed by band limiting the output of the filter. Furthermore in some embodiments a digital equaliser may be implemented to convolve the equalizer with the processed signal so that speech could be understood more clearly by enhancing this 'sensitive' frequency region.
With respect to Figure 7 a further schematic view of a speech extractor controiled audio signal processor according to some further embodiments of the application is shown. In such embodiments the multiple input channels Xi to XN are passed to a speech gain filter similar to that described previously and a!so to a mixer 601. The mixer 601 may then produce a single channel mixed signal which is passed to an analyser. The analyser in these embodiments is similar to the analyser configuration described previously which outputs a speech indicator decision and a fundamental frequency value but outputs the value to all of the filters. In such embodiments, the processing requirement is simplified over the previous embodiments as there is only one analyser operation.
With respect to Figure 8 a further schematic view of a speech extractor controlled audio signal processor schematic is shown. These embodiments are similar to the embodiments shown with respect to Figure 7 in that the multiple input channels Xi to XN are pre-processed prior to the operation of the analyser 101. Furthermore the multiple input channels Xi to XN are also pre-processed prior to the application of a single speech gain filter 103. In such embodiments the multiple input channels Xi to XN are input to a centre channel extractor 701 which is configured to implement a centre channel extraction or mixing of the input channels audio signals to generate an audio signal which most closely represents the central channel information. The centre channel extraction may be any suitable centre channel extraction operation and be implemented by any suitable apparatus.
In some embodiments, for example, the centre channel extractor may output signals using both magnitude and phase information at lower frequency values but only magnitude information for higher frequencies to extract the centre channel audio components, in other embodiments, frequency dependent magnitude and phase difference information between pairs of channels may be compared against user specific intra-level difference (ILD) and intra-time difference (iTD) cues. The result of the comparison may be used in some embodiments to determine whether or not a signal component is located at the centre channel and therefore suitable for extraction. Such embodiments may be further customised according to the user of the apparatus. Such embodiments may use centre channel extraction dependent on the operator's own head related transfer function. In such embodiments, sources in the medial plane of a binaurally recorded signal may be extracted.
The centre channel signal Xc may then in these embodiments be passed to the analyser which may be schematically similar to the analyser 101 described with respect to Figure 3, 4 and 5 and is configured to output an indicator whether or not there are any detected speech components within the centre channel extracted audio signal and the fundamental frequencies Fo associated with this detected speech component. These values may be passed to the speech gain filter 103 which also receives the centre channel audio signal Xc and in some embodiments performs a similar gain filtering as described with respect to Figure 6 to output an enhanced speech audio signal Yc. It would be further understood that in some embodiments the pre-processing may comprise an equalizer configured to boost frequencies within a predefined speech frequency range. With respect to Figure 9, an example of the operation of such speech gain filtering on an audio signal is shown.
Figure 9 shows two separate time sample values where an audio signal input 800 and filtered output 802 comprises music comprising instrumentals and also vocal sources.
The instrument only component portions can be clearly shown with respect to the time periods 851 and 853 where the filtered output signal 802 is very similar to the input audio signal 800 for these ranges.
However where a voice component is present, such as those regions shown by the periods 801 , 803, 805, 807 and 809, the output trace is shown to be significantly amplified with respect to the input trace 800. Furthermore although the above description has been with respect to the enhancement of speech/voice components within the audio signal, it would be understood that by changing the gain within the speech gain filter 103, the speech components may be enhanced, by using a gain greater than 1 , or diminished, by using a gain less than 1. Thus in some embodiments of the application, it would be possible to extract the music from the speech by significantly diminishing the voice components.
Although the above embodiments describe the identification of a speech signal within an audio signal and the associated processing of the audio signal using a fundamental frequency associated with the speech signal it would be appreciated that the above methods may be used for example to extract other audio sources with well defined fundamental and harmonic components. For example the above examples are suitable for implementation to remove specific instruments with well defined tonal values.
Thus in summary embodiments of the application perform a method comprising; dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental speech frequency component. Although the above examples describe embodiments of the invention operating within an electronic device 10 or apparatus, it would be appreciated that the invention as described below may be implemented as part of any audio processor. Thus, for example, embodiments of the invention may be implemented in an audio processor which may implement audio processing over fixed or wired communication paths.
Thus user equipment may comprise an audio processor such as those described in embodiments of the invention above. it shall be appreciated that the term electronic device and user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be impiemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Thus at least some embodiments may be an apparatus comprising at least one processor and at least one memory including computer program code the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform; dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental frequency component.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as In the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
Thus at least some embodiments may be a computer-readable medium encoded with instructions that, when executed by a computer perform: dividing at least one audio signal into a plurality of segments; determining at least one fundamental frequency component over a predetermined number of segments; and processing the at least one audio signal dependent on the at least one fundamental frequency component.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
As used in this application, the term 'circuitry' refers to all of the following:
(a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and
(b) to combinations of circuits and software (and/or firmware), such as: (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processors)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present. This definition of 'circuitry' applies to all uses of this term in this application, including any claims. As a further example, as used in this application, the term 'circuitry' would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term 'circuitry' would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or similar integrated circuit in server, a cellular network device, or other network device.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fail within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:
1. A method comprising:
dividing at ieast one audio signal into a plurality of segments;
determining at Ieast one fundamental frequency component over a predetermined number of segments; and
processing the at Ieast one audio signal dependent on the at Ieast one fundamental speech frequency component.
2, The method as claimed in claim 1, wherein determining at !east one fundamental frequency component comprises:
determining at Ieast one frequency component; and
selecting as the fundamental frequency component the frequency component with the greatest magnitude.
3. The method as claimed in ciaim 2, wherein determining at Ieast one frequency component comprises selecting at Ieast one consistent peak frequency component.
4. The method as claimed in claim 3, wherein selecting each of the at Ieast one consistent peak frequency component comprises:
determining for at Ieast a first range of consecutive segments a peak frequency component within a range of frequency values.
5. The method as claimed in ciaim 4, wherein the first range of consecutive segments comprises at ieast:
a first range defined by a minimum number of segments; and
a first range defined by a minimum and maximum number of segments.
6. The method as claimed in claims 2 to 5, wherein the at Ieast one frequency component may be an at Ieast one speech frequency component.
7. The method as claimed in claims 1 to 6, wherein processing the at least one audio signal dependent on the at least one fundamental frequency component comprises at least one of:
filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signal; and
enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
8. The method as claimed in claims 1 to 7, wherein the at ieast one audio signal comprises a centre channel mix of a multiple channel signal.
9. The method as claimed in claims 1 to 8, wherein the at ieast one audio signal comprises the output of a band pass filter configured to output speech frequencies.
10. The method as claimed in claims 1 to 9, wherein the at Ieast one fundamental frequency component may be an at Ieast one fundamental speech frequency component.
11. An apparatus comprising at Ieast one processor and at Ieast one memory including computer program code the at ieast one memory and the computer program code configured to, with the at Ieast one processor, cause the apparatus at Ieast to perform:
dividing at Ieast one audio signal into a plurality of segments;
determining at Ieast one fundamental frequency component over a predetermined number of segments; and
processing the at least one audio signal dependent on the at Ieast one fundamental frequency component.
12. The apparatus as claimed in claim 1 1 , wherein determining at least one fundamental frequency component cause the apparatus at least to perform:
determining at least one frequency component; and
selecting as the fundamental frequency component the frequency component with the greatest magnitude,
13. The apparatus as claimed in claim 12, wherein determining at least one frequency component cause the apparatus at least to perform selecting at least one consistent peak frequency component.
14. The apparatus as claimed in claim 13, wherein selecting each of the at least one consistent peak frequency component cause the apparatus at least to perform determining for at least a first range of consecutive segments a peak frequency component within a range of frequency values.
15. The apparatus as claimed in claim 14, wherein the first range of consecutive segments comprise at least:
a first range defined by a minimum number of segments; and
a first range defined by a minimum and maximum number of segments.
16. The apparatus as claimed in claims 12 to 15, wherein the at least one frequency component may be an at least one speech frequency component.
17. The apparatus as claimed in claims 11 to 16, wherein processing the at least one audio signal dependent on the at least one fundamental frequency component cause the apparatus at least to perform at least one of:
filtering the audio signal with a comb filter set at the fundamental frequency; suppressing the fundamental frequency and harmonics of the fundamental frequency in the audio signal; and
enhancing the fundamental frequency and harmonics of the fundamental frequency in the audio signal.
18. The apparatus as claimed in claims 11 to 17, wherein the at least one audio signal comprises a centre channel mix of a multiple channel signal
19. The apparatus as claimed in claims 1 1 to 18, wherein the at least one audio signal comprises the output of a band pass filter configured to output frequencies.
20. The apparatus as claimed in claims 11 to 19, wherein at least one fundamental frequency component may be an at least one fundamental speech frequency component.
PCT/EP2009/067894 2009-12-23 2009-12-23 An apparatus WO2011076284A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2009/067894 WO2011076284A1 (en) 2009-12-23 2009-12-23 An apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2009/067894 WO2011076284A1 (en) 2009-12-23 2009-12-23 An apparatus

Publications (1)

Publication Number Publication Date
WO2011076284A1 true WO2011076284A1 (en) 2011-06-30

Family

ID=42674631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2009/067894 WO2011076284A1 (en) 2009-12-23 2009-12-23 An apparatus

Country Status (1)

Country Link
WO (1) WO2011076284A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109524016A (en) * 2018-10-16 2019-03-26 广州酷狗计算机科技有限公司 Audio-frequency processing method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0982713A2 (en) * 1998-06-15 2000-03-01 Yamaha Corporation Voice converter with extraction and modification of attribute data
EP1065655A1 (en) * 1992-03-18 2001-01-03 Sony Corporation High efficiency encoding method
WO2003015077A1 (en) * 2001-08-08 2003-02-20 Amusetec Co., Ltd. Pitch determination method and apparatus on spectral analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1065655A1 (en) * 1992-03-18 2001-01-03 Sony Corporation High efficiency encoding method
EP0982713A2 (en) * 1998-06-15 2000-03-01 Yamaha Corporation Voice converter with extraction and modification of attribute data
WO2003015077A1 (en) * 2001-08-08 2003-02-20 Amusetec Co., Ltd. Pitch determination method and apparatus on spectral analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109524016A (en) * 2018-10-16 2019-03-26 广州酷狗计算机科技有限公司 Audio-frequency processing method, device, electronic equipment and storage medium
CN109524016B (en) * 2018-10-16 2022-06-28 广州酷狗计算机科技有限公司 Audio processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US10523168B2 (en) Method and apparatus for processing an audio signal based on an estimated loudness
US8577676B2 (en) Method and apparatus for maintaining speech audibility in multi-channel audio with minimal impact on surround experience
US9881635B2 (en) Method and system for scaling ducking of speech-relevant channels in multi-channel audio
US10210883B2 (en) Signal processing apparatus for enhancing a voice component within a multi-channel audio signal
US9282419B2 (en) Audio processing method and audio processing apparatus
EP2484127B1 (en) Method, computer program and apparatus for processing audio signals
US9633667B2 (en) Adaptive audio signal filtering
JP2009520419A (en) Apparatus and method for synthesizing three output channels using two input channels
WO2013144422A1 (en) A method and apparatus for filtering an audio signal
WO2006126473A1 (en) Sound image localization device
WO2011076284A1 (en) An apparatus
CN110996205A (en) Earphone control method, earphone and readable storage medium
RU2384973C1 (en) Device and method for synthesising three output channels using two input channels

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09807607

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09807607

Country of ref document: EP

Kind code of ref document: A1