EP2380171A2 - Procédé et dispositif de traitement de signaux acoustiques vocaux - Google Patents

Procédé et dispositif de traitement de signaux acoustiques vocaux

Info

Publication number
EP2380171A2
EP2380171A2 EP09808931A EP09808931A EP2380171A2 EP 2380171 A2 EP2380171 A2 EP 2380171A2 EP 09808931 A EP09808931 A EP 09808931A EP 09808931 A EP09808931 A EP 09808931A EP 2380171 A2 EP2380171 A2 EP 2380171A2
Authority
EP
European Patent Office
Prior art keywords
frequency
speech
signal
signals
sounds
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP09808931A
Other languages
German (de)
English (en)
Inventor
Hans-Dieter Bauer
Axel Plinge
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BAUER, HANS-DIETER
PLINGE, AXEL
Original Assignee
Forschungsgesellschaft fuer Arbeitsphysiologie und Arbeitsschutz eV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from DE102009018469A external-priority patent/DE102009018469A1/de
Priority claimed from DE102009018470A external-priority patent/DE102009018470A1/de
Priority claimed from DE102009032238A external-priority patent/DE102009032238A1/de
Priority claimed from DE102009032236A external-priority patent/DE102009032236A1/de
Application filed by Forschungsgesellschaft fuer Arbeitsphysiologie und Arbeitsschutz eV filed Critical Forschungsgesellschaft fuer Arbeitsphysiologie und Arbeitsschutz eV
Publication of EP2380171A2 publication Critical patent/EP2380171A2/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/057Time compression or expansion for improving intelligibility
    • G10L2021/0575Aids for the handicapped in speaking

Definitions

  • the present invention relates to a method for processing acoustic speech signals and a device suitable for this purpose.
  • Corresponding methods and devices are used, for example, in hearing aid technology in order to improve the intelligibility of human speech for persons with hearing damage.
  • Such conventional electro-acoustic systems usually have arrangements of linearly reinforcing assemblies.
  • Such an assembly may be for example a microphone input, a filter bank, a compressor or an output amplifier.
  • the acoustic speech signals are first converted via a microphone into electrical speech signals, which are input to the microphone input.
  • the filter bank which usually has a plurality of bandpass filters, there is a division of the electrical speech signal into a plurality of frequency bands, which are individually compressed by the compressor, for which purpose it has a plurality of compressor subunits. Subsequently, the compressed frequency bands are combined into a compressed speech signal which is amplified by the output amplifier.
  • the available level-controlled compressor significantly degrades the speech signal when the available dynamic range becomes narrower than the pitch of the levels of weak and strong sounds. Then, when the articulately weak sounds are made well over-threshold, this causes the articulatory strong sounds to be driven into the compressor's limiting characteristic branch, resulting in distortion of the rhythm and co-modulation of those sounds.
  • the object of the invention to provide an improved method and an improved apparatus for processing acoustic speech signals.
  • This object is achieved in a method of the type mentioned in that a class-specific processing of the speech signals is carried out, with weakly articulated sounds are extended in time. This can be done by strengthening the energy of weak sounds by time-delayed repetition of a feature-carrying part of the sound waveform.
  • a sound class comprises all sound variations of a sound which can be distinguished from another sound. For example, an "i" can be pronounced high, low or long without leaving the boundaries of the "i" sound class.
  • the speech signals are divided into several frequency bands.
  • this enables a further possibility for the individual processing of the speech signals, so that the processing can also be adapted very precisely to the respectively present hearing deficit.
  • the speech signals are split into high-frequency frequency bands which are above an upper limit frequency and frequency bands which are below the upper limit frequency.
  • the cut-off frequency preferably corresponds to the upper edge of the audible range and can be adjusted individually to the extent of the particular high-frequency loss present.
  • the invention further proposes that the high-frequency frequency bands are shifted to lower frequencies below the upper limit frequency and above a lower limit frequency.
  • sounds that are at the upper edge of the listening area or beyond the limit of audibility be spectrally shifted into a more usable low-frequency listening area, so that the effectiveness of these sounds is increased.
  • the shift of the high-frequency frequency bands to lower frequencies below the upper limit frequency must leave the physiological class formation of the speech sounds completely in tact. The shift may therefore only be done as far or only in such a way that the natural class boundaries, which are naturally to be found in the physiological classification space, are not exceeded.
  • inter-sound transformations are to be excluded. For example, the frequency shift does not allow an "i" to become a "ü".
  • the frequency shift may only take place in the form of intra-sound transformations in which no conversion of sounds takes place and in which, for example, a high and acute perceptible "i" becomes a dull perceptible "i".
  • a high and acute perceptible "i" becomes a dull perceptible "i”.
  • the shift of the high-frequency frequency bands to lower frequencies takes place above a lower limit frequency.
  • a further advantageous embodiment of the invention provides that the displacement of the high-frequency frequency bands takes place individually as a function of the respective frequency position of the high-frequency frequency band.
  • the lying below the upper frequency limit frequency bands are provided with different pre-emphasis.
  • This embodiment of the invention serves in particular to improve the signal-to-noise ratio. Since the individual frequency bands lying below the upper limit frequency are arranged in different frequency ranges, it makes sense to modulate each of these frequency bands with a different pre-emphasis. This procedure also benefits the individual adaptability of the method to the respective hearing deficit.
  • the frequency bands lying below the upper limit frequency are expediently compressed differently. This also allows the respective requirements for the processing of the acoustic speech signals are satisfied by these are processed very individually.
  • the speech signals are each assigned a specific sound class.
  • a sound class selector can be used, with which an adjustment of the speech signals with predetermined characteristics of the individual sound classes can be made, so that it can be determined to which sound class the sound received with the respective speech signal belongs.
  • an individual control of the individual processing measures of the speech signals according to the invention takes place.
  • the high-frequency frequency bands shifted to low frequencies are combined to form an intermediate voice signal as a function of the sound class respectively associated with the voice signals. Whether and in what form this combination is carried out can also be individually adapted to the respective requirements. It is further considered advantageous to combine the low frequency frequency shifted high frequency frequency bands having an upper frequency band located below the upper limit frequency closest thereto to a high pitch intermediate speech signal.
  • the intermediate speech signal or the tweeter intermediate speech signal are stored as a function of the speech signals assigned to the speech signals, retrieved at predetermined time intervals, individually compressed and combined to generate an output speech signal with the other individually compressed frequency bands that are below the upper limit frequency becomes.
  • voiced speech the natural attenuation of the upper formant resonances is so strong that the envelopes have relatively narrow peaks and broad valleys.
  • the valleys can be filled smoothly by the repetition of the respective waveform according to this embodiment of the invention with a time delay, whereby the formant energy content of the overall vibration is substantially increased, for example up to 6 dB. If the energy of the ear is intact over segments of 10 ms, this can produce a considerable increase in physiological activity, for example with regard to loudness and clarity. A weakly articulated sound is prolonged by these processing measures.
  • the output speech signal is used to compensate for unwanted spectral characteristics of speech signals that can be connected to the processing device.
  • Signal output units modulated by means of an equalizer Preferably, the equalizer has a programmable FIR filter.
  • a preamplification of the frequency bands takes place before their compression as a function of the respective sound class assigned to the speech signals and / or the volume of ambient noise.
  • the gain per band is adaptively either reduced so that a medium level is created, which is either just barely perceptible or just barely perceptible.
  • the hearing impaired person can select the presetting of the just good, permanent perceptibility of the environmental noise if a control option is to remain, or the presetting "just no longer perceptible" if any environmental noise is to be considered disturbing.
  • the loudspeaker-specific processing of the speech signals for each sound class comprises its own or at least two inter-class processing measures.
  • at least two cross-class processing measures in particular those are to be selected which are equally applicable to a plurality of loudspeaker classes and produce a perceptual gain without interference.
  • the lowest frequency band shifted to low frequencies always delivers signals, whereas the frequency higher, low frequency shifted, high frequency frequency bands are switched on according to the class of the class.
  • the method according to the invention thus takes place a non-linear time domain modification and a non-linear frequency domain modification of the incoming acoustic speech signals, these frequency domain modifications are closely coordinated.
  • the signal modification in the time domain in the form of a temporal extension of a speech signal preferably takes place only in the case of a spectrally preselected part of the speech signal, especially where such a modification makes sense and does not cause interference.
  • the formant energy content of the overall vibration is substantially increased.
  • the explosive area with its peak can be extended up to 10 ms with the same frequency domain prefiltering, and even at these one-time events by the above-mentioned delay and Summation a significant increase in activity relative to the non-impulsive feature signals.
  • the second formant of the "i" is made much more robust by frequency shifting of the third formant by a factor of about 0.8 and superimposition, ie emphasized the second formant is found at 2.1 KHz, by limiting the frequency shift by the lower cutoff frequency, a fixed truncation of the low frequency energies at 2.3 KHz.
  • the feature energies of the other fricative sounds are compatibly concentrated and frequency limited. This causes above all in people with high-frequency hearing loss, a better effect of the characteristic energies in "ch” and also in "f.”
  • the lower frequency limit always ensures that an excitation of the physiological "sch" channels, ie an inter-loud class violation, is omitted.
  • the upper frequency range of 5 to 9 kHz contains feature energies of the "s” but also of the "t” and the "ch.”
  • a different middle frequency shift by an individual factor is required natural perception of sharpness in the high frequency range equivalent perception of sharpness in the shifted feature energy and thus for a physiological nature-sound-adequate perception.
  • the frequency shift factors of the individual frequency band shifting units are made programmable in hearing aid applications to allow adjustment to individual hearing loss
  • artifacts on vocal sounds, such as the vowels mentioned, including their formant transitions require synchronization of the computational processing windows with the real-time pitch periods, so a pitch synchronizer is required indispensable
  • the feature burst extension can not fill pauses, as there are none.
  • an overlay after delay is not harmful either. A special suppression of the delay is therefore not required.
  • speech signal transformations can be used to compensate for the hearing loss, wherein a transformed speech signal is derived more or less directly from the original signal by comparatively simple modifications in the spectral range or in the time domain.
  • this is only effective if transformation-limiting boundary conditions are adhered to, which are designed so that speech linguistic changes are prevented by inadequate superposition of the original spectrum with the transformed spectrum.
  • weakly articulated sounds in the speech signal to be processed are recognized in the shortest possible time and replaced by corresponding, synthetic sounds.
  • a selective replacement of speech signal elements which are weakly articulated.
  • This selectivity is generated by means of a speech signal recognition method specifically tailored for the purpose of sound classification.
  • individual sound classes are selected in a short time.
  • replacement sounds are synthesized from stored components and inserted in place of the sounds to be replaced in the original speech signal.
  • a classification must be made in a relatively short time at the sound level in order to avoid the perception of asynchronicity between the lips and the face
  • the permitted time offset which does not yet cause any perception of asynchrony, is about 30 ms in
  • the synthetic replacement sounds or their components are preferably largely pre-calculated and stored in a memory. In advance, it can be ensured that these new sounds are largely similar to the natural sounds in terms of perception.
  • the synthetic sounds before their insertion into the speech signal in terms of energy (volume) and / or the frequency center of gravity (pitch) adapted to be replaced, weakly articulated sounds. As a result, the synthetic sounds are largely similar in perception to the weakly articulated sounds to be replaced.
  • the speech signal is delayed in time prior to insertion of the synthetic sounds. This delay is used for the temporal synchronization of speech signal and synthetic sound. Since the processing of the speech signal, for example in the form of a compression, and the speech signal recognition and production of the synthetic sound take different times, the temporal synchronization is almost essential.
  • a further advantageous embodiment of the invention provides that the synthetic sounds are dynamically switched into and out of the received speech signal. This causes annoying and unnatural jumps in the sound image are avoided.
  • the speech signal is divided into a plurality of frequency bands, which can be compressed individually to allow an ideal adaptation of the speech signal processing to a specific hearing damage.
  • each of the weakly articulated sounds is assigned a predetermined sound prototype. This is done by voice signal detection.
  • speech signal features are extracted from the untreated speech signal, which were identified as being optimally suitable in preliminary tests for speech recognition.
  • the assignment of the sound prototypes to the weakly articulated sounds takes place taking into account at least one speech signal feature.
  • the spectral energy ratios of the speech signal can be used.
  • Another suitable speech signal feature may be the voicing of the speech signal.
  • the normalized cross-correlation is defined as the cross-correlation (CC) for the displacement t divided by the square root of the product of the autocorrelation (AK) at the points 0 and t.
  • the maximum of this function in the range between 1 and 10 ms is interpreted as an indicator of voicing.
  • NCC n ⁇ mz ⁇ ⁇ NCCF (t) / te 0 ... ⁇
  • the voice signal should be precleaned from DC offsets and low-pass filtered.
  • a Tschebycheff low-pass 4th order with a 3 kHz cutoff frequency is used for this purpose.
  • a pause may be used in the speech signal.
  • a peak-over-average pause detector can be used to detect lock pauses in plosives, whereby the local modulation of the speech signal can be determined by comparing the absolute signal values (0.1 ms) to a slower (10 ms) energy average
  • Both single-channel pause detectors and multi-channel pause detectors with "min-max-tracking" can be used. The latter are less susceptible to interference than the one-channel breakers.
  • the last pause value can be held for 20 ms.
  • Another suitable speech signal feature is the rate-of-rise (ROR) of the speech signal.
  • the slew rate may be used, for example, to detect the plosive burst, which is the timing of the increase in local relative energy of the speech signal short term averages at the times t ms and t + 1 ms.
  • the ratio of the time averages over 20 ms at time t ms and t-1 ms can be formed.
  • the maximum of this value can be kept, for example, at 50 ms.
  • the speech signal can be prefiltered by an FIR bandpass with a passband of 2 - 10 kHz.
  • a Gaussian classifier with stored sound prototypes for the sound classes “f”, “seh”, “ch”, “s”, “z”, “k” and “t” can be used as the speech signal recognizer
  • the classifier works preferably in three
  • the input value for each speech signal feature is first windowed with a valid range, and the distance to the normally distributed sound prototype can be calculated using a normal distribution-based distance measure d k (x) for the speech signal features remaining after this filtering. which is closest to the input vector, consisting of components of different speech signal features, is selected and assigned to the weakly articulated sound
  • the looped decision time series can still be smoothed with a stochastic filter.
  • the probability of the loudness class prototypes can be determined according to predetermined Gaussian densities. As usual, the decision is made by means of a distance measure. The sound class with the smallest corresponding distance is assumed to be detected. Without covariances, the distance of a sound k over all dimensions i is calculated
  • a temporal smoothing is preferably carried out.
  • the upper two steps can be carried out continuously, so that there is a decision for a sound prototype class per input sample. For example, all timing decisions can be written to a 20 ms ring buffer, of which the most common class value in the 20 ms interval is used as the final recognition result (MAXWINS operator).
  • the classifier should be trained with natural language before use. To determine valid parameters, the following method has proven itself.
  • the function "set ranges by agglomeration" is used, whereby a sufficiently variable sample is used, for example, speakers of different ages and sexes can each speak at least five different possible utterances per phoneme.
  • the ranges can be won with the "extended median" with the values at 30 and 70% of the sorted sequence as Values of a segment can be defined as limits.
  • the valid range per speech signal feature is determined from the union of the regions of all training words. For example, three limits can not be trained, but can be preselected according to empirical values:
  • the range of the NCC maximum is fixed for voiceless sounds by an upper bound of 0.45 and voiced by a lower bound of 0.55.
  • the pause length area is set to at least 30 ms for Plosive.
  • the invention further proposes that the synthetic sounds are generated by generating a noise signal component and a sinusoidal signal component for a synthetic sound and combining them together.
  • the multidimensional Gaussian distributions can be calculated directly from the training material.
  • a combination of band limited noise with limited level distribution and a variable frequency sinusoidal tone can be chosen.
  • controlled frequency shifts of the added sine tone can be introduced to make the shifts of the spectral center of the original sound perceptually transferable.
  • the shapes of the noise signals are preferably selected such that a maximum similarity to the original sound is achieved despite changed frequencies. This can be achieved by special synthesis measures, which generate at the replacement sound perception values of sharpness and roughness, which are as equivalent as possible to those of the original sound despite changed timbre.
  • the noise signal of all components can first be generated by FIR filtering of white noise (random number generator).
  • the feature-carrying frequency range of the replacement is usually positioned at 1.6 kHz, this position produces good perceptual distances to the natural" s "and” ch.
  • "Accordingly narrowband can be filtered between 1.4 and 1.8 kHz shaping the amplitude distribution for a most pleasurable perception of sharpness (little noise) can hard-limit the resulting signal, for example, replaced by its sign, and filtered again. This process is preferably repeated several times. As a result, one obtains a distribution of the amplitudes with strong asymmetry, ie there are only a few slight exceedances of the limitation level.
  • Such a signal maximizes the perception of sharpness. Even at high presentation levels, the generation of an unpleasant noise character is avoided. Furthermore, the sensory cells are protected from high loads by short peak levels.
  • the noise thus generated can be stored as a time signal.
  • randomly sized segments can be randomly selected from a sufficiently large buffer (about 500 ms). These can be linked by means of a sinusoidal transition to a longer pseudorandom noise signal.
  • a second broadband noise component can be generated for the "s.” This can be cut out of the white noise spectrum spectrum by FIR filtering with a passband of 800 Hz - 4 kHz. The amplitude distribution distribution described above is also used here This second component should not be omitted since it ensures that the binding of the replacement "to context sounds with features in this spectral range is improved and" stream-segregation "is avoided It can be added to the first noise component with a level that is lower by about -6 to -12 dB, and the exact level value should be adjusted to your individual hearing impairment.
  • the argument of the sine function can be obtained by integrating phase values.
  • the current frequency can be obtained from the count of zero crossings.
  • the interpolation positions for "middle s", "high s" and "low s” can be set individually by a simple hearing test.
  • the "ch” may for example be composed of two spectral-centered components.
  • the substitute “ch” can be formed by a low-frequency component of about 400 Hz and a higher-frequency component of about 2 kHz, which are preferably determined in preliminary experiments by means of listening experiments white noise can be generated.
  • a change in the amplitude distribution can be similar to the "s" done:
  • the signal can be modified here by two times limiting and filtering so that on the one hand high signal peaks are avoided, on the other hand, however, still noticeable fluctuations, as well as the natural "ch” occur.
  • the noise that can be generated in this way can in turn be stored as a time signal.
  • random segments of random length can be selected from a sufficiently large buffer (approximately 500 ms). These segments can be concatenated by means of a sinusoidal crossfade into a longer pseudorandom noise signal.
  • a peculiarity can be introduced: Preliminary tests have shown that the natural perception image of the "ch” is influenced by the fluctuations of the envelopes in the range of 5 to 20 ms This perception can be called, for example, single-element roughness "Repetition can be generated in accordance with single element roughness, for example, by introducing random short (5 to 10 ms) pauses between the sine windows of each of the aforementioned noise segments. In this way, maximum similarities are achieved from the natural "ch” and synthetic "ch". It is expected that this characteristic can also be evaluated well by the damaged ear.
  • a zero crossing counter can be used when generating the sine component of the replacement "ch". This time for the bandpass filtered input signal in the range of 5 to 10 kHz.
  • the thus obtained Value which is to be understood as an estimate of the mean frequency, can be transformed again with a linear mapping function and then added up to be usable as an argument of a sine function.
  • the “t” can be generated by inserting a complex consisting of a stored synthetic pulse burst and an additional wideband noise component which equals the high frequency portion of the synthetic "s" signal.
  • the stored plosion can be obtained from a bipolar triangular signal, which is filtered, for example, with an FIR bandpass filter with a passband between 100 Hz and 800 Hz and repeated twice every 10 ms.
  • the deployment time can be set to that of the maximum ROR.
  • a continuous, broadband, higher frequency noise signal 800 Hz to 4 kHz
  • the implementation is preferably carried out in the recognizer, which holds the pause and ROR signals for 50 ms.
  • the "t" processing is normally maintained for 50 ms, and the process is preferably interrupted only if the spectral shape changes greatly, such that the band energy values fall outside the range permitted for "t".
  • the soft-switch of the "t" is operated, for example, with 2 instead of 10 ms switch-on time.
  • the amplitudes of the synthetic sounds be compressed individually before insertion into the speech signal.
  • the levels of the synthesized sounds can be adapted to the individual recruitment characteristics of the injured ear.
  • the original signal bandpass filtered, the moving average of the amount formed and the resulting original energy is transformed by a compression characteristic according to the new spectral position.
  • a 4-segment compression characteristic curve can be provided: 1. Under thO, no compression is applied.
  • a limiting compression rate of r2 (about 10 to "infinity"): 1 or a negative slope is made adjustable.
  • the multiplication factor m is calculated as a function of the mean value x as follows:
  • the sum signal resulting from the addition of the noise and sine signals can be multiplied by the original compressed amplitude signal.
  • the addition of the replacement sounds by the Erkennersignal can be controlled by a soft-switch, ie on detection of a weakly articulated sound to be replaced whose synthesis signal with a over the duration of a switch-on t on (about 10 ms) linearly increasing amplitude is added.
  • the signal is blanked out over a switch-off time t Off (approximately 20 to 50 ms) with a linearly decreasing amplitude to 0.
  • the input signal is preferably delayed by 20 ms with respect to the synthesis signal in order to compensate for the delay by detecting and switching on.
  • the invention further relates to a speech synthesis method, in particular for the production of synthetic lutes in a method of the above A method wherein two or more formant waveforms are respectively generated by modulating a formant frequency oscillating source signal having an envelope function, the two or more formant waveforms are added, and the added formant waveforms are concatenated into a suprasegmental speech signal according to a pitch interval length and suprasegmental chaining rules.
  • tonality When irritating the ear with a sine wave, a pure tone is perceived. The quality of this sensation is called tonality.
  • Speech contains no tonality and may contain no such in synthetic production. Sound sensations within synthetic language sequences are disorders. Frequency changes of complexes involving tonality perturbations can create particularly annoying "chirping".
  • the invention proposes a synthesis method in which the source signals are frequency-modulated in the generation of the formant waveforms.
  • tonality in repetitive waveforms consisting primarily of sine signal packets is largely eliminated by frequency modulation.
  • the at the respective Wobbled with formant frequency oscillating source signals according to a predetermined function.
  • the varying frequency of the source signal prevents the basilar membrane from producing only a narrow distribution of time intervals in the acoustic nerve over time.
  • the distribution is broadened by the frequency modulation.
  • the frequency position of the cortically extracted maximum of the distribution becomes (controllable) more undefined.
  • the frequency modulation of the source signals is zyklostationmaschine.
  • This type of frequency modulation is practically particularly easy to implement and produces the desired naturalness of the synthesized speech.
  • Tonality can also be heard with intermittently offered sine packets almost behind the repetition pitch. This is especially true in the periodically repeated sine bursts of formant waveforms.
  • the percept tonality can thus be integrated over short breaks. With a shorter sampling time, this percept weakens and virtually disappears in the strength of the periodicity pitch perception.
  • the concatenation of the frequency-modulated wave packets takes place in such a way by pitch-adaptive envelope shaping that no perceptible disturbances occur by modulation in the superposition area of the wave trains.
  • the modulation of the frequency modulation in the generation of formant wave forms depends on the respective average formant frequency.
  • the frequency-swept sinusoidal packet according to the invention is intended to represent an optimally classifiable vocal formant, the frequency of a source signal can not be deflected arbitrarily far from the original sinusoidal frequency. It can not happen that the cognitive range of the "good vowel prototype" is left, this can be achieved by setting the range functions accordingly.
  • a formant frequency within a period contains large micro-fluctuations, which may be the reason for natural articulation Tonality is never a problem.
  • the extent of the vowels' realms of existence, as far as they are spanned by two formants without varying the frequency of the source signals can be determined beforehand by psychophysical experiments.This expansion of the respective areas of existence of both formants depends essentially on the average position of the source In the synthesis of 2-formant vowels, for example, the following two range functions can be specified for the two oscillating source signals: one for first formants in the range of up to 1000 Hz and one for second form antennas in the range of 500 Hz to 4
  • the modulation of the frequency modulation modulation is up to 20%, preferably up to 10% of the respective average formant frequency.
  • the modulation swing of the frequency modulation in the synthesis of female speech is smaller than in the synthesis of male speech.
  • the typical deviation for male speakers for example, for broad u-formants below 200 Hz at a constant 10%, then falls (percentage) linear to 1 kHz and rises slightly again to 4 kHz. With high pitch of female speakers, less frequency modulation can be used. For example, the percentage deviation chosen for men is halved.
  • a further advantageous embodiment of the invention provides that in the superposition and concatenation of the added Formantwellenformen the Pitch interval length is varied.
  • a randomized variation of the pitch interval length is preferably introduced, whereby the maximum occurring deviation can be predefined.
  • This embodiment serves to avoid the occurrence of tonality with equivalent synthesis of voiced pitch excitation intervals.
  • a precisely repeated pitch waveform generates a very narrow and high-energy frequency distribution of the pitch interval-assigned pulse spikes in the acoustic nerve when the repetition intervals are evaluated neuronally; conceivable as cross-correlation.
  • the pitch interval length is varied so that its instantaneous value is provided with stochastic fluctuations amounting to a maximum of 1% to 2% in the synthesis of male speakers, but only ⁇ 0.5% in the synthesis of female speakers.
  • a further advantageous embodiment additionally provides a rule according to which an absolute constancy of the stylized synthesized pitch curve (without the abovementioned stochastic fluctuations) over a typical syllable interval (approximately 200 ms) is prohibited; the deviation from a horizontal course must be> 3% here.
  • the envelope functions consist of three temporally successive segments, namely a transient segment in which the amplitude of the source signal rises from zero, a holding segment in which the amplitude of the source signal is constant, and a decay segment in which the amplitude of the source signal drops back to zero exists.
  • the windowing of the source signal by the transient segment is preferably chosen as a function of the formant frequency.
  • the underlying model idea is that with natural articulation the transient segment is triggered by the abrupt closure of the glottis.
  • the envelope slope is given by the "filter quality" of the cavity in shot glottis Formantresonanzfrequenz.
  • the time length of the holding segment is dependent on the frequency.
  • the decay segment is provided analogous to the transient process with a window whose length is preferably made dependent on the frequency of the source signal.
  • the state of the system changes, so that different, varying losses must be expected, which in turn can influence the decay segment.
  • This system assumption is used later to vary the swing-out segment as a function of the pitch frequency close to nature.
  • the segments of the envelope function should be changed as a function of the frequencies of the source signals, for example, as follows: For the holding segment, linear segment functions in three carrier frequency ranges are used.
  • the swing-out segment is defined as a percentage of the pitch period. The percentage is a function of the frequency of the source signal, which is preferably selected to be constant below 800 Hz and moreover linearly drops to 4 kHz.
  • the duration of the transient segment, the holding segment and / or the decay segment depend on the pitch interval length.
  • the duration of the swing-out segment is shortened to a minimum value, and then the duration of the hold segment is shortened, so that interferences of formant waveforms of successive pitch intervals are avoided.
  • a cascading shortening strategy ensures that initially no unwanted bandwidth increase takes place.
  • the holding segment of the formant waveform is shortened as the excitation frequency increases further; in the limiting case, the holding segment disappears completely.
  • the duration of the transient segment preferably corresponds to an integer number of zero crossings of the oscillations of the source signal.
  • the number of zero crossings is determined as a function of the formant center frequency. It preferably increases to 1 kHz in order to obtain a realistic transient response of lower formants. From 1 to 2.6 kHz, it preferably continues to flatten up to 3 kHz and then drops off again with a high gradient. This prevents the occurrence of periods with the resulting unnaturally overemphatic percept of the second formant - if a near-natural percept is desired, rather than an over-clear one. However, if the latter is desired to increase intelligibility in the presence of noise, an "over-clear" setting can also be selected.
  • the swing-out segment of the envelope function is designed such that the amplitude at the end of the pitch interval has fallen to at most 35%, preferably to a maximum of 25% of the constant amplitude during the holding segment.
  • the value of the final amplitude is preferably pitch-adaptively set.
  • the speech signal undergoes high-pass filtering.
  • a high-pass filtering by means of an IIR filter with a cutoff frequency of 100 Hz. This can be unwanted low-frequency signal components are eliminated, which arise by superposition of waveforms with variable pitch interval length.
  • the level ratio is defined as a two-dimensional function depending on the frequencies of the first and second formants (F1 and F2, respectively).
  • the table below shows values for typical vocal positions.
  • ratio values from the tabulated interpolation points can be used for intermediate layers be interpolated. This is done by calculating the triangulation of the F1 / F2 vertices once and then calculating each required value as a point on a corner-to-side distance of the surrounding triangle. The values are determined by comparing the resulting synthesis spectrum with the spectrum of natural sounds after specifying all other parameters.
  • the F1 / F2 value for "i" was estimated to be -12 dB rather high, so that the synthesis is not unnecessarily difficult to understand.
  • the method according to the invention makes it possible to synthesise "super-clear" vowels, which is advantageous, for example, for generating test signals for the adaptation of hearing aids Furthermore, such vowels can be better understood by persons with hearing deficits
  • the spectral valley subsidence and thus the spectral modulation degree are driven as far as the naturalness constraints allow
  • By increasing the time at higher formants relative to natural window lengths and by concentrating the feature-carrying energies of maximum extent into the perceptually effective spectral Feature detection areas can produce super-clear or super (noise) interference-resistant vocal prototypes, with a particular advantage in speech output with thus generated vowels in disturbed environments.
  • the (mean) formant frequencies are pitch-varied, in such a way that the formant frequencies are increased as the pitch interval length is shortened.
  • intonated, rhythmic, suprasegmentale sequences can be generated in which a natural perceptual vocal stability is ensured.
  • only the measurable required formant changes are needed as a function of pitch changes necessary for optimum identity preservation of the vowel image.
  • the mean formant frequency position can not only give an impression of unnaturalness, but the perception can skip a class boundary with significant shifts in the average pitch, so that the vowel is perceptually into another Class can mutate (male-female-child- soprano).
  • pitch-intonation variations including man- woman differences, we find that to prevent these disturbances of the vocal perceptual constancy of the Formalities in the suprasegmental time scale must be changed according to unique functions. The perceptive and cognitive mechanism underlying the established vocal constancy has not yet been fully elucidated.
  • the formant frequency can be varied in the same way as for complex suprasegmental pitch contours.
  • the formant frequencies be varied in the same direction dictated by the pitch change. For this purpose, for example, a positive feedback of 1 to 5% formant frequency change at 10% pitch change in the suprasegment can be used.
  • the spectral motion of the formants towards the central plosive frequency center of gravity or maximum promotes clarity of nature and clarity and can also be introduced with this method via the formant correction function, which was previously responsible for the pitch adjustment ,
  • the invention relates to a method for controlling the adaptation of a hearing aid, in particular a hearing aid whose function is based on the above-described method, wherein the hearing aid has a filter bank for spectrally selective amplification and dynamic compression of audio signals, to a Hördefizit a hearing aid wearer a test signal is generated by means of a signal source and the perception of the test signal is evaluated by the hearing device wearer.
  • Modern hearing aids allow in principle to compensate for hearing deficits individually.
  • a multiplicity of parameters of the hearing aid have to be set and precise checks carried out. These are the amplification and compression parameters of the filters of the hearing aid filter bank responsible for the various spectral ranges.
  • the necessary time or the necessary means for sufficient control and adjustment of the hearing aid are not available. It is noted that the quality of the fitting methods in the younger The past could not keep pace with the development of hearing aid technology and in general the technical processing possibilities of audio signals, in particular of speech signals. Therefore, one often finds suboptimal fitting results.
  • a suboptimal adaptation of a hearing aid to the individual hearing deficit of the hearing device wearer has unacceptable effects on the communication ability of the hearing aid wearer, especially if there are high-grade sensory damage with severely restricted dynamic ranges. For such damage, the fitting criteria must not be aligned with a general compensation of gain factors in the spectrum. Instead, the adaptation must be aimed specifically at restoring the voice communication ability in the most important conversation situations (possibly with the respective interference environment).
  • the invention provides a method for controlling the adaptation of a
  • test signal comprises at least one natural or nature-like speech element which is spectrally filtered or selected in such a way that the spectrum of the test signal corresponds to the spectral range of at least one filter of the filter bank of the hearing aid.
  • the review and adjustment of the hearing aid can be done according to the invention with filtered (natural) language material.
  • speaker-independent test signals can be used. These must, however, be produced as natural as possible.
  • the above-described speech synthesis method according to the invention is particularly well suited.
  • language elements are used for checking and adjusting the hearing device, which are known as problem cases anyway. These are articulate too weak elements, eg. B. / ch, s, f, seh / or articulately too strong elements, eg. B. / a, ä /, or even very short and weak elements, eg. B.
  • the selection of the vocalic speech elements as test signals or the spectral filtering is carried out according to the invention so that the test signals cover critical areas of transmission in the spectrum, so that from the assessment of the perception of the corresponding test signals by the hearing aid wearer specific conclusions on suboptimal set parameters of the filter bank Hearing aid can be pulled.
  • the spectrum of the test signals corresponds to the spectral range of at least one filter of the filter bank of the hearing device.
  • test signals have a certain spectral concentration, so that it can be purposefully concluded which device parameters or which parameter groups of the respective hearing device are not set optimally.
  • the invention does not necessarily require that the test signals are matched with respect to their spectrum 1: 1 to the spectral configuration of the filter bank of the hearing aid. It is important that the test signals are still perceived as a language even after filtering the underlying language elements.
  • the entirety of the (essential, representative) speech features that can be mapped into the rest of hearing in practical distance ranges in the various communication situations are really also provided with pleasant, usable loudness, i. H. such a loudness that produces distinctness and clarity is available.
  • pleasant, usable loudness i. H. such a loudness that produces distinctness and clarity is available.
  • This must be checked in the relevant communication situations.
  • the generation of the test signals and corresponding evaluation by the hearing aid wearer must be followed by partner speech at standard communication distances of 0.5 to 2 m, preferably 1 m.
  • the corresponding levels of the test signals can be determined by natural sound signal pressure level measurements.
  • the distance between the microphone of the hearing aid (eg behind the ear) and the mouth must be simulated for one's own language as the basis for a good self-articulation check.
  • situation-dependent test signals for speakers in greater distances, z. In lecture situations. The hearing aid wearer assesses the perception of the respective test signals and preferably gives a graduated evaluation. He gives z. B. on whether the respective test signal is too loud, loud, pleasant, quiet or too soft perceived. As a quantitative measure of the individual quality of the adaptation, the totality of the usable distance ranges and their intersection can then be considered, in which a cognitively utilizable transmission of the speech elements can be achieved.
  • the central feature of the invention is thus the use of test signals which are speech elements and at the same time are spectrally concentrated.
  • the test signals should be made testable at user controllable variable distances to ensure good matching in all relevant communication situations.
  • the spectral concentration of the test signals allows a targeted adjustment of the hearing aid in accordance with the evaluation of the test signals by the hearing aid wearer. For this purpose, a plurality of test signals must be generated, correspondingly covering the different spectral ranges of the filter bank of the hearing aid.
  • test signals are on the one hand speech elements, i. H. Language character, and at the same time have the greatest possible spectral concentration.
  • a major problem with the fitting of a hearing aid is that in one and the same spectral range large level differences must be dynamically mapped correctly. This applies in particular to the second formants of / i / and / ä /. It is especially important to check whether spectrally high, weak features are adequately processed at an acceptable distance so that they can be heard loud enough. On the other hand, it has to be examined whether there are too many loudness levels in the same spectral range that are simply unconsciously loud or that entail an unacceptably strong masking of neighboring phonemes.
  • the furthest and the shortest distance should be determined, which does not yet lead to disturbances of the perceptions.
  • the respectively same test signal should be generated repeatedly with a different volume, the characteristic curves of the spectrally selective dynamic compression of the hearing aid being adjusted in accordance with the evaluation by the hearing aid wearer.
  • test signals that corresponds to a natural fricative.
  • Such test signals are spectrally far extended, with feature energies that are often far beyond the usable residual hearing range, so that only very weak residual energies fall into the residual hearing range. It has to be examined whether these are made sufficiently cognitive. Again, the widest and the shortest distance should be determined, which does not lead to loss of perception yet.
  • specific questions may need to be asked, such as: B .: Which of these sounds are audible at all? What is their distinctiveness? Frequently, the maximum possible amplification in the upper frequency bands by means of the hearing aid sets the distance limit for fricatives too short distances.
  • the feedback whistling begins in unfavorable constructions or leaks in the otoplasty even at low reinforcements, which for adequate amplification of fricative energies is too low.
  • another hearing aid must be selected or the acoustic adjustment and tightness of the earmould must be improved.
  • the assessment by the hearing device wearer may also result in additional requirements for the technique of the hearing aid, such as an additional selective speech feature enhancement or a spectral transposition.
  • Plosive which should also be tested according to the invention in a further process step.
  • These language elements are short-term stimuli with impulsive character with koartikulativ distributed features, some of which have very low levels, d. H. they often get lost in the environmental noise.
  • the spectra of Plosvie are extensive. This in turn results in the fact that in high-tone losses large parts fall out of the rest of hearing, so are not recyclable.
  • the evaluation questions are therefore similar to fricatives. The evaluation may indicate that a higher fundamental gain in high frequency bands (> 2000 Hz) is required. If necessary, a hearing aid with spectral transposition must be used.
  • test signals are generated in a further method step which correspond to different vowels with high second formants, wherein the hearing device wearer assesses the distinctness of the test signals.
  • Two-form vowels with high lying second formants eg / Y, i, e /
  • a possibly individually poorly adapted dynamic characteristic in the critical spectral range and Missing tuning of a limiting function can lead to serious mismatch. Excessive resonances in the earmold can also shift category boundaries.
  • the overall resulting lack of unstable feature transfer leads to a poor distinguishability of the vowels with each other and also to confusion with IuI.
  • the invention makes it possible to directly check whether a spectral increase in the energies of the second and third formants of the critical vowels would improve their perceptibility. This can be implemented directly in a corresponding setting of the parameters of the hearing aid. According to the evaluation by the hearing aid wearer, the dynamic characteristics in the spectral regions of the high lying formants can be adjusted accordingly.
  • the vocal energies are the carriers of speech rhythm or segmental stress. So-called recruitment, d. H. abnormal loudness increase in sensory damage, alters the natural perception of stress and rhythm. With strong variation of the threshold and the dynamics as a function of the spectral location, the transmission of rhythm is strongly distorted and requires a transformation to a constant perceptual measure for constantly articulated rhythm strength. This is achieved by a spectrally correspondingly different characteristic slope of the compression of the relevant vocal feature signals in the region of the dominant rhythm transmission (from about 250 to about 1400 Hz). In addition, the transmission of level differences is less critical. To achieve this, according to the invention, unbounded pairs of vowel-type test signals can be used. The rhythm intensity perceived by the hearing aid wearer should be approximately the same for the test signals used up to frequencies of approximately 1400 Hz. In case of deviations, the slope of the slope should be adjusted in the affected spectral range.
  • the transmission of the essential speech atoms in the residual hearing range must be ensured even if there are disturbances due to environmental noise. Therefore, ensuring sufficient suppression of ambient noise is equally indispensable.
  • the inventive method simultaneously with the Test signals Störlärmsignale be generated.
  • the Störlärmsignale can be generated from a non-frontal area of the hearing aid wearer. In this way, the effectiveness of the directivity of the hearing aid can be checked.
  • parts of the useful signal are disadvantageously changed, since a true spectral separation of speech feature signals and noise signals may not be possible. This depends on the individual environmental noise to which the respective hearing device wearer is exposed.
  • the method according to the invention makes it possible to intentionally bring about a reduction in the distance between the microphone of the hearing device and the speaker mouth while at the same time reducing the gain, so that the effective interference level is lowered. It is determined whether there is a usable distance range in which all speech features are transmitted undisturbed. The hearing aid wearer can learn from this, which distance he must comply with in the presence of appropriate ambient noise to his interlocutor in order to ensure optimum intelligibility.
  • the invention relates to an apparatus for processing acoustic speech signals, with an electronic processing device, wherein the processing means is adapted for the class-specific processing of the speech signals and having means with which a temporal extension of weakly articulated sounds is feasible.
  • the device makes it possible to realize the method described above, according to which an individual emphasis can be made on slightly articulated sounds, this emphasis not being based on amplification of the sounds but on a temporal extension thereof.
  • the device has a filter device by means of which the speech signals can be split into high-frequency frequency bands lying above an upper limit frequency and into frequency bands lying below the upper limit frequency.
  • the high-frequency frequency bands can then be moved individually by means of frequency band shifting units in the Nutzroy Son below the upper limit frequency.
  • the frequency bands lying below the upper limit frequency can be modulated individually by means of filter units of the filter device with a pre-emphasis.
  • the device expediently has a sound class selector with which a speech signal can be assigned a specific sound class.
  • the time extension of the weakly articulated sounds preferably takes place.
  • the frequency bands are individually compressible, whereby also this compression are controlled in dependence of the respective voice signal assigned to a speech signal.
  • an apparatus for processing acoustic signals having an electronic processing device, wherein the processing device is set up to replace weakly articulated sounds by means of corresponding synthetic sounds.
  • the invention relates to a speech synthesizer having means for generating two or more formant waveforms each by modulating a source signal oscillating at a formant frequency with an envelope function, means for adding the two or more formant waveforms, and means for superimposing and concatenating the added formant waveforms according to a pitch interval length to a speech signal.
  • the speech synthesizer is adapted to carry out the above-described synthesis method in which two or more formant waveforms are respectively generated by modulating a source signal oscillating at a formant frequency with an envelope function; the two or more formant waveforms are added and the added formant waveforms are concatenated to a suprasegmental speech signal according to a pitch interval length and according to suprasegmental chaining rules.
  • the source signals are frequency-modulated in the generation of the formant waveforms.
  • the above-described method for controlling the adaptation of a hearing aid can be very easily applied by the hearing aid wearer himself.
  • Specialist staff is not mandatory. This requires only one suitable arrangement comprising a personal computer, an audio interface connected to the personal computer, and at least one speaker connected to the audio interface (e.g., via an amplifier).
  • a corresponding computer program for the personal computer makes it possible to carry out the method described above.
  • Important for the reproducibility of the perception of the test signals according to the invention is an approximately linear output frequency response of the device.
  • Low cost active PC speakers typically have unacceptable variations in frequency response, requiring electronic compensation.
  • the required frequency can be achieved with little effort by means of a software implemented linearization filter. For this, e.g. a FIR filter with constant group delay can be used.
  • a microphone connected to the audio interface can be used as a reference for calibration of the linearization filter.
  • the microphone should have as linear a frequency response as possible.
  • Commercially available highly linear electret microphones are suitable.
  • electret microphone, z. B. in conjunction with a calibrated signal source.
  • simple personal computers eg laptops
  • loudspeakers can be used to carry out the method according to the invention. Care should be taken to ensure that the loudspeakers have sufficient power to provide enough low-distortion reserves for higher levels of test signals.
  • Figure 1 a schematic representation of an embodiment of a device according to the invention
  • Figure 2 is a schematic representation of another embodiment of a device according to the invention.
  • Figure 3 an embodiment for the synthesis of the replacement "and the replacement” ch ";
  • FIG. 5 shows the frequency modulation according to the invention of the source signal during the generation of a formant waveform.
  • FIG. 6 shows the spectrum and time signal of the test signal IuI according to the invention
  • FIG. 7 test signal / o1 /
  • FIG. 8 test signal / o2 /;
  • FIG. 9 test signal IaI
  • FIG. 11 test signal / ü /;
  • FIG. 12 test signal / i /;
  • FIG. 13 test signal / a /
  • FIG. 14 test signal IeI
  • FIG. 16 test signal / kh /, time signal and spectrum
  • FIG. 17 test signal / t-e / (time signal);
  • FIG. 18 test signal / e-t-e / (time signal);
  • FIG. 19 test signal / i-e / (time signal);
  • FIG. 20 test signal / sch-f-ch-s / (time signal and spectrum of / s /);
  • Figure 21 Schematic representation of the inventive arrangement for controlling the adaptation of a hearing aid
  • FIG. 22 arrangement with microphone
  • Figure 23 Arrangement for controlling the directivity of a hearing aid.
  • the embodiment of the device 1 shown in FIG. 1 has a filter device 2 by means of which the incoming acoustic speech signals 3 are split into high-frequency frequency bands FB1, FB2 and FB3 lying above the upper limit frequency and frequency bands FB4 and FB5 below the upper limit frequency.
  • the illustrated upper area 4 of the filter device 2 serves to process frequency bands FB1, FB2 and FB3 of the speech signals 3 which are not to be shifted below the upper limit frequency
  • the lower area 5 of the filter device 2 shows the high-frequency frequency bands FB4 and FB5 from the incoming speech signals 3 filters out, which are above the upper limit frequency and should be moved to the Nutz choir Scheme below the upper limit frequency.
  • the apparatus 1 further comprises a pitch synchronizer 6, which serves to synchronize the windowing of the frequency band shifting units 7, taking into account the phase of the envelope of the voice signals 3 via the control line 8. Furthermore, the device 1 has a sound class selector 9 which receives a received speech signal 3 Assigns default sound class. The result of this assignment is used to control other components of the device 1 via control lines 10, 11 and 12, which will be described below.
  • One of these components of the device 1 is a frequency shift module 13, which in this embodiment has two programmable frequency band shifting units 7.
  • the frequency band shifting units 7 preferably operate with scan modification.
  • the spectrum that generates each frequency band shifting unit 7 is limited by a downstream postfilter 14, 15. These are designed as bandpass filters which limit the shifted signal in the spectral range and prevent exceeding of physiological loudness class limits.
  • the output signal of the post filters 14, 15 is in each case through a combiner 16, comprising a customized soft-switch, switched on or off. This switching on or off is controlled by the sound class selector 9 as a function of the respective voice signal 3 associated with a class via the control line 10.
  • the device 1 further comprises a means 17 for the loudspeaker-specific time extension of slightly articulated sounds.
  • This samples the signal current that comes from the link 18 to him, with overlapping windows, stores the window contents and outputs it after a predetermined time, such as between 2 and 10 ms, again and adds it to the input signal stream.
  • the Delay and Addition operations can be done in parallel with multiple shift reverberation times. In the exemplary embodiment shown, one time is delayed by 4 ms and added in each case. Different modes of operation can be used for different sound classes. This is also controlled by the sound class selector 9 via the control line 11.
  • the compressor unit 19 is a 3-band compressor with a compressor unit K1, K2 or K3 and with three different time constants per band, wherein in each case a time constant per band position are adapted to the speech characteristics. There is one identical in all bands slow gain setting, medium-fast syllabic compression and fast limiting with different speed characteristics. All work with "look-ahead" technology, avoiding transient spikes, and weakened feedback from the second to the first band and the third to the second band counteracts the physiological "upward spread of masking".
  • a preamplification of the frequency bands to be compressed takes place before their compression, for which purpose the compressor units K1, K2 and K3 are controlled individually by the control device 20 via the control lines 21.
  • the control device 20 itself is controlled via the control line 12 as a function of the respective one speech signal 3 by means of the Lauttrenselektors 9 associated sound class.
  • the output signals of the individual compressor units K1, K2 and K3 are combined with each other by means of the linker 22 and supplied to an equalizer 23. This generates the output speech signal 24 of the device 1.
  • FIG. 2 schematically shows a further exemplary embodiment of a device 201 according to the invention.
  • This device has a processing unit 202 for processing the incoming speech signals 203 in the remaining hearing range.
  • This processing device 2022 has a plurality of compressor units with different compression characteristics in order to be able to process the incoming speech signals 203 individually to the particular hearing impairment present.
  • speech signal features are filtered out by means of a feature extractor 204.
  • the extracted speech signal features are then output to the classifier 205, with which sound prototypes stored in a training database 206 are associated with the speech signals 203.
  • a soft switch is used to make the weakly articulated sound corresponding to synthetic sounds the speech signals processed by the processing means 202 via the link 208 Zuhegbar.
  • the synthetic sounds are generated in a synthesizer 208 and then compressed and modulated by a processor 210.
  • the compression and modulation takes place as a function of the recognized speech signal features, in that the feature extractor 204 controls the processing device 210 correspondingly via the control line 211.
  • FIG. 3 shows schematically an embodiment for the synthesis of the replacement "s" and the replacement "ch".
  • the incoming speech signals 213 are split and used in the upper illustrated branch to produce a frequency modulated sine signal and in the lower branch to generate a noise signal.
  • the speech signal 213 first passes through a bandpass filter 214 which has a loudspeaker-specific pass-through area.
  • the bandpass filtered speech signal is then applied to a zero crossing counter 215 to obtain the current frequency from the count of zero crossings. This instantaneous frequency is used to determine the frequency center of gravity of the speech signal, which is used for the modulation of the replacement sound or for the ideal adaptation thereof to the weakly articulated sound to be replaced.
  • the speech signal is fed to a sine wave generator 216 with which the desired sine signal to be superimposed on the noise signal is generated.
  • This sinusoidal signal is then linked to the noise signal of the noise generator 217 via a linker 218.
  • the lower branch first uses a bandpass filter 219, by means of which a loudspeaking frequency range is filtered out of the speech signal 213.
  • This bandpass filtered speech signal is then fed to a device 220, which forms the moving average.
  • the resulting Original energy is then transformed by a compression characteristic 221 and 222, respectively, according to the new spectral location.
  • the transformed speech signals are then combined in the lower branch with the noise signal of the noise generator 223 via the linker 224.
  • the transformed speech signal of the compression characteristic 221 is linked via the linker 225 to the speech signal linked by the linker 218.
  • the further linker 226 links the speech signal generated by the linker 224 to the speech signal generated by the linker 225 and then to a soft switch 227 which corresponds to the soft switch 207 of FIG.
  • the soft switch 227 is voice dependent feature-dependent via the speech feature extractor 204 of FIG. 2 and the control line 211 so that when a weakly articulated sound occurs, it is replaced by a synthetic sound corresponding thereto.
  • incoming speech signals 228 are filtered with a bandpass filter 229 having a passband between 100 and 800 kHz
  • the filtered speech signal is sent to a device 230 for forming the moving
  • the speech signal emanating from this device 230 is split and fed to compression curves 231 and 232 for the transformation of the resulting original energy according to the new spectral position
  • the speech signal processed by the compression characteristic 232 is connected via a link 233 to the noise generator of the noise generator
  • the speech signal of the upper compression characteristic 231 is also fed to a linker 235 which links these speech signals to plosions stored in means 236.
  • the speech signals associated in the links 233 and 235 are communicated by means of the linker 238 linked together and a soft switch 239 according to the soft-switch 207 shown in Figure 2, which outputs a signal in response to the control via the control line 211.
  • FIG. 5 which illustrates the speech synthesis method according to the invention, shows in the upper part an envelope 301 of a formant waveform.
  • the Formant waveform is generated by modulating a source signal oscillating at a formant frequency with the envelope function 301.
  • Right and left of the envelope 301 dotted temporally preceding or following envelope 302, 303 of other formant waveforms of the speech signal are shown.
  • Such chained and superimposed waveforms together make up the synthesized speech signal.
  • the formant waveform consists of the temporally successive segments transient segment E, sustain segment H and decay segment A 1, which are generated according to the method described above, wherein the decay segment A of a preceding formant waveform overlaps the transient segment E of the following formant waveform, depending on the pitch interval length.
  • the two lower graphics show embodiments of functions with which the source signal is frequency-modulated in the generation of the formant waveform to prevent the occurrence of tonality.
  • the modulation stroke x is about 10% in the embodiments.
  • FIGS. 6 to 20 show, by way of example, test signals that can be used according to the invention for checking the adaptation of a hearing aid.
  • the individual fine-tuning of hearing aids with sinewave signals, narrowband sounds, word material and logarithms is not suitable for testing or adjusting an optimal transmission of speech.
  • the articulately weak language elements are not mapped with sufficient quality or sufficient level.
  • test signals can be obtained from natural language recordings or through digital synthesis.
  • the test signals according to the invention are designed so that they are always perceived as natural language elements and can be named accordingly, even if they consist only of spectral parts of the same.
  • An essential feature of the invention is that the vocal equivalent test signals are selected in such a way or spectrally filtered that adjustments of the filter banks of hearing aids can be made directly. In other words, it is important that natural or natural signals with spectrally concentrated feature energies are used as test signals.
  • Filtering rules may be established to produce vowel-like test signals suitable in accordance with the invention.
  • z. B the spectral range divided into five sub-areas. 250 to 400 Hz, 400 to 600 Hz, 700 to 1400 Hz, 1400 to 2000 Hz and 2000 to 3500 Hz. In these areas, the formants of the vowels must be filtered out in different ways in order to be able to directly identify or avoid common adaptation errors.
  • the second formant of / ä / would be shifted without limitation into regions beyond the discomfort limit.
  • a two-part characteristic curve with a suitably increasing actual passband and subsequent limitation is absolutely necessary.
  • An acceptable / ä / loudness must be set very precisely.
  • the vowel equivalent test signals of Figures 6 to 14 are each filtered out of the overall natural sound spectrum using phase linear FIR filters.
  • FIG. 6a shows the time signal and FIG. 6b shows the spectrum of the test signal IuI.
  • the low-frequency range up to 1000 Hz it must be noted that there is often strong noise, with excitation frequencies below 200 Hz and a strong harmonic spectrum. Therefore, a lowest possible amplification should always be selected, so that the interfering noises as far as possible have no influence on the speech perception in the range> 1000 Hz. It follows that the transmission of the / u / must be adjusted so that a low IuI is perceived just as well with the lowest possible basic gain. For the inventive control of the adjustment of the hearing aid, the effect of the IuI is very important. All components of the IuI that may be relevant to perceptions must be available. In the test signal shown in FIG. 1, the natural spectrum of the IuI is band-pass filtered between 250 and 500 Hz.
  • FIG. 7a shows the spectrum and FIG. 7b shows the time signal of the test signal / o1 /.
  • It is the open / o / which, like the IuI, is still in the range of low frequency noise components and in an area where high levels can strongly mask upwards. Despite possibly large dynamic range must be taken to ensure that no excessive levels of loudness occur. The steepness of the curve must reflect optimal rhythm and the horizontal boundary must be rather conservative in the assessment of the test signal by the hearing aid wearer as loud.
  • the first formant of the open / o / is filtered out to generate the test signal shown in FIG. 7 over a broadband between 250 and 700 Hz.
  • FIG. 8 shows the test signal / o2 / in a corresponding manner. It is the closed lo /. The same applies to the open / o /. According to the natural bandwidth, the first formant is filtered out broadband between 300 and 900 Hz.
  • FIG. 9 shows the test signal / a /. All variants of / a / have much higher levels relative to the neighboring vowel energies. Therefore, there is a danger that the / a / excitation leads to excessive loudness.
  • test signal / a / is filtered according to Figure 9 in the range between 600 Hz and 1600 Hz with two main formants extremely broadband. If the loudness of this complex is dynamically adjusted isophonically relative to the features of the other vowels, it can be assumed that excessive masking is prevented.
  • FIG. 10 shows the test signal / ö /.
  • the relatively weak feature energy of / ö / lies at the end of the / a / spectral range and can therefore be increased by appropriate amplification.
  • the energy is filtered out between 1100 and 1800 Hz, as shown in FIG.
  • FIG. 11 shows the test signal / u /.
  • the feature energy of / ü / is extremely weak and narrowband. Good suprathresholdness must be taken care of by setting a suitable gain in this spectral range. At subliminal level the / u / becomes IuI. Accordingly, to generate the test signal according to FIG. 11, the filter width is selected to be 1750 to 2100 Hz.
  • FIG. 12 shows the test signal / i /.
  • the / i / can be even lower in level than the / ü / and therefore requires even more base gain.
  • the IM possesses not only one but two higher feature-carrying formants, which can amplify loudness summation by broadening critical bands. Both must therefore be taken into account in the control of the adaptation of a hearing aid according to the invention.
  • the band filtering of the test signal IM according to FIG. 12 therefore takes place in the range from 2050 to 3300 Hz. Problems prepares the / ä /.
  • the corresponding test signal is shown in FIG.
  • the feature energy of the / ä / is filtered out in the range between 1000 Hz and 2600 Hz for the test signal in order to be able to take into account all the spectral components which are distributed around the position of the maximum and which generate loudness during the control. This is the only way to adequately adjust the limitation of the strong / ä / -Energies by suitably selecting the horizontal branches of the corresponding dynamic characteristics in this area.
  • FIG. 14 shows the spectrum and the time signal of the test signal IeI.
  • the IeI has feature energy in the range of 1900 to 2600 Hz and is cut accordingly. After prior adjustment of the IM and / ä /, the IeI automatically falls within an adequate intermediate range of the dynamic characteristic curve. The subtleties of the slope in the main passage area can still be adjusted.
  • FIG. 15 shows by way of example the corresponding rhythm pair of the test signal / a /.
  • the first four test signals ( Figures 6 to 9) should produce a very similar and distinct rhythmic strength.
  • the following five signals ( Figures 10 to 14) should produce at least one rhythm that is just perceptible.
  • test signals shown in the further FIGS. 16 to 20 can additionally be used in the method according to the invention for the purpose of further refinement of the adaptation. It is about the evaluation of the perceptibility of plosive features and fricatives. Simple natural language recordings can be used.
  • FIG. 16a shows the time signal and FIG. 16b shows the associated spectrum of the test signal / kh /. It is a plosive burst with aspiration. This should be easy to perceive by the hearing aid wearer. The exact mechanisms for the spectral and temporal energy summation of the spectrally broad and temporally narrowly limited burst energy are largely unknown. The sensitivities of the damaged hearing can not be deduced from threshold measurements. Therefore it is necessary to determine the perception directly with representative prototypes. For this purpose, test signals from clearly articulated speech samples of the plosive bursts IpI, / t / and IYJ with aspiration in the unvoiced Ih / context can be prepared. Alternatively, synthetically generated prototypes can be used. To check the fitting of a hearing aid, it should be examined whether the hearing thresholds are well exceeded in the individual test signals. Furthermore, the distinctness of the prototypes should be examined.
  • plosive vowel logaroms are required as test signals. Combinations with transitions in all spectral ranges should be tested. However, particularly critical are combinations with high lying second formants, ie with / ü /, IeI, IM, / ä /. Representative shows the figure 17, the time signal of the test signal / t-e /. Based on this test signal, the transmission can be finely adjusted especially in the critical range> 2000 Hz.
  • the log atoms with the inverted order may also be provided as test signals, e.g. B. / e-p /, Ie-VJ, Ie-M.
  • plosives can be produced with excessive articulation pressure. This can significantly increase the visibility of the plosive for persons with hearing deficits.
  • signals can be used as test signals in the method according to the invention, in which the plosive is embedded between two vowels, the first vowel being unstressed and the second emphasized.
  • FIG. 18 shows the time signal of the corresponding test signal / e-t-e / (with emphasis on the second IeI).
  • Alternative test signals are e.g. / e-p-e / and Iek-e /.
  • High-frequency deficits often do not transmit the frequency range> 2000 Hz in sufficient quality.
  • the adjustment of the excitation energies of weak high lying second formants relative to the excitation energies of the first formants and the pitch harmonics may be inadequate.
  • the image of the sound that results from the combination of the action of the first and second formants is out of focus, or the effect of the second formant is absent, so that the perceptual images collapse into the image of the IuI.
  • the good perception of the individual formants in the spectrum ensured.
  • a fine adjustment of the ratio of the energies of both formants which takes into account the simultaneous influence of the two energies, is indispensable for the adjustment or testing of the best cognitively classifiable auditory images.
  • FIG. 19 shows by way of example the time signal of the test signal / ie /.
  • Other possible test signals are / i-ü /, lu /, / iu / as well as / ui-ü-e /.
  • test signals for checking the perception of fricatives can be used in the method according to the invention.
  • the perception of fricative energies which are naturally characteristic in higher spectral ranges, is severely disturbed by sensory high-tonal deficits. This may mean that with normal, only amplifying hearing aids, the / s / and / h / are only perceived so weakly that these sounds are practically unusable for the perception of current speech.
  • Transformed or spectrally transposed feature energies that replace natural feature energy must then be made available.
  • the / see / is in a lower frequency range and is produced at the highest articulation level compared to the other fricatives. Often a good readability can be achieved here by sufficient reinforcement.
  • the IM also has feature-carrying energies in the low-frequency range. However, these have very low levels, so that excessive amplification would be needed.
  • test signal / sch-f-ch-s / shown in FIG. 20 can be used.
  • Figure 20a shows the time signal
  • Figure 20b the spectrum of the component IsI.
  • the verifiability of the voiced fricatives can also be checked.
  • the bailance between the low-frequency voiced portion and the high-frequency unvoiced portion of the voiced fricatives may be important for good distinctness.
  • the voiced portion may not obscure the unvoiced portion.
  • FIGS 21 to 23 show an arrangement for controlling the adaptation of a hearing aid according to the invention.
  • the arrangement comprises a personal computer 401 (a laptop) which is connected, for example via a USB interface, to an audio interface 402 of a conventional type.
  • an amplifier 403 is connected to control element 404 for gain adjustment.
  • a speaker 405 is connected to the amplifier 403, a speaker 405 is connected.
  • the loudspeaker 405 are located in front of a hearing aid wearer 406.
  • This carries a hearing aid 407 behind the ear.
  • an absorption funnel 408 may be used which consists of commercially available acoustic insulation mats.
  • the hearing device wearer 406 is located at a distance of preferably about 1 meter in front of the loudspeaker 405.
  • the inventive method described above is implemented by means of appropriate software.
  • the software can be operated by the hearing aid wearer 406 himself, so that the implementation of the method according to the invention requires no further specialist personnel.
  • FIG. 22 shows the arrangement with additional microphone 409. It is a highly linear electret microphone which serves to calibrate a linearization filter realized by software in the personal computer 401.
  • the linearization filter is required to linearize the frequency response of the loudspeaker when outputting the test signals.
  • another loudspeaker 405 ' is located obliquely behind the hearing device wearer 406.
  • the loudspeaker 405' is used to generate an interference signal in order to check the directivity of the hearing aid 407 to be controlled.
  • the standard measuring position of the speaker 405 'with respect to the speaker 405 is 115 °.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

La présente invention concerne un procédé de traitement de signaux acoustiques vocaux au moyen d'un dispositif de traitement électronique. Le but de l'invention est d'améliorer le traitement de signaux acoustiques vocaux par rapport à l'état de la technique. A cet effet, un traitement phonémique des signaux vocaux est effectué, les phonèmes peu articulés étant prolongés. L'invention porte également sur un procédé de traitement de signaux acoustiques vocaux, selon lequel les phonèmes peu articulés sont détectés rapidement et remplacés par des phonèmes de synthèse correspondant à ces derniers. Un autre aspect de l'invention concerne un procédé de synthèse de la parole, selon lequel au moins deux formes d'onde formantiques sont produites par modulation d'un signal source oscillant à une certaine fréquence de formant avec une fonction enveloppe; les formes d'onde formantiques sont additionnées et les formes d'onde formantiques additionnées sont concaténées en un signal vocal suprasegmental selon une longueur d'écart de hauteur et selon des lois de concaténation suprasegmentale. Selon l'invention, les signaux sources sont modulés en fréquence lors de la production des formes d'ondes formantiques. L'invention concerne enfin un procédé de contrôle de l'adaptation d'un appareil auditif, qui consiste à utiliser des signaux d'essai qui sont des éléments de la parole naturelle ou similaires à la parole naturelle sélectionnés ou filtrés spectralement, de sorte que le spectre du signal d'essai corresponde au domaine spectral d'au moins un filtre d'un ensemble de filtres de l'appareil auditif.
EP09808931A 2008-12-18 2009-12-18 Procédé et dispositif de traitement de signaux acoustiques vocaux Withdrawn EP2380171A2 (fr)

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
DE102008062776 2008-12-18
DE102008062775 2008-12-18
DE102008063279 2008-12-29
DE102008063367 2008-12-30
DE102009018469A DE102009018469A1 (de) 2008-12-18 2009-04-22 Verfahren und Vorrichtung zum Verarbeiten von akustischen Sprachsignalen
DE102009018470A DE102009018470A1 (de) 2008-12-18 2009-04-22 Verfahren und Vorrichtung zum Verarbeiten von akustischen Sprachsignalen
DE102009032238A DE102009032238A1 (de) 2008-12-30 2009-07-08 Verfahren zur Kontrolle der Anpassung eines Hörgerätes
DE102009032236A DE102009032236A1 (de) 2008-12-29 2009-07-08 Sprachsyntheseverfahren
PCT/EP2009/009129 WO2010078938A2 (fr) 2008-12-18 2009-12-18 Procédé et dispositif de traitement de signaux acoustiques vocaux

Publications (1)

Publication Number Publication Date
EP2380171A2 true EP2380171A2 (fr) 2011-10-26

Family

ID=42236434

Family Applications (1)

Application Number Title Priority Date Filing Date
EP09808931A Withdrawn EP2380171A2 (fr) 2008-12-18 2009-12-18 Procédé et dispositif de traitement de signaux acoustiques vocaux

Country Status (2)

Country Link
EP (1) EP2380171A2 (fr)
WO (1) WO2010078938A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9686620B2 (en) 2012-03-02 2017-06-20 Sivantos Pte. Ltd. Method of adjusting a hearing apparatus with the aid of the sensory memory

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102011083736B4 (de) * 2011-09-29 2014-11-20 Siemens Medical Instruments Pte. Ltd. Verstärkungseinstellung bei einem Hörhilfegerät
EP3588984B1 (fr) * 2018-06-29 2022-04-20 Interacoustics A/S Système de validation d'appareils auditifs pour enfants à l'aide d'un signal vocal

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7065485B1 (en) * 2002-01-09 2006-06-20 At&T Corp Enhancing speech intelligibility using variable-rate time-scale modification
JP4759052B2 (ja) * 2005-06-27 2011-08-31 ヴェーデクス・アクティーセルスカプ 高周波数再生が強化された補聴器および音声信号処理方法
JP4946293B2 (ja) * 2006-09-13 2012-06-06 富士通株式会社 音声強調装置、音声強調プログラムおよび音声強調方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2010078938A2 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9686620B2 (en) 2012-03-02 2017-06-20 Sivantos Pte. Ltd. Method of adjusting a hearing apparatus with the aid of the sensory memory

Also Published As

Publication number Publication date
WO2010078938A2 (fr) 2010-07-15
WO2010078938A3 (fr) 2010-12-29

Similar Documents

Publication Publication Date Title
DE10041512B4 (de) Verfahren und Vorrichtung zur künstlichen Erweiterung der Bandbreite von Sprachsignalen
DE102005032724B4 (de) Verfahren und Vorrichtung zur künstlichen Erweiterung der Bandbreite von Sprachsignalen
US5933801A (en) Method for transforming a speech signal using a pitch manipulator
EP3074974B1 (fr) Dispositif de correction auditive avec modification de la fréquence fondamentale
EP2364646A1 (fr) Procédé de test auditif
DE602004007953T2 (de) System und verfahren zur audiosignalverarbeitung
EP4017031A1 (fr) Système auditif à programmation spécifique d'un utilisateur
WO2010078938A2 (fr) Procédé et dispositif de traitement de signaux acoustiques vocaux
EP2584795B1 (fr) Procédé de détermination d'une ligne caractéristique de compression
DE102009032238A1 (de) Verfahren zur Kontrolle der Anpassung eines Hörgerätes
Alexander et al. Spectral tilt change in stop consonant perception
DE19525944C2 (de) Hörhilfe
DE102011006472B4 (de) Verfahren zur Verbesserung der Sprachverständlichkeit mit einem Hörhilfegerät sowie Hörhilfegerät
EP2394271B1 (fr) Procédé de séparation de cheminements de signaux et application de ce procédé pour améliorer la qualité de la voix d'un larynx électronique
DE102020210918A1 (de) Verfahren zum Betrieb einer Hörvorrichtung in Abhängigkeit eines Sprachsignals
EP3961624A1 (fr) Procédé de fonctionnement d'un dispositif auditif en fonction d'un signal vocal
EP3962115A1 (fr) Procédé d'évaluation de la qualité de parole d'un signal vocal au moyen d'un dispositif auditif
DE102009018469A1 (de) Verfahren und Vorrichtung zum Verarbeiten von akustischen Sprachsignalen
EP3834723A1 (fr) Procédé de détermination du seuil auditif d'un sujet humain
EP2506255A1 (fr) Procédé d'amélioration de l'intelligibilité de la parole avec un appareil auditif ainsi qu'appareil auditif
Munoz et al. Enhancement of Spectral Contrast in Speech for Hearing Impaired Listeners
DE102009018470A1 (de) Verfahren und Vorrichtung zum Verarbeiten von akustischen Sprachsignalen
Wheeler Perceptual weighting of acoustic cues to contrastive prosody for sentences in quiet and in noise
WO2002075721A1 (fr) Procede et dispositif de reconnaissance de la parole a compensation du bruit
DE102009032236A1 (de) Sprachsyntheseverfahren

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20110718

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: PLINGE, AXEL

Owner name: BAUER, HANS-DIETER

RIN1 Information on inventor provided before grant (corrected)

Inventor name: BAUER, HANS-DIETER

Inventor name: PLINGE, AXEL

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20160701