EP3962115A1 - Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung - Google Patents

Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung Download PDF

Info

Publication number
EP3962115A1
EP3962115A1 EP21190918.9A EP21190918A EP3962115A1 EP 3962115 A1 EP3962115 A1 EP 3962115A1 EP 21190918 A EP21190918 A EP 21190918A EP 3962115 A1 EP3962115 A1 EP 3962115A1
Authority
EP
European Patent Office
Prior art keywords
signal
speech
determined
input audio
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21190918.9A
Other languages
German (de)
English (en)
French (fr)
Inventor
Jana Thiemt
Marko Lugger
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sivantos Pte Ltd
Original Assignee
Sivantos Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sivantos Pte Ltd filed Critical Sivantos Pte Ltd
Publication of EP3962115A1 publication Critical patent/EP3962115A1/de
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/30Monitoring or testing of hearing aids, e.g. functioning, settings, battery power
    • H04R25/305Self-monitoring or self-testing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/40Arrangements for obtaining a desired directivity characteristic
    • H04R25/405Arrangements for obtaining a desired directivity characteristic by combining a plurality of transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/40Arrangements for obtaining a desired directivity characteristic
    • H04R25/407Circuits for combining signals of a plurality of transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/43Electronic input selection or mixing based on input signal analysis, e.g. mixing or selection between microphone and telecoil or between microphones with different directivity characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R25/00Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
    • H04R25/50Customised settings for obtaining desired overall acoustical characteristics
    • H04R25/505Customised settings for obtaining desired overall acoustical characteristics using digital signal processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2225/00Details of deaf aids covered by H04R25/00, not provided for in any of its subgroups
    • H04R2225/43Signal processing in hearing aids to enhance the speech intelligibility

Definitions

  • the invention relates to a method for evaluating the speech quality of a speech signal by means of a hearing device, a sound containing the speech signal being picked up from an area surrounding the hearing device by means of an acousto-electrical input converter of the hearing device and converted into an input audio signal, with analysis of the input Audio signal is quantitatively detected by means of a signal processing at least property of the voice signal.
  • An important task in the use of hearing devices often consists of outputting a speech signal as precisely as possible, i.e. in particular acoustically as comprehensibly as possible, to a user of the hearing device.
  • a speech signal as precisely as possible, i.e. in particular acoustically as comprehensibly as possible
  • interference noises from the sound are often suppressed in an audio signal, which is generated using a sound with a speech signal, in order to emphasize the signal components which represent the speech signal and thus improve its intelligibility.
  • the sound quality of a resulting output signal can often be reduced by algorithms for noise reduction, with signal processing of the audio signal in particular being able to produce artifacts and/or a hearing impression being generally felt to be less natural.
  • noise suppression is usually carried out using parameters which primarily relate to the noise or the overall signal, ie, for example, a signal-to-noise ratio (“signal-to-noise-ratio”, SNR), a background noise level (“noise foor”), or also a level of the audio signal.
  • signal-to-noise-ratio SNR
  • background noise level noise foor
  • this approach to controlling the noise suppression can ultimately result in the noise suppression also being applied when this would not be necessary at all, although noticeable background noise is present, as a result of speech components that are still easy to understand despite the background noise.
  • the risk of degrading sound quality eg due to noise reduction artefacts, is taken without real necessity.
  • a speech signal which has only little noise superimposed and insofar as the associated audio signal has a good SNR can also have a low speech quality if the speaker has weak articulation.
  • the invention is therefore based on the object of specifying a method by means of which the quality of a speech component in an audio signal to be processed by a hearing device can be assessed objectively.
  • the invention is also based on the object of specifying a hearing device which is set up to objectively evaluate the quality of a speech component contained in an internal audio signal.
  • the first-mentioned object is achieved according to the invention by a method for evaluating the speech quality of a speech signal using a hearing device, with an acousto-electric input converter of the hearing device recording a sound containing the speech signal from the surroundings of the hearing device and converting it into an input audio signal, with Analysis of the input audio signal by means of signal processing, in particular signal processing of the hearing device and/or an auxiliary device that can be connected to the hearing device, at least one articulatory and/or prosodic property of the speech signal is quantitatively detected, and depending on a quantitative measure for the speech quality is derived from the at least one articulatory or prosodic property.
  • Advantageous and partly inventive configurations are the subject matter of the subclaims and the following description.
  • a hearing device which comprises an acousto-electrical input converter and a signal processing device having in particular a signal processor, the acousto-electrical input converter being set up to record a sound from an environment surrounding the hearing device and to convert it into an input audio signal , and wherein the signal processing device is set up to quantitatively detect at least one articulatory and/or prosodic property of a portion of a speech signal contained in the input audio signal by analyzing the input audio signal and, depending on the at least one articulatory or prosodic property, a quantitative derive a measure of voice quality.
  • the hearing device according to the invention shares the advantages of the method according to the invention, which can be carried out in particular by means of the hearing device according to the invention.
  • the advantages mentioned below for the method and for its further developments can be transferred analogously to the hearing device.
  • an acousto-electrical input transducer includes, in particular, any transducer that is set up to generate an electrical audio signal from ambient sound, so that air movements and air pressure fluctuations at the location of the transducer caused by the sound are compensated by corresponding oscillations of an electrical variable, in particular a Voltage can be reproduced in the generated audio signal.
  • the acousto-electric input converter can be provided by a microphone.
  • the signal processing takes place in particular by means of a corresponding signal processing device, which uses at least one signal processor for Carrying out the calculations and/or algorithms provided for the signal processing.
  • the signal processing device is arranged in particular on the hearing device.
  • the signal processing device can also be arranged on an auxiliary device, which is set up for a connection with the hearing device for data exchange, e.g. a smartphone, a smartwatch, etc.
  • the hearing device can then, for example, transmit the input audio signal to the auxiliary device, and the Analysis is performed using the computing resources provided by the auxiliary device. Finally, as a result of the analysis, the quantitative measure can be transmitted back to the hearing device.
  • the analysis can be carried out directly on the input audio signal or using a signal derived from the input audio signal.
  • a signal can be given in particular by the isolated speech signal component, but also by an audio signal, such as can be generated in a hearing device by a feedback loop using a compensation signal to compensate for acoustic feedback, or similar, or by a directional signal, which is based on a further input audio signal of a further input converter is generated.
  • An articulatory property of the speech signal includes in particular a precision of formants, especially vowels, and a dominance of consonants, especially fricatives and/or plosives.
  • the statement can be made that the higher the precision of the formants or the higher the dominance and/or precision of consonants, the higher the speech quality.
  • a prosodic property of the speech signal includes, in particular, a time stability of a fundamental frequency of the speech signal and a relative sound intensity of accents.
  • Sound generation usually comprises three physical components of a sound source: a mechanical oscillator such as a string or membrane, which causes the air surrounding the oscillator to oscillate, an excitation of the oscillator (eg by plucking or stroking), and a resonator.
  • the oscillator is set into oscillations by the excitation, so that the air surrounding the oscillator is set into pressure oscillations by the oscillations of the oscillator, which propagate as sound waves.
  • the mechanical oscillator not only vibrations of a single frequency are excited, but vibrations of different frequencies, with the spectral composition of the propagating vibrations determining the sound pattern.
  • the frequencies of certain vibrations are often given as integer multiples of a fundamental frequency and are referred to as "harmonics" or overtones of this fundamental frequency.
  • harmonics or overtones of this fundamental frequency.
  • more complex spectral patterns can also develop, so that not all frequencies generated can be represented as harmonics of the same fundamental frequency.
  • the resonance of the frequencies generated in the resonance chamber is also relevant for the sound image, since certain frequencies generated by the oscillator in the resonance chamber are often weakened relative to the dominant frequencies of a sound.
  • the mechanical oscillator is given by the vocal cords and their excitation in the air flowing past the vocal cords from the lungs, with the resonance chamber being formed primarily by the pharynx and oral cavity.
  • the basic frequency of a male voice is usually in the range of 60 Hz to 150 Hz, for women mostly in the range of 150 Hz to 300 Hz.
  • the formants form independently of the fundamental frequency, i.e. the frequency of the fundamental oscillation.
  • precision of formants is to be understood in particular as a degree of concentration of the acoustic energy on formant ranges that can be distinguished from one another, in particular on individual frequencies in the formant ranges, and a resulting ability to determine the individual vowels using the formants.
  • consonants For the generation of consonants, the airflow flowing past the vocal cords is partially or completely blocked at least in one place, which among other things also creates turbulence in the airflow, which is why only some consonants can be assigned a similarly clear formant structure as vowels, and other consonants a more broadband one Have frequency structure.
  • consonants can also be assigned specific frequency bands in which the acoustic energy is concentrated. Due to the rather percussive "noise quality" of consonants, these are generally above the formant ranges of vowels, namely primarily in the range from approx. 2 to 8 kHz, while the ranges of the most important formants F1 and F2 of vowels are generally at approx. 1.5 kHz ( F1) or 4 kHz (F2).
  • the precision of consonants is determined in particular from a degree of concentration of the acoustic energy on the corresponding frequency ranges and the resulting determinability of the individual consonants.
  • the ability to distinguish the individual components of a speech signal does not only depend on articulatory aspects. While these primarily concern the acoustic precision of the smallest isolated sound events of speech, the so-called phonemes, prosodic aspects also determine the speech quality, since here through intonation and accentuation, especially across several segments, i.e. several phonemes or phoneme groups, a statement has a special meaning can be imprinted, such as raising the pitch at the end of a sentence to clarify a question, or by stressing a specific syllable in a word to distinguish between different meanings (cf. "detour” vs. "detour”) or stressing a word to emphasize it.
  • a voice quality for a voice signal can also be measured quantitatively on the basis of prosodic properties, in particular those just mentioned, by measuring, for example, measures for a temporal variation in the pitch of the voice, i.e. its fundamental frequency, and for the clarity of a lift of the amplitude and/or Level maxima are determined.
  • the quantitative measure for the speech quality can thus be derived on the basis of one or more of the named and/or further, quantitatively recorded articulatory and/or prosodic properties of the speech signal.
  • the preferred articulatory property of the speech signal is a parameter correlated with the precision of predetermined formants of vowels in the speech signal, a parameter correlated with the dominance of consonants, in particular fricatives, in the speech signal and/or a parameter correlated with the precision of the transitions from voiced to unvoiced sounds correlated parameter recorded.
  • the quantitative measure of the voice quality can then be given directly by the said detected parameter, or formed on the basis of this, for example by weighting two parameters for different formants or the like, or by the weighting, ie by a weighted averaging, of at least two different parameters mentioned.
  • the quantitative measure of speech quality thus refers to the speech production of a speaker, who can have deficits (such as lisping or mumbling) through to speech errors, which reduce the speech quality accordingly, from a pronunciation perceived as "clean".
  • the present measure of the here is particularly independent of the external properties of a transmission channel such as propagation in a possibly reverberant room or a noisy environment, but preferably only dependent on the intrinsic properties of speech production by the speaker.
  • a first energy contained in a low frequency range is advantageously calculated, a second energy contained in a higher frequency range lying above the low frequency range is calculated, and the correlated parameter is calculated using a ratio and /or a ratio of the first energy and the second energy weighted over the respective bandwidths of the frequency ranges mentioned is formed.
  • the voice signal can be smoothed over time in advance.
  • the input audio signal can in particular be divided into the lower and the higher frequency range, e.g. by means of a filter bank and, if necessary, by means of a corresponding selection of individual resulting frequency bands.
  • the low frequency range is preferably selected in such a way that it lies within the frequency interval [0 Hz, 2.5 kHz], particularly preferably within the frequency interval [0 Hz, 2 kHz].
  • the higher frequency range is preferably selected in such a way that it lies within the frequency interval [3 kHz, 10 kHz], particularly preferably within the frequency interval [4 Hz, 8 kHz].
  • the parameter correlated with the precision of the transitions between voiced and unvoiced sounds is recorded using a correlation measurement and/or using a zero crossing rate of the input audio signal or a signal derived from the input audio signal a distinction is made between voiced and unvoiced time sequences, a transition from a voiced time sequence to an unvoiced time sequence or from an unvoiced time sequence to a voiced time sequence is determined, the energy contained in the voiced or unvoiced time sequence before the transition is determined for at least one frequency range is determined, and for the at least one frequency range the energy contained in the unvoiced or voiced time sequence after the transition is determined, and the parameter is determined based on the energy before the transition and based on the energy after the transition.
  • the voiced and unvoiced time sequences of the speech signal are first determined in the input audio signal, and from this a transition from voiced to unvoiced or from unvoiced to voiced is identified.
  • the energy before the transition in the frequency range for the input audio signal or for a signal derived therefrom is now determined for at least one frequency range, which is predetermined in particular on the basis of empirical findings for the precision of the transitions. This energy can be taken, for example, over the voiced or unvoiced time sequence just before the transition.
  • the energy in the relevant frequency range is determined after the transition, e.g. via the unvoiced or voiced time sequence following the transition.
  • a characteristic value can now be determined on the basis of these two energies, which in particular enables a statement to be made about a change in the energy distribution at the transition.
  • This parameter can be determined, for example, as a quotient or a relative deviation of the two energies before and after the transition.
  • the characteristic value can also be formed as a comparison of the energy before or after the transition with the total (broadband) signal energy.
  • the energies can also be determined for a further frequency range before and after the transition, so that the characteristic value can also be determined using the energies before and after the transition in the further frequency band, e.g. as a rate of change of the energy distribution to the frequency ranges involved over the transition away (i.e a comparison of the distribution of the energies in both frequency ranges before the transition with the distribution after the transition).
  • the parameter correlated with the precision of the transitions for the measure of the speech quality can then be determined on the basis of said parameter.
  • the characteristic value can be used directly, or the characteristic value can be compared with a reference value determined in advance for good articulation, in particular on the basis of corresponding empirical knowledge (e.g. as a quotient or relative deviation).
  • the concrete configuration in particular with regard to the frequency ranges and limit or reference values to be used, can generally be based on empirical results about a corresponding significance of the respective frequency bands or groups of frequency bands.
  • frequency bands 13 to 24, preferably 16 to 23, of the Bark scale can be used as the at least one frequency range.
  • a frequency range of lower frequencies can be used as a further frequency range.
  • the acoustic energies of the speech signal concentrated in at least two different formant ranges are preferably compared with one another for detecting the parameter correlated with the precision of predetermined formants of vowels in the speech signal.
  • a signal component of the speech signal is determined in at least one formant range in the frequency domain
  • a signal variable correlated with the level is determined for the signal component of the speech signal in at least one formant range
  • the parameter is determined based on a maximum value and/or based on a time stability of the values correlated with the level Signal size determined.
  • the frequency range of the first formant F1 (preferably 250 Hz to 1 kHz, particularly preferably 300 Hz to 750 Hz) or the second formant F2 (preferably 500 Hz to 3.5 kHz, particularly preferably 600 Hz to 2 .5 kHz) can be selected, or two formant ranges of the first and second formants are selected.
  • several first and/or second formant ranges assigned to different vowels i.e. the frequency ranges, which are assigned to the first or second formant of the respective vowel).
  • the signal portion is now determined for the selected formant range or ranges, and a signal magnitude of the respective signal portion that is correlated with the level is determined.
  • the signal size can be given by the level itself, or also by the possibly suitably smoothed maximum signal amplitude. Based on a time stability of the signal size, which in turn can be determined by a variance of the signal size over a suitable time window, and/or based on a deviation of the signal size from its maximum value over a suitable time window, a statement can now be made about the precision of formants to the effect that a small variance and small deviation from the maximum level for an articulated sound (the length of the time window can be chosen depending on the length of an articulated sound in particular) speak for a high precision.
  • the fundamental frequency of the speech signal is detected in a time-resolved manner, and a parameter that is characteristic of the time stability of the fundamental frequency is determined as a prosodic property of the speech signal.
  • This parameter can be determined, for example, based on a relative deviation of the fundamental frequency accumulated over time, or by detecting a number of maxima and minima of the fundamental frequency over a predetermined period of time.
  • the time stability of the fundamental frequency is particularly important for a monotony of the speech melody and accentuation, which is why a quantitative recording also allows a statement about the speech quality of the speech signal.
  • a variable correlated with the volume in particular an amplitude and/or a level, is preferably recorded in a time-resolved manner for the speech signal, in particular by a corresponding analysis of the input audio signal or a signal derived therefrom, with a quotient of a maximum value of the variable correlated with the volume is formed into a mean value of said variable determined over the predetermined period of time, and a characteristic variable depending on said quotient is determined as a prosodic property of the speech signal, which is calculated from the maximum value and the mean value of the variable correlated with the volume via the default period is formed.
  • a statement about a definition of the accentuation can be made on the basis of the indirectly recorded volume dynamics of the speech signal.
  • At least two parameters that are characteristic of articulatory and/or prosodic properties are determined based on the analysis of the input audio signal, with the quantitative measure for the speech quality based on a product of these parameters and/or based on a weighted average and/or a maximum or minimum value of these parameters is formed. This is particularly advantageous when a single measure of speech quality is required or desired, or when a single measure intended to capture all articulatory or all prosodic properties is desired.
  • the analysis of the speech quality of the speech signal can be limited to those cases in which a speech signal is actually present or in which the SNR is above a predetermined limit value in particular, so that it can be assumed that a sufficiently good recognition of the signal components of the speech signal in the input audio signal is possible in the first place in order to make a corresponding assessment.
  • the hearing device is preferably designed as a hearing aid.
  • the hearing aid can be a monaural device or a binaural device with two local Be given devices that are to be worn by the user of the hearing aid on his right or left ear.
  • the hearing aid can also have at least one further acousto-electrical input converter, which converts the sound of the environment into a corresponding further input audio signal, so that the quantitative detection of the at least one articulatory and/or prosodic property of a speech signal by a Analysis of a plurality of input audio signals involved can be done.
  • two of the input audio signals used can each be generated in different local units of the hearing device (that is to say in each case on the left or on the right ear).
  • the signal processing device can in particular include signal processors of both local units, locally generated measures for the speech quality preferably being standardized in a suitable manner by averaging or a maximum or minimum value for both local units, depending on the articulatory and/or prosodic property considered.
  • a hearing device 1 is shown schematically in a circuit diagram, which is embodied as a hearing aid 2 in the present case.
  • the hearing aid 2 has an acousto-electric input converter 4 which is set up to convert a sound 6 in the area surrounding the hearing aid 2 into an input audio signal 8 .
  • An embodiment of the hearing device 2 with a further input converter (not shown), which generates a corresponding further input audio signal from the sound 6 of the environment, is also conceivable here.
  • the hearing device 2 is designed as a stand-alone, monaural device. Equally conceivable is an embodiment of the hearing device 2 as a binaural hearing device with two local devices (not shown), which are to be worn by the user of the hearing device 2 on his right and left ear.
  • the input audio signal 8 is fed to a signal processing device 10 of the hearing aid 2, in which the input audio signal 8 is processed in accordance with the audiological requirements of the user of the hearing aid 2 and, for example, is amplified and/or compressed by frequency band.
  • the signal processing device 10 is provided in particular by means of a corresponding signal processor (in figure 1 not shown in detail) and a main memory that can be addressed via the signal processor. Any pre-processing of the input audio signal 8, such as A/D conversion and/or pre-amplification of the generated input audio signal 8, should be considered part of the input converter 4 in this case.
  • the signal processing device 10 By processing the input audio signal 8 , the signal processing device 10 generates an output audio signal 12 which is converted into an output sound signal 16 of the hearing aid 2 by means of an electro-acoustic output converter 14 .
  • the input transducer 4 is preferably provided by a microphone, the output transducer 14, for example, by a loudspeaker (such as a balanced metal case receiver), but can also be provided by a bone conductor or the like.
  • the sound 6 in the area surrounding the hearing aid 2, which is detected by the input transducer 4, includes, among other things, a speech signal 18 from a speaker (not shown in detail) and other sound components 20, which can include, in particular, directed and/or diffuse background noise (interfering noise or background noise). , but can also contain such noises, which ever depending on the situation, could be regarded as a useful signal, for example music or acoustic warning or information signals relating to the environment.
  • a speech signal 18 from a speaker not shown in detail
  • other sound components 20 can include, in particular, directed and/or diffuse background noise (interfering noise or background noise). , but can also contain such noises, which ever depending on the situation, could be regarded as a useful signal, for example music or acoustic warning or information signals relating to the environment.
  • the signal processing of the input audio signal 8 that takes place in the signal processing device 10 to generate the output audio signal 12 can in particular include a suppression of the signal components that suppress the background noise contained in the sound 6, or a relative increase in the signal components representing the speech signal 18 compared to the further sound components 20 representing signal portion.
  • a frequency-dependent or broadband dynamic compression and/or amplification as well as algorithms for noise suppression can also be used here.
  • the signal processing device 10 for Control of the algorithms to be applied to the input audio signal 8 a quantitative measure of the speech quality of the speech signal 18 can be determined. This is based on figure 2 described.
  • figure 2 shows a processing of the input audio signal 8 of the hearing device 2 in a block diagram figure 2 .
  • a voice activity VAD is recognized for the input audio signal 8 . If there is no significant speech activity (path "n"), the signal processing of the input audio signal 8 to generate the output audio signal 12 takes place using a first algorithm 25.
  • the first algorithm 25 evaluates signal parameters of the input audio signal in a previously specified manner 8 such as level, background noise, transients or similar, broadband and/or in particular frequency band by frequency, and determines individual parameters from this, e.g Input audio signal 8 are to be applied.
  • the first algorithm 25 can also provide a classification of a hearing situation, which is realized in the sound 6, and set individual parameters as a function of the classification, possibly as a hearing program provided accordingly for a specific hearing situation. Furthermore, the individual audiological requirements of the user of the hearing device 2 can also be taken into account for the first algorithm 25 in order to be able to compensate for a hearing impairment of the user as well as possible by applying the first algorithm 25 to the input audio signal 8 .
  • an SNR is determined next and compared with a predefined limit value Th SNR . If the SNR is not above the limit value, ie SNR ⁇ Th SNR , then the first algorithm 25 is again applied to the input audio signal 8 to generate the output audio signal 12 . However, if the SNR is above the specified limit value Th SNR , i.e. SNR > Th SNR , a quantitative measure 30 for the speech quality of the speech component 18 contained in the input audio signal 8 is determined for the further processing of the input audio signal 8 in the manner described below. For this purpose, articulatory and/or prosodic properties of the speech signal 18 are recorded quantitatively.
  • the term speech signal component 26 contained in input audio signal 8 is to be understood as meaning those signal components of input audio signal 8 which represent speech component 18 of sound 6 from which input audio signal 8 is generated by input converter 4 .
  • the input audio signal 8 is divided into individual signal paths.
  • a centroid wavelength ⁇ c is first determined and compared with a predetermined limit value for the centroid wavelength Th ⁇ . If it is determined on the basis of said limit value for the centroid wavelength Th ⁇ that the signal components in Input audio signal 8 are sufficiently high-frequency, the signal components are selected in the first signal path 32, possibly after a suitable temporal smoothing (not shown), for a low frequency range NF and a higher frequency range HF lying above the low frequency range NF.
  • the low frequency range NF includes all frequencies f N ⁇ 2500 Hz, in particular f N ⁇ 2000 Hz
  • the higher frequency range HF includes frequencies f H with 2500 Hz ⁇ f H ⁇ 10000 Hz, in particular 4000 Hz ⁇ f H ⁇ 8000 Hz or 2500 Hz ⁇ f H ⁇ 5000 Hz.
  • the selection can be made directly in the input audio signal 8, or in such a way that the input audio signal 8 is divided into individual frequency bands by means of a filter bank (not shown), individual frequency bands depending on the respective band limits belonging to the lower or higher frequency range NF or HF.
  • a first energy E1 is then determined for the signal contained in the low frequency range NF and a second energy E2 for the signal contained in the higher frequency range HF.
  • a quotient QE is now formed from the second energy as a numerator and the first energy E1 as a denominator.
  • the quotient QE can now be used as a parameter 33 which is correlated with the dominance of consonants in the speech signal 18 .
  • the parameter 33 thus enables a statement to be made about an articulatory property of the speech signal components 26 in the input audio signal 8. For example, for a value of the quotient QE>>1 (i.e. QE>Th QE with a predetermined limit value Th QE >>1, which is not detailed). a high dominance for consonants can be concluded, while for a value QE ⁇ 1 a low dominance can be concluded.
  • a differentiation 36 into voiced time sequences V and unvoiced time sequences UV is carried out in the input audio signal 8 based on correlation measurements and/or based on a zero crossing rate of the input audio signal 8. Based on the voiced and unvoiced Time sequences V and UV, a transition TS from a voiced time sequence V to a voiceless time sequence UV is determined.
  • the length of a voiced or unvoiced time sequence can be, for example, between 10 and 80 ms, in particular between 20 and 50 ms.
  • An energy Ev for the voiced time sequence V before the transition TS and an energy En for the unvoiced time sequence UV after the transition TS can also be determined separately for more than one frequency range. It is now determined how the energy changes at the transition TS, for example by a relative change ⁇ E TS or by a quotient (not shown) of the energies Ev, En before and after the transition TS.
  • the measure of the change in energy is now compared with a limit value Th E for energy distribution at transitions determined in advance for good articulation.
  • a parameter 35 can be formed based on a ratio of the relative change ⁇ E TS and said limit value Th E or based on a relative deviation of the relative change ⁇ E TS from this limit value Th E .
  • Said parameter 35 is correlated with the articulation of the transitions from voiced and unvoiced sounds in the speech signal 18, and thus provides information about a further articulatory property of the speech signal components 26 in the input audio signal 8.
  • the statement applies here that a transition between voiced and unvoiced time sequences is articulated more precisely, the faster, i.e. more temporally delimited, a change in the energy distribution over the frequency ranges relevant for voiced and unvoiced sounds.
  • an energy distribution in two frequency ranges (e.g. the above-mentioned frequency ranges according to the Bark scale, or also in the lower and higher frequency ranges LF, HF) can be considered, eg via a quotient of the respective energies or a comparable characteristic value, and a change in the quotient or the characteristic value over the transition can be used for the parameter.
  • a rate of change of the quotient or of the parameter can be determined and compared with a reference value for the rate of change that has previously been determined to be suitable.
  • transitions from unvoiced time sequences can also be considered in an analogous manner.
  • the concrete configuration in particular with regard to the frequency ranges and limit or reference values to be used, can generally be based on empirical results about a corresponding significance of the respective frequency bands or groups of frequency bands.
  • a fundamental frequency f G of the speech signal component 26 is detected in the input audio signal 8 in a time-resolved manner, and a time stability 40 is determined for said fundamental frequency f G using a variance of the fundamental frequency f G .
  • the time stability 40 can be used as a parameter 41 which enables a statement to be made about a prosodic property of the speech signal components 26 in the input audio signal 8 .
  • a greater variance in the fundamental frequency f G can be used as an indicator for better speech intelligibility, while a monotonic fundamental frequency f G has lower speech intelligibility.
  • a level LVL is detected in a time-resolved manner for the input audio signal 8 and/or for the voice signal component 26 contained therein, and a time average MN LVL is formed over a time period 44 specified in particular on the basis of corresponding empirical findings. Furthermore, the maximum MX LVL of the level LVL is determined over the period 44 . The maximum MX LVL of the level LVL is now divided by the time average MN LVL of the level LVL, and a parameter 45 correlated with a volume of the speech signal 18 is thus determined, which enables further information about a prosodic property of the speech signal components 26 in the input audio signal 8 to be made .
  • the level LVL another one with the volume can also be used and/or the energy content of the voice signal portion 26 correlated size can be used.
  • the parameters 33, 35, 41 and 45 determined as described in the first to fourth signal paths 32, 34, 38, 42 can now be used individually as the quantitative measure 30 for the quality of the speech component 18 contained in the input audio signal 8.
  • a second algorithm 46 is now applied to the input audio signal 8 for signal processing.
  • the second algorithm 46 can result from the first algorithm 25 through a corresponding change in one or more parameters of the signal processing, depending on the relevant quantitative measure 30, or can provide a completely independent hearing program.
  • a single value can also be determined as a quantitative measure 30 for the voice quality using the parameters 33, 35, 41 or 45 determined as described, e.g. by a weighted mean value or a product of the parameters 33, 35, 41, 45 (in 2 shown schematically by combining the parameters 33, 35, 41, 45).
  • the weighting of the individual parameters can take place in particular using previously empirically determined weighting factors, which can be determined using the significance of the articulatory or prosodic property for the speech quality recorded by the respective parameter.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Neurosurgery (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Circuit For Audible Band Transducer (AREA)
EP21190918.9A 2020-08-28 2021-08-12 Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung Pending EP3962115A1 (de)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
DE102020210919.2A DE102020210919A1 (de) 2020-08-28 2020-08-28 Verfahren zur Bewertung der Sprachqualität eines Sprachsignals mittels einer Hörvorrichtung

Publications (1)

Publication Number Publication Date
EP3962115A1 true EP3962115A1 (de) 2022-03-02

Family

ID=77316824

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21190918.9A Pending EP3962115A1 (de) 2020-08-28 2021-08-12 Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung

Country Status (4)

Country Link
US (1) US12009005B2 (zh)
EP (1) EP3962115A1 (zh)
CN (1) CN114121040A (zh)
DE (1) DE102020210919A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040167774A1 (en) * 2002-11-27 2004-08-26 University Of Florida Audio-based method, system, and apparatus for measurement of voice quality
US7165025B2 (en) * 2002-07-01 2007-01-16 Lucent Technologies Inc. Auditory-articulatory analysis for speech quality assessment
US20180255406A1 (en) * 2017-03-02 2018-09-06 Gn Hearing A/S Hearing device, method and hearing system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2948214A1 (en) * 2013-01-24 2015-12-02 Advanced Bionics AG Hearing system comprising an auditory prosthesis device and a hearing aid
US9814879B2 (en) * 2013-05-13 2017-11-14 Cochlear Limited Method and system for use of hearing prosthesis for linguistic evaluation
DE102013224417B3 (de) * 2013-11-28 2015-05-07 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Hörhilfevorrichtung mit Grundfrequenzmodifizierung, Verfahren zur Verarbeitung eines Sprachsignals und Computerprogramm mit einem Programmcode zur Durchführung des Verfahrens
US11253193B2 (en) * 2016-11-08 2022-02-22 Cochlear Limited Utilization of vocal acoustic biomarkers for assistive listening device utilization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7165025B2 (en) * 2002-07-01 2007-01-16 Lucent Technologies Inc. Auditory-articulatory analysis for speech quality assessment
US20040167774A1 (en) * 2002-11-27 2004-08-26 University Of Florida Audio-based method, system, and apparatus for measurement of voice quality
US20180255406A1 (en) * 2017-03-02 2018-09-06 Gn Hearing A/S Hearing device, method and hearing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ASGER HEIDEMANN ANDERSEN ET AL: "Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 26, no. 10, 1 October 2018 (2018-10-01), pages 1925 - 1939, XP058416624, ISSN: 2329-9290, DOI: 10.1109/TASLP.2018.2847459 *

Also Published As

Publication number Publication date
US20220068294A1 (en) 2022-03-03
DE102020210919A1 (de) 2022-03-03
CN114121040A (zh) 2022-03-01
US12009005B2 (en) 2024-06-11

Similar Documents

Publication Publication Date Title
DE602004004242T2 (de) System und Verfahren zur Verbesserung eines Audiosignals
Tchorz et al. SNR estimation based on amplitude modulation analysis with applications to noise suppression
EP2364646B1 (de) Hörtestverfahren
DE102008031150B3 (de) Verfahren zur Störgeräuschunterdrückung und zugehöriges Hörgerät
US20110178799A1 (en) Methods and systems for identifying speech sounds using multi-dimensional analysis
Alku et al. Measuring the effect of fundamental frequency raising as a strategy for increasing vocal intensity in soft, normal and loud phonation
EP1563487B1 (de) Verfahren zur ermittlung akustischer merkmale von schallsignalen fuer die analyse unbekannter schallsignale und modifikation einer schallerzeugung
DE602004007953T2 (de) System und verfahren zur audiosignalverarbeitung
Hansen et al. A speech perturbation strategy based on “Lombard effect” for enhanced intelligibility for cochlear implant listeners
DE102014207437B4 (de) Spracherkennung mit einer Mehrzahl an Mikrofonen
Messing et al. A non-linear efferent-inspired model of the auditory system; matching human confusions in stationary noise
Henrich et al. Just noticeable differences of open quotient and asymmetry coefficient in singing voice
Chennupati et al. Spectral and temporal manipulations of SFF envelopes for enhancement of speech intelligibility in noise
Parida et al. Underlying neural mechanisms of degraded speech intelligibility following noise-induced hearing loss: The importance of distorted tonotopy
DE60110541T2 (de) Verfahren zur Spracherkennung mit geräuschabhängiger Normalisierung der Varianz
WO2010078938A2 (de) Verfahren und vorrichtung zum verarbeiten von akustischen sprachsignalen
EP3962115A1 (de) Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung
EP2548382B1 (de) Verfahren zum test des sprachverstehens einer mit einem hörhilfegerät versorgten person
Rao et al. Speech enhancement for listeners with hearing loss based on a model for vowel coding in the auditory midbrain
Bapineedu et al. Analysis of Lombard speech using excitation source information.
DE102009032238A1 (de) Verfahren zur Kontrolle der Anpassung eines Hörgerätes
EP3961624A1 (de) Verfahren zum betrieb einer hörvorrichtung in abhängigkeit eines sprachsignals
DE102020210918A1 (de) Verfahren zum Betrieb einer Hörvorrichtung in Abhängigkeit eines Sprachsignals
Alku et al. On the linearity of the relationship between the sound pressure level and the negative peak amplitude of the differentiated glottal flow in vowel production
EP2394271B1 (de) Methode zur trennung von signalpfaden und anwendung auf die verbesserung von sprache mit elektro-larynx

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220901

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

RIC1 Information provided on ipc code assigned before grant

Ipc: G10L 25/15 20130101ALI20240322BHEP

Ipc: G10L 25/60 20130101ALI20240322BHEP

Ipc: H04R 25/00 20060101AFI20240322BHEP

INTG Intention to grant announced

Effective date: 20240419

GRAJ Information related to disapproval of communication of intention to grant by the applicant or resumption of examination proceedings by the epo deleted

Free format text: ORIGINAL CODE: EPIDOSDIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE