EP3962115A1 - Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung - Google Patents
Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung Download PDFInfo
- Publication number
- EP3962115A1 EP3962115A1 EP21190918.9A EP21190918A EP3962115A1 EP 3962115 A1 EP3962115 A1 EP 3962115A1 EP 21190918 A EP21190918 A EP 21190918A EP 3962115 A1 EP3962115 A1 EP 3962115A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- signal
- speech
- determined
- input audio
- audio signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 17
- 230000005236 sound signal Effects 0.000 claims abstract description 91
- 238000012545 processing Methods 0.000 claims abstract description 32
- 238000004458 analytical method Methods 0.000 claims abstract description 14
- 230000006870 function Effects 0.000 claims abstract description 8
- 230000007704 transition Effects 0.000 claims description 48
- 230000002596 correlated effect Effects 0.000 claims description 31
- 230000000694 effects Effects 0.000 claims description 11
- 238000001514 detection method Methods 0.000 claims description 4
- 230000001419 dependent effect Effects 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 230000000875 corresponding effect Effects 0.000 description 17
- 230000008859 change Effects 0.000 description 14
- 230000010355 oscillation Effects 0.000 description 5
- 230000001629 suppression Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 210000001260 vocal cord Anatomy 0.000 description 4
- 230000001944 accentuation Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000005284 excitation Effects 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 210000000214 mouth Anatomy 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 210000003800 pharynx Anatomy 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000006735 deficit Effects 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 208000016354 hearing loss disease Diseases 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 239000012528 membrane Substances 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/15—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/30—Monitoring or testing of hearing aids, e.g. functioning, settings, battery power
- H04R25/305—Self-monitoring or self-testing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/40—Arrangements for obtaining a desired directivity characteristic
- H04R25/405—Arrangements for obtaining a desired directivity characteristic by combining a plurality of transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/40—Arrangements for obtaining a desired directivity characteristic
- H04R25/407—Circuits for combining signals of a plurality of transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/43—Electronic input selection or mixing based on input signal analysis, e.g. mixing or selection between microphone and telecoil or between microphones with different directivity characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R25/00—Deaf-aid sets, i.e. electro-acoustic or electro-mechanical hearing aids; Electric tinnitus maskers providing an auditory perception
- H04R25/50—Customised settings for obtaining desired overall acoustical characteristics
- H04R25/505—Customised settings for obtaining desired overall acoustical characteristics using digital signal processing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2225/00—Details of deaf aids covered by H04R25/00, not provided for in any of its subgroups
- H04R2225/43—Signal processing in hearing aids to enhance the speech intelligibility
Definitions
- the invention relates to a method for evaluating the speech quality of a speech signal by means of a hearing device, a sound containing the speech signal being picked up from an area surrounding the hearing device by means of an acousto-electrical input converter of the hearing device and converted into an input audio signal, with analysis of the input Audio signal is quantitatively detected by means of a signal processing at least property of the voice signal.
- An important task in the use of hearing devices often consists of outputting a speech signal as precisely as possible, i.e. in particular acoustically as comprehensibly as possible, to a user of the hearing device.
- a speech signal as precisely as possible, i.e. in particular acoustically as comprehensibly as possible
- interference noises from the sound are often suppressed in an audio signal, which is generated using a sound with a speech signal, in order to emphasize the signal components which represent the speech signal and thus improve its intelligibility.
- the sound quality of a resulting output signal can often be reduced by algorithms for noise reduction, with signal processing of the audio signal in particular being able to produce artifacts and/or a hearing impression being generally felt to be less natural.
- noise suppression is usually carried out using parameters which primarily relate to the noise or the overall signal, ie, for example, a signal-to-noise ratio (“signal-to-noise-ratio”, SNR), a background noise level (“noise foor”), or also a level of the audio signal.
- signal-to-noise-ratio SNR
- background noise level noise foor
- this approach to controlling the noise suppression can ultimately result in the noise suppression also being applied when this would not be necessary at all, although noticeable background noise is present, as a result of speech components that are still easy to understand despite the background noise.
- the risk of degrading sound quality eg due to noise reduction artefacts, is taken without real necessity.
- a speech signal which has only little noise superimposed and insofar as the associated audio signal has a good SNR can also have a low speech quality if the speaker has weak articulation.
- the invention is therefore based on the object of specifying a method by means of which the quality of a speech component in an audio signal to be processed by a hearing device can be assessed objectively.
- the invention is also based on the object of specifying a hearing device which is set up to objectively evaluate the quality of a speech component contained in an internal audio signal.
- the first-mentioned object is achieved according to the invention by a method for evaluating the speech quality of a speech signal using a hearing device, with an acousto-electric input converter of the hearing device recording a sound containing the speech signal from the surroundings of the hearing device and converting it into an input audio signal, with Analysis of the input audio signal by means of signal processing, in particular signal processing of the hearing device and/or an auxiliary device that can be connected to the hearing device, at least one articulatory and/or prosodic property of the speech signal is quantitatively detected, and depending on a quantitative measure for the speech quality is derived from the at least one articulatory or prosodic property.
- Advantageous and partly inventive configurations are the subject matter of the subclaims and the following description.
- a hearing device which comprises an acousto-electrical input converter and a signal processing device having in particular a signal processor, the acousto-electrical input converter being set up to record a sound from an environment surrounding the hearing device and to convert it into an input audio signal , and wherein the signal processing device is set up to quantitatively detect at least one articulatory and/or prosodic property of a portion of a speech signal contained in the input audio signal by analyzing the input audio signal and, depending on the at least one articulatory or prosodic property, a quantitative derive a measure of voice quality.
- the hearing device according to the invention shares the advantages of the method according to the invention, which can be carried out in particular by means of the hearing device according to the invention.
- the advantages mentioned below for the method and for its further developments can be transferred analogously to the hearing device.
- an acousto-electrical input transducer includes, in particular, any transducer that is set up to generate an electrical audio signal from ambient sound, so that air movements and air pressure fluctuations at the location of the transducer caused by the sound are compensated by corresponding oscillations of an electrical variable, in particular a Voltage can be reproduced in the generated audio signal.
- the acousto-electric input converter can be provided by a microphone.
- the signal processing takes place in particular by means of a corresponding signal processing device, which uses at least one signal processor for Carrying out the calculations and/or algorithms provided for the signal processing.
- the signal processing device is arranged in particular on the hearing device.
- the signal processing device can also be arranged on an auxiliary device, which is set up for a connection with the hearing device for data exchange, e.g. a smartphone, a smartwatch, etc.
- the hearing device can then, for example, transmit the input audio signal to the auxiliary device, and the Analysis is performed using the computing resources provided by the auxiliary device. Finally, as a result of the analysis, the quantitative measure can be transmitted back to the hearing device.
- the analysis can be carried out directly on the input audio signal or using a signal derived from the input audio signal.
- a signal can be given in particular by the isolated speech signal component, but also by an audio signal, such as can be generated in a hearing device by a feedback loop using a compensation signal to compensate for acoustic feedback, or similar, or by a directional signal, which is based on a further input audio signal of a further input converter is generated.
- An articulatory property of the speech signal includes in particular a precision of formants, especially vowels, and a dominance of consonants, especially fricatives and/or plosives.
- the statement can be made that the higher the precision of the formants or the higher the dominance and/or precision of consonants, the higher the speech quality.
- a prosodic property of the speech signal includes, in particular, a time stability of a fundamental frequency of the speech signal and a relative sound intensity of accents.
- Sound generation usually comprises three physical components of a sound source: a mechanical oscillator such as a string or membrane, which causes the air surrounding the oscillator to oscillate, an excitation of the oscillator (eg by plucking or stroking), and a resonator.
- the oscillator is set into oscillations by the excitation, so that the air surrounding the oscillator is set into pressure oscillations by the oscillations of the oscillator, which propagate as sound waves.
- the mechanical oscillator not only vibrations of a single frequency are excited, but vibrations of different frequencies, with the spectral composition of the propagating vibrations determining the sound pattern.
- the frequencies of certain vibrations are often given as integer multiples of a fundamental frequency and are referred to as "harmonics" or overtones of this fundamental frequency.
- harmonics or overtones of this fundamental frequency.
- more complex spectral patterns can also develop, so that not all frequencies generated can be represented as harmonics of the same fundamental frequency.
- the resonance of the frequencies generated in the resonance chamber is also relevant for the sound image, since certain frequencies generated by the oscillator in the resonance chamber are often weakened relative to the dominant frequencies of a sound.
- the mechanical oscillator is given by the vocal cords and their excitation in the air flowing past the vocal cords from the lungs, with the resonance chamber being formed primarily by the pharynx and oral cavity.
- the basic frequency of a male voice is usually in the range of 60 Hz to 150 Hz, for women mostly in the range of 150 Hz to 300 Hz.
- the formants form independently of the fundamental frequency, i.e. the frequency of the fundamental oscillation.
- precision of formants is to be understood in particular as a degree of concentration of the acoustic energy on formant ranges that can be distinguished from one another, in particular on individual frequencies in the formant ranges, and a resulting ability to determine the individual vowels using the formants.
- consonants For the generation of consonants, the airflow flowing past the vocal cords is partially or completely blocked at least in one place, which among other things also creates turbulence in the airflow, which is why only some consonants can be assigned a similarly clear formant structure as vowels, and other consonants a more broadband one Have frequency structure.
- consonants can also be assigned specific frequency bands in which the acoustic energy is concentrated. Due to the rather percussive "noise quality" of consonants, these are generally above the formant ranges of vowels, namely primarily in the range from approx. 2 to 8 kHz, while the ranges of the most important formants F1 and F2 of vowels are generally at approx. 1.5 kHz ( F1) or 4 kHz (F2).
- the precision of consonants is determined in particular from a degree of concentration of the acoustic energy on the corresponding frequency ranges and the resulting determinability of the individual consonants.
- the ability to distinguish the individual components of a speech signal does not only depend on articulatory aspects. While these primarily concern the acoustic precision of the smallest isolated sound events of speech, the so-called phonemes, prosodic aspects also determine the speech quality, since here through intonation and accentuation, especially across several segments, i.e. several phonemes or phoneme groups, a statement has a special meaning can be imprinted, such as raising the pitch at the end of a sentence to clarify a question, or by stressing a specific syllable in a word to distinguish between different meanings (cf. "detour” vs. "detour”) or stressing a word to emphasize it.
- a voice quality for a voice signal can also be measured quantitatively on the basis of prosodic properties, in particular those just mentioned, by measuring, for example, measures for a temporal variation in the pitch of the voice, i.e. its fundamental frequency, and for the clarity of a lift of the amplitude and/or Level maxima are determined.
- the quantitative measure for the speech quality can thus be derived on the basis of one or more of the named and/or further, quantitatively recorded articulatory and/or prosodic properties of the speech signal.
- the preferred articulatory property of the speech signal is a parameter correlated with the precision of predetermined formants of vowels in the speech signal, a parameter correlated with the dominance of consonants, in particular fricatives, in the speech signal and/or a parameter correlated with the precision of the transitions from voiced to unvoiced sounds correlated parameter recorded.
- the quantitative measure of the voice quality can then be given directly by the said detected parameter, or formed on the basis of this, for example by weighting two parameters for different formants or the like, or by the weighting, ie by a weighted averaging, of at least two different parameters mentioned.
- the quantitative measure of speech quality thus refers to the speech production of a speaker, who can have deficits (such as lisping or mumbling) through to speech errors, which reduce the speech quality accordingly, from a pronunciation perceived as "clean".
- the present measure of the here is particularly independent of the external properties of a transmission channel such as propagation in a possibly reverberant room or a noisy environment, but preferably only dependent on the intrinsic properties of speech production by the speaker.
- a first energy contained in a low frequency range is advantageously calculated, a second energy contained in a higher frequency range lying above the low frequency range is calculated, and the correlated parameter is calculated using a ratio and /or a ratio of the first energy and the second energy weighted over the respective bandwidths of the frequency ranges mentioned is formed.
- the voice signal can be smoothed over time in advance.
- the input audio signal can in particular be divided into the lower and the higher frequency range, e.g. by means of a filter bank and, if necessary, by means of a corresponding selection of individual resulting frequency bands.
- the low frequency range is preferably selected in such a way that it lies within the frequency interval [0 Hz, 2.5 kHz], particularly preferably within the frequency interval [0 Hz, 2 kHz].
- the higher frequency range is preferably selected in such a way that it lies within the frequency interval [3 kHz, 10 kHz], particularly preferably within the frequency interval [4 Hz, 8 kHz].
- the parameter correlated with the precision of the transitions between voiced and unvoiced sounds is recorded using a correlation measurement and/or using a zero crossing rate of the input audio signal or a signal derived from the input audio signal a distinction is made between voiced and unvoiced time sequences, a transition from a voiced time sequence to an unvoiced time sequence or from an unvoiced time sequence to a voiced time sequence is determined, the energy contained in the voiced or unvoiced time sequence before the transition is determined for at least one frequency range is determined, and for the at least one frequency range the energy contained in the unvoiced or voiced time sequence after the transition is determined, and the parameter is determined based on the energy before the transition and based on the energy after the transition.
- the voiced and unvoiced time sequences of the speech signal are first determined in the input audio signal, and from this a transition from voiced to unvoiced or from unvoiced to voiced is identified.
- the energy before the transition in the frequency range for the input audio signal or for a signal derived therefrom is now determined for at least one frequency range, which is predetermined in particular on the basis of empirical findings for the precision of the transitions. This energy can be taken, for example, over the voiced or unvoiced time sequence just before the transition.
- the energy in the relevant frequency range is determined after the transition, e.g. via the unvoiced or voiced time sequence following the transition.
- a characteristic value can now be determined on the basis of these two energies, which in particular enables a statement to be made about a change in the energy distribution at the transition.
- This parameter can be determined, for example, as a quotient or a relative deviation of the two energies before and after the transition.
- the characteristic value can also be formed as a comparison of the energy before or after the transition with the total (broadband) signal energy.
- the energies can also be determined for a further frequency range before and after the transition, so that the characteristic value can also be determined using the energies before and after the transition in the further frequency band, e.g. as a rate of change of the energy distribution to the frequency ranges involved over the transition away (i.e a comparison of the distribution of the energies in both frequency ranges before the transition with the distribution after the transition).
- the parameter correlated with the precision of the transitions for the measure of the speech quality can then be determined on the basis of said parameter.
- the characteristic value can be used directly, or the characteristic value can be compared with a reference value determined in advance for good articulation, in particular on the basis of corresponding empirical knowledge (e.g. as a quotient or relative deviation).
- the concrete configuration in particular with regard to the frequency ranges and limit or reference values to be used, can generally be based on empirical results about a corresponding significance of the respective frequency bands or groups of frequency bands.
- frequency bands 13 to 24, preferably 16 to 23, of the Bark scale can be used as the at least one frequency range.
- a frequency range of lower frequencies can be used as a further frequency range.
- the acoustic energies of the speech signal concentrated in at least two different formant ranges are preferably compared with one another for detecting the parameter correlated with the precision of predetermined formants of vowels in the speech signal.
- a signal component of the speech signal is determined in at least one formant range in the frequency domain
- a signal variable correlated with the level is determined for the signal component of the speech signal in at least one formant range
- the parameter is determined based on a maximum value and/or based on a time stability of the values correlated with the level Signal size determined.
- the frequency range of the first formant F1 (preferably 250 Hz to 1 kHz, particularly preferably 300 Hz to 750 Hz) or the second formant F2 (preferably 500 Hz to 3.5 kHz, particularly preferably 600 Hz to 2 .5 kHz) can be selected, or two formant ranges of the first and second formants are selected.
- several first and/or second formant ranges assigned to different vowels i.e. the frequency ranges, which are assigned to the first or second formant of the respective vowel).
- the signal portion is now determined for the selected formant range or ranges, and a signal magnitude of the respective signal portion that is correlated with the level is determined.
- the signal size can be given by the level itself, or also by the possibly suitably smoothed maximum signal amplitude. Based on a time stability of the signal size, which in turn can be determined by a variance of the signal size over a suitable time window, and/or based on a deviation of the signal size from its maximum value over a suitable time window, a statement can now be made about the precision of formants to the effect that a small variance and small deviation from the maximum level for an articulated sound (the length of the time window can be chosen depending on the length of an articulated sound in particular) speak for a high precision.
- the fundamental frequency of the speech signal is detected in a time-resolved manner, and a parameter that is characteristic of the time stability of the fundamental frequency is determined as a prosodic property of the speech signal.
- This parameter can be determined, for example, based on a relative deviation of the fundamental frequency accumulated over time, or by detecting a number of maxima and minima of the fundamental frequency over a predetermined period of time.
- the time stability of the fundamental frequency is particularly important for a monotony of the speech melody and accentuation, which is why a quantitative recording also allows a statement about the speech quality of the speech signal.
- a variable correlated with the volume in particular an amplitude and/or a level, is preferably recorded in a time-resolved manner for the speech signal, in particular by a corresponding analysis of the input audio signal or a signal derived therefrom, with a quotient of a maximum value of the variable correlated with the volume is formed into a mean value of said variable determined over the predetermined period of time, and a characteristic variable depending on said quotient is determined as a prosodic property of the speech signal, which is calculated from the maximum value and the mean value of the variable correlated with the volume via the default period is formed.
- a statement about a definition of the accentuation can be made on the basis of the indirectly recorded volume dynamics of the speech signal.
- At least two parameters that are characteristic of articulatory and/or prosodic properties are determined based on the analysis of the input audio signal, with the quantitative measure for the speech quality based on a product of these parameters and/or based on a weighted average and/or a maximum or minimum value of these parameters is formed. This is particularly advantageous when a single measure of speech quality is required or desired, or when a single measure intended to capture all articulatory or all prosodic properties is desired.
- the analysis of the speech quality of the speech signal can be limited to those cases in which a speech signal is actually present or in which the SNR is above a predetermined limit value in particular, so that it can be assumed that a sufficiently good recognition of the signal components of the speech signal in the input audio signal is possible in the first place in order to make a corresponding assessment.
- the hearing device is preferably designed as a hearing aid.
- the hearing aid can be a monaural device or a binaural device with two local Be given devices that are to be worn by the user of the hearing aid on his right or left ear.
- the hearing aid can also have at least one further acousto-electrical input converter, which converts the sound of the environment into a corresponding further input audio signal, so that the quantitative detection of the at least one articulatory and/or prosodic property of a speech signal by a Analysis of a plurality of input audio signals involved can be done.
- two of the input audio signals used can each be generated in different local units of the hearing device (that is to say in each case on the left or on the right ear).
- the signal processing device can in particular include signal processors of both local units, locally generated measures for the speech quality preferably being standardized in a suitable manner by averaging or a maximum or minimum value for both local units, depending on the articulatory and/or prosodic property considered.
- a hearing device 1 is shown schematically in a circuit diagram, which is embodied as a hearing aid 2 in the present case.
- the hearing aid 2 has an acousto-electric input converter 4 which is set up to convert a sound 6 in the area surrounding the hearing aid 2 into an input audio signal 8 .
- An embodiment of the hearing device 2 with a further input converter (not shown), which generates a corresponding further input audio signal from the sound 6 of the environment, is also conceivable here.
- the hearing device 2 is designed as a stand-alone, monaural device. Equally conceivable is an embodiment of the hearing device 2 as a binaural hearing device with two local devices (not shown), which are to be worn by the user of the hearing device 2 on his right and left ear.
- the input audio signal 8 is fed to a signal processing device 10 of the hearing aid 2, in which the input audio signal 8 is processed in accordance with the audiological requirements of the user of the hearing aid 2 and, for example, is amplified and/or compressed by frequency band.
- the signal processing device 10 is provided in particular by means of a corresponding signal processor (in figure 1 not shown in detail) and a main memory that can be addressed via the signal processor. Any pre-processing of the input audio signal 8, such as A/D conversion and/or pre-amplification of the generated input audio signal 8, should be considered part of the input converter 4 in this case.
- the signal processing device 10 By processing the input audio signal 8 , the signal processing device 10 generates an output audio signal 12 which is converted into an output sound signal 16 of the hearing aid 2 by means of an electro-acoustic output converter 14 .
- the input transducer 4 is preferably provided by a microphone, the output transducer 14, for example, by a loudspeaker (such as a balanced metal case receiver), but can also be provided by a bone conductor or the like.
- the sound 6 in the area surrounding the hearing aid 2, which is detected by the input transducer 4, includes, among other things, a speech signal 18 from a speaker (not shown in detail) and other sound components 20, which can include, in particular, directed and/or diffuse background noise (interfering noise or background noise). , but can also contain such noises, which ever depending on the situation, could be regarded as a useful signal, for example music or acoustic warning or information signals relating to the environment.
- a speech signal 18 from a speaker not shown in detail
- other sound components 20 can include, in particular, directed and/or diffuse background noise (interfering noise or background noise). , but can also contain such noises, which ever depending on the situation, could be regarded as a useful signal, for example music or acoustic warning or information signals relating to the environment.
- the signal processing of the input audio signal 8 that takes place in the signal processing device 10 to generate the output audio signal 12 can in particular include a suppression of the signal components that suppress the background noise contained in the sound 6, or a relative increase in the signal components representing the speech signal 18 compared to the further sound components 20 representing signal portion.
- a frequency-dependent or broadband dynamic compression and/or amplification as well as algorithms for noise suppression can also be used here.
- the signal processing device 10 for Control of the algorithms to be applied to the input audio signal 8 a quantitative measure of the speech quality of the speech signal 18 can be determined. This is based on figure 2 described.
- figure 2 shows a processing of the input audio signal 8 of the hearing device 2 in a block diagram figure 2 .
- a voice activity VAD is recognized for the input audio signal 8 . If there is no significant speech activity (path "n"), the signal processing of the input audio signal 8 to generate the output audio signal 12 takes place using a first algorithm 25.
- the first algorithm 25 evaluates signal parameters of the input audio signal in a previously specified manner 8 such as level, background noise, transients or similar, broadband and/or in particular frequency band by frequency, and determines individual parameters from this, e.g Input audio signal 8 are to be applied.
- the first algorithm 25 can also provide a classification of a hearing situation, which is realized in the sound 6, and set individual parameters as a function of the classification, possibly as a hearing program provided accordingly for a specific hearing situation. Furthermore, the individual audiological requirements of the user of the hearing device 2 can also be taken into account for the first algorithm 25 in order to be able to compensate for a hearing impairment of the user as well as possible by applying the first algorithm 25 to the input audio signal 8 .
- an SNR is determined next and compared with a predefined limit value Th SNR . If the SNR is not above the limit value, ie SNR ⁇ Th SNR , then the first algorithm 25 is again applied to the input audio signal 8 to generate the output audio signal 12 . However, if the SNR is above the specified limit value Th SNR , i.e. SNR > Th SNR , a quantitative measure 30 for the speech quality of the speech component 18 contained in the input audio signal 8 is determined for the further processing of the input audio signal 8 in the manner described below. For this purpose, articulatory and/or prosodic properties of the speech signal 18 are recorded quantitatively.
- the term speech signal component 26 contained in input audio signal 8 is to be understood as meaning those signal components of input audio signal 8 which represent speech component 18 of sound 6 from which input audio signal 8 is generated by input converter 4 .
- the input audio signal 8 is divided into individual signal paths.
- a centroid wavelength ⁇ c is first determined and compared with a predetermined limit value for the centroid wavelength Th ⁇ . If it is determined on the basis of said limit value for the centroid wavelength Th ⁇ that the signal components in Input audio signal 8 are sufficiently high-frequency, the signal components are selected in the first signal path 32, possibly after a suitable temporal smoothing (not shown), for a low frequency range NF and a higher frequency range HF lying above the low frequency range NF.
- the low frequency range NF includes all frequencies f N ⁇ 2500 Hz, in particular f N ⁇ 2000 Hz
- the higher frequency range HF includes frequencies f H with 2500 Hz ⁇ f H ⁇ 10000 Hz, in particular 4000 Hz ⁇ f H ⁇ 8000 Hz or 2500 Hz ⁇ f H ⁇ 5000 Hz.
- the selection can be made directly in the input audio signal 8, or in such a way that the input audio signal 8 is divided into individual frequency bands by means of a filter bank (not shown), individual frequency bands depending on the respective band limits belonging to the lower or higher frequency range NF or HF.
- a first energy E1 is then determined for the signal contained in the low frequency range NF and a second energy E2 for the signal contained in the higher frequency range HF.
- a quotient QE is now formed from the second energy as a numerator and the first energy E1 as a denominator.
- the quotient QE can now be used as a parameter 33 which is correlated with the dominance of consonants in the speech signal 18 .
- the parameter 33 thus enables a statement to be made about an articulatory property of the speech signal components 26 in the input audio signal 8. For example, for a value of the quotient QE>>1 (i.e. QE>Th QE with a predetermined limit value Th QE >>1, which is not detailed). a high dominance for consonants can be concluded, while for a value QE ⁇ 1 a low dominance can be concluded.
- a differentiation 36 into voiced time sequences V and unvoiced time sequences UV is carried out in the input audio signal 8 based on correlation measurements and/or based on a zero crossing rate of the input audio signal 8. Based on the voiced and unvoiced Time sequences V and UV, a transition TS from a voiced time sequence V to a voiceless time sequence UV is determined.
- the length of a voiced or unvoiced time sequence can be, for example, between 10 and 80 ms, in particular between 20 and 50 ms.
- An energy Ev for the voiced time sequence V before the transition TS and an energy En for the unvoiced time sequence UV after the transition TS can also be determined separately for more than one frequency range. It is now determined how the energy changes at the transition TS, for example by a relative change ⁇ E TS or by a quotient (not shown) of the energies Ev, En before and after the transition TS.
- the measure of the change in energy is now compared with a limit value Th E for energy distribution at transitions determined in advance for good articulation.
- a parameter 35 can be formed based on a ratio of the relative change ⁇ E TS and said limit value Th E or based on a relative deviation of the relative change ⁇ E TS from this limit value Th E .
- Said parameter 35 is correlated with the articulation of the transitions from voiced and unvoiced sounds in the speech signal 18, and thus provides information about a further articulatory property of the speech signal components 26 in the input audio signal 8.
- the statement applies here that a transition between voiced and unvoiced time sequences is articulated more precisely, the faster, i.e. more temporally delimited, a change in the energy distribution over the frequency ranges relevant for voiced and unvoiced sounds.
- an energy distribution in two frequency ranges (e.g. the above-mentioned frequency ranges according to the Bark scale, or also in the lower and higher frequency ranges LF, HF) can be considered, eg via a quotient of the respective energies or a comparable characteristic value, and a change in the quotient or the characteristic value over the transition can be used for the parameter.
- a rate of change of the quotient or of the parameter can be determined and compared with a reference value for the rate of change that has previously been determined to be suitable.
- transitions from unvoiced time sequences can also be considered in an analogous manner.
- the concrete configuration in particular with regard to the frequency ranges and limit or reference values to be used, can generally be based on empirical results about a corresponding significance of the respective frequency bands or groups of frequency bands.
- a fundamental frequency f G of the speech signal component 26 is detected in the input audio signal 8 in a time-resolved manner, and a time stability 40 is determined for said fundamental frequency f G using a variance of the fundamental frequency f G .
- the time stability 40 can be used as a parameter 41 which enables a statement to be made about a prosodic property of the speech signal components 26 in the input audio signal 8 .
- a greater variance in the fundamental frequency f G can be used as an indicator for better speech intelligibility, while a monotonic fundamental frequency f G has lower speech intelligibility.
- a level LVL is detected in a time-resolved manner for the input audio signal 8 and/or for the voice signal component 26 contained therein, and a time average MN LVL is formed over a time period 44 specified in particular on the basis of corresponding empirical findings. Furthermore, the maximum MX LVL of the level LVL is determined over the period 44 . The maximum MX LVL of the level LVL is now divided by the time average MN LVL of the level LVL, and a parameter 45 correlated with a volume of the speech signal 18 is thus determined, which enables further information about a prosodic property of the speech signal components 26 in the input audio signal 8 to be made .
- the level LVL another one with the volume can also be used and/or the energy content of the voice signal portion 26 correlated size can be used.
- the parameters 33, 35, 41 and 45 determined as described in the first to fourth signal paths 32, 34, 38, 42 can now be used individually as the quantitative measure 30 for the quality of the speech component 18 contained in the input audio signal 8.
- a second algorithm 46 is now applied to the input audio signal 8 for signal processing.
- the second algorithm 46 can result from the first algorithm 25 through a corresponding change in one or more parameters of the signal processing, depending on the relevant quantitative measure 30, or can provide a completely independent hearing program.
- a single value can also be determined as a quantitative measure 30 for the voice quality using the parameters 33, 35, 41 or 45 determined as described, e.g. by a weighted mean value or a product of the parameters 33, 35, 41, 45 (in 2 shown schematically by combining the parameters 33, 35, 41, 45).
- the weighting of the individual parameters can take place in particular using previously empirically determined weighting factors, which can be determined using the significance of the articulatory or prosodic property for the speech quality recorded by the respective parameter.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Otolaryngology (AREA)
- Neurosurgery (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Circuit For Audible Band Transducer (AREA)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE102020210919.2A DE102020210919A1 (de) | 2020-08-28 | 2020-08-28 | Verfahren zur Bewertung der Sprachqualität eines Sprachsignals mittels einer Hörvorrichtung |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3962115A1 true EP3962115A1 (de) | 2022-03-02 |
Family
ID=77316824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP21190918.9A Pending EP3962115A1 (de) | 2020-08-28 | 2021-08-12 | Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung |
Country Status (4)
Country | Link |
---|---|
US (1) | US12009005B2 (zh) |
EP (1) | EP3962115A1 (zh) |
CN (1) | CN114121040A (zh) |
DE (1) | DE102020210919A1 (zh) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040167774A1 (en) * | 2002-11-27 | 2004-08-26 | University Of Florida | Audio-based method, system, and apparatus for measurement of voice quality |
US7165025B2 (en) * | 2002-07-01 | 2007-01-16 | Lucent Technologies Inc. | Auditory-articulatory analysis for speech quality assessment |
US20180255406A1 (en) * | 2017-03-02 | 2018-09-06 | Gn Hearing A/S | Hearing device, method and hearing system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2948214A1 (en) * | 2013-01-24 | 2015-12-02 | Advanced Bionics AG | Hearing system comprising an auditory prosthesis device and a hearing aid |
US9814879B2 (en) * | 2013-05-13 | 2017-11-14 | Cochlear Limited | Method and system for use of hearing prosthesis for linguistic evaluation |
DE102013224417B3 (de) * | 2013-11-28 | 2015-05-07 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Hörhilfevorrichtung mit Grundfrequenzmodifizierung, Verfahren zur Verarbeitung eines Sprachsignals und Computerprogramm mit einem Programmcode zur Durchführung des Verfahrens |
US11253193B2 (en) * | 2016-11-08 | 2022-02-22 | Cochlear Limited | Utilization of vocal acoustic biomarkers for assistive listening device utilization |
-
2020
- 2020-08-28 DE DE102020210919.2A patent/DE102020210919A1/de active Pending
-
2021
- 2021-08-12 EP EP21190918.9A patent/EP3962115A1/de active Pending
- 2021-08-27 CN CN202110993782.3A patent/CN114121040A/zh active Pending
- 2021-08-30 US US17/460,555 patent/US12009005B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7165025B2 (en) * | 2002-07-01 | 2007-01-16 | Lucent Technologies Inc. | Auditory-articulatory analysis for speech quality assessment |
US20040167774A1 (en) * | 2002-11-27 | 2004-08-26 | University Of Florida | Audio-based method, system, and apparatus for measurement of voice quality |
US20180255406A1 (en) * | 2017-03-02 | 2018-09-06 | Gn Hearing A/S | Hearing device, method and hearing system |
Non-Patent Citations (1)
Title |
---|
ASGER HEIDEMANN ANDERSEN ET AL: "Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 26, no. 10, 1 October 2018 (2018-10-01), pages 1925 - 1939, XP058416624, ISSN: 2329-9290, DOI: 10.1109/TASLP.2018.2847459 * |
Also Published As
Publication number | Publication date |
---|---|
US20220068294A1 (en) | 2022-03-03 |
DE102020210919A1 (de) | 2022-03-03 |
CN114121040A (zh) | 2022-03-01 |
US12009005B2 (en) | 2024-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE602004004242T2 (de) | System und Verfahren zur Verbesserung eines Audiosignals | |
Tchorz et al. | SNR estimation based on amplitude modulation analysis with applications to noise suppression | |
EP2364646B1 (de) | Hörtestverfahren | |
DE102008031150B3 (de) | Verfahren zur Störgeräuschunterdrückung und zugehöriges Hörgerät | |
US20110178799A1 (en) | Methods and systems for identifying speech sounds using multi-dimensional analysis | |
Alku et al. | Measuring the effect of fundamental frequency raising as a strategy for increasing vocal intensity in soft, normal and loud phonation | |
EP1563487B1 (de) | Verfahren zur ermittlung akustischer merkmale von schallsignalen fuer die analyse unbekannter schallsignale und modifikation einer schallerzeugung | |
DE602004007953T2 (de) | System und verfahren zur audiosignalverarbeitung | |
Hansen et al. | A speech perturbation strategy based on “Lombard effect” for enhanced intelligibility for cochlear implant listeners | |
DE102014207437B4 (de) | Spracherkennung mit einer Mehrzahl an Mikrofonen | |
Messing et al. | A non-linear efferent-inspired model of the auditory system; matching human confusions in stationary noise | |
Henrich et al. | Just noticeable differences of open quotient and asymmetry coefficient in singing voice | |
Chennupati et al. | Spectral and temporal manipulations of SFF envelopes for enhancement of speech intelligibility in noise | |
Parida et al. | Underlying neural mechanisms of degraded speech intelligibility following noise-induced hearing loss: The importance of distorted tonotopy | |
DE60110541T2 (de) | Verfahren zur Spracherkennung mit geräuschabhängiger Normalisierung der Varianz | |
WO2010078938A2 (de) | Verfahren und vorrichtung zum verarbeiten von akustischen sprachsignalen | |
EP3962115A1 (de) | Verfahren zur bewertung der sprachqualität eines sprachsignals mittels einer hörvorrichtung | |
EP2548382B1 (de) | Verfahren zum test des sprachverstehens einer mit einem hörhilfegerät versorgten person | |
Rao et al. | Speech enhancement for listeners with hearing loss based on a model for vowel coding in the auditory midbrain | |
Bapineedu et al. | Analysis of Lombard speech using excitation source information. | |
DE102009032238A1 (de) | Verfahren zur Kontrolle der Anpassung eines Hörgerätes | |
EP3961624A1 (de) | Verfahren zum betrieb einer hörvorrichtung in abhängigkeit eines sprachsignals | |
DE102020210918A1 (de) | Verfahren zum Betrieb einer Hörvorrichtung in Abhängigkeit eines Sprachsignals | |
Alku et al. | On the linearity of the relationship between the sound pressure level and the negative peak amplitude of the differentiated glottal flow in vowel production | |
EP2394271B1 (de) | Methode zur trennung von signalpfaden und anwendung auf die verbesserung von sprache mit elektro-larynx |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20220901 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G10L 25/15 20130101ALI20240322BHEP Ipc: G10L 25/60 20130101ALI20240322BHEP Ipc: H04R 25/00 20060101AFI20240322BHEP |
|
INTG | Intention to grant announced |
Effective date: 20240419 |
|
GRAJ | Information related to disapproval of communication of intention to grant by the applicant or resumption of examination proceedings by the epo deleted |
Free format text: ORIGINAL CODE: EPIDOSDIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |