EP4303874A1 - Bereitstellung eines masses der verständlichkeit eines audiosignals - Google Patents

Bereitstellung eines masses der verständlichkeit eines audiosignals Download PDF

Info

Publication number
EP4303874A1
EP4303874A1 EP22183533.3A EP22183533A EP4303874A1 EP 4303874 A1 EP4303874 A1 EP 4303874A1 EP 22183533 A EP22183533 A EP 22183533A EP 4303874 A1 EP4303874 A1 EP 4303874A1
Authority
EP
European Patent Office
Prior art keywords
audio signal
determining
intelligibility
measure
time period
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22183533.3A
Other languages
English (en)
French (fr)
Inventor
Richard Friedrich Schiller
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Interprefy Ag
Original Assignee
Interprefy Ag
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Interprefy Ag filed Critical Interprefy Ag
Priority to EP22183533.3A priority Critical patent/EP4303874A1/de
Publication of EP4303874A1 publication Critical patent/EP4303874A1/de
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Definitions

  • the present invention relates to processing an audio signal to determine a measure of intelligibility of the audio signal, in particular an audio signal comprising speech.
  • VoIP voice over internet protocol
  • videoconferencing applications such as Skype, MS Teams and Zoom
  • VoIP voice over internet protocol
  • videoconferencing applications such as Skype, MS Teams and Zoom
  • people taking part in these events are usually unaware how audible or inaudible, or how intelligible, their own speech is to their correspondents.
  • the muffled interlocutor believes that the whole event was clear and so will not make any effort to improve their intelligibility.
  • the nature of the issue means that those who most need to know about shortcomings usually know least.
  • volume or loudness of an audio signal In terms of live speech communication software, the focus has been on maintaining the volume or loudness of an audio signal. The assumption is that if the speech is loud enough then it would be to some extent intelligible.
  • the volume is indicated in a volume meter of some kind, typically a VU (Volume Unit) meter or a peak meter, sometimes known as a Peak Programme Meter.
  • VU Volume Unit
  • peak meter sometimes known as a Peak Programme Meter
  • the present invention provides a measure of intelligibility (i.e. a quantitative parameter) which can be notified to the user in real time via e.g. a visual and/or audible indicator.
  • a measure of intelligibility i.e. a quantitative parameter
  • the user's intelligibility could be displayed on a screen as a number and/or as a position on a scale.
  • Speakers would then be aware when they are not coming across clearly, while listeners would be able to quantify their reporting and discussion of issues.
  • the speaker can then adjust their setup and/or speech right away and make improvements. They can, for example, speak louder, get closer to the microphone or close windows to reduce background noise. In the slightly longer term, they can try another microphone or move to another room and so on.
  • a computer implemented method for providing a measure of intelligibility of an audio signal comprising: receiving an audio signal; determining a first energy level of the audio signal at frequencies above a first threshold frequency during an energy time period; determining a second energy level of the audio signal during the energy time period; comparing the first energy level and the second energy level; determining the measure of intelligibility of the audio signal based on at least the comparison between the first energy level and the second energy level; and outputting the measure of intelligibility.
  • the first energy level may correspond to an energy of high frequency components of the audio signal
  • the second energy level may correspond to an energy of low and/or medium frequency components of the audio signal.
  • low frequencies may comprise frequencies below around 1 kHz
  • high frequencies may comprise frequencies above around 3 kHz
  • medium frequencies may comprise frequencies between the low and high frequency ranges.
  • the present invention is not limited to these frequency ranges and that the computer implemented method could equally be applied with the first and second energy levels corresponding to energies of signal components within other frequency ranges.
  • high frequency components refers to frequency components of an audio signal that have higher frequencies than “low frequency components”
  • medium frequency components of the audio signal have frequencies that are higher than the low frequency components but lower than the high frequency components
  • the “energy time period” refers to a period of time over which the first and second energy levels are determined.
  • the measure of intelligibility could comprise a number.
  • determining the measure of intelligibility based on the comparison between the first energy level and the second energy level enables the measure of intelligibility to quantify a balance between the first and second energy levels (e.g. corresponding to high and low frequency components of the audio signal).
  • the measure of intelligibility could comprise, or be determined from, a ratio of the first (high frequency) energy level to the second (low/medium frequency) energy level, and/or a difference between the first energy level and the second energy level.
  • the second energy level is an average energy of the whole audio signal (i.e. over all frequencies), and the measure of is determined based on a comparison between the first (high frequency) energy level and the average energy of the audio signal.
  • a reduction or loss of signal components above the first threshold frequency indicates poor intelligibility of speech.
  • frequencies above the first threshold frequency may correspond to signal components generated during sibilants or other consonant sounds, e.g. in English a hard letter "t” or a letter "s" pronounced such that there is a hiss sound.
  • the first energy level is too great in comparison to the second energy level (e.g. too much high frequency content of the audio signal), which might indicate e.g. over-filtering of the audio signal, which could also indicate poor intelligibility.
  • outputting a measure of intelligibility that is determined based on a comparison between the first energy level and the second energy level provides a useful quantification of a listener's expected ability to discern what is being said by a speaker, in other words the general quality of an audio signal.
  • the outputted measure of intelligibility may then, for example, be used to provide visual and/or audible feedback to the speaker and/or the listener such that appropriate action can be taken to improve the audio signal.
  • RMS root mean square
  • determining the second energy level may comprise determining an energy level of the audio signal at frequencies below a second threshold frequency.
  • the second energy level may comprise a peak energy level.
  • determining the second energy level may comprise determining a peak energy level.
  • the first energy level comprises a peak energy level
  • determining the first energy level preferably comprises determining a peak energy level of the audio signal at frequencies above the first threshold frequency during the energy time period.
  • the average energy of the audio signal at frequencies above the first threshold frequency e.g. high frequency components
  • the second threshold frequency e.g. low and/or medium frequency components
  • comparing average energies of e.g. high frequency components of the audio signal with e.g. low and/or medium frequency components of the audio signal would not be effective.
  • the peak energy value of e.g. frequencies below the second threshold frequency may be a reasonable approximation of the average (and/or RMS) energy of the low and/or medium frequency components of the audio signal (frequencies below the second threshold frequency), and vice versa.
  • the first threshold frequency and the second threshold frequency may be the same. It will be further understood that, in general, the second threshold frequency is not greater than the first threshold frequency.
  • the first and/or second energy level(s) may comprise a quasi-peak energy level.
  • the computer implemented method may further comprise determining a depth of modulation of the audio signal, wherein determining the depth of modulation comprises: determining a first volume level of the audio signal within a first modulation time window during a modulation time period; determining whether a second volume level of the audio signal is present within a second modulation time window during the modulation time period, wherein the second volume level is lower than the first volume level; determining, when a second volume level of the audio signal is present, a difference between the first volume level and the second volume level; and determining the measure of intelligibility of the audio signal based additionally on the difference between the first and second volume levels.
  • modulation time period refers to a period of time over which the first and second volume levels are determined.
  • a comparison between the first and second energy levels could provide to a measure of intelligibility that appears to indicate a good quality signal (e.g. a good balance between high and low/medium frequency components), but that may in fact correspond to a signal that is dominated by noise (e.g. background noise), and that is therefore unintelligible.
  • determining a depth of modulation may advantageously enable a determination of whether or not a user is actually speaking, since modulation occurs naturally in speech (e.g. pausing for breath), thus further distinguishing an intelligible signal from one that is dominated by noise.
  • first and second volume levels may be referred to as high and low volume levels, respectively.
  • the first volume level may be an average volume level (but the first volume level could also, or instead, comprise a peak and/or RMS volume level, for example).
  • the second volume level may then be determined relative to the first volume. In some examples, if the difference between the first and second volume levels is greater than or equal to a minimum volume difference, the depth of modulation can be considered to be indicative of real speech and the measure of intelligibility may be adjusted to reflect an increased intelligibility.
  • the computer implemented method may further comprise comparing, when the second volume level of the audio signal is present within the second modulation time window, the difference between the first volume level and the second volume level with a minimum volume difference; and determining the measure of intelligibility of the audio signal based additionally on the comparison between the difference between the first and second volume levels, and the minimum volume difference.
  • the measure of intelligibility may be varied continuously depending on the difference between the first and second volume levels.
  • the computer implemented method may further comprise: obtaining a first plurality of samples of the audio signal during an first amplitude time period, each sample having an associated amplitude; determining a distribution of the associated amplitudes; detecting any peaks in the distribution; determining whether any of the detected peaks correspond to a non-zero amplitude; and determining the measure of intelligibility of the audio signal based additionally on a number of peaks in the distribution corresponding to a non-zero amplitude.
  • first amplitude time period refers to a period of time over which the first plurality of samples of the audio signal is determined.
  • a curve (i.e. distribution) of amplitudes should fall monotonically, or substantially monotonically, between the peak at zero amplitude and the maximum amplitude. If this is not the case, it may be that the audio signal is subject to a non-linear distortion (e.g. clipping). Detecting non-zero peaks in an amplitude distribution of an audio signal may therefore advantageously enable the detection of non-linear distortion of the audio signal. For example, the measure of intelligibility could be adjusted to reflect a reduced intelligibility if non-linear distortion is present.
  • the amplitude may be sampled as an absolute value (i.e. there may be positive and negative amplitude values, or in some cases only positive or only negative amplitude values may be sampled), and/or an absolute value (a magnitude) of the amplitude may be sampled.
  • determining a distribution of the associated amplitudes may comprise sorting the amplitudes into bins, each bin corresponding to a range of amplitudes, and determining a distribution may comprise determining a distribution of the bins.
  • Sorting the amplitudes into bins may advantageously reduce the time and/or processing power required to process the amplitude values to determine the distribution.
  • the computer implemented method may further comprise: obtaining a second plurality of samples of the audio signal during a second amplitude time period, each sample having an associated amplitude; determining a difference between each of the associated amplitudes and a maximum amplitude; determining whether the difference between any of the associated amplitudes and the maximum amplitude is less than a minimum amplitude difference; and determining the measure of intelligibility of the audio signal based additionally on whether the difference between any of the associated amplitudes and the maximum amplitude is less than the minimum amplitude difference.
  • second amplitude time period refers to a period of time over which the second plurality of samples of the audio signal is determined.
  • determining whether the difference between any of the associated amplitudes and the maximum amplitude is less than a minimum amplitude difference (where the minimum amplitude difference may be zero or non-zero), enables detection of samples of the audio signal at or around the maximum measureable amplitude, and may therefore further enable the detection of non-linear distortion of the audio signal (in particular clipping).
  • the measure of intelligibility could be adjusted to reflect a reduced intelligibility if non-linear distortion (e.g. clipping of the audio signal) is present.
  • the computer implemented method according to the present disclosure may further comprise: determining an overall volume level of the audio signal during a volume time period; determining whether the overall volume level of the audio signal is between a minimum volume level and a maximum volume level; and determining the measure of intelligibility of the audio signal based additionally on whether the overall volume level is between the minimum volume level and the maximum volume level.
  • the "overall volume level” refers to a volume level of the audio signal that is representative of the volume of the audio signal over the entire volume time period.
  • the overall volume level may be an average volume of the audio signal over the volume time period.
  • the overall volume level could be a peak volume level of the audio signal over the volume time period.
  • volume time period refers to a period of time over which the overall volume level is determined.
  • Determining whether the overall volume level of the audio signal is between a minimum volume level and a maximum volume level advantageously provides determination of a further aspect of intelligibility.
  • the measure of intelligibility may be reduced if a speaker is too quiet to be comprehensible, and/or the measure of intelligibility may be reduced if a speaker is too loud e.g. where being too loud may introduce distortion to the audio signal.
  • the computer implemented method according to the present disclosure may further comprise: obtaining a third plurality of samples of the audio signal during a step change time period; detecting a step change between temporally adjacent samples; and determining the measure of intelligibility of the audio signal based additionally on whether a step change between temporally adjacent samples is detected.
  • step change time period refers to a period of time over which the third plurality of samples is determined.
  • a step change in an audio signal may indicate a signal disruption or other data loss. For example, a valid sample may be followed by a zero (or other substitute value). Such erroneous readings may take the form of clicks or spikes in the audio signal.
  • the computer implemented method may comprise detecting a step change in the volume, amplitude, average frequency, and/or other measurable parameter of the audio signal.
  • the measure of intelligibility could be adjusted to reflect a reduced intelligibility where step changes are detected in the audio signal.
  • the computer implemented method according to the present disclosure may further comprise: determining a volume of an echo of the audio signal; and determining the measure of intelligibility of the audio signal based additionally on the volume of the echo of the audio signal.
  • an echo may be a repeated portion of the audio signal that occurs due to e.g. foldback between a user's speakers or headphones and their microphone.
  • the measure of intelligibility could be adjusted to reflect a reduced intelligibility where one or more echoes is detected and/or based on the volume(s) of the detected echo(es).
  • quiet echoes may not be as detrimental to intelligibility as loud echoes.
  • the computer implemented method according to the present disclosure may further comprise: determining a temporal separation between a portion of the audio signal within an echo time period and an echo of the portion of the audio signal within the echo time period; and determining the measure of intelligibility of the audio signal based additionally on the temporal separation.
  • an echo that is close in time to the original sound may not be as detrimental to intelligibility as an echo that is further temporally separated from the original sound.
  • the computer implemented method according to the present disclosure may further comprise: transforming the audio signal, during a frequency time period, into a frequency domain; determining whether any frequencies are constantly present and/or constantly absent in the audio signal during the frequency time period; and determining the measure of intelligibility of the audio signal based additionally on whether any frequencies are constantly present and/or constantly absent during the frequency time period.
  • frequency time period refers to a period of time of the audio signal over which the transform of the audio signal is performed.
  • Frequencies may be constantly present during the frequency time period, for example, due to fixed tones that occur in addition to any speech, e.g. in environments where there is a constant or substantially constant background sound such as machine rooms.
  • frequencies may be constantly absent during the frequency time period where, e.g. due to filtering of the audio signal, parts of the audible spectrum have been removed. Both constantly present and constantly absent frequencies may further contribute to a reduced intelligibility, and so the measure of intelligibility could be adjusted to reflect such a reduced intelligibility.
  • any of the time periods described in the present disclosure may be a same time period as any other time period(s). That is, one or more determinations of the measure of intelligibility may be performed with respect to a same portion of the audio signal.
  • a computer implemented method for providing a measure of intelligibility could comprise: obtaining a first plurality of samples of the audio signal during an first amplitude time period, each sample having an associated amplitude; determining a distribution of the associated amplitudes; detecting any peaks in the distribution; determining whether any of the detected peaks correspond to a non-zero amplitude; and determining a measure of intelligibility of the audio signal based additionally on a number of peaks in the distribution corresponding to a non-zero amplitude.
  • first amplitude time period refers to a period of time over which the first plurality of samples of the audio signal is determined.
  • a curve (i.e. distribution) of amplitudes should fall monotonically, or substantially monotonically, between the peak at zero amplitude and the maximum amplitude. If this is not the case, it may be that the audio signal is subject to a non-linear distortion (e.g. clipping). Detecting non-zero peaks in an amplitude distribution of an audio signal may therefore advantageously enable the detection of non-linear distortion of the audio signal. For example, the measure of intelligibility could be adjusted to reflect a reduced intelligibility if non-linear distortion is present.
  • the amplitude may be sampled as an absolute value (i.e. there may be positive and negative amplitude values, or in some cases only positive or only negative amplitude values may be sampled), and/or an absolute value (a magnitude) of the amplitude may be sampled.
  • determining a distribution of the associated amplitudes may comprise sorting the amplitudes into bins, each bin corresponding to a range of amplitudes, and determining a distribution may comprise determining a distribution of the bins.
  • Sorting the amplitudes into bins may advantageously reduce the time and/or processing power required to process the amplitude values to determine the distribution.
  • a computing device comprising a processor, the processor being configured to carry out any of the computer implemented methods described herein, the computer device further comprising: an input device configured to receive the audio signal; and an output device, wherein outputting the measure of intelligibility comprises outputting the measure of intelligibility via the output device.
  • the computer device could be a personal computer or laptop, or a mobile phone, tablet, or other portable communications device.
  • the input device may be configured to receive a live or recorded audio signal such as a VoiP call.
  • the output device may comprise a display such as a screen or other display or indicator such as a light emitting diode (LED) display or a liquid crystal display (LCD), and/or an audio output device such as one or more speakers and/or headphones.
  • the output device may output the measure of intelligibility, or a parameter derived from the measure of intelligibility, in a way that is easy for a human to interpret, such as a number or a position on a scale. It could be envisaged that an audio and/or visual warning could be provided to a user if e.g. the measure of intelligibility reflects an intelligibility that is below a minimum intelligibility level.
  • the instructions may be provided on one or more carriers.
  • a non-transient memory e.g. a EEPROM (e.g. a flash memory) a disk, CD-or DVD-ROM, programmed memory such as read-only memory (e.g. for Firmware), one or more transient memories (e.g. RAM), and/or a data carrier(s) such as an optical or electrical signal carrier.
  • the memory/memories may be integrated into a corresponding processing chip and/or separate to the chip.
  • Code (and/or data) to implement embodiments of the present disclosure may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language.
  • a conventional programming language interpreted or compiled
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the audio signal could equally be a sound recording, for example recorded speech or even music.
  • the present invention may have applications in live interpretation (e.g. where an interpreter is listening to speech in one language, for example using headphones, and providing real-time interpretation into another language via a microphone).
  • the intelligibility of a person who is speaking may be affected by many different factors. Examples of factors that may affect intelligibility include:
  • Such actions may include:
  • Figure 1A shows a communication system 100a comprising a first user 104 (User A) who is associated with a first user terminal 102 and a second user 110 (User B) who is associated with a second user terminal 108. Whilst only two users have been shown in Figure 1A for simplicity, the communication system 100a may comprise any number of users and associated user devices.
  • the user terminals 102 and 108 can communicate over the network 106 in the communication system 100, thereby allowing the users 104 and 110 to communicate with each other over the network 106.
  • the network 106 may be any suitable network which has the ability to provide a communication channel between the first user terminal 102 and the second user terminal 108.
  • the network 106 may be the Internet or another type of network such as a High data rate mobile network, such as a 3rd generation (“3G”), 4th generation (“4G”), or 5th generation (“5G”) mobile network.
  • the user terminal 102 may be, for example, a mobile phone, a personal digital assistant (“PDA”), a personal computer (“PC”) (including, for example, Windows TM , Mac OS TM and Linux TM PCs), a gaming device or other embedded device able to connect to the network 106.
  • the user terminal 102 is arranged to receive information from and output information to the user 104 of the user terminal 102.
  • the user terminal 102 comprises a display such as a screen and an input device suitable for receiving an audio signal such as a microphone.
  • the user terminal 102 further comprises one or more non-audio input devices such as a keypad and/or a touch-screen.
  • Figure 1B shows a computer system 100b in which one or more inputs may be provided to a computer device 118.
  • the computer device 118 may correspond to one of the user terminals 102, 108 of Figure 1A .
  • the input(s) to the computer device 118 may comprise, for example, one or more voice recordings 112, one or more music recordings 114, and/or one or more sounds produced by a user 116 (e.g. speech) e.g. input to the computer device 118 via an input device (e.g. a microphone).
  • a user 116 e.g. speech
  • an input device e.g. a microphone
  • FIG. 2 illustrates a detailed view of an example of a user terminal 102 on which is executed a communication client for communicating over the communication system 100a.
  • the user terminal 102 comprises a central processing unit ("CPU") 202, to which is connected a display 204 such as a screen or touch screen, non-audio input devices such as a keypad 206, a camera 208, and touch screen 204.
  • An audio output device 210 e.g. a speaker
  • an (audio) input device 212 e.g. a microphone
  • the display 204, keypad 206, camera 208, audio output device 210 and input device 212 may be integrated into the user terminal 102 as shown in Figure 2 .
  • one or more of the display 204, the keypad 206, the camera 208, the audio output device 210 and the input device 212 may not be integrated into the user terminal 102 and may be connected to the CPU 202 via respective interfaces.
  • One example of such an interface is a USB interface.
  • the CPU 202 is connected to a network interface 224 such as a modem for communication with the network 106.
  • the network interface 224 may be integrated into the user terminal 102 as shown in Figure 2 .
  • the network interface 224 is not integrated into the user terminal 102.
  • the user terminal 102 also comprises a memory 226 for storing data as is known in the art.
  • the memory 226 may be a permanent memory, such as ROM.
  • the memory 226 may alternatively be a temporary memory, such as RAM.
  • Methods according to the present disclosure may be carried out, for example, by the user terminal(s) 102, 108, and/or by a network entity on the network 106 (e.g. a server), for example during a video or audio call or conference call.
  • a network entity on the network 106 e.g. a server
  • methods according to the present disclosure may be carried out, for example, by software stored on, or accessible to (e.g. cloud-based software) the computer device 118.
  • methods according to the present disclosure may be applied to audio files such as speech or music, and/or as a local microphone test.
  • FIG 3 illustrates an example of an implementation of the invention according to the present disclosure.
  • a display 300 of a computing device e.g. of a user terminal 102, 108 illustrated in Figure 1A
  • the display 300 may additionally show a visual indicator 302 configured to indicate an intelligibility level of the speaker and/or the other users on the call.
  • a separate visual indicator could be displayed in the visual feed 306 for each user on the call to indicate an intelligibility level for each user.
  • the visual indicator 302 may display the actual measure of intelligibility, as measured, and/or the visual indicator 302 may display an indicator that is analogous to, or reflective of, the measure of intelligibility.
  • the visual indicator 302 may display a number, and/or may provide a bar and/or dial with or without a marked scale.
  • the display 300 illustrated in Figure 3 may correspond to the display 204 illustrated in Figure 2 .
  • the computing device may emit an audible indicator, for example a warning sound, when a user's intelligibility e.g. falls below a certain level.
  • an audible indicator for example a warning sound
  • a user receiving feedback in the form of an intelligibility level (equal or analogous to the measure of intelligibility), would then be able to take appropriate corrective action to improve their intelligibility.
  • a user i.e. listener
  • a user who might be, for example, struggling to make out what one or more of the other users is saying can determine whether the problem lies with the user who is speaking, or with the listener.
  • Figure 4 illustrates a part of an audio signal 400.
  • the audio signal 400 comprises a background 404 and several examples of words or phrases 406 that may be spoken by a user. It will be understood that the audio signal 400, illustrated in Figure 4 , is illustrated in terms of amplitude (shown on the vertical axis) varying with time (shown along the horizontal axis). The part of the audio signal 400 of Figure 4 may correspond to a part of the audio signal during an energy time period 420.
  • Figure 4 also shows the time-varying the energies of high frequency 410 and low and/or medium frequency 408 components of the audio signal 400, where it will be understood that energy is shown on the vertical axis in this case. It will be further understood that "high frequency” components are frequency components of the audio signal 400 that are above a first threshold frequency.
  • a measure of intelligibility can be determined by measuring a balance between the high frequency and low and/or medium frequency components of the audio signal 400.
  • a loss of high frequency components means poor intelligibility.
  • the average energy at high frequencies is generally low, so comparing average energies (high frequencies and low/medium frequencies) is not effective. For this reason, preferably the peak high frequency energy is measured.
  • an average energy of the audio signal could be used, and/or an average energy of frequencies below a second threshold frequency, and/or a peak energy of the signal at frequencies below the second threshold frequency. That is, at low/medium frequencies, the peak energy value is a reasonable approximation of the average value and vice versa.
  • high frequencies would comprise frequencies above around 3 kHz and low frequencies would comprise frequencies below around 1 kHz (e.g. the first threshold frequency could be around 3 kHz and the second threshold frequency could be around 1 kHz), although it would be understood that these are merely provided as examples and that the invention is not limited to these threshold frequencies.
  • Any method(s) for determining a peak (i.e. a signal peak) in the time-varying energy of frequency components of the audio signal known to persons skilled in the art may be applied to detect a peak energy.
  • the energy time period 420 could be 1.5 seconds.
  • a peak energy level of the high frequency component during the energy time period would be determined (e.g. an energy level of the audio signal 400 at frequencies above a first threshold frequency) and compared with either a peak or average (or RMS) energy level of the low/medium frequency components (e.g. an energy level of the audio signal 400 at frequencies below a second threshold frequency) during the same energy time period 420.
  • a peak or average (or RMS) energy level of the low/medium frequency components e.g. an energy level of the audio signal 400 at frequencies below a second threshold frequency
  • the measure of intelligibility may be adjusted to reflect a reduced intelligibility.
  • the measure of intelligibility may comprise, or be determined from, a ratio of the first energy level to the second energy level, and/or a difference (e.g. an absolute difference) between the first energy level and the second energy level.
  • the measure of intelligibility could vary according to a difference between the first energy level and the second energy level.
  • the measure of intelligibility may decrease with increasing distance from this optimum. It will be understood that, in some implementations, the optimum may differ depending on the language being spoken.
  • Figure 5 therefore illustrates an additional step that may be taken to determine the measure of intelligibility of an audio signal by detecting modulation of the audio signal.
  • Modulation occurs naturally in speech, for example when speakers pause for breath or thought, or to add emphasis. Such modulation is reflected in an audio signal as a modulation of the volume of the audio signal. Therefore, detecting a depth of modulation may enable the measure of intelligibility to reflect whether the audio signal is dominated by noise, or whether real speech is present.
  • Figure 5 illustrates a part of an audio signal 500.
  • the audio signal 500 comprises a background 504 and several examples of words or phrases 506 that may be spoken by a user. It will be understood that the audio signal 500, illustrated in Figure 5 , is illustrated in terms of amplitude (shown on the vertical axis) varying with time (shown along the horizontal axis).
  • the part of the audio signal 500 illustrated in Figure 5 may be the same as the part of the audio signal 400 illustrated in Figure 4 , and the modulation time period 520 may be a same time period as the energy time period 420 (e.g. t 1 may be equal to t 2 ), or as any other time period described herein.
  • the part of the audio signal 500 illustrated in Figure 5 may be a different to the part of the audio signal 400 illustrated in Figure 4 , and the modulation time period 520 and the energy time period 420 may be different time periods.
  • the modulation time period 520 may partially overlap any other time period.
  • the measure of intelligibility may further be determined by monitoring the amount of modulation of the audio signal 500.
  • the method according to the present disclosure may comprise determining whether the audio signal 500 spends a certain amount of time (i.e. a first modulation time window 522) at a first (higher) volume level 502, and whether the audio signal 500 spends a certain amount of time (i.e. a second modulation time window 524) at a second (lower) volume level (e.g. around the volume level of the background 504).
  • the measure of intelligibility may be further determined based on the difference between the first and second volume levels. For example, the measure of intelligibility may increase with increasing difference between the first and second volume levels.
  • the first and second volume levels may be a maximum and a minimum volume level, respectively, measured during the modulation time period 520.
  • the second volume level may only be determined if the audio signal 500 spends the second modulation time window 524 at a volume level that is a given minimum volume difference below the first volume level, and the measure of intelligibility may be determined based on the presence, or lack, of the second volume level.
  • the audio signal 500 may be considered "unintelligible" (corresponding to a reduced measure of intelligibility) unless the audio signal 500 spends at least the second modulation time window (e.g. 1 second) at a volume level that is at least a minimum volume difference (e.g. 20 dB) below a volume level at which the audio signal spends at least the first modulation time window (e.g. 1 second).
  • noise in an audio signal may be detected using a speech detection algorithm that uses knowledge of what are legitimate and illegitimate components of signals forming speech to remove those parts deemed illegitimate.
  • the removed (illegitimate) parts of the audio signal may be used to provide a measure of the noise in an audio signal, and the measure of intelligibility could therefore be determined based on these removed parts of the audio signal.
  • Determining a measure of intelligibility may therefore further, or alternatively, comprise detecting non-linear distortion of an audio signal.
  • non-linear distortion occurs when an equal rise in the input of a system does not lead to an equal rise in the output. For example, where the input rises from zero to one and the output goes from zero to four but when the input rises from one to two, the output rises not to eight but six or ten.
  • Non-linear distortion of an audio signal is most often caused by clipping of the audio signal (i.e. when the highest amplitudes of the input are "clipped"). However, other factors may also cause non-linear distortion.
  • a measure of intelligibility not be determined based solely on a balance between high and low/medium frequency components of an audio signal alone, since the high frequency content of the signal would be artificially increased without an increase in the real intelligibility.
  • non-linear distortion of an audio signal may be detected by analysing sound sample values and looking at their magnitude distribution over a period time.
  • Digital signals comprise consecutive instantaneous samples. The nature of sound is such that there will be more samples at (around) zero than any other value. There will then generally be fewer samples at a maximum measurable amplitude value than any values between zero and the maximum. Between these two points, the number of samples at any amplitude in between should then be less than at any smaller (i.e. closer to zero) amplitude. Plotted as a curve, it should gradually fall from the count of samples measured as zero to the count of samples measured at the maximum (i.e. there are progressively fewer samples counted at each magnitude of the amplitude as the magnitude of the amplitude increases).
  • Figure 6 illustrates schematically an example of a graph 600 showing a distribution 606 of amplitudes sampled for a part of an audio signal during a first amplitude time period.
  • the amplitude is shown on the vertical axis 604 and the density of the distribution is shown on the horizontal axis 602.
  • the first amplitude time period may be a same time period as any other time period described herein, or the first amplitude time period may be a different time period. In some examples, the first amplitude time period may partially overlap any other time period.
  • any peaks 608 are detected in the distribution 606 that are not around zero amplitude (i.e. if the distribution 606 of sampled amplitudes does not fall monotonically from zero amplitude), then it can be interpreted that a non-linear distortion has taken place.
  • the peaks 608 are shown at the maximum (positive) measurable amplitude and the minimum (i.e. the maximum negative) measurable amplitude of the audio signal. This may indicate clipping of the audio signal. However, it will be understood that one or more peaks could equally be detected at any other non-zero amplitude along the distribution 606.
  • the amplitude may be sampled directly (i.e. there may be positive and negative amplitude values, or in some cases only positive or only negative amplitude values may be sampled), and/or an absolute value (a magnitude) of the amplitude may be sampled.
  • a high-pass filter may be applied to the audio signal to account for any offset of the amplitudes of the audio signal.
  • samples may be allocated into bins of a given amplitude width to reduce the required processing time and/or power. For example, each measured sample may be allocated into a bin and a count for that bin may be incremented. It will be understood that a bin comprises a range of amplitudes. In some examples, the bins may be equal in size.
  • the bin(s) around zero should contain the most samples with the next furthest bin(s) from zero containing the next highest number of samples and so on. If any bin contains significantly more samples than the bin below it (i.e. corresponding to a lower range of amplitude magnitudes) would suggest that the signal had been the subject of a non-linearity.
  • the measure of intelligibility may therefore be determined according to whether the audio signal has been subject to one or more non-linear distortions. If one or more non-linear distortions is detected, the measure of intelligibility may be reduced. For example, the measure of intelligibility may decrease with increasing height of any peaks 608 corresponding to a non-zero amplitude in the distribution 606.
  • the method according to the present disclosure may comprise sampling amplitudes at or near the maximum amplitude (i.e. maximum positive and/or negative amplitudes), as amplitudes detected at these values may indicate clipping of the audio signal.
  • the measure of intelligibility may decrease with increasing samples detected at or near the maximum (positive and/or negative) measurable amplitude.
  • amplitudes "at or near" the maximum amplitude may be amplitudes that are separated from the maximum measurable amplitude by less than a minimum amplitude difference.
  • detecting clipping of the audio signal in this way may be simpler, and therefore less resource intensive, than sampling amplitudes to produce a distribution as described above.
  • Amplitudes at or near the maximum amplitude may be sampled during a second amplitude time period.
  • the second amplitude time period may be a same time period as any other time period described herein, or the second amplitude time period may be a different time period.
  • the second amplitude time period may partially overlap any other time period.
  • the intelligibility of an audio signal may also depend on its volume. For example, if a user is speaking too far away from their microphone, they may not be easily understood. In another example, volumes that are too high may also cause a reduction in intelligibility because they may introduce distortion.
  • the measure of intelligibility may be further determined according to whether an overall volume level of the audio signal is between a minimum and a maximum volume level. If the volume is too low (i.e. below the minimum volume level), or too high (i.e. above the maximum volume level), the intelligibility value may be reduced.
  • the overall volume level may be, for example, an average volume or a peak volume determined during a volume time period.
  • the volume time period may be a same time period as any other time period described herein, or the volume time period may be a different time period. In some examples, the volume time period may partially overlap any other time period.
  • a further effect that may cause erroneous intelligibility readings is that of clicks and spikes caused by disruption to the signal path or other data loss or disruption.
  • a sudden break in the signal can cause high frequency content to be generated where a valid sample is followed by a zero (or other substitute value) formed when the system has no real data from which to create the next sample. This sudden sample step is generally heard as a click.
  • these steep changes in level between samples are not a phenomenon which occurs often naturally and so their presence can be detected and assumed to be cause by a discontinuity in the signal.
  • the measure of intelligibility may therefore further be determined based on whether any step changes (i.e. discontinuities) are present in the audio signal.
  • Step changes may be determined by sampling the audio signal during a step change time period, and detecting a step change in one or more of e.g. the volume, amplitude, average frequency, high frequency component (i.e. component of the audio signal at frequencies above the first threshold frequency), low/medium frequency component (i.e. component of the audio signal at frequencies below the second threshold frequency), or any other time-varying parameter. Where one or more step changes is detected, the measure of intelligibility may be reduced accordingly.
  • the step change time period may be a same time period as any other time period described herein, or the step change time period may be a different time period. In some examples, the step change time period may partially overlap any other time period.
  • Any method for detecting a step change in a time-varying parameter known to persons skilled in the art may be applied to detect the step change.
  • Echoes may occur in an audio signal if, for example, a speaker is in a reverberant room, from foldback between the user's headphones and microphone or because of issues with the processing of the signal.
  • the measure of intelligibility may be reduced.
  • the intelligibility of the audio signal may depend on, e.g. the severity (loudness) and/or the timing of an echo. For example, loud echoes which are close to the original sound in time may not be as detrimental to intelligibility of the audio signal as quieter echoes which occur with a greater timing separation.
  • the measure of intelligibility may decrease with increasing volume of an echo, and/or with increasing temporal separation between a portion of an audio signal that occurs during an echo time period and an echo of that portion.
  • the echo time period may be a same time period as any other time period described herein, or the echo time period may be a different time period. In some examples, the echo time period may partially overlap any other time period.
  • Another audible phenomenon that may cause reduced intelligibility of an audio signal may be tones added to the speech. Such tones can occur in environments such as machine rooms, or where devices are operating which produce a whistle. Furthermore, it is possible that through filtering, parts of the audible spectrum have been removed. Both of these cases may be detected by performing a frequency transform on the audio signal and detecting frequencies, or ranges of frequencies, which are either constant or constantly absent compared to the rest of the audio signal (e.g. the speech).
  • the measure of intelligibility may therefore be reduced when, e.g., a constantly present and/or a constantly absent frequency or range of frequencies is detected in the audio signal.
  • the frequency transform may be performed over a part of the audio signal during a frequency time period.
  • the frequency time period may be a same time period as any other time period described herein, or the frequency time period may be a different time period. In some examples, the frequency time period may partially overlap any other time period.
  • the frequency transform may be performed according to any method known to persons skilled in the art, e.g. fast Fourier transform.
  • the listener can hear the sound quite well, but the intelligibility is reduced by foldback. That is, the leakage of sound from another source.
  • an interpreter may listen to German on headphones and speak English. However, the listener may hear both the English and the German leaking from the headphones.
  • the measure of intelligibility may be further determined based on the detection of foldback in the audio signal. For example, if any foldback is detected in the audio signal, the measure of intelligibility may be reduced accordingly.
  • foldback could be detected by pattern-matching the original German sound and looking for it within the English feed.
  • Figure 7 illustrates an example of a method 700 for providing a measure of intelligibility of an audio signal according to the present disclosure.
  • the method 700 may be performed by the CPU 202 of the user terminal 102 as shown in Figure 2 . While we refer to steps of the method 700 being performed by a single processing unit (e.g. CPU 202) of a user terminal, the method 700 may be performed by multiple processing units of a user terminal. Furthermore, the method 700 may be performed by distributed processing units that may be distributed across two or more user terminals. Each processing unit of the distributed processing units may comprise one or more CPUs 202 referred to herein.
  • the CPU 202 receives an audio signal.
  • the audio signal may be captured from an environment by the input device (e.g. microphone 212) of the user terminal 102 and relayed to the CPU 202.
  • the CPU 202 determines a first energy level of the audio signal at frequencies above a first threshold frequency during an energy time period.
  • the CPU 202 determines a second energy level of the audio signal during the energy time period.
  • the CPU 202 compares the first energy level and the second energy level.
  • the CPU 202 determines the measure of intelligibility of the audio signal based on at least the comparison between the first energy level and the second energy level.
  • the CPU 202 outputs the measure of intelligibility.
  • CPU 202 may further determine the measure of intelligibility based on any of the other methods described herein, alone or in combination.
  • the measure of intelligibility may be determined in part by one or more of the methods and method steps described herein.
  • each part of the method for determining the measure of intelligibility may provide a separate numerical value representing a given aspect of the intelligibility (balance between high and low frequencies, modulation, non-linear distortion, volume, etc.).
  • These separate values may be matrixed together or otherwise combined such that, for example, a an optimal difference between the first energy level and the second energy level would increase the intelligibility reading, but the detected presence of non-linear distortion, for example, would lower the reading.
  • FIG. 8 schematically illustrates an example of a system 800 for determining a measure of intelligibility for output according to the present disclosure.
  • the CPU 202 may comprise the system 800.
  • Each element of the system 800 described in detail below may be implemented in software, firmware, hardware, or a combination thereof.
  • this element comprises one or more electronic components arranged in a circuit, and one or more of the values described below in relation to the element may be provided as electrical signals, e.g. voltages.
  • reference to an element is used herein to refer to one of a detector, an operator, a module, or a filter as described herein.
  • the example system 800 of Figure 8 may comprise a first energy level detector 802.
  • the first energy level detector 802 may be configured to determine a first energy level H, e.g., a peak energy level of the audio signal at frequencies above a first threshold frequency.
  • the first energy level detector may be configured to carry out the step S704 of the method 700 illustrated in Figure 7 .
  • the system 800 may further comprise a second energy level detector 804.
  • the second energy level detector 804 may be configured to determine a second energy level W, e.g., a peak energy level or an average energy level of the audio signal e.g. at frequencies below a second threshold frequency.
  • the second energy level detector may be configured to carry out the step S706 of the method 700 illustrated in Figure 7 .
  • each of the first H and second W energy levels may be provided as values between 0 and 1.
  • the example system 800 may further comprise one or more first operators 806, 808 configured to determine e.g. a difference and/or a ratio between the first energy level H and the second energy level W.
  • first operators 806, 808 configured to determine e.g. a difference and/or a ratio between the first energy level H and the second energy level W.
  • two first operators 806, 808 are combined to provide an absolute value of the difference between the first energy level H and the second energy level W (e.g. abs(H - W)).
  • the one or more first operators 806, 808 may be configured to carry out the step S708 of the method 700 illustrated in Figure 7 .
  • the example system 800 may further comprise a non-linearity detector 810 configured to detect non-linearities (such as clipping) in the audio signal.
  • a non-linearity in the audio signal may correspond to a non-zero peak in a distribution of sampled amplitudes of the audio signal.
  • the non-linearity detector 810 may be configured to provide a non-linearity value C that is proportional to a degree of detected non-linearity in the audio signal.
  • the non-linearity value C may be between 0 and 1.
  • the system 800 may further comprise a second operator 812 configured to combine the non-linearity value C and the difference between the first energy level H and the second energy level W, for example as a difference (e.g. abs(H - W) - C).
  • the system 800 may further comprise a modulation detector 814 configured to determine a depth of modulation value M in the audio signal.
  • the system may further comprise a limiter 816 configured to output a depth of modulation value M only if a difference between first and second volume levels above a certain minimum value is present in the audio signal.
  • the depth of modulation value M may be between 0 and 1, where 0 corresponds to no modulation of the audio signal.
  • the system 800 may further comprise a third operator 818 configured to combine the non-linearity value C, the difference between the first energy level H and the second energy level W, and the modulation value M (e.g. M x (abs(H - W) - C)).
  • a third operator 818 configured to combine the non-linearity value C, the difference between the first energy level H and the second energy level W, and the modulation value M (e.g. M x (abs(H - W) - C)).
  • the first 806, 808, second 812, and/or third 818 operators may be configured to carry out the step S710 of the method 700 illustrated in Figure 7 .
  • the measure of intelligibility may comprise any combination of values determined by the detectors of the system 800 and combined by the operators.
  • the system 800 may comprise an output module 820 configured to output a measure of intelligibility to an output device of the user terminal 102 (for example the display 204 and/or speaker 210 of the user terminal 102), the measure of intelligibility corresponding to the combination of values as combined by the operators described above.
  • output module 820 may output the measure of intelligibility as a value between 0 and 1.
  • the output module 820 may be configured to carry out the step S712 of the method 700 illustrated in Figure 7 .
  • the output module 820 may output the measure of intelligibility to an output device, such as a display 204 and/or a speaker 210 of the user terminal 102. Alternatively, or in addition, the measure of intelligibility may be recorded in memory 226 (e.g. in a table). In some examples, the output module 820 may output the measure of intelligibility to an external process or system, for example to provide automated control of one or more elements of an external process or system. For example, an external process or system may be configured to compensate for poor intelligibility by e.g. providing further processing or filtering of the audio signal and/or to provide automatic adjustment of hardware, e.g. automatically adjusting the position of one or more input devices 212.
  • an alternative input device 212 e.g. microphone
  • the system 800 could automatically reconfigure to receive the voice of an alternative user (e.g. an alternative interpreter), for example a user in a different location and/or using a different input device 212.
  • FIG. 9 schematically illustrates further example of a system 900 for determining a measure of intelligibility for output according to the present disclosure.
  • the CPU 202 may comprise the system 900.
  • Each element of the system 900 described in detail below may be implemented in software, firmware, hardware, or a combination thereof.
  • this element comprises one or more electronic components arranged in a circuit, and one or more of the values described below in relation to the element may be provided as electrical signals, e.g. voltages.
  • reference to an element is used herein to refer to one of a detector, an operator, a module, or a filter as described herein.
  • the system 900 illustrated in Figure 9 may be particularly useful in cases where a variety of input devices (e.g. microphones) of varying quality may be employed.
  • a lower quality microphone may have a more limited frequency range than a higher quality microphone (typically a lower quality microphone will "cut off" at lower frequencies than a higher quality microphone), and it may be desirable to limit the intelligibility score that can be achieved using a microphone having a lower frequency range.
  • the system 900 may be configured to receive an audio signal 901.
  • the system 900 may be configured to carry out the step S702 of the method 700 illustrated in Figure 7 .
  • an audio signal 901 received by the system 900 may be processed by a modulation detector 902, and/or a non-linear distortion (or clip) detector 904.
  • Each of the modulation detector 902 and the non-linear distortion detector 904 may produce an output value ("Modu" and "Clip", respectively) corresponding to a measure of the modulation and non-linear distortion of the audio signal 901, respectively.
  • the modulation detector 902 may produce an output value corresponding to a difference between first and second volume levels as described above (i.e.
  • an amount of modulation of the audio signal 901), and the non-linear distortion detector 904 may produce an output value corresponding to a height of any peaks corresponding to a non-zero amplitude in a distribution of the kind illustrated in Figure 6 and described herein.
  • Modu and/or Clip may be values between 0 and 1.
  • the audio signal 901 may be filtered by one or more bandpass filters 906a-d, each bandpass filter 906a-d being configured to pass a particular range of frequencies, e.g. low frequencies (LF), low-high frequencies (LHF), mid-high frequencies (MHF), and ultra-high frequencies (UHF).
  • LF low frequencies
  • LHF low-high frequencies
  • MHF mid-high frequencies
  • UHF ultra-high frequencies
  • the LF range may be 0-2 kHz, or may be e.g. 20 Hz - 2 kHz, or 125 Hz - 2 kHz
  • the LHF range may be 2-4 kHz
  • the MHF range may be 4-8 kHz
  • the UHF range may be 8-20 kHz.
  • Energy level detectors 908a-d may be configured to determine energy levels of different frequency ranges of the audio signal 901.
  • a LF energy level detector 908a may be configured to determine a peak energy level or an average energy level of the audio signal 901 in the LF range
  • a LHF energy level detector 908b may be configured to determine a peak energy level of the audio signal 901 in the LHF range
  • a MHF energy level detector 908c may be configured to determine a peak energy level of the audio signal 901 in the MHF range
  • an UHF energy level detector 908d may be configured to determine a peak energy level of the audio signal 901 in the UHF range.
  • the energy level detectors 908a-d may each provide energy levels as values between 0 and 1.
  • the energy levels provided by the energy level detectors 908a-d are labelled as LF, LHF, MHF, and UHF, respectively.
  • the energy level detectors 908a-d may be configured to carry out the steps S704 and S706 of the method 700 illustrated in Figure 7 .
  • one or more of the LHF 908b, MHF 908c, and UHF 908d energy level detectors may be configured to carry out the step S704 of the method 700 (determine a first energy level of the audio signal at frequencies above a first threshold frequency during an energy time period, where the frequencies above the first threshold frequency may correspond to frequencies passed by one or more of one or more of the bandpass filters 906b-d)
  • the LF energy level detector 908a may be configured to carry out the step S706 of the method 700 (determine a second energy level of the audio signal during the energy time period). Where the second energy level corresponds to frequencies below a second threshold frequency, said frequencies below the second threshold frequency may correspond to frequencies passed by the bandpass filter 906a).
  • the algorithm provides a "Qualifier Value", which may be used in the determination of the measure of intelligibility.
  • the above example algorithm provides a Qualifier Value between 0 and 2 (based only on the modulation and non-linear distortion of the audio signal 901) if no LHF, MHF, or UHF components of the audio signal 901 are present. If a LHF component is detected in the audio signal, the maximum Qualifier Value is 3. If a MHF component is detected, the maximum Qualifier Value is 4. If an UHF component is detected, the maximum Qualifier Value is 5. It will be appreciated that the scale of 1 to 5 is arbitrary and used here for illustrative purposes only. It will be further understood that the coefficients applied in the calculation of the MHF_state and UHF_state Qualifier Values are also for illustrative purposes only, and that in practice these would be determined by experimentation.
  • the energy level LF in Figure 9 and in the above example algorithm, could alternatively be a medium frequency energy level, a combination of low and medium frequency energy levels, or an average energy level of the audio signal 901.
  • a microphone that cuts off below the MHF range cannot achieve the same maximum intelligibility score as one having a higher frequency range, since the higher frequency range microphone would be inherently capable of achieving a higher intelligibility.
  • a user can still maximise the intelligibility of the lower frequency range microphone within the allowable range of intelligibility scores.
  • a microphone having a range from LF up to UHF, but from which little to no modulation is detected e.g. corresponding to the detection of noise rather than speech
  • any individual element, or combination of elements, of the system 800 illustrated in Figure 8 may be configured to operate as part of a system in conjunction with any individual element, or combination of elements, of the system 900 illustrated in Figure 9 .
  • the measure of intelligibility would be displayed or otherwise indicated to a user as a score via an indicator, such as the visual indicator 302 illustrated in Figure 3 .
  • the score would rise or fall depending on the clarity (intelligibility) of the speaker.
  • the user may attempt to raise or lower their voice, reposition themselves relative to the microphone and/or take other actions that may improve their score where necessary.
  • a user By watching the indicator as they speak, a user would be able to see whether a particular change increases or decreases their score.
  • a computer will have more than one microphone.
  • the user may have access to a microphone built into the laptop but also one on a headset. This means that the user can switch between microphones while they are speaking. In turn this will allow them to see which microphone is superior in terms of clarity, a task that is otherwise very hard without additional human help.
  • the intelligibility indicator could replace a traditional sound loudness or volume meter.
  • the intelligibility indicator may indicate different aspects of the intelligibility measurement individually (such as comparison between first and second energy levels, volume, non-linearity/distortion etc., or any combination of these), allowing the user to see which factor was dominant and so which issue they should seek to solve first, or which solution would provide the biggest improvement.
  • an organisation hosting a conference call may be able to accept or reject speakers according to a minimum intelligibility level set by the organisation (e.g. set by a system administrator).
  • the methods according to the present disclosure could be implemented to support remote simultaneous interpretation.
  • users can see whether they are able to create a signal with sufficient clarity for a remote interpreter to be able to translate their speech. If not, they will be able to make adjustments until sufficient quality is attained.
  • the methods according to the present disclosure may further improve automated systems, for example ensuring that speakers use a minimum quality of speech signal to be fed to e.g. a text-to-speech program or an automated translation device.
  • the methods according to the present disclosure could also be applied to an audio track on a video, or to a sound recording, to indicate the quality someone could expect were they to listen to the recording. In addition, it could be used to check along the length of a recording and indicate the intelligibility of e.g. speech at different points of the recording.
  • the methods according to the present disclosure could also be applied to a testing circuit.
  • the outputted measure of intelligibility could be transmitted through a testing circuit along with the audio signal. Then, at some point later in the circuit, the intelligibility could be measured in the same way again. The two readings could be compared. If a discrepancy in the intelligibility of the same audio signal is found, then it would be possible to identify that problems with the connection such as momentary breaks in the signal or loss of samples. In this way, a system could report on not just the quality of the original sound but also on its transmission. It is important to note that compared to traditional methods like detecting packet loss, this is a real reading of the effect of the problem, not an abstract reading of the cause. In some cases, for example, many packets might be lost but the sound remains intelligible while in others, a few packets could cause significant loss of intelligibility.
  • any of the functions described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), or a combination of these implementations.
  • the terms “detector”, “operator”, “module”, “filter”, and “element”, as used herein generally represent software, firmware, hardware, or a combination thereof.
  • the detector, operator, module, filter, or element represents program code that performs specified tasks when executed on a processor (e.g. CPU or CPUs).
  • the program code can be stored in one or more computer readable memory devices.
EP22183533.3A 2022-07-07 2022-07-07 Bereitstellung eines masses der verständlichkeit eines audiosignals Pending EP4303874A1 (de)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22183533.3A EP4303874A1 (de) 2022-07-07 2022-07-07 Bereitstellung eines masses der verständlichkeit eines audiosignals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP22183533.3A EP4303874A1 (de) 2022-07-07 2022-07-07 Bereitstellung eines masses der verständlichkeit eines audiosignals

Publications (1)

Publication Number Publication Date
EP4303874A1 true EP4303874A1 (de) 2024-01-10

Family

ID=82399487

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22183533.3A Pending EP4303874A1 (de) 2022-07-07 2022-07-07 Bereitstellung eines masses der verständlichkeit eines audiosignals

Country Status (1)

Country Link
EP (1) EP4303874A1 (de)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648678A (zh) * 2019-09-20 2020-01-03 厦门亿联网络技术股份有限公司 一种用于具有多麦克风会议的场景识别方法和系统

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648678A (zh) * 2019-09-20 2020-01-03 厦门亿联网络技术股份有限公司 一种用于具有多麦克风会议的场景识别方法和系统

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
EYBEN FLORIAN ET AL: "The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing", IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, IEEE, USA, vol. 7, no. 2, 1 April 2016 (2016-04-01), pages 190 - 202, XP011612278, ISSN: 1949-3045, [retrieved on 20160526], DOI: 10.1109/TAFFC.2015.2457417 *
HAMMARBERG BRITTA ET AL: "Perceptual and Acoustic Correlates of Abnormal Voice Qualities", ACTA OTO-LARYNGOLOGICA, vol. 90, no. 1-6, 8 January 1980 (1980-01-08), NO, pages 441 - 451, XP055972488, ISSN: 0001-6489, Retrieved from the Internet <URL:http://dx.doi.org/10.3109/00016488009131746> DOI: 10.3109/00016488009131746 *

Similar Documents

Publication Publication Date Title
US8284947B2 (en) Reverberation estimation and suppression system
TWI422147B (zh) 音頻訊號之處理裝置及其方法,及電腦可讀取之紀錄媒體
US20050018862A1 (en) Digital signal processing system and method for a telephony interface apparatus
JP6420353B2 (ja) 周波数依存的減衰段をチューニングするための装置及び方法
US9520140B2 (en) Speech dereverberation methods, devices and systems
US10636406B2 (en) Automated room audio equipment monitoring system
US11626850B2 (en) Automated tuning by measuring and equalizing speaker output in an audio environment
US20230079741A1 (en) Automated audio tuning launch procedure and report
EP2037449B1 (de) Verfahren und System zum integralen und diagnostischen Testen der Qualität gehörter Sprache
US20160164480A1 (en) Method, apparatus, and system for analysis, evaluation, measurement and control of audio dynamics processing
CN110942781B (zh) 声音处理方法及声音处理设备
EP2828853B1 (de) Verfahren und system zur biaskorrektur von sprachpegelmessungen
JP2014513320A (ja) オーディオ信号におけるドミナント周波数を減衰する方法及び装置
Moeller et al. Objective estimation of speech quality for communication systems
EP4303874A1 (de) Bereitstellung eines masses der verständlichkeit eines audiosignals
US20230146772A1 (en) Automated audio tuning and compensation procedure
WO2020023856A1 (en) Forced gap insertion for pervasive listening
CN113031904B (zh) 一种控制方法及电子设备
CN114902560A (zh) 具有环境噪音补偿的用于自动音量控制的设备和方法
Drullman The significance of temporal modulation frequencies for speech intelligibility
JP2011141540A (ja) 音声信号処理装置、テレビジョン受像機、音声信号処理方法、プログラム、および、記録媒体
WO2023081534A1 (en) Automated audio tuning launch procedure and report
CN114615581A (zh) 一种提升音频主观感受质量的方法及装置
Lundahl Reducing Listening Effort of Audio Podcasts by Applying Equalization and Dynamic Processing at Playback
Lech et al. A Speech Enhancement Method for Improved Intelligibility in the Presence of an Ambient Noise

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR