EP3120534A2

EP3120534A2 - Interpretation system and method

Info

Publication number: EP3120534A2
Application number: EP15765582.0A
Authority: EP
Inventors: Pär STIHL; Martin HAMMARSTRÖM
Original assignee: Simultanex AB
Current assignee: Simultanex AB
Priority date: 2014-03-17
Filing date: 2015-03-13
Publication date: 2017-01-25
Also published as: SE1450295A1; WO2015142249A2; EP3120534A4; WO2015142249A3

Abstract

The present disclosure relates to an interpretation system comprising a first sound interface (3), for instance for an interviewer, a second sound interface (5), for instance for an interviewer, and a third sound interface (7) for an interpreter. The interpretation system comprises a switching subsystem (13) which can be switched between a first setting, interviewer-interpreter-interviewee, and a second setting, interviewee-interpreter-interviewer. The system comprises a processing unit (15) which is devised to detect speech originating from the first and second sound interfaces, and to automatically control the switching subsystem depending on this detection.

Description

INTERPRETATION SYSTEM AND METHOD

Technical field

The present disclosure relates to an interpretation system with a first sound interface, for a first participant such as an interviewer, a second sound interface, for a second participant such as an interviewee, and a third sound interface for a third participant such as an interpreter. The system has a switching subsystem that can be switched between at least a first setting, where a voice signal generated at the first sound interface is connected primarily to the third sound interface and a voice signal generated at the third interface is connected primarily to the second sound interface, and a second setting, where a voice signal generated at the second sound interface is connected primarily to the third sound interface and a voice signal generated at the third sound interface is connected primarily to the first sound interface.

The disclosure further relates to a corresponding method.

Background

Such a system is described in EP-15451 1 1 -A, which provides for bi- directional simultaneous interpretation services in connection with an interpretation assistance device. An interpreter may be in a remote location, and the users may generate switch signals by pressing buttons. The switch signals are detected by the system which directs sound to and from different users and the interpreter in such a way that unwanted sound is cancelled or attenuated.

One problem associated with such systems is how to make them user friendly to provide smooth and clear interview sessions. Summary

An object of the present invention is therefore to provide an improved system that is reliable and easy to use.

This object is achieved in an interpretation system of the initially men- tioned kind which is provided with a processing unit which is capable of detecting speech originating from the first and second sound interfaces, and to control the switching subsystem depending on this detection, such that the system switches between the first and second settings. This means that the system can operate automatically and adapt to the interview situation. The participants, particularly the interviewee who may be inexperienced, need not control the system manually, and a clear and undisturbed interview may nevertheless be produced and optionally recorded. The interpreter also obtains a better working situation as he or she may concentrate on translating in one direction at the time in an orderly manner. The interpreter need not be concerned with controlling the system.

The detection of beginning and termination of speech may be carried out by comparing a parameter corresponding to the first order derivative of the RMS of the AC component in a voice signal to a positive and a negative threshold, respectively. This has shown to provide a reliable detection also in cases where there is background noise. Such detection may be carried out by detecting and removing a DC component from a voice signal resulting in an AC signal, rectifying and low-pass filtering the AC signal to obtain a detection signal, and-comparing a first order derivative of detection signal to a positive and a negative threshold.

The interpretation may be adapted to switch between an idle state and at least a first active state corresponding to the first setting, in which the first participant is active, and a second active state corresponding to the second setting, wherein the second participant is active. This means that the system adapts naturally to an interview situation. The system may be adapted to remain in the first active state for a predetermined time after it is detected that the first participant stops talking, and may further be adapted to remain in the first active state for a predetermined time after it is detected that the third participant, interpreting the first participant, stops talking. This makes sure that the system does not switch in an undesired way because of the first participant e.g. allowing the interpreter to catch up in the interview process.

It is possible to gradually adjust the gain of an amplifier of at least one of the sound interfaces in response to a switching of the switching subsystem. This avoids disturbing clicks during switching.

The system may be adapted to provide a visual feedback signal in response to a switching of the switching subsystem, such as for instance changing the backlight colour of a display in the system.

The present disclosure further relates to a corresponding method. That method generally involves steps corresponding to the measures carried out by the different features of the system, and the method may be varied in correspondence with the system. Brief description of the drawings

Fig 1 illustrates a system overview. A switching device is in an interviewer-to-interviewee setting.

Fig 2 shows the switching device in fig 1 in an interviewee-to-interviewer setting.

Fig 3 shows a flow chart of a process for detection of speech.

Figs 4a-4d show schematically waveforms corresponding to the first four steps of fig 3, and fig 4e illustrates an envelope, with a time frame larger than the waveforms in figs 4a-4d, where detection of speech takes place.

Fig 5 shows a flow chart for a switching procedure.

Detailed description

Fig 1 illustrates schematically an overview of an simultaneous interpretation system 1 according to the present disclosure. The system is intended to use in a situation where a first person, hereinafter called interviewer, talks to a second person, hereinafter called interviewee. This naming of the first and second persons is done to simplify the following disclosure does not limit the scope of the present disclosure. In fact, the interviewer and the interviewee may have totally symmetrical roles as simply persons talking to each other.

Typically, the system may be used in situations such as police, customs and immigration investigations as well as healthcare procedures, and other procedures.

The interviewer and the interviewee do not share a common language, or may at least not be capable of communicating in a common language with a sufficient quality to ensure, depending on the situation, for instance legal certainty or medical safety.

Usually, the interviewer and the interviewee may be present in the same room, although this is not necessary. The interpreter may also be present, or may be available via a telephone line, a mobile telephone connection, a video conference system, or the like. In another example, the interpreter is present but placed e.g. in a neigboring room simply to maintain the interpreter's anonymity. The system may be capable of dealing with all such configurations by applying different settings, as will be discussed later. It should be noted that the interviewer or interviewee may be remote with regard to the system as well.

In summary, and as an example, the system may comprise a first 3, a second 5 and a third 7 sound interface, each providing a sound input 9, for feeding sound to a user loudspeaker or more likely headphones, and a sound output 1 1 providing an output from a user microphone.

The system may further comprise a switching subsystem 13 that directs the flow of sound in a path that is appropriate in the current situation. For in- stance, if the interviewer speaks, his or her microphone signal is transferred to the interpreter's headphones, and the signal from the latter's microphone is transferred to the interviewee's headphones. This path is achieved with the connection pattern indicated with black filled dots in the switching subsystem of fig 1 . When the interviewer stops speaking and the interviewee begins to speak, this path is altered by the switching subsystem by changing the connection pattern as indicated with dashed rings, as will be discussed later. The operation of the system is controlled by a processor unit 15, which may be a central processing unit, CPU, a digital signal processor, DSP, a dedicated application specific integrated circuit, ASIC, or a collection of circuits, optionally comprising both analog and digital signal processing means, as will be discussed further later.

Additionally, the system may include I/O processing means 17, a user interface 19, and additional storage means 21 , as will be discussed in more detail later.

In order to achieve good sound quality, the input and output of each sound interface may be provided with an amplifier 23 that the processor unit can adjust.

Sound interface

The sound interfaces may be adaptable to the configuration currently used. For instance, the system may allow, in one configuration, the inteviewer and the interviewee to be connected directly to the system by means of a headset with earphones and a microphone, and to connect the interpreter via a video conference system or a fixed telephone line. In another configuration, all three parties may be connected directly to the system via a headset. Other configurations, e.g. using cellphones may be considered, and it has also been considered to use more than three sound interfaces. The latter may be useful e.g. to allow having two interpreters interpreting via an intermediate language, or only interpreting in one direction, from a first to a second language.

While unbalanced microphones can be used, it may be preferred to use balanced microphones, e.g. using XLR connectors, to provide improved sound quality and lesser susceptibility to interference. TRS (tip/ring/sleeve) connectors may be used as well. Further, phantom powering may be used which provides a power source if a condenser microphone is used. Balanced headphones may be used as well.

Other standardized line in/out connectors may be used to connect the sound interface to a videoconference system. Each sound interface may also be connected to an internal mobile telephone system to connect one of the interfaces to e.g. a GSM compliant cell phone, at least as an emergency solution. Other options are available for wireless connection of a sound interface to a headset or the like, such as a wireless LAN, Bluetooth, etc.

Regardless of which solution is used to connect the sound interfaces to interviewers, interviewees and interpreters, it may be useful to allow the processor unit to control the amplitude of the incoming and outgoing signals of each interface, which may be done by means of controlling each line's amplifier, as will be discussed later.

Switching subsystem

The switching subsystem may be accomplished with different means. First, it should be noted that conveying both digital and analog sound signals have been considered. While employing electronics well known for decades, analog signal transmission may be considered, as the interpretation system 1 may be used in an environment with low interference and may be rather compact. Further, analog systems can sometimes provide superior sound quality, thanks to absense of quantization noise, etc.

Needless to say, corresponding entirely digital systems may be employed as well. In fact, the switching subsystem may, as the skilled person understands, be realized with anything from a set of mechanical relays to a software routine executed in a processor as long as it is capable of switching between different connection patterns, that connect the microphone of one speaker to the headphones of another as necessary in the circumstances and as decided adaptively by the system. The system may be integrated in an IP (Internet Protocol) telephony system using session initiation protocols (SIP) and real-time transport (RTP) protocols.

As mentioned, the configuration indicated with black filled dots in the switching subsystem of fig 1 is used when the interviewer speaks. The microphone signal from the interviewer's sound interface 3 is connected by the switching subsystem to the input/headphone line of the interpreter's sound interface 7, such that the interpreter hears the interviewer's voice. The signal from the interpreter's microphone is similarly transferred to the interviewee's headphones by the switching system.

When the interviewer stops speaking and the interviewee begins to speak, the path reverses; from the interviewee to the interpreter to the interviewer by changing the connection pattern as indicated with dashed rings, and as indicated in fig 2.

Other configurations are also possible. For instance, the system may be set in a conference mode, where each participant hear the others and can speak to the others. Also, even if indicated as such in fig 1 , the connections need not switch between on and off. For instance, the interviewer may, in the configuration indicated in fig 2 hear the voice of the interviewee, at a low volume, together with the voice of the interpreter, at a higher volume. This may, even though the interviewer and interviewee may not share improve the mutual understanding as the original speech, together with eye contact, body language, etc. can contribute with nuances and the like.

Processor unit

The processor unit may, as mentioned earlier, be a CPU, a DSP or an application specific circuit. It should further be noted that the switching subunit, the amplifiers, and at least parts of the sound interfaces, etc. may be integrated with the processing unit. Although the illustrated schematic configuration may be realised, it is primarily an example useful for understanding the overall disclosure of the system.

Speech detection

One way of triggering the switching from one configuration to another is to detect when one party, typically the interviewer or the interviewee begins to speak. An example of a method for accomplishing this speech detection is described with reference to the flow chart of fig 3 and the corresponding waveforms shown in figs 4a-4d. An analog voice signal is, very schematically shown in fig 4a. This signal has an AC component and a DC component 27. In a first step, the DC component is detected 25, and in a second step the DC component is removed 29, leaving only the AC component in the signal as illustrated in fig 4b. In a DSP this could be carried out with suitable subroutines, and in an analog system an operational amplifier or even a simple capacitor circuit may be used to remove the DC component directly.

In a third step, the signal is rectified 31 resulting in the waveform shown in fig 4c. This signal is in turn low-pass, LP, filtered 33 in a fourth step result- ing in the waveform of fig 4d. This resulting signal shows the instantaneous changes in voice signal amplitude, and in a fifth step there is carried out a detection, which determines 35 whether the first order derivative of the amplitude, ΔΑ/Δί, exceeds a predetermined positive or negative threshold, cf. fig 4e. If a positive threshold is exceeded, it is determined that speech has begun, and if a negative threshold is exceeded it is determined that speech has ended. This to a great extent corresponds to comparing a parameter corresponding to the first order derivative of the RMS of the AC component in a voice signal to a positive and a negative threshold, respectively. The system may react on this as will now be described.

Switching procedure

The disclosed features allows for automatically switching between interviewer and interviewee, and vice versa. This implies an improvement as a conversation can flow much more freely as compared to if a manual control, e.g. by the interviewer, would be used. Needless to say, it is possible to override this automatical switching and carry out such manual control if needed in a specific interview situation.

Further, as compared to regular conference calls, the speech quality will be much improved, as one party (interviewer/interviewee) at a time talks. This is particularly useful if the conversation is recorded e.g. as evidence. In that case it may also be possible to analyse at a later stage how the interpretation affects e.g. questions raised and answers produced in order to achieve higher legal certainty.

The system remains in one connection pattern, e.g. interviewee to interviewer as long as the interviewee speaks.

When the system detects, for instance as described above, that the interviewee stops talking, the system may wait for a short waiting time and then switch to the reversed connection pattern in order to allow the interviewer to talk. The system may then produce optical and/or acoustic feedback to the users to indicate that switching has taken place and that the previously silent part can begin to talk. Different feedback features are discussed later.

If, on the other hand, the interviewee resumes talking during the waiting time, the system may remain in the first connection pattern until the interviewee is ready.

This procedure can be summarized in an example flow chart as shown in fig 5. Starting out from a state where the system is idle 37, it is continuously or at regular intervals tested whether interviewer speech is detected 39 or whether interviewee speech is detected 41 . If for instance interviewer speech is detected, the system switches 43 to an interviewer-translator-interviewee pattern as described before, and provides feedback via the user interface as will be discussed, such that the interviewer and interviewee become aware of the switching. As the processor unit (cf. 15 in fig 1 ) may be able to control the amplifiers (cf. 23 in fig 1 ) for the sound inputs and outputs for each interface, it is possible to make the switching smooth by allowing the amplifier gains to ramp up and down rather than just switching on and off. This means that uncomfortable and disturbing clicking in the switching transitions can be avoided.

The system is thus in an interviewer-active state 45, where preferably any voice signals from the interviewee are shut down or at least substantially attenuated. If the interviewee attempts to talk, a feedback signal, e.g. optical or acoustic, may further be provided to the interviewee to inform the interviewee that he should wait. In the interviewer-active state, the interviewer may thus talk for as long as needed without being interrupted. In the interviewer-active state 45, it is regularly tested 47 whether the interviewer becomes inactive as discussed before. If the interviewer is inactive for a predetermined time period T, where T is typically in the range 0.5-5 s and preferably about 1 s, it is assumed that the interviewer has stopped talking. Howevever, it may be the case that the interpreter lags a few seconds. It is therefore optionally also tested 49 whether the interpreter becomes inactive for a time period that may also be T s, even if this is not necessary. If this does not happen, it is assumed that the interviewer has begun talking again, and the system remains in the interviewer-active state 45. If the interpreter however is silent long enough, the system returns to the idle state 37, and this is indicated by the user interface, as a feedback to the participants.

As illustrated in fig 5, the system may operate in the same way if in the idle state 37, it is determined that the interviewee begins to talk, and the system enters an interviewee-active state 51 . In this way, an interview situation can be handled very smoothly, and can be readily dealt with by the interpreter. User interface

Again with reference to fig 1 , the user interface 19, may typically include a keyboard 53 a screen 55, such as an LCD screen and some indicator lamps 57. The keyboard 53 may be used to select different settings such as the above-described automatic switching or the previously mentioned conference mode. It can also be used to manually control switching if needed.

Feedback to the users regarding in which state (e.g. interviewer-active or interviewee-active as described above) may be provided in different ways e.g. using the screen 55 or the indicator lamps 57. One efficient way of giving feedback is to use the screen's backlight colour. For instance, in the inter- viewer-active mode, the backlight may be red, while it is green in the idle mode. Other variations of course exist. A user interface may also be useful to choose the language e.g. the interviewee wishes to speak. For instance, a pressure sensitive screen may initially show a number of nations' flags, each representing a specific language. The interviewee may than tap a desired flag/language, and a suitable interpreter is connected to the system accordingly.

I/O subsystem and memory

The I/O subsystem 17 may connect the system to other functions. For instance, it is possible to provide additional feedback lights on each user's headset or the like to enhance the feedback function. Further connections to storage solutions such as a harddrive, etc. may be provided to store interview sound data produced during an interview. It is possible to store voice data in a number of separate channels.

Additionally it is possible to provide local storage, such as indicated with an SD card.

The present disclosure is not limited by the examples given above, and may be varied in different ways within the scope of the appended claims.

Claims

1 . An interpretation system comprising a first sound interface (3), for a first participant such as an interviewer, a second sound interface (5), for a second participant such as an interviewee, and a third sound interface (7) for a third participant such as an interpreter, wherein the interpretation system comprises a switching subsystem (13) that can be switched between at least a first setting (45), where a voice signal generated at the first sound interface (3) is connected primarily to the third sound interface (7) and a voice signal generated at the third interface is connected primarily to the second sound interface (5), and a second setting (51 ), where a voice signal generated at the second sound interface (5) is connected primarily to the third sound interface (7) and a voice signal generated at the third sound interface (7) is connected primarily to the first sound interface (3), characterised by a processing unit (15) which is devised to detect speech originating from the first and second sound interfaces, and to control the switching subsystem depending on this detection, such that the system switches between the first and second settings.

2. An interpretation system according to claim 1 , which is adapted to detect beginning and termination of speech by comparing a parameter corresponding to the first order derivative of the RMS of the AC component in a voice signal to a positive and a negative threshold, respectively.

3. An interpretation system according to claim 2, which is adapted to detect beginning and termination of speech by:

-detecting and removing a DC component from a voice signal resulting in an AC signal,

-rectifying and low-pass filtering the AC signal to obtain a detection signal, and -comparing a first order derivative of the detection signal to a positive and a negative threshold.

4. An interpretation system according to any of the preceding claims, which is adapted to switch between an idle state and at least a first active state corresponding to the first setting, in which the first participant is active, and a second active state corresponding to the second setting, wherein the second participant is active.

5. An interpretation system according to claim 4, which is adapted to remain in the first active state for a predetermined time after it is detected that the first participant stops talking.

6. An interpretation system according to claim 5, which is further adapted to remain in the first active state for a predetermined time after it is detected that the third participant, interpreting the first participant, stops talking.

7. An interpretation system according to any of the preceding claims, which is adapted to gradually adjust the gain of an amplifier of at least one of the sound interfaces in response to a switching of the switching subsystem.

8. An interpretation system according to any of the preceding claims, which is adapted to provide a visual feedback signal in response to a switching of the switching subsystem.

9. An interpretation system according to claim 8, wherein the visual feedback signal includes changing the backlight colour of a display.

10. A method for controlling an interpretation system, the system comprising a first sound interface (3), for a first participant such as an interviewer, a second sound interface (5), for a second participant such as an interviewee, and a third sound interface (7) for a third participant such as an interpreter, wherein the interpretation system comprises a switching subsystem (13) that can be switched between at least a first setting (45), where a voice signal generated at the first sound interface (3) is connected primarily to the third sound interface (7) and a voice signal generated at the third interface is connected primarily to the second sound interface (5), and a second setting (51 ), where a voice signal generated at the second sound interface (5) is connected primarily to the third sound interface (7) and a voice signal generated at the third sound interface (7) is connected primarily to the first sound interface (3), characterised by detecting speech originating from the first and second sound interfaces, and to controlling the switching subsystem depending on this detection, such that the system switches between the first and second settings.