GB2396271A - A user terminal and method for voice communication - Google Patents

A user terminal and method for voice communication Download PDF

Info

Publication number
GB2396271A
GB2396271A GB0228765A GB0228765A GB2396271A GB 2396271 A GB2396271 A GB 2396271A GB 0228765 A GB0228765 A GB 0228765A GB 0228765 A GB0228765 A GB 0228765A GB 2396271 A GB2396271 A GB 2396271A
Authority
GB
United Kingdom
Prior art keywords
speech
user terminal
application
transmission
voice activity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB0228765A
Other versions
GB2396271B (en
GB0228765D0 (en
Inventor
David Pearce
Holly Louise Kelleher
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Solutions Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Priority to GB0228765A priority Critical patent/GB2396271B/en
Publication of GB0228765D0 publication Critical patent/GB0228765D0/en
Priority to PCT/EP2003/050686 priority patent/WO2004053837A1/en
Priority to AU2003282110A priority patent/AU2003282110A1/en
Publication of GB2396271A publication Critical patent/GB2396271A/en
Application granted granted Critical
Publication of GB2396271B publication Critical patent/GB2396271B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A user terminal (2), for use in a speech recognition system, the user terminal (2) comprising a client application, wherein, in use, the client application is connected to a server application (54) over a network (52), the server application performing speech recognition processing, communication between the client application and the server application depending on communication settings, wherein the user terminal (2) comprises a voice activity detector, the voice activity detector generating information that indicates which of a plurality of states (T1,S,T2) is represented by user utterance data; and the user terminal (2) is adapted to choose the communication settings, at any or all stages of the communication link between the client application and the server application, in dependence on the indicated state of the utterance data, the available communication settings comprising at least high quality transmission (H3) and low quality transmission (L1).

Description

239627 1
A user terminal and method for voice communication Technical Field
5 The present invention relates to the field of speech
transmission. Background
10 In speech transmission, delay is an obvious aspect of quality to the user. The degree of impact varies from application to application, but it would be desirable to minimise the total transmission delay between a client device and a server application (either a voice enabled 15 service or a router to another user). An example of a client device is a portable radio communications device, such as a mobile or portable radio, or a mobile phone. This device may be wirelessly linked to a network, the server being part of the network.
An example of transmission to a voice enabled service is provided by distributed speech recognition.
In a distributed speech recognition (DSR) system, the 25 front-end processing (feature extraction) is performed by the client application in the user terminal. The back-end processing (speech recognition) is performed at a server somewhere in a network. The front-end features are transmitted over the network from the client to the server.
The network may either be terrestrial, such as the Internet, or wireless, such as GPRS or 3G. For terrestrial networks the bandwidth is comparatively high, error rates are comparatively low, and consequently a good recognition 35 performance is obtained. In comparative terms, bandwidth tends to be lower for wireless networks, and the . c., À. ': :. i.' '' '. i'
transmission error rates higher, resulting in poorer recognition performance.
The user experience of DSR is strongly influenced by two 5 important factors. The first is the recognition performance, which is dependent on the quality (integrity) of the data. The second is the latency in recognition due to transmission delays. In existing implementations there is a trade-off between the recognition performance and 10 latency, especially for poor quality transmission channels.
Mitigating techniques, such as allowing packet retransmissions over the network, can reduce the performance degradation caused by transmission errors.
15 However, each packet retransmission increases the delay.
Thus the designer is often left with a choice between high recognition performance at the expense of delay, or accepting lower recognition performance due to the transmission errors, but with faster and dependable 20 response times.
Thus there is perceived to be a need for a more effective means of balancing this trade-off, in order to facilitate optimal quality versus latency transmission for speech.
The signalling schemes of two known prior art arrangements
are illustrated in appended figure 1.
The upper part of figure 1 shows an uneven trace, which 30 represents the speech energy received by a user terminal plotted against time. This prior art terminal transmits to
a server at a constant high quality level Hi. The transmission continues at level Hi, even after the received speech energy has fallen to zero.
......DTD: À À À À À À
À À À À À À
À À À À À À
À À À À À À e.
À À. . À..
The system shown in the upper part of figure 1 would continue to use the high quality transmission level until, for example, the user of the terminal 'hung up' the call, thereby terminating the call.
The lower part of figure 1 shows the same trace of speech energy received at the user terminal as in the upper part of figure 1. However, transmission by the user terminal to the server only continues at high quality level H2 for a 10 finite time. The transmission ceases a certain time after the cessation of speech. This transmission scheme is referred to as discontinuous transmission. The time between the cessation of speech and the cessation of transmission at level H2 is referred to as the 'hangover time'.
Summary of the Invention
In accordance with a first aspect of the present invention, there is provided a user terminal, as claimed in claim 1.
20 In accordance with a second aspect of the present invention, there is provided a method for transmission, as claimed in claim 15. Further aspects of the present invention are defined in the dependent claims.
25 Brief description of the drawings
Figure 1 illustrates the signalling schemes of two known prior art arrangements;
30 Figure 2 illustrates the general signalling scheme in accordance with the invention; Figure 3 is a more detailed illustration of various signals that may be generated by a device in accordance with the 35 invention; À.e ÀÀe ee. À ee. e e À e À e À À Àe À À e À À À e À ÀÀÀ. ÀÀÀ À A
À À.. À
Figure 4 illustrates a determination that may be made by an enhanced version of the invention; Figure 5 is a flowchart illustrating a method in accordance 5 with the invention; Figure 6 illustrates a mobile radio communications device, which is one example of the user terminal 2 of the invention; Figure 7 illustrates a communications system in accordance with the invention.
Detailed description of the preferred embodiment
The present invention alleviates the trade-off between quality and latency. This is done by regularly updating the configuration of the communication process, either at the application level or at the network level, using 20 information about voice activity within a user's utterance.
Speech data is sent over wireless networks using real-time protocols (RTP) . A sequence of RTP payloads is used to transport speech data to the recognition application. The 25 speech data represents the user's utterance at the client terminal. A signalling scheme in accordance with the invention is illustrated in accordance with figure 2. The apparatus of 30 the invention comprises a user terminal 2, which will be described in more detail in relation to figure 6. The user terminal 2 is for use in a speech recognition system, and includes a client application, wherein, in use, the client application is connected to a server application 54 over a 35 network 52. This arrangement is illustrated in figure 7.
À. eÀe.e À ae. e e c À À..
À À À À
À. . À À...
À À. . ....
The server application 54 performs speech recognition processing. Communication between the client application of the user 5 terminal 2 and the server application 54 depends on communication settings. These communication settings are dynamic, and their state at any particular time depends on the output of a voice activity detector. The voice activity detector is part of the user terminal 2. The voice activity 10 detector provides an indication of the state of an utterance on a frame- by-frame basis. Voice activity detectors are themselves known, and therefore will not be described in further detail here.
15 The voice activity detector generates information that indicates which of a plurality of states is represented by user utterance data. The user terminal 2 is adapted to choose the communication settings, at any or all stages of the communication link between the client application and 20 the server application, in dependence on the indicated state of the utterance data. The available communication settings comprise at least: (i) high quality transmission H3; and (ii) low quality transmission L1.
Figure 2 illustrates the high quality transmission, shown as H3. Figure 2 also illustrates the low quality transmission, shown as L1. Transmission at quality L1 commences at the end of the utterance, which transition 30 will be indicated by the output of the voice activity detector transitioning at this point.
The low quality transmission L1 in figure 2 is a period in which transmission of data packets from the user terminal 2 35 to the server application 54 can still occur. This period L1 allows relatively rapid transmission, since the À. e e.... À À e À À e À À. À À À À e À À À À À À.
À. .
transmission is at low quality. The transmission at L1 will ensure that the system has caught up with all necessary packet transmissions by the time that an utterance is at an end. It is also advantageous over the scheme shown in the 5 lower trace of figure 1, because the scheme of figure 2 does not completely stop transmission. If it did, a substantial time would be needed to re-commence transmission once again.
10 Figure 3 illustrates the energy, SD and AFT Payload values That may be observed and generated in a user terminal in accordance with the invention.
The upper 'energy' trace of figure 3 shows that possible 15 speech energy is identifiable between points a and b, d and e, and h and i of the input signal.
The 'SD' trace relates to speech detection. The detection of speech at the client device, the user terminal 2, 20 classifies frames as belonging to speech or non-speech.
Non-speech frames may comprise noise or quiet. This is the output of a signal processing algorithm, rather than the actual speech endpoints. In particular these positions may be different in high background noise. The speech detection
25 algorithm includes any handling of intra-word gaps, and intra-word silence is marked as speech. Examples of intra word gaps include stop gaps before plosives or unvoiced phonemes that may have a low energy in the input signal.
This low energy may be due to reduced bandwidth, or being 30 hidden in background noise.
At this point, it is worth defining an utterance segment and a speech segment, for the purposes of the present invention. An utterance segment is a group of one or more 35 spoken words, grouped together based on their temporal proximity. This is defined by the constraint that the start tee À... e À.
À.. À À À...
À À À À.
À... À..:.:e:e
of a word in an utterance is not separated from the end of the previous word by more than a specified duration. A speech segment is a group of one or more spoken words resulting from speech detection, plus additional frames at 5 the start and end. A speech segment contains all the frames that are needed by the recogniser to achieve good recognition performance. Typically extra frames are needed before and after speech detection to compensate for SD overshoot or undershoot in background noise. These extra
10 frames correspond to c-d, e-f, g-h and i-j in the lower trace of figure 3, and T1 and T2 in figure 4.
In figure 3, the resulting payload in the lower 'AFE payload' trace begins at a point c, before the start of 15 speech point d. The point c is where the voice-activity detector first indicates speech. This portion of speech continues to point e, but the voice activity detector will continue to indicate speech until point f. The time periods c to d and e to f are dealt with more thoroughly in 20 connection with figure 4, below.
The 'zig-zag' line from e to h indicates a time for which the present invention may judge that one utterance is continuing, even though the voice activity detector ceases 25 indicating speech at point f.
The present invention may judge the entire time period from c to the end of the zig-zag after point j as being one utterance. The invention can use: 30 (i) high quality transmission for the periods c-f and g-j; and (ii) low quality transmission for at least the period between f and g, and for the period beyond point j until the end of the zig-zag. Figure 4 below explains more about 35 how these judgements are made.
Àe À e ee. Àe À À e À À e e À À À e À À À À e À À À À e À e Àe À e.
e À.
Figure 4 illustrates voice activity states. The upper trace of figure 4 shows that voice activity detection information may indicate one of the following states for the current frame of utterance data: 5 i) Speech T1, S. T2; or ii) Intra-speech gap G. In this case, the user terminal 2 will be adapted to choose: 10 (i) high quality transmission from the client application to the server application, when the voice activity detection indicates speech T1, S. T2; or (ii) low quality transmission from the client application to the server application, when the voice 15 activity detection indicates an intra-speech gap G. In the upper trace of figure 4, speech occurs for a period S'. This period is preceded by a short period T1 and followed by a short period T2.
The voice activity detector is adapted to indicate the presence of speech whilst either speech S is received from the user of terminal 2, or within the first threshold period T1 before speech commences, or until the second 25 threshold period T2 has elapsed since speech was last received. The first threshold period T1 is commensurate with typical speech attack times, preferably about 50ms. The second 30 threshold period T2 is commensurate with typical speech decay times, preferably about 150ms. The periods T1 and T2 can be viewed as delays within the voice activity detector circuitry. 35 As also illustrated in the upper part of figure 4, the voice activity detector can indicate the presence of an ......
an::I;. À: cc ..... c
intra-speech gap G. This occurs when the second threshold period T2 has elapsed since speech was last received, and until the start of the first threshold period T1 before speech commences. The voice activity detector is adapted to 5 continue to indicate an intra-speech gap G whilst either silence or noise is received.
The arrangement of the present invention may employ discontinuous transmission. Such a transmission scheme 10 would mean that the user terminal would cease even low quality transmission L1 under certain conditions. These conditions can be set by means of a third time threshold T3. If the voice activity detector indicates that an intra-
speech gap G exceeds the threshold period T3, then the user 15 terminal 2 may be arranged to cease transmission.
The lower trace of figure 4 shows an example of how the threshold T3 can operate. In the gap period G1, threshold T3 is not exceeded. Gap G1 might be the pause in an 20 utterance where the speaker is drawing breath. In gap period G2, threshold T3 is exceeded. Gap G2 might be a gap of several seconds, during which a speaker is looking for a new page of notes from which to read.
25 In effect then, when the intra-speech gap G exceeds a threshold period T3, the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data. The user terminal is adapted to discontinue transmission from the 30 client to the application server at this point.
The period T3 may typically be in the range of 1-3 seconds, preferably being about 1.5 seconds.
35 In accordance with the present invention, the user terminal 2 may be adapted to alter communication settings that À. À. a.- ce À À À À
À À C.
À a c a À À. À À -,
À.. À À
control any or all of the following, in dependence on the voice activity detection information: i) Application level protocol; ii) Transmission quality of service; and 5 iii) Error mitigation scheme.
This control of the transmission quality of service may take the form of requesting or allowing a greater number of permitted retransmissions when a speech packet Tl;S;T2 is 10 indicated than when an intra-speech gap G packet is indicated. Alternatively, the control of the transmission quality of service may comprise the assignment of different coding 15 schemes, using a more robust coding scheme when a speech packet Tl;S;T2 is indicated than when an intra-speech gap G packet is indicated.
In a further alternative, the control of the application 20 level protocol may be achieved by the preference to use TCP when a speech packet Tl;S;T2 is indicated, and the preference to use UDP when an intra- speech gap G packet is indicated. 25 Figure 5 shows a flowchart, which illustrates a method in accordance with the invention.
The method for transmission between the user terminal 2 and the server application 54 involves the user terminal 2, 30 with its client application. In use, the client application is connected to the server application over the network 52, the server application performing speech recognition processing, and communication between the client application and the server application depending on 35 communication settings. The flowchart of figure 5 shows a method of deriving those settings. In use, the voice Àe À cee ads ce.
À a À À s a À À À
À À. Àe se À e.
activity detector of the user terminal generates information that indicates which of a plurality of states is represented by the user utterance data. The user terminal 2 chooses communication settings, at any or all 5 stages of the communication link between the client application and the server application, in dependence on the indicated state of the utterance data, the available communication settings comprising at least: (i) high quality transmission; and 10 (ii) low quality transmission.
In figure 5, signal 510 is provided to voice activity detector 512. If decision box 514 indicates that speech is present, then a clock is reset to zero, box 518. Decision 15 box 514 indicates that speech is present during the periods T1, S and T2. This is the time for which the voice activity detector indicates that speech is present.
If decision box 514 indicates that speech is not present, 20 then the clock is incremented by one, box 516.
If in box 520 the clock value is found not to be greater than zero, then the voice activity detector indicates that speech is present, see box 522, and the flowchart returns 25 to box 514. If in box 520 the clock value is found to be greater than zero, then a check is made in box 524 as to whether the clock value exceeds a threshold E. If yes, then the method determines that the utterance has ended, see box 528. If the result of box 524 is no, then an indication can 30 be made that there is an intraspeech gap, see box 526.
The indication of an intra-speech gap clearly corresponds to gap 'G' shown in Fig. 4. The clock value E can be set to determine how large a gap G is treated as being just part 35 of one utterance, or is treated as being the break between À. À ee. Àe e ee. À À À À
À e À À À À
À À e À. À Àe À À e.e
different utterances. So the value E determines the threshold T3.
The value of E could, for example, correspond to a time 5 greater than G1 in figure 4, but less than the time corresponding to gap G2. Thus the flowchart of figure 5 would classify gap G1 as simply part of one continuous utterance, see box 526 on figure 5. Gap G2 however would be the end of an utterance, box 528 on figure 5.
Clearly therefore the voice activity detection information from figure 5 may indicate, for the current frame of utterance data, either: (i) Speech (T1, S. T2); 15 (ii) Intra-speech gap (G).
The user terminal 2 then may choose communication settings that provide the following: (i) high quality transmission from the client application to the server application, when the voice 20 activity detection indicates speech (T1, S. T2); or (ii) low quality transmission from the client application to the server application, when the voice activity detection indicates an intra-speech gap G. 25 When the clock value exceeds E, then intra-speech gap G exceeds a threshold period T3. When G exceeds T3, the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data. The user terminal can then discontinue transmission 30 from the client to the application server. The period T3 may be in the range of 1-3 seconds, preferably being about 1.5 seconds.
Figure 6 illustrates a mobile radio communications device, 35 which is one example of the user terminal 2 of the see e see see a eve e À e e e e e À e e e e e À e e e e e À e e e e e ee e..
e e e e e.
invention. The user terminal may for example be either a portable- or a mobile radio.
The radio 2 of figure 6 can transmit speech from a user of 5 the radio. The radio comprises a microphone 34, which provides a signal for transmission by the radio. The signal from the microphone is transmitted by transmission circuit 22. Transmission circuit 22 transmits via switch 24 and antenna 26.
The transmitter 2 also has a controller 30 and a read only memory (ROM) 32. Controller 30 may be a microprocessor. ROM 32 is a permanent memory, and may be a non-volatile Electrically Erasable Programmable Read Only Memory 15 (EEPROM).
The radio 2 also comprises a display 42 and keypad 44, which serve as part of the user interface circuitry of the radio. At least the keypad 44 portion of the user interface 20 circuitry is activatable by the user. Voice activation of the radio, or other means of interaction with a user, may also be employed.
Signals received by the radio are routed by the switch 24 25 to receiving circuitry 28. From there, the received signals are routed to controller 30 and audio processing circuitry 38. A loudspeaker 40 is connected to audio circuit 38.
Loudspeaker 40 forms a further part of the user interface.
Controller 30 performs the function of the voice activity 30 detector of the present invention.
A data terminal 36 may be provided. Terminal 36 would provide a signal comprising data for transmission by transmitter circuit 22, switch 24 and antenna 26.
À. Àe. ce... À... À e À À À... . À À. À À À À
À. À À..
Figure 7 illustrates the relationship between the user terminal 2 of the present invention, and the network 52 and server application 54. The server application is either a Distributed Speech Recognition (DSR) application or an 5 Automatic Speech Recognition (ASR) application.
The user terminal 2 may be adapted to communicate with the server via a packet-switched radio transmission network, the indicated state of an entire packet being determined by 10 the indicated states of the data frames within the packet.
User terminal 2 may take the form of a portable- or mobile radio, a wirelessly linked lap-top PC, Personal Digital Assistant or personal organiser, or a mobile telephone.
Network 52 and one or more user terminals 2 comprise a communication system.
Discussion of the invention and its effects In the enhanced arrangement of the invention explained in connection with figure 4, there is no transmission during long pauses between utterances.
25 The total length of the data segment to be transmitted consists of a whole utterance. Each whole utterance is made up of both speech and the gaps within the speech, provided that those gaps do not exceed period T3. The length of the gap determines the segmentation: up to threshold T3 for the 30 duration of the gaps, speech instances are categorized as being part of the same utterance. Speech utterances are categorized as part of a new utterance, if the gap is longer than this threshold.
35 There is a period after the end of the last word spoken when the system at the terminal 2 will wait to see if there en. me a.e À À e À À À À À À
À. À À À
À. À À
is further speech that is part of the same utterance, or whether this utterance is complete, based on the segmentation threshold E, T3. Consequently, a complete utterance consists of the actual speech together with 5 intra-speech gaps of up to (for example) 1.5 seconds between words, and a final gap of typically 1.5 seconds at the end.
The impact of transmission errors on quality is much higher 10 during speech frames than during intra-speech gaps. These errors are such as to adversely affect a user's perception of speech, or the performance of a speech recognition system. 15 Frames designated 'speech' in this scheme may include a number of frames preceding/following actual detected speech to form a buffer. The respective number of frames would be commensurate with typical speech attack and decay times; typically 50ms for attack and 150ms for decay, but varying 20 with the vocabulary used on the system. The preceding speech buffer would require a small delay. In an alternative embodiment, the voice activity detector may indicate confidence in these states, and/or sub-categorise the states, for example sub-categorising speech as voiced 25 and unvoiced speech.
In the preferred embodiment, the indicated states of each frame in the utterance are used to control the trade-off between recognition performance and latency. This is done 30 by selecting communication settings emphasising recognition performance during speech (T1,S,T2), and selecting communication settings emphasising low latency during intraspeech gaps (G). This selection must be done as permitted by current transmission conditions, such as 35 packet data size (i.e. a single packet of data may span both conditions), or service availability. For ... . .
À À. À À
À ÀÀ. .. :.... -.....
* À. À.
communication settings operating on whole packets, a state for the whole packet can be determined from the packet content. 5 In the case of a packet spanning several states, in a preferred embodiment communication settings affecting the whole packet would be made to emphasise quality if speech was indicated in the packet.
10 In an alternative embodiment, simple rules can be employed to determine more sophisticated decisions for the situation of a packet spanning several states. An example would be deciding whether the amount of speech in a packet is significant depending on the percentage of speech frames 15 within the packet, and/or whether they are contiguous frames. Clearly, a person skilled in the art could construct rules appropriate to the circumstances.
Hereinafter, the 'indicated state' refers to either the 20 indicated state of the data frame or the indicated state of the data packet as appropriate.
In one embodiment, an example of application level protocol control would be to choose between TCP or UDP protocols 25 depending on the indicated state of the packet. This would involve using the TOP protocol for the speech components of the utterance, which would guarantee their transmission but can incur latency. It would conversely involve using UDP for the intra-speech gaps, which would risk their loss in 30 the network, but reduce overall latency. Clearly, any appropriate protocols available on a given network may be used in a similar manner, if they exhibit similar trade-
offs. 35 An example of transmission quality of service control would be to define the number of permitted retransmissions for a ..........
.... :.:::.:...:...DTD:
...DTD:
packet depending on the indicated state of the packet. A packet containing a significant amount of speech would be permitted more retransmissions than one predominantly comprising an intra-speech gap.
An additional example of transmission quality of service control would be to exploit encoding properties of the host network. For example, GPRS provides four coding levels, CS1 through to CS4. At one end of the range, CS1 is robust to 10 channel errors but contains relatively little data. At the other end of the range, CS4 is not very robust to channel errors but contains a relatively large amount of data.
Using the more robust coding schemes for speech and the less robust coding schemes for intra-speech gaps would 15 increase the overall payload for a given bandwidth, without compromising the protection given to the speech data, and so reduce latency. The coding decision could be either based on the indicated state of the packet, or the indicated state of the constituent data frames. This would 20 depend on the relative size of the RIP packet and the GPRS transmission blocks.
The effect of the control provided by the present invention is to minimise latency, whilst preserving the robustness of 25 the speech within an utterance, thereby providing a more effective means of balancing the recognition performance versus latency trade-off for distributed speech recognition systems. 30 The above mechanisms of the present invention are employed within the user's terminal 2. However, if the voice activity indication is transmitted to or derivable by the server, one may employ state-dependent schemes at the server also.
Àa... s.e.* À À À À À
À À a À À À À À a a À as a.
Àe a An example of error mitigation scheme control based on the transmitted
voice activity indication from the user terminal would be to select different schemes depending on the indicated state of the data frames. For intra-speech 5 gaps, low latency but relatively poor methods could be used. Such a method would be, for example, copy-forward error correction. For speech, higher latency methods, that require both the last and next good packet, could be employed. In addition to packet error mitigation, the selection of different schemes could also be used for other recognition server based tasks, such as frame error mitigation and/or the adjustment of recognition complexity parameters (such as beamwidth) within the recogniser itself.
.........DTD: . .. .. .. . .... .....
. .

Claims (18)

Claims
1. A user terminal (2), for use in a speech recognition 5 system, the user terminal (2) comprising a client application, wherein, in use, the client application is connected to a server application (54) over a network (52), the server application performing speech recognition processing, communication between the client application 10 and the server application depending on communication settings, wherein: a) the user terminal (2) comprises a voice activity detector, the voice activity detector generating 15 information that indicates which of a plurality of states (T1,S,T2) is represented by user utterance data; and b) the user terminal (2) is adapted to choose the communication settings, at any or all stages of the 20 communication link between the client application and the server application, in dependence on the indicated state of the utterance data, the available communication settings comprising at least: (i) high quality transmission (H3); and 25 (ii) low quality transmission (L1).
Àe À À À e À À e ÀÀ À.
À - - ee À À e..
2. A user terminal (2) in accordance with claim 1, wherein: a) the voice activity detection information indicates one of the following states for the current frame of utterance 5 data: i) Speech (T1, S. T2); or ii) Intra-speech gap (G); and 10 b) the user terminal (2) is adapted to choose: (i) high quality transmission (H3) from the client application to the server application (54), when the voice activity detection indicates speech (T1, S. T2); or (ii) low quality transmission (L1) from the client 15 application to the server application (54), when the voice activity detection indicates an intra-speech gap(G).
3. A user terminal (2) in accordance with either claim 1 or claim 2, wherein, when the intra-speech gap (G) exceeds a 20 threshold period T3: (i) the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data; and (ii) the user terminal is adapted to discontinue 25 transmission from the client to the application server; the period T3 being in the range of 1-3 seconds, preferably being about 1.5 seconds.
30
4. A user terminal (2) in accordance with claim 2, wherein the voice activity detector is adapted to continue to indicate an intra-speech gap (G) whilst either silence or noise is received.
Àe Àe A- e À À À À À
À À À.
À ÀÀ À À
À. À. À. À À.
À....
5. A user terminal (2) in accordance with claim 2, wherein the voice activity detector is adapted to: indicate the presence of speech whilst either speech (S) is received, or within a first threshold period (T1) before 5 speech commences, or until a second threshold period (T2) has elapsed since speech was last received; and indicate the presence of an intra-speech gap (G) only when the second threshold period (T2) has elapsed since speech was last received, and until either the start of the first 10 threshold period (T1) before speech commences, or the duration of the intra-speech gap (G) exceeds the threshold period T3.
6. A user terminal in accordance with claim 5, wherein the 15 first threshold period (T1) is commensurate with typical speech attack times, preferably about 50ms, and the second threshold period (T2) is commensurate with typical speech decay times, preferably about 150ms 20
7. A user terminal (2) in accordance with any previous claim, wherein the user terminal is adapted to alter communication settings that control any or all of the following, in dependence on the voice activity detection information: 25 i) Application level protocol; ii) Transmission quality of service; and iii) Error mitigation scheme.
8. A user terminal (2) in accordance with claim 7, wherein 30 the control of the transmission quality of service is characterized by requesting or allowing a greater number of permitted retransmissions when a speech (Tl; S;T2) packet is indicated than when an intra-speech gap (G) packet is indicated. bee.. ate À. À À ÀÀ À
À À À À À À À À. a e.
À À. À
9. A user terminal (2) in accordance with claim 7, wherein the control of the transmission quality of service comprises the assignment of different coding schemes, using a more robust coding scheme when a speech (Tl;S;T2) packet 5 is indicated than when an intra-speech gap (G) packet is indicated.
10. A user terminal (2) in accordance with claim 7, wherein the control of the application level protocol is 10 characterized by the preference to use TCP when a speech (Tl;S;T2) packet is indicated and the preference to use UDP when an intra-speech gap (G) packet is indicated.
11. A user terminal (2) in accordance with any previous 15 claim, wherein the server application (54) is either a Distributed Speech Recognition (DSR) application or an Automatic Speech Recognition (ASR) application.
12. A user terminal (2) in accordance with any previous 20 claim, adapted to communicate with the server (54) via a packet-switched radio transmission network (52), the indicated state of an entire packet being determined by the indicated states of the data frames within the packet.
25
13. A portable- or mobile radio, a wirelessly linked lap-
top PC, Personal Digital Assistant or personal organiser, or a mobile telephone, comprising a user terminal (2) according to any previous claim.
30
14. A communication system comprising one or more user terminals (2) in accordance with any previous claim.
Àe ÀÀe ea- À-
À a À À À À À À.
À À À À À Àe À À À À À
15. A method for transmission between a user terminal (2) and a server application (54) of a speech recognition system, the user terminal (2) comprising a client application, wherein, in use, the client application is 5 connected to the server application over a network (52), the server application performing speech recognition processing, communication between the client application and the server application depending on communication settings, wherein, in use: a) a voice activity detector of the user terminal generates information that indicates which of a plurality of states (TI,S,T2) is represented by user utterance data; and 15 b) the user terminal (2) chooses communication settings, at any or all stages of the communication link between the client application and the server application (54), in dependence on the indicated state of the utterance data, the available communication settings comprising at least: 20 (i) high quality transmission (H3); and (ii) low quality transmission (Ll).
Àe eÀe.e À.e À e ÀÀ À À À À - À À À. À À
À À À À À -
16. The method of claim 15, wherein: a) the voice activity detection information indicates one of the following states for the current frame of utterance 5 data: (i) Speech (T1, S. T2); (ii) Intra-speech gap (G); and b) the user terminal is adapted to choose communication 10 settings that provide the following: (i) high quality transmission (H3) from the client application to the server application (54), when the voice activity detection indicates speech (T1, S. T2); or 15 (ii) low quality transmission (L1) from the client application to the server application (54), when the voice activity detection indicates an intra-speech gap(G). 20
17. The method of any of claims 15-16, wherein, when the intra-speech gap (G) exceeds a threshold period T3: (i) the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data; and 25 (ii) the user terminal (2) is adapted to discontinue transmission from the client to the application server; the period T3 being in the range of 1-3 seconds, preferably being about 1.5 seconds.
18. A method in accordance with the arrangement of any of figures 2-7 of the drawings, and/or the description
thereof. .........
À À À À À.
À À À À
À À À À
.........CLME:
......CLME:
GB0228765A 2002-12-10 2002-12-10 A user terminal and method for voice communication Expired - Fee Related GB2396271B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB0228765A GB2396271B (en) 2002-12-10 2002-12-10 A user terminal and method for voice communication
PCT/EP2003/050686 WO2004053837A1 (en) 2002-12-10 2003-10-03 A user terminal and method for distributed speech recognition
AU2003282110A AU2003282110A1 (en) 2002-12-10 2003-10-03 A user terminal and method for distributed speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0228765A GB2396271B (en) 2002-12-10 2002-12-10 A user terminal and method for voice communication

Publications (3)

Publication Number Publication Date
GB0228765D0 GB0228765D0 (en) 2003-01-15
GB2396271A true GB2396271A (en) 2004-06-16
GB2396271B GB2396271B (en) 2005-08-10

Family

ID=9949410

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0228765A Expired - Fee Related GB2396271B (en) 2002-12-10 2002-12-10 A user terminal and method for voice communication

Country Status (3)

Country Link
AU (1) AU2003282110A1 (en)
GB (1) GB2396271B (en)
WO (1) WO2004053837A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006077626A1 (en) 2005-01-18 2006-07-27 Fujitsu Limited Speech speed changing method, and speech speed changing device
WO2006082288A1 (en) * 2005-02-04 2006-08-10 France Telecom Method of transmitting end-of-speech marks in a speech recognition system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8451823B2 (en) 2005-12-13 2013-05-28 Nuance Communications, Inc. Distributed off-line voice services

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0680034A1 (en) * 1994-04-28 1995-11-02 Oki Electric Industry Co., Ltd. Mobile radio communication system using a sound or voice activity detector and convolutional coding
US5754554A (en) * 1994-10-28 1998-05-19 Nec Corporation Telephone apparatus for multiplexing digital speech samples and data signals using variable rate speech coding
WO2002043262A1 (en) * 2000-11-22 2002-05-30 Tait Electronics Limited Improvements relating to duplex transmission in mobile radio systems

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2225321T3 (en) * 1991-06-11 2005-03-16 Qualcomm Incorporated APPARATUS AND PROCEDURE FOR THE MASK OF ERRORS IN DATA FRAMES.
US7941313B2 (en) * 2001-05-17 2011-05-10 Qualcomm Incorporated System and method for transmitting speech activity information ahead of speech features in a distributed voice recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0680034A1 (en) * 1994-04-28 1995-11-02 Oki Electric Industry Co., Ltd. Mobile radio communication system using a sound or voice activity detector and convolutional coding
US5754554A (en) * 1994-10-28 1998-05-19 Nec Corporation Telephone apparatus for multiplexing digital speech samples and data signals using variable rate speech coding
WO2002043262A1 (en) * 2000-11-22 2002-05-30 Tait Electronics Limited Improvements relating to duplex transmission in mobile radio systems

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006077626A1 (en) 2005-01-18 2006-07-27 Fujitsu Limited Speech speed changing method, and speech speed changing device
EP1840877A1 (en) * 2005-01-18 2007-10-03 Fujitsu Ltd. Speech speed changing method, and speech speed changing device
EP1840877A4 (en) * 2005-01-18 2008-05-21 Fujitsu Ltd Speech speed changing method, and speech speed changing device
US7912710B2 (en) 2005-01-18 2011-03-22 Fujitsu Limited Apparatus and method for changing reproduction speed of speech sound
WO2006082288A1 (en) * 2005-02-04 2006-08-10 France Telecom Method of transmitting end-of-speech marks in a speech recognition system
FR2881867A1 (en) * 2005-02-04 2006-08-11 France Telecom METHOD FOR TRANSMITTING END-OF-SPEECH MARKS IN A SPEECH RECOGNITION SYSTEM

Also Published As

Publication number Publication date
GB2396271B (en) 2005-08-10
AU2003282110A1 (en) 2004-06-30
GB0228765D0 (en) 2003-01-15
WO2004053837A1 (en) 2004-06-24

Similar Documents

Publication Publication Date Title
US7246057B1 (en) System for handling variations in the reception of a speech signal consisting of packets
KR100575193B1 (en) A decoding method and system comprising an adaptive postfilter
TWI390505B (en) Method for discontinuous transmission and accurate reproduction of background noise information
US9047863B2 (en) Systems, methods, apparatus, and computer-readable media for criticality threshold control
EP1017042B1 (en) Voice activity detection driven noise remediator
US6898566B1 (en) Using signal to noise ratio of a speech signal to adjust thresholds for extracting speech parameters for coding the speech signal
EP2055055B1 (en) Adjustment of a jitter memory
US20060217976A1 (en) Adaptive noise state update for a voice activity detector
KR101121212B1 (en) Method of transmitting data in a communication system
EP1838066A2 (en) Jitter buffer controller
KR20030048067A (en) Improved spectral parameter substitution for the frame error concealment in a speech decoder
US7573907B2 (en) Discontinuous transmission of speech signals
EP3815082B1 (en) Adaptive comfort noise parameter determination
CA2408890C (en) System and methods for concealing errors in data transmission
US8631295B2 (en) Error concealment
JP2002237785A (en) Method for detecting sid frame by compensation of human audibility
US7231348B1 (en) Tone detection algorithm for a voice activity detector
US8391313B2 (en) System and method for improved use of voice activity detection
KR101002405B1 (en) Controlling a time-scaling of an audio signal
RU2445737C2 (en) Method of transmitting data in communication system
US8112273B2 (en) Voice activity detection and silence suppression in a packet network
US20080103765A1 (en) Encoder Delay Adjustment
GB2396271A (en) A user terminal and method for voice communication
US11070666B2 (en) Methods and devices for improvements relating to voice quality estimation
US8204753B2 (en) Stabilization and glitch minimization for CCITT recommendation G.726 speech CODEC during packet loss scenarios by regressor control and internal state updates of the decoding process

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20071210