GB2396271A - A user terminal and method for voice communication - Google Patents
A user terminal and method for voice communication Download PDFInfo
- Publication number
- GB2396271A GB2396271A GB0228765A GB0228765A GB2396271A GB 2396271 A GB2396271 A GB 2396271A GB 0228765 A GB0228765 A GB 0228765A GB 0228765 A GB0228765 A GB 0228765A GB 2396271 A GB2396271 A GB 2396271A
- Authority
- GB
- United Kingdom
- Prior art keywords
- speech
- user terminal
- application
- transmission
- voice activity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000006854 communication Effects 0.000 title claims abstract description 43
- 238000004891 communication Methods 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 title claims description 16
- 230000005540 biological transmission Effects 0.000 claims abstract description 78
- 230000000694 effects Effects 0.000 claims abstract description 52
- 238000001514 detection method Methods 0.000 claims description 23
- 230000000116 mitigating effect Effects 0.000 claims description 6
- 230000011664 signaling Effects 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 230000001934 delay Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 206010019133 Hangover Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
A user terminal (2), for use in a speech recognition system, the user terminal (2) comprising a client application, wherein, in use, the client application is connected to a server application (54) over a network (52), the server application performing speech recognition processing, communication between the client application and the server application depending on communication settings, wherein the user terminal (2) comprises a voice activity detector, the voice activity detector generating information that indicates which of a plurality of states (T1,S,T2) is represented by user utterance data; and the user terminal (2) is adapted to choose the communication settings, at any or all stages of the communication link between the client application and the server application, in dependence on the indicated state of the utterance data, the available communication settings comprising at least high quality transmission (H3) and low quality transmission (L1).
Description
239627 1
A user terminal and method for voice communication Technical Field
5 The present invention relates to the field of speech
transmission. Background
10 In speech transmission, delay is an obvious aspect of quality to the user. The degree of impact varies from application to application, but it would be desirable to minimise the total transmission delay between a client device and a server application (either a voice enabled 15 service or a router to another user). An example of a client device is a portable radio communications device, such as a mobile or portable radio, or a mobile phone. This device may be wirelessly linked to a network, the server being part of the network.
An example of transmission to a voice enabled service is provided by distributed speech recognition.
In a distributed speech recognition (DSR) system, the 25 front-end processing (feature extraction) is performed by the client application in the user terminal. The back-end processing (speech recognition) is performed at a server somewhere in a network. The front-end features are transmitted over the network from the client to the server.
The network may either be terrestrial, such as the Internet, or wireless, such as GPRS or 3G. For terrestrial networks the bandwidth is comparatively high, error rates are comparatively low, and consequently a good recognition 35 performance is obtained. In comparative terms, bandwidth tends to be lower for wireless networks, and the . c., À. ': :. i.' '' '. i'
transmission error rates higher, resulting in poorer recognition performance.
The user experience of DSR is strongly influenced by two 5 important factors. The first is the recognition performance, which is dependent on the quality (integrity) of the data. The second is the latency in recognition due to transmission delays. In existing implementations there is a trade-off between the recognition performance and 10 latency, especially for poor quality transmission channels.
Mitigating techniques, such as allowing packet retransmissions over the network, can reduce the performance degradation caused by transmission errors.
15 However, each packet retransmission increases the delay.
Thus the designer is often left with a choice between high recognition performance at the expense of delay, or accepting lower recognition performance due to the transmission errors, but with faster and dependable 20 response times.
Thus there is perceived to be a need for a more effective means of balancing this trade-off, in order to facilitate optimal quality versus latency transmission for speech.
The signalling schemes of two known prior art arrangements
are illustrated in appended figure 1.
The upper part of figure 1 shows an uneven trace, which 30 represents the speech energy received by a user terminal plotted against time. This prior art terminal transmits to
a server at a constant high quality level Hi. The transmission continues at level Hi, even after the received speech energy has fallen to zero.
......DTD: À À À À À À
À À À À À À
À À À À À À
À À À À À À e.
À À. . À..
The system shown in the upper part of figure 1 would continue to use the high quality transmission level until, for example, the user of the terminal 'hung up' the call, thereby terminating the call.
The lower part of figure 1 shows the same trace of speech energy received at the user terminal as in the upper part of figure 1. However, transmission by the user terminal to the server only continues at high quality level H2 for a 10 finite time. The transmission ceases a certain time after the cessation of speech. This transmission scheme is referred to as discontinuous transmission. The time between the cessation of speech and the cessation of transmission at level H2 is referred to as the 'hangover time'.
Summary of the Invention
In accordance with a first aspect of the present invention, there is provided a user terminal, as claimed in claim 1.
20 In accordance with a second aspect of the present invention, there is provided a method for transmission, as claimed in claim 15. Further aspects of the present invention are defined in the dependent claims.
25 Brief description of the drawings
Figure 1 illustrates the signalling schemes of two known prior art arrangements;
30 Figure 2 illustrates the general signalling scheme in accordance with the invention; Figure 3 is a more detailed illustration of various signals that may be generated by a device in accordance with the 35 invention; À.e ÀÀe ee. À ee. e e À e À e À À Àe À À e À À À e À ÀÀÀ. ÀÀÀ À A
À À.. À
Figure 4 illustrates a determination that may be made by an enhanced version of the invention; Figure 5 is a flowchart illustrating a method in accordance 5 with the invention; Figure 6 illustrates a mobile radio communications device, which is one example of the user terminal 2 of the invention; Figure 7 illustrates a communications system in accordance with the invention.
Detailed description of the preferred embodiment
The present invention alleviates the trade-off between quality and latency. This is done by regularly updating the configuration of the communication process, either at the application level or at the network level, using 20 information about voice activity within a user's utterance.
Speech data is sent over wireless networks using real-time protocols (RTP) . A sequence of RTP payloads is used to transport speech data to the recognition application. The 25 speech data represents the user's utterance at the client terminal. A signalling scheme in accordance with the invention is illustrated in accordance with figure 2. The apparatus of 30 the invention comprises a user terminal 2, which will be described in more detail in relation to figure 6. The user terminal 2 is for use in a speech recognition system, and includes a client application, wherein, in use, the client application is connected to a server application 54 over a 35 network 52. This arrangement is illustrated in figure 7.
À. eÀe.e À ae. e e c À À..
À À À À
À. . À À...
À À. . ....
The server application 54 performs speech recognition processing. Communication between the client application of the user 5 terminal 2 and the server application 54 depends on communication settings. These communication settings are dynamic, and their state at any particular time depends on the output of a voice activity detector. The voice activity detector is part of the user terminal 2. The voice activity 10 detector provides an indication of the state of an utterance on a frame- by-frame basis. Voice activity detectors are themselves known, and therefore will not be described in further detail here.
15 The voice activity detector generates information that indicates which of a plurality of states is represented by user utterance data. The user terminal 2 is adapted to choose the communication settings, at any or all stages of the communication link between the client application and 20 the server application, in dependence on the indicated state of the utterance data. The available communication settings comprise at least: (i) high quality transmission H3; and (ii) low quality transmission L1.
Figure 2 illustrates the high quality transmission, shown as H3. Figure 2 also illustrates the low quality transmission, shown as L1. Transmission at quality L1 commences at the end of the utterance, which transition 30 will be indicated by the output of the voice activity detector transitioning at this point.
The low quality transmission L1 in figure 2 is a period in which transmission of data packets from the user terminal 2 35 to the server application 54 can still occur. This period L1 allows relatively rapid transmission, since the À. e e.... À À e À À e À À. À À À À e À À À À À À.
À. .
transmission is at low quality. The transmission at L1 will ensure that the system has caught up with all necessary packet transmissions by the time that an utterance is at an end. It is also advantageous over the scheme shown in the 5 lower trace of figure 1, because the scheme of figure 2 does not completely stop transmission. If it did, a substantial time would be needed to re-commence transmission once again.
10 Figure 3 illustrates the energy, SD and AFT Payload values That may be observed and generated in a user terminal in accordance with the invention.
The upper 'energy' trace of figure 3 shows that possible 15 speech energy is identifiable between points a and b, d and e, and h and i of the input signal.
The 'SD' trace relates to speech detection. The detection of speech at the client device, the user terminal 2, 20 classifies frames as belonging to speech or non-speech.
Non-speech frames may comprise noise or quiet. This is the output of a signal processing algorithm, rather than the actual speech endpoints. In particular these positions may be different in high background noise. The speech detection
25 algorithm includes any handling of intra-word gaps, and intra-word silence is marked as speech. Examples of intra word gaps include stop gaps before plosives or unvoiced phonemes that may have a low energy in the input signal.
This low energy may be due to reduced bandwidth, or being 30 hidden in background noise.
At this point, it is worth defining an utterance segment and a speech segment, for the purposes of the present invention. An utterance segment is a group of one or more 35 spoken words, grouped together based on their temporal proximity. This is defined by the constraint that the start tee À... e À.
À.. À À À...
À À À À.
À... À..:.:e:e
of a word in an utterance is not separated from the end of the previous word by more than a specified duration. A speech segment is a group of one or more spoken words resulting from speech detection, plus additional frames at 5 the start and end. A speech segment contains all the frames that are needed by the recogniser to achieve good recognition performance. Typically extra frames are needed before and after speech detection to compensate for SD overshoot or undershoot in background noise. These extra
10 frames correspond to c-d, e-f, g-h and i-j in the lower trace of figure 3, and T1 and T2 in figure 4.
In figure 3, the resulting payload in the lower 'AFE payload' trace begins at a point c, before the start of 15 speech point d. The point c is where the voice-activity detector first indicates speech. This portion of speech continues to point e, but the voice activity detector will continue to indicate speech until point f. The time periods c to d and e to f are dealt with more thoroughly in 20 connection with figure 4, below.
The 'zig-zag' line from e to h indicates a time for which the present invention may judge that one utterance is continuing, even though the voice activity detector ceases 25 indicating speech at point f.
The present invention may judge the entire time period from c to the end of the zig-zag after point j as being one utterance. The invention can use: 30 (i) high quality transmission for the periods c-f and g-j; and (ii) low quality transmission for at least the period between f and g, and for the period beyond point j until the end of the zig-zag. Figure 4 below explains more about 35 how these judgements are made.
Àe À e ee. Àe À À e À À e e À À À e À À À À e À À À À e À e Àe À e.
e À.
Figure 4 illustrates voice activity states. The upper trace of figure 4 shows that voice activity detection information may indicate one of the following states for the current frame of utterance data: 5 i) Speech T1, S. T2; or ii) Intra-speech gap G. In this case, the user terminal 2 will be adapted to choose: 10 (i) high quality transmission from the client application to the server application, when the voice activity detection indicates speech T1, S. T2; or (ii) low quality transmission from the client application to the server application, when the voice 15 activity detection indicates an intra-speech gap G. In the upper trace of figure 4, speech occurs for a period S'. This period is preceded by a short period T1 and followed by a short period T2.
The voice activity detector is adapted to indicate the presence of speech whilst either speech S is received from the user of terminal 2, or within the first threshold period T1 before speech commences, or until the second 25 threshold period T2 has elapsed since speech was last received. The first threshold period T1 is commensurate with typical speech attack times, preferably about 50ms. The second 30 threshold period T2 is commensurate with typical speech decay times, preferably about 150ms. The periods T1 and T2 can be viewed as delays within the voice activity detector circuitry. 35 As also illustrated in the upper part of figure 4, the voice activity detector can indicate the presence of an ......
an::I;. À: cc ..... c
intra-speech gap G. This occurs when the second threshold period T2 has elapsed since speech was last received, and until the start of the first threshold period T1 before speech commences. The voice activity detector is adapted to 5 continue to indicate an intra-speech gap G whilst either silence or noise is received.
The arrangement of the present invention may employ discontinuous transmission. Such a transmission scheme 10 would mean that the user terminal would cease even low quality transmission L1 under certain conditions. These conditions can be set by means of a third time threshold T3. If the voice activity detector indicates that an intra-
speech gap G exceeds the threshold period T3, then the user 15 terminal 2 may be arranged to cease transmission.
The lower trace of figure 4 shows an example of how the threshold T3 can operate. In the gap period G1, threshold T3 is not exceeded. Gap G1 might be the pause in an 20 utterance where the speaker is drawing breath. In gap period G2, threshold T3 is exceeded. Gap G2 might be a gap of several seconds, during which a speaker is looking for a new page of notes from which to read.
25 In effect then, when the intra-speech gap G exceeds a threshold period T3, the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data. The user terminal is adapted to discontinue transmission from the 30 client to the application server at this point.
The period T3 may typically be in the range of 1-3 seconds, preferably being about 1.5 seconds.
35 In accordance with the present invention, the user terminal 2 may be adapted to alter communication settings that À. À. a.- ce À À À À
À À C.
À a c a À À. À À -,
À.. À À
control any or all of the following, in dependence on the voice activity detection information: i) Application level protocol; ii) Transmission quality of service; and 5 iii) Error mitigation scheme.
This control of the transmission quality of service may take the form of requesting or allowing a greater number of permitted retransmissions when a speech packet Tl;S;T2 is 10 indicated than when an intra-speech gap G packet is indicated. Alternatively, the control of the transmission quality of service may comprise the assignment of different coding 15 schemes, using a more robust coding scheme when a speech packet Tl;S;T2 is indicated than when an intra-speech gap G packet is indicated.
In a further alternative, the control of the application 20 level protocol may be achieved by the preference to use TCP when a speech packet Tl;S;T2 is indicated, and the preference to use UDP when an intra- speech gap G packet is indicated. 25 Figure 5 shows a flowchart, which illustrates a method in accordance with the invention.
The method for transmission between the user terminal 2 and the server application 54 involves the user terminal 2, 30 with its client application. In use, the client application is connected to the server application over the network 52, the server application performing speech recognition processing, and communication between the client application and the server application depending on 35 communication settings. The flowchart of figure 5 shows a method of deriving those settings. In use, the voice Àe À cee ads ce.
À a À À s a À À À
À À. Àe se À e.
activity detector of the user terminal generates information that indicates which of a plurality of states is represented by the user utterance data. The user terminal 2 chooses communication settings, at any or all 5 stages of the communication link between the client application and the server application, in dependence on the indicated state of the utterance data, the available communication settings comprising at least: (i) high quality transmission; and 10 (ii) low quality transmission.
In figure 5, signal 510 is provided to voice activity detector 512. If decision box 514 indicates that speech is present, then a clock is reset to zero, box 518. Decision 15 box 514 indicates that speech is present during the periods T1, S and T2. This is the time for which the voice activity detector indicates that speech is present.
If decision box 514 indicates that speech is not present, 20 then the clock is incremented by one, box 516.
If in box 520 the clock value is found not to be greater than zero, then the voice activity detector indicates that speech is present, see box 522, and the flowchart returns 25 to box 514. If in box 520 the clock value is found to be greater than zero, then a check is made in box 524 as to whether the clock value exceeds a threshold E. If yes, then the method determines that the utterance has ended, see box 528. If the result of box 524 is no, then an indication can 30 be made that there is an intraspeech gap, see box 526.
The indication of an intra-speech gap clearly corresponds to gap 'G' shown in Fig. 4. The clock value E can be set to determine how large a gap G is treated as being just part 35 of one utterance, or is treated as being the break between À. À ee. Àe e ee. À À À À
À e À À À À
À À e À. À Àe À À e.e
different utterances. So the value E determines the threshold T3.
The value of E could, for example, correspond to a time 5 greater than G1 in figure 4, but less than the time corresponding to gap G2. Thus the flowchart of figure 5 would classify gap G1 as simply part of one continuous utterance, see box 526 on figure 5. Gap G2 however would be the end of an utterance, box 528 on figure 5.
Clearly therefore the voice activity detection information from figure 5 may indicate, for the current frame of utterance data, either: (i) Speech (T1, S. T2); 15 (ii) Intra-speech gap (G).
The user terminal 2 then may choose communication settings that provide the following: (i) high quality transmission from the client application to the server application, when the voice 20 activity detection indicates speech (T1, S. T2); or (ii) low quality transmission from the client application to the server application, when the voice activity detection indicates an intra-speech gap G. 25 When the clock value exceeds E, then intra-speech gap G exceeds a threshold period T3. When G exceeds T3, the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data. The user terminal can then discontinue transmission 30 from the client to the application server. The period T3 may be in the range of 1-3 seconds, preferably being about 1.5 seconds.
Figure 6 illustrates a mobile radio communications device, 35 which is one example of the user terminal 2 of the see e see see a eve e À e e e e e À e e e e e À e e e e e À e e e e e ee e..
e e e e e.
invention. The user terminal may for example be either a portable- or a mobile radio.
The radio 2 of figure 6 can transmit speech from a user of 5 the radio. The radio comprises a microphone 34, which provides a signal for transmission by the radio. The signal from the microphone is transmitted by transmission circuit 22. Transmission circuit 22 transmits via switch 24 and antenna 26.
The transmitter 2 also has a controller 30 and a read only memory (ROM) 32. Controller 30 may be a microprocessor. ROM 32 is a permanent memory, and may be a non-volatile Electrically Erasable Programmable Read Only Memory 15 (EEPROM).
The radio 2 also comprises a display 42 and keypad 44, which serve as part of the user interface circuitry of the radio. At least the keypad 44 portion of the user interface 20 circuitry is activatable by the user. Voice activation of the radio, or other means of interaction with a user, may also be employed.
Signals received by the radio are routed by the switch 24 25 to receiving circuitry 28. From there, the received signals are routed to controller 30 and audio processing circuitry 38. A loudspeaker 40 is connected to audio circuit 38.
Loudspeaker 40 forms a further part of the user interface.
Controller 30 performs the function of the voice activity 30 detector of the present invention.
A data terminal 36 may be provided. Terminal 36 would provide a signal comprising data for transmission by transmitter circuit 22, switch 24 and antenna 26.
À. Àe. ce... À... À e À À À... . À À. À À À À
À. À À..
Figure 7 illustrates the relationship between the user terminal 2 of the present invention, and the network 52 and server application 54. The server application is either a Distributed Speech Recognition (DSR) application or an 5 Automatic Speech Recognition (ASR) application.
The user terminal 2 may be adapted to communicate with the server via a packet-switched radio transmission network, the indicated state of an entire packet being determined by 10 the indicated states of the data frames within the packet.
User terminal 2 may take the form of a portable- or mobile radio, a wirelessly linked lap-top PC, Personal Digital Assistant or personal organiser, or a mobile telephone.
Network 52 and one or more user terminals 2 comprise a communication system.
Discussion of the invention and its effects In the enhanced arrangement of the invention explained in connection with figure 4, there is no transmission during long pauses between utterances.
25 The total length of the data segment to be transmitted consists of a whole utterance. Each whole utterance is made up of both speech and the gaps within the speech, provided that those gaps do not exceed period T3. The length of the gap determines the segmentation: up to threshold T3 for the 30 duration of the gaps, speech instances are categorized as being part of the same utterance. Speech utterances are categorized as part of a new utterance, if the gap is longer than this threshold.
35 There is a period after the end of the last word spoken when the system at the terminal 2 will wait to see if there en. me a.e À À e À À À À À À
À. À À À
À. À À
is further speech that is part of the same utterance, or whether this utterance is complete, based on the segmentation threshold E, T3. Consequently, a complete utterance consists of the actual speech together with 5 intra-speech gaps of up to (for example) 1.5 seconds between words, and a final gap of typically 1.5 seconds at the end.
The impact of transmission errors on quality is much higher 10 during speech frames than during intra-speech gaps. These errors are such as to adversely affect a user's perception of speech, or the performance of a speech recognition system. 15 Frames designated 'speech' in this scheme may include a number of frames preceding/following actual detected speech to form a buffer. The respective number of frames would be commensurate with typical speech attack and decay times; typically 50ms for attack and 150ms for decay, but varying 20 with the vocabulary used on the system. The preceding speech buffer would require a small delay. In an alternative embodiment, the voice activity detector may indicate confidence in these states, and/or sub-categorise the states, for example sub-categorising speech as voiced 25 and unvoiced speech.
In the preferred embodiment, the indicated states of each frame in the utterance are used to control the trade-off between recognition performance and latency. This is done 30 by selecting communication settings emphasising recognition performance during speech (T1,S,T2), and selecting communication settings emphasising low latency during intraspeech gaps (G). This selection must be done as permitted by current transmission conditions, such as 35 packet data size (i.e. a single packet of data may span both conditions), or service availability. For ... . .
À À. À À
À ÀÀ. .. :.... -.....
* À. À.
communication settings operating on whole packets, a state for the whole packet can be determined from the packet content. 5 In the case of a packet spanning several states, in a preferred embodiment communication settings affecting the whole packet would be made to emphasise quality if speech was indicated in the packet.
10 In an alternative embodiment, simple rules can be employed to determine more sophisticated decisions for the situation of a packet spanning several states. An example would be deciding whether the amount of speech in a packet is significant depending on the percentage of speech frames 15 within the packet, and/or whether they are contiguous frames. Clearly, a person skilled in the art could construct rules appropriate to the circumstances.
Hereinafter, the 'indicated state' refers to either the 20 indicated state of the data frame or the indicated state of the data packet as appropriate.
In one embodiment, an example of application level protocol control would be to choose between TCP or UDP protocols 25 depending on the indicated state of the packet. This would involve using the TOP protocol for the speech components of the utterance, which would guarantee their transmission but can incur latency. It would conversely involve using UDP for the intra-speech gaps, which would risk their loss in 30 the network, but reduce overall latency. Clearly, any appropriate protocols available on a given network may be used in a similar manner, if they exhibit similar trade-
offs. 35 An example of transmission quality of service control would be to define the number of permitted retransmissions for a ..........
.... :.:::.:...:...DTD:
...DTD:
packet depending on the indicated state of the packet. A packet containing a significant amount of speech would be permitted more retransmissions than one predominantly comprising an intra-speech gap.
An additional example of transmission quality of service control would be to exploit encoding properties of the host network. For example, GPRS provides four coding levels, CS1 through to CS4. At one end of the range, CS1 is robust to 10 channel errors but contains relatively little data. At the other end of the range, CS4 is not very robust to channel errors but contains a relatively large amount of data.
Using the more robust coding schemes for speech and the less robust coding schemes for intra-speech gaps would 15 increase the overall payload for a given bandwidth, without compromising the protection given to the speech data, and so reduce latency. The coding decision could be either based on the indicated state of the packet, or the indicated state of the constituent data frames. This would 20 depend on the relative size of the RIP packet and the GPRS transmission blocks.
The effect of the control provided by the present invention is to minimise latency, whilst preserving the robustness of 25 the speech within an utterance, thereby providing a more effective means of balancing the recognition performance versus latency trade-off for distributed speech recognition systems. 30 The above mechanisms of the present invention are employed within the user's terminal 2. However, if the voice activity indication is transmitted to or derivable by the server, one may employ state-dependent schemes at the server also.
Àa... s.e.* À À À À À
À À a À À À À À a a À as a.
Àe a An example of error mitigation scheme control based on the transmitted
voice activity indication from the user terminal would be to select different schemes depending on the indicated state of the data frames. For intra-speech 5 gaps, low latency but relatively poor methods could be used. Such a method would be, for example, copy-forward error correction. For speech, higher latency methods, that require both the last and next good packet, could be employed. In addition to packet error mitigation, the selection of different schemes could also be used for other recognition server based tasks, such as frame error mitigation and/or the adjustment of recognition complexity parameters (such as beamwidth) within the recogniser itself.
.........DTD: . .. .. .. . .... .....
. .
Claims (18)
1. A user terminal (2), for use in a speech recognition 5 system, the user terminal (2) comprising a client application, wherein, in use, the client application is connected to a server application (54) over a network (52), the server application performing speech recognition processing, communication between the client application 10 and the server application depending on communication settings, wherein: a) the user terminal (2) comprises a voice activity detector, the voice activity detector generating 15 information that indicates which of a plurality of states (T1,S,T2) is represented by user utterance data; and b) the user terminal (2) is adapted to choose the communication settings, at any or all stages of the 20 communication link between the client application and the server application, in dependence on the indicated state of the utterance data, the available communication settings comprising at least: (i) high quality transmission (H3); and 25 (ii) low quality transmission (L1).
Àe À À À e À À e ÀÀ À.
À - - ee À À e..
2. A user terminal (2) in accordance with claim 1, wherein: a) the voice activity detection information indicates one of the following states for the current frame of utterance 5 data: i) Speech (T1, S. T2); or ii) Intra-speech gap (G); and 10 b) the user terminal (2) is adapted to choose: (i) high quality transmission (H3) from the client application to the server application (54), when the voice activity detection indicates speech (T1, S. T2); or (ii) low quality transmission (L1) from the client 15 application to the server application (54), when the voice activity detection indicates an intra-speech gap(G).
3. A user terminal (2) in accordance with either claim 1 or claim 2, wherein, when the intra-speech gap (G) exceeds a 20 threshold period T3: (i) the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data; and (ii) the user terminal is adapted to discontinue 25 transmission from the client to the application server; the period T3 being in the range of 1-3 seconds, preferably being about 1.5 seconds.
30
4. A user terminal (2) in accordance with claim 2, wherein the voice activity detector is adapted to continue to indicate an intra-speech gap (G) whilst either silence or noise is received.
Àe Àe A- e À À À À À
À À À.
À ÀÀ À À
À. À. À. À À.
À....
5. A user terminal (2) in accordance with claim 2, wherein the voice activity detector is adapted to: indicate the presence of speech whilst either speech (S) is received, or within a first threshold period (T1) before 5 speech commences, or until a second threshold period (T2) has elapsed since speech was last received; and indicate the presence of an intra-speech gap (G) only when the second threshold period (T2) has elapsed since speech was last received, and until either the start of the first 10 threshold period (T1) before speech commences, or the duration of the intra-speech gap (G) exceeds the threshold period T3.
6. A user terminal in accordance with claim 5, wherein the 15 first threshold period (T1) is commensurate with typical speech attack times, preferably about 50ms, and the second threshold period (T2) is commensurate with typical speech decay times, preferably about 150ms 20
7. A user terminal (2) in accordance with any previous claim, wherein the user terminal is adapted to alter communication settings that control any or all of the following, in dependence on the voice activity detection information: 25 i) Application level protocol; ii) Transmission quality of service; and iii) Error mitigation scheme.
8. A user terminal (2) in accordance with claim 7, wherein 30 the control of the transmission quality of service is characterized by requesting or allowing a greater number of permitted retransmissions when a speech (Tl; S;T2) packet is indicated than when an intra-speech gap (G) packet is indicated. bee.. ate À. À À ÀÀ À
À À À À À À À À. a e.
À À. À
9. A user terminal (2) in accordance with claim 7, wherein the control of the transmission quality of service comprises the assignment of different coding schemes, using a more robust coding scheme when a speech (Tl;S;T2) packet 5 is indicated than when an intra-speech gap (G) packet is indicated.
10. A user terminal (2) in accordance with claim 7, wherein the control of the application level protocol is 10 characterized by the preference to use TCP when a speech (Tl;S;T2) packet is indicated and the preference to use UDP when an intra-speech gap (G) packet is indicated.
11. A user terminal (2) in accordance with any previous 15 claim, wherein the server application (54) is either a Distributed Speech Recognition (DSR) application or an Automatic Speech Recognition (ASR) application.
12. A user terminal (2) in accordance with any previous 20 claim, adapted to communicate with the server (54) via a packet-switched radio transmission network (52), the indicated state of an entire packet being determined by the indicated states of the data frames within the packet.
25
13. A portable- or mobile radio, a wirelessly linked lap-
top PC, Personal Digital Assistant or personal organiser, or a mobile telephone, comprising a user terminal (2) according to any previous claim.
30
14. A communication system comprising one or more user terminals (2) in accordance with any previous claim.
Àe ÀÀe ea- À-
À a À À À À À À.
À À À À À Àe À À À À À
15. A method for transmission between a user terminal (2) and a server application (54) of a speech recognition system, the user terminal (2) comprising a client application, wherein, in use, the client application is 5 connected to the server application over a network (52), the server application performing speech recognition processing, communication between the client application and the server application depending on communication settings, wherein, in use: a) a voice activity detector of the user terminal generates information that indicates which of a plurality of states (TI,S,T2) is represented by user utterance data; and 15 b) the user terminal (2) chooses communication settings, at any or all stages of the communication link between the client application and the server application (54), in dependence on the indicated state of the utterance data, the available communication settings comprising at least: 20 (i) high quality transmission (H3); and (ii) low quality transmission (Ll).
Àe eÀe.e À.e À e ÀÀ À À À À - À À À. À À
À À À À À -
16. The method of claim 15, wherein: a) the voice activity detection information indicates one of the following states for the current frame of utterance 5 data: (i) Speech (T1, S. T2); (ii) Intra-speech gap (G); and b) the user terminal is adapted to choose communication 10 settings that provide the following: (i) high quality transmission (H3) from the client application to the server application (54), when the voice activity detection indicates speech (T1, S. T2); or 15 (ii) low quality transmission (L1) from the client application to the server application (54), when the voice activity detection indicates an intra-speech gap(G). 20
17. The method of any of claims 15-16, wherein, when the intra-speech gap (G) exceeds a threshold period T3: (i) the voice activity detection information further indicates the end of the complete utterance for the current frame of utterance data; and 25 (ii) the user terminal (2) is adapted to discontinue transmission from the client to the application server; the period T3 being in the range of 1-3 seconds, preferably being about 1.5 seconds.
18. A method in accordance with the arrangement of any of figures 2-7 of the drawings, and/or the description
thereof. .........
À À À À À.
À À À À
À À À À
.........CLME:
......CLME:
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0228765A GB2396271B (en) | 2002-12-10 | 2002-12-10 | A user terminal and method for voice communication |
PCT/EP2003/050686 WO2004053837A1 (en) | 2002-12-10 | 2003-10-03 | A user terminal and method for distributed speech recognition |
AU2003282110A AU2003282110A1 (en) | 2002-12-10 | 2003-10-03 | A user terminal and method for distributed speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0228765A GB2396271B (en) | 2002-12-10 | 2002-12-10 | A user terminal and method for voice communication |
Publications (3)
Publication Number | Publication Date |
---|---|
GB0228765D0 GB0228765D0 (en) | 2003-01-15 |
GB2396271A true GB2396271A (en) | 2004-06-16 |
GB2396271B GB2396271B (en) | 2005-08-10 |
Family
ID=9949410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0228765A Expired - Fee Related GB2396271B (en) | 2002-12-10 | 2002-12-10 | A user terminal and method for voice communication |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU2003282110A1 (en) |
GB (1) | GB2396271B (en) |
WO (1) | WO2004053837A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006077626A1 (en) | 2005-01-18 | 2006-07-27 | Fujitsu Limited | Speech speed changing method, and speech speed changing device |
WO2006082288A1 (en) * | 2005-02-04 | 2006-08-10 | France Telecom | Method of transmitting end-of-speech marks in a speech recognition system |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8451823B2 (en) | 2005-12-13 | 2013-05-28 | Nuance Communications, Inc. | Distributed off-line voice services |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0680034A1 (en) * | 1994-04-28 | 1995-11-02 | Oki Electric Industry Co., Ltd. | Mobile radio communication system using a sound or voice activity detector and convolutional coding |
US5754554A (en) * | 1994-10-28 | 1998-05-19 | Nec Corporation | Telephone apparatus for multiplexing digital speech samples and data signals using variable rate speech coding |
WO2002043262A1 (en) * | 2000-11-22 | 2002-05-30 | Tait Electronics Limited | Improvements relating to duplex transmission in mobile radio systems |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
ES2225321T3 (en) * | 1991-06-11 | 2005-03-16 | Qualcomm Incorporated | APPARATUS AND PROCEDURE FOR THE MASK OF ERRORS IN DATA FRAMES. |
US7941313B2 (en) * | 2001-05-17 | 2011-05-10 | Qualcomm Incorporated | System and method for transmitting speech activity information ahead of speech features in a distributed voice recognition system |
-
2002
- 2002-12-10 GB GB0228765A patent/GB2396271B/en not_active Expired - Fee Related
-
2003
- 2003-10-03 WO PCT/EP2003/050686 patent/WO2004053837A1/en not_active Application Discontinuation
- 2003-10-03 AU AU2003282110A patent/AU2003282110A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0680034A1 (en) * | 1994-04-28 | 1995-11-02 | Oki Electric Industry Co., Ltd. | Mobile radio communication system using a sound or voice activity detector and convolutional coding |
US5754554A (en) * | 1994-10-28 | 1998-05-19 | Nec Corporation | Telephone apparatus for multiplexing digital speech samples and data signals using variable rate speech coding |
WO2002043262A1 (en) * | 2000-11-22 | 2002-05-30 | Tait Electronics Limited | Improvements relating to duplex transmission in mobile radio systems |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006077626A1 (en) | 2005-01-18 | 2006-07-27 | Fujitsu Limited | Speech speed changing method, and speech speed changing device |
EP1840877A1 (en) * | 2005-01-18 | 2007-10-03 | Fujitsu Ltd. | Speech speed changing method, and speech speed changing device |
EP1840877A4 (en) * | 2005-01-18 | 2008-05-21 | Fujitsu Ltd | Speech speed changing method, and speech speed changing device |
US7912710B2 (en) | 2005-01-18 | 2011-03-22 | Fujitsu Limited | Apparatus and method for changing reproduction speed of speech sound |
WO2006082288A1 (en) * | 2005-02-04 | 2006-08-10 | France Telecom | Method of transmitting end-of-speech marks in a speech recognition system |
FR2881867A1 (en) * | 2005-02-04 | 2006-08-11 | France Telecom | METHOD FOR TRANSMITTING END-OF-SPEECH MARKS IN A SPEECH RECOGNITION SYSTEM |
Also Published As
Publication number | Publication date |
---|---|
GB2396271B (en) | 2005-08-10 |
AU2003282110A1 (en) | 2004-06-30 |
GB0228765D0 (en) | 2003-01-15 |
WO2004053837A1 (en) | 2004-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7246057B1 (en) | System for handling variations in the reception of a speech signal consisting of packets | |
KR100575193B1 (en) | A decoding method and system comprising an adaptive postfilter | |
TWI390505B (en) | Method for discontinuous transmission and accurate reproduction of background noise information | |
US9047863B2 (en) | Systems, methods, apparatus, and computer-readable media for criticality threshold control | |
EP1017042B1 (en) | Voice activity detection driven noise remediator | |
US6898566B1 (en) | Using signal to noise ratio of a speech signal to adjust thresholds for extracting speech parameters for coding the speech signal | |
EP2055055B1 (en) | Adjustment of a jitter memory | |
US20060217976A1 (en) | Adaptive noise state update for a voice activity detector | |
KR101121212B1 (en) | Method of transmitting data in a communication system | |
EP1838066A2 (en) | Jitter buffer controller | |
KR20030048067A (en) | Improved spectral parameter substitution for the frame error concealment in a speech decoder | |
US7573907B2 (en) | Discontinuous transmission of speech signals | |
EP3815082B1 (en) | Adaptive comfort noise parameter determination | |
CA2408890C (en) | System and methods for concealing errors in data transmission | |
US8631295B2 (en) | Error concealment | |
JP2002237785A (en) | Method for detecting sid frame by compensation of human audibility | |
US7231348B1 (en) | Tone detection algorithm for a voice activity detector | |
US8391313B2 (en) | System and method for improved use of voice activity detection | |
KR101002405B1 (en) | Controlling a time-scaling of an audio signal | |
RU2445737C2 (en) | Method of transmitting data in communication system | |
US8112273B2 (en) | Voice activity detection and silence suppression in a packet network | |
US20080103765A1 (en) | Encoder Delay Adjustment | |
GB2396271A (en) | A user terminal and method for voice communication | |
US11070666B2 (en) | Methods and devices for improvements relating to voice quality estimation | |
US8204753B2 (en) | Stabilization and glitch minimization for CCITT recommendation G.726 speech CODEC during packet loss scenarios by regressor control and internal state updates of the decoding process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PCNP | Patent ceased through non-payment of renewal fee |
Effective date: 20071210 |