MX2007015190A - Robust decoder - Google Patents

Robust decoder

Info

Publication number
MX2007015190A
MX2007015190A MXMX/A/2007/015190A MX2007015190A MX2007015190A MX 2007015190 A MX2007015190 A MX 2007015190A MX 2007015190 A MX2007015190 A MX 2007015190A MX 2007015190 A MX2007015190 A MX 2007015190A
Authority
MX
Mexico
Prior art keywords
signal
concealment
frame
available
frames
Prior art date
Application number
MXMX/A/2007/015190A
Other languages
Spanish (es)
Inventor
Sun Xiaoqin
Wang Tian
A Khalil Hosam
Koishida Kazuhito
Chen Weige
Original Assignee
Microsoft Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corporation filed Critical Microsoft Corporation
Publication of MX2007015190A publication Critical patent/MX2007015190A/en

Links

Abstract

Techniques and tools related to delayed or lost coded audio information are described. For example, a concealment technique for one or more missing frames is selected based on one or more factors that include a classification of each of one or more available frames near the one or more missing frames. As another example, information from a concealment signal is used to produce substitute information that is relied on in decoding a subsequent frame. As yet another example, a data structure having nodes corresponding to received packet delays is used to determine a desired decoder packet delay value.

Description

VOLUMINOUS DECODIFIED TECHNICAL FIELD Tools and techniques described relate to audio encoders-decoders, and particularly to sub-band coding, coding books, and / or redundant coding.
BACKGROUND With the advent of digital wireless telephone networks, Internet audio addressing, Internet telephony, digital processing and dialogue delivery has become common. Engineers use a variety of techniques to efficiently process dialogue while still maintaining the cavity. To understand these techniques, it helps to understand how audio information is represented and processed on a computer. 1. Representation of rile audio information on a computer or -a computer processes audio information as a series of numbers representing the audio. An individual number can represent a simple sample, which is a value of amplitude at a particular time. Several factors affect audio quality They include the sample depth and the sampling rate. The sample depth (or precision) indicates the scale of numbers used to represent a sample. More possible values for each sample typically generate higher quality results because they can be represented, plus sudden variations in amplitude. An 8-bit sample has 256 possible values, while a 16-bit notch has 65,536 possible values. The sampling rate (usually measured as the number of samples per second) also affects the quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples / second (Hz). Table 1 shows several audio formats with different quality levels, together with corresponding bit rate costs in natural state.
TABLE 1 Bit rates for audio of high quality As shown in Table 1, the costs of high quality audio is high bit rate. Quality audio information consumes large amounts of computer storage and transmission capacity. Many computers and computer networks lack the resources to process digital audio in the natural state. Compression (also called coding or coding) decreases the cost of storage and transmits audio information by converting the information into a form of lower bit rate. The compression can be lossless (in which the quality does not suffer) or with loss (in which the quality suffers but the compression bit rate reduction without subsequent loss is more dramatic.) Decompression (also called decoding) extracts a reconstructed version of the original information of the compressed form A decoder-decoder is a decoder / decoder system.
. Dialogue decoders and decoders A goal of audio compression is to digitally represent audio signals to provide maximum signal quality for a given number of bits. Mentioned differently, this goal is to represent the audio signals with the last bits of a given level of quality. Other goals such as resilience to transmission errors limit the total delay due to the encoding / transmission / decoding that some scenarios apply. Different kinds of audio signals have different characteristics. Music is characterized by large scales of frequencies and amplitudes, and often includes two or more channels. On the other hand, dialogue is characterized by smaller scales of frequencies and amplitudes, and is commonly represented in an individual channel. Certain encoder-decoders and processing techniques are adapted for music and general audio; other decoder-decoders and processing techniques are adapted for dialogue. A conventional type of speech decoder / decoder uses linear prediction to achieve compression. Dialog encoding includes several stages. The encoder finds and quantifies coefficients for a linear prediction filter, which is used to predict sample values as linear combinations of preceding sample values. A residual signal (represented as an "excitation" signal) indicates parts of the original signal not foreseen Exactly by filtering. In some stages, the dialogue decoder-encoder uses different compression techniques for segments with voice (characterization by vibration of vocal cord), segment without voice, and silence segments, since different kinds of dialogue have different characteristics. Voice segments typically exhibit highly repetitive speech patterns, even in the residual domain. For segments with voice, the encoder achieves another compression by comparing the current residual signal with the previous residual cycles and coding the current residual signal in terms of delay or lag of information in relation to previous cycles. The encoder controls other discrepancies between the original signal and the predicted, coded representation (of linear prediction and delay information) that uses specially designed coding books. Although some dialogue coders-decoders described above have good overall performance for many applications, they have several disadvantages. In particular, several disadvantages arise when using dialogue encoders-decoders in conjunction with dynamic network resources. In such scenarios, the encoder dialog may be lost due to a shortage of temporary bandwidth or other problems.
A. Ineffective Hiding Techniques When one or more coded dialogue packets are missing, such as in donate they are lost, delayed, corrupted or otherwise become useless in transit or elsewhere, the decoders frequently try to hide the missing packages in some way. For example, some decoders simply repeat packets that have already been received. If there are significant packet losses, however, this technique can quickly result in degraded quality of the decoded dialogue output. Some decoder-decoders use more sophisticated concealment techniques, such as the method of adding waveform similarity superimposed (" WSOLA "). This technique extends the decoded audio signal to hide missing packets when generating new group cycles through the heavy averages of existing group cycles. This method can be more effective when hiding missing packages that only repeat previous packages. However, it may not be ideal for all situations. In addition, it can produce undesirable sound effects (such as a mechanical or ringing sound), if it is used to extend a signal for too long. Additionally, many frames rely on memory of decoded features of previous frames (such as excitation signal history) for decoding. When there is no such memory (such as where the packets that were used to produce the memory are lost, delayed, etc.), the signal quality may be degraded even for frames received following the missing frames.
B. Delay Calculations of Ineffective or Ineffective Desired Palets While packets of encoded audio information are transported to a decoder application, each packet may experience a different delay due to, for example, network variations. This can also result in packages arriving in a different order than when they are shipped. An application or decoder may calculate delay statistics to determine a desired decoder buffer delay that is expected to be long enough to allow a sufficient number of packets to arrive at the decoder in time to decode and be used. Of course, a compensatory matter may be total delay in the system, especially for real-time, interactive applications such as telephony. One aspect for calculating the optimal delay is to observe the maximum delay of previous packets and use the delay value as a guide. The delay of a packet is typically determined by calculating the difference between the sent time stamp applied on the encoder side when the packet is sent and a received time stamp applied to the decoder side when the packet is received. However, sometimes there are atypical values, which cause the system to adapt to non-representative packages.
In addition, it is sometimes better to allow some packets to arrive too late (and be missing) than to impose a delay long enough to receive those late packets of atypical value. An alternative is to calculate the desired delay based on formulas such as operating averages, and performance variation calculations. However, many parameters need to be optimized in such calculations, and it is difficult to find the correct change between calculation and response speed on the other hand, and to base the calculations on a respective population of historical values on the other hand. Another aspect is to directly analyze the package delay distribution. For example, a histogram of packet delays can be maintained. The width of a deposit in the delay time histogram represents the desired precision with which the optimal delay will be calculated. Decreasing deposit size improves accuracy. The shape of the histogram roughly reflects the distribution of base packet delay. When a new package arrives, the package delay is delineated to the corresponding deposit and the account of the packages that fall into that deposit increases. To reflect the age of some old packages, the accounts in all the other deposits are scaled down in a procedure called "maturation". To find the desired new delay, the decoder sets a desired loss rate. Typical values vary between one percent and five percent. The histogram is analyzed for Determine the value of the desired delay that is needed to achieve the desired loss. A problem with this approach is that some parameters need to be tuned, such as deposit width, and maturation factors. In addition, all old packages are treated similarly in the ripening process, and the maturation approach itself plays a very significant role in the overall performance of the technique. In addition, a clock variation situation may occur. Clock variation occurs when the clock speeds of different devices are not the same. If clock variation occurs between the encoder-side devices that apply time stamps sent and decoder-side devices that apply time stamps received, the total delay has a positive or negative trend. This can cause the histogram to vary along the delay timeline even when the histogram should be static.
BRIEF DESCRIPTION OF THE IMVEMC8QM In summary, the detailed description is directed to various techniques and tools for audio decoder-decoders and specifically to tools and techniques related to techniques for dealing with audio information that is missing for any reason. The described embodiments implement one or more of the techniques described and tools that include, but are not limited to, the following: 1 In one aspect, if one or more missing frames are found while processing a bitstream for an audio signal, then a concealment technique is selected from among multiple signal-dependent concealment techniques available based at least partly on one or more factors. The selected concealment technique is performed and a result is produced. In another aspect, when one or more missing frames of a bitstream for an audio signal are encountered, a concealment signal based at least in part on group cycles generates one or more previous frames, including group snow. In another aspect, one or more missing frames of a bit stream for an audio signal are encountered and a concealment signal is produced. A subsequent framework is found that relies at least in part on information from one or more frames missing for decoding. The substitute information of the concealment signal is produced and trusted in place of the information from one or more missing frames to decode the subsequent frame. In another aspect, a bit stream is processed for an audio signal. When one or more missing frames of the bit stream are found, then the processing includes generating a concealment signal including an extension signal contribution based at least in part on one or more values associated with an available frame, and then of a threshold duration of the concealment signal, add a noise contribution to the concealment signal. When one or more missing frames are found of the bitstream, the processing may further include gradually decreasing the energy of the extension signal contribution along at least part of an audio signal, and gradually increasing the noise contribution power along at least part of the audio signal. In addition, gradually decreasing the energy of the extension signal contribution may include gradually decreasing the energy until the extension signal is imperceptible. Also, the energy of the extension signal contribution can be gradually decreased and the energy of the noise contribution can be gradually increased until the concealment signal consists essentially of a predetermined level of background noise. In another aspect, a bitstream is processed for an audio signal. When one or more missing frames of the bitstream are encountered, the processing includes identifying plural available segments of an available frame, and for each of the plural available segments, using one or more characteristics of the available segment to generate a derived segment. A merged signal is formed by using the plural segments available to generate a derived segment. A merged signal is formed by using plural available segments and plural derived segments. The available frame can be a frame with voice, and one or more features can include available segment power. The plural available segments may include first and second available segments, the segments plural derivatives may include first and second derivative segments, and forming the merged signal may include merging the first available segment with the first derivative segment, merging the first derivative segment with the second derivative segment, and merging the second derivative segment with the second available segment. . The plural available segments may include more than two available segments, and the plural derived segments may include more than two derived segments. Also, the processing may also include using the merged segment instead of the available frame and one or more missing frames. In another aspect, a data structure is maintained. The data structure includes a group of plural nodes corresponding to packets in a group of received packets. Each node of the plural node group includes a delay value for receipt of a corresponding packet from the group of received packets, a higher value indicator indicating a node of the group of plural nodes with a next higher delay value, and an indicator of lower value that points to a node in the group of plural nodes with a next lower delay value. A desired decoder packet delay value is determined based at least in part on the data structure. When a new packet is received, maintaining the data structure may include replacing a delay value of an older packet of the packet group received with a delay value of the new packet, updating the flag of one or more higher value nodes of the group of plural nodes, and update the lower value indicator of one or more nodes of the group of plural nodes. Further. Determining the desired decoder packet delay value can include locating a node from the plural node group comprising a maximum delay value, looking for nodes from the plural node group with successively lower delay values until a number of nodes can be searched desired, and use the delay value of the last node to be searched as a desired packet delay for the decoder. Also, the desired number of nodes may correspond to a predetermined desired packet loss rate. The various techniques and tools can be used in combination or independently. The additional features and advantages will become apparent from the following detailed description of different embodiments which proceed with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram of a suitable computing environment in which one or more of the described modalities can be implemented. Figure 2 is a block diagram of a network environment in conjunction with which one or more described modalities can be implemented.
Figure 3 is a block diagram of a real-time dialogue frame coder in conjunction with which one or more described modalities can be implemented. Figure 4 is a flow diagram illustrating the determination of coding book parameters in an implementation. Figure 5 is a block diagram of a real-time dialog frame decoder in conjunction with which one or more of the embodiments described can be implemented. Figure 6 is a block diagram illustrating the general flow of audio information in an illustrative decoder-side VolP system. Figure 7 is a block diagram illustrating the sample buffer in a decoder side buffering technique. Figure 8 is a graph of an illustrative packet delay distribution. Figure 9 is a diagram of an illustrative package delay data structure. Figure 10 is a flow chart illustrating an example of determining an appropriate concealment technique. Figure 11 is a schematic diagram illustrating a technique for concealment of audio information without speech. Figure 12 is a flow diagram illustrating an example of a decoder memory retrieval technique.
DETAILED DESCRIPTION The described modalities are directed to techniques and tools for processing audio information in coding and decoding. With these techniques the quality of dialogue derived from a dialogue encoder-decoder, such as a real-time dialogue decoder-encoder, is improved. Such improvements may result from the use of various techniques and tools separately or in combination. Such techniques and tools may include choosing a concealment technique based on characteristics of the audio signal and / or using group snow in conjunction with group extension concealment techniques. The techniques may also include encoding some or all of the signal resulting from concealment techniques and using the encoded information to regenerate the memory used to decode future packets. Additionally, the techniques may include calculating a desired packet delay value that uses a data structure adapted to track and order packet delays. Although operations for the various techniques are described in a particular, sequential order for the presentation search, it should be understood that this form of description covers the minor dispositions in the order of operations, unless a particular order is required. For example, the operations described sequentially in some cases can be rearranged or performed concurrently In addition, for the search for simplicity, the flow charts may not show the various ways in which particular techniques may be used in conjunction with other techniques. While describing particular computing environment characteristics and characteristics of audio decoder encoders, one or more of the tools and techniques may be used with several different types of computing environments and / or several different types of decoder-decoders. For example, one or more bulky decoder techniques may be used with decoder-decoders that do not use CELP coding model, such as adaptive differential pulse code modulation encoders-decoders, transform decoder-encoders and / or other types of decoder-decoders. As another example, one or more bulky decoder techniques may be used with single-band decoders-decoders or sub-band decoders-decoders. As another example, one or more bulky decoder techniques can be applied to an individual band of a multi-band codec-decoder and / or to a synthesized or uncoded signal that includes multi-band contributions of a multi-band decoder-decoder.
I. Computing Environment Figure 1 illustrates a generalized example of an adequate computing environment (100) in which one or more described modalities can be implemented. The computing environment (100) is not intended to suggest any limitation to the scope of use or functionality of the invention, while the present may be implemented in various general purpose or special purpose computing environments.
With reference to Figure 1, the computing environment (100) includes at least one processing unit (110) and memory (120). In Figure 1, this very basic configuration (130) is included within a dotted line. The processing unit (110) executes executable instructions by computer and can be a real or virtual processor. In a multiple processing system, multiple processing units execute computer executable instructions to increase the processing power. The memory (120) may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g. RAM, EEPROM, flash memory, etc.), or some combination of the two. The memory (120) stores software (180) which implements subband coding, multiple stage coding books, and / or redundant coding techniques for a dialogue encoder or decoder. A computing environment (100) may have additional features. In Figure 1, the computing environment (100) includes storage (140), one or more input devices (150), one or more output devices (160), and one or more communication connections (170). An interconnection mechanism (not shown) such as a common conductor, controller, or network interconnects the components of the computing environment (100). Typically, the operating system software (not shown) provides an operating environment for software running in the computing environment (100), and coordinates activities of the computing environment components (100). The storage (140) may be removable or non-removable, and may include magnetic disks, magnetic tape or cassette, CD-ROMs, CD-RWs, DVDs, or any other means that can be used to store information and that can be accessed within the environment of computation (100). The storage (140) stores instructions for the software (180). The input device (s) (150) can be a touch-sensitive input device such as a keyboard, mouse, needle, or trackball, a voice input device, a scanning device, a network adapter, or other device that provides input to the computing environment (100). For audio, the input device (s) (150) can be a sound card, microphone or other device that accepts audio input in analogue or digital form, or a CD / DVD player that provides audio samples to the audio environment. computation (100). The output device (s) (160) can be a screen, printer, speaker, CD / DVD writer, network adapter, or other device that provides output from the environment of computation (100). The communication connection (s) (170) allows communication in one communication means to another computing entity. The communication means carries information such as computer executable instructions, compressed dialogue information, or other data in a modulated data signal. A modulated data signal is a signal having one or more of its characteristics set or changed in such a way as to encode information in the signal. By way of example, and not limitation, the communication means include wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier. The invention can be described in the general context of computer readable media. Computer-readable media are in any available medium that can be accessed within a computing environment. By way of example, and not limitation, with the computing environment (100), computer readable media includes memory (120), storage (140), media, and combinations of any of the foregoing. The invention can be described in the general context of computer executable instructions, such as those included in program modules, which are executed in a computing environment in a real or virtual target processor. Generally, the program modules include routines, programs, libraries, objects, classes, components, data structures, etc. who perform particular tasks or implement particular abstract data types. The functionality of the program modules can be combined or divided among program modules as desired in various modalities. Computer executable instructions for program modules can be executed within a local or distributed computing environment. For the presentation search, the detailed description uses terms such as "determine", "generate", "adjust", and "apply" to describe computer operations in the computing environment. These terms are higher level abstractions for operations performed by a computer and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on the implementation.
ID. Generalized network environment and real-time dialogue encoding and decoding Figure 2 is a block diagram of a generalized network environment (200) in conjunction with which one or more of the embodiments described can be implemented. A network (250) separates several encoder side components from the various decoder side components. The primary functions of the decoder side and decoder side components are encoded and decode by dialogue, respectively. On the encoder side, an input buffer (210) accepts and stores dialog input (202). The dialogue coder (230) takes dialogue input (202) from the input buffer (210) and encodes it. Specifically, a frame splitter (212) divides the samples of the dialogue entry (202) into frames. In one implementation, frames are uniformly 20 ms long-160 samples for 8 kHz input and 320 samples for 16 kHz input. In other implementations, the frames have different durations, are not uniform or overlap, and / or the sample rate of the input (202) is different. Frames can be arranged in a superframe / frame, frame / subframe, or other configuration for different stages of encoding and decoding. A frame classifier (214) classifies frames according to one or more criteria, such as signal energy, crossover speed 0, long-term prediction achievement, achievement differential, and / or other criteria for sub-frames or the complete frames. Based on the criteria, the frame classifier (214) classifies the different frames into classes such as silent, voiceless, voice, and transitional (for example, voiceless voice). Additionally, frames can be classified according to the type of redundant coding, if any, that is used for the framework. The frame class affects the parameters that will be calculated to encode the frame. In addition, frame classes can affect the resolution and loss of resilience by which parameters are encoded, to provide more resolution and resilience loss to more important classes and frame parameters. For example, silent frames are typically encoded at a very low speed, are very simple to recover for concealments if they are lost, and may not need protection against loss. Frames without voice are typically coded at slightly higher speed, are reasonably simple to recover by concealment if lost, and significantly are not protected against loss. Voice and transition data are usually encoded with more bits, depending on the complexity of the frame as well as the presence of transitions. Frames with voice and transition are more difficult to recover if they are lost and thus protect themselves more significantly against loss. Alternatively, the frame classifier (214) uses other and / or additional frame classes. In Figure 2, each frame is encoded, as by a frame coding component (232). Such frame coding is described in more detail below with reference to Figure 3. The resulting coded dialogue is provided for software for one or more network layers (240) through a multiplexer ("MUX") (236). The network layer (s) (240) processes the encoded dialogue for transmission in the network (250). For example, the network layer software packages frames the packet-coded dialogue information following the RTP protocol, which are trusted on the Internet using UDP, IP, and various layer protocols physical Alternatively, other and / or additional software layers or network protocols are used. The network (250) is a packet switched, wide area network such as the Internet. Alternatively, the network (250) is a local area network or another kind of network. On the decode side, the software for one or more network layers (260) receives and processes the transmitted data. The network, transport, and upper layer and software protocols in the decoder side network layer (s) (260) usually correspond to those in the encoder side layer (s) (240). The network layer (s) provides the information of a coded dialogue to the dialogue decoder (270) through a demultiplexer ("DE UX") (276). The decoder (270) decodes each frame, as illustrated in the decoding module (272). The decoded dialogue output (292) can also be passed through one or more post-filters (284) to improve the quality of the resulting filtered dialog output (294). A generalized real-time speech band decoder is described below with reference to Figure 5, but other dialogue decoders can be used as well. Additionally, some or all of the described tools and techniques may be used with other types of audio decoders and decoders, such as music encoders and decoders, or general-purpose audio decoders and decoders.
In addition to these primary encoding and decoding functions, the components may also share information (shown in the dotted lines in Figure 2) to control the speed, quality, and / or resilience of loss of the encoded dialogue. The speed controller (220) considers a variety of factors such as the complexity of the current input in the input buffer (210), the integrity of the intermediate band of intermediate output bands in the encoder (230) or in some other side, desired output speed, current network bandwidth, network congestion / noise conditions and / or decoder loss rate. The decoder (270) feeds the decoder loss rate information to the speed controller (220). The network layer (s) (240, 260) collects or estimates information about the current network bandwidth and congestion / noise conditions, which is fed back to the speed controller (220). Alternatively, the speed controller (220) considers other and / or additional factors. The speed controller (220) directs the dialogue encoder (230) to change the speed, quality, and / or loss residence with which the dialogue is encoded. The encoder (230) can change the speed and quality by adjusting quantization factors for parameters or changing the resolution of entropy codes that represent the parameters. Additionally, the encoder can change the loss resilience by adjusting the speed or type of redundant coding. In that way, the encoder (230) can change the bit distribution between primary encoding punches and loss resilience functions depending on network conditions. Alternatively, the speed is controlled in some other way, such as where the encoder operates at a fixed speed. Figure 3 is a block diagram of a generalized dialog frame encoder (300) in conjunction with which one or more of the described embodiments can be implemented. The frame encoder (300) generally corresponds to the frame coding component (232) in Figure 2. The frame encoder (300) accepts the frame input (302) of the frame splitter and produces coded frame output (392) ). The LP analysis component (330) calculates linear prediction coefficients (332). In some implementation, the LP filter uses 10 coefficients for input of eight kHz and sixteen coefficients for input of sixteen kHz, and the LP analysis component (330) calculates a group of linear prediction coefficients per frame. Alternatively, the LP analysis component (330) calculates two groups of coefficients per frame for each band, one for each of the two windows centered in different locations, or calculates a different number of coefficients per frame. The LPC processing component (335) receives and processes the linear prediction coefficients (332). Typically, the LPC processing component (335) covers LPC values for a different representation for more quantification and coding efficient. For example, the LPC processing component (335) converts LPC values to a line spectrum pair (LSP) representation. The LSP values can be internally encoded or predicted from LSP values. Various representations, quantification techniques, and coding techniques are possible for LPC values. The LPC values are provided in some form as part of the encoded frame output (392) for packaging and transmission (together with any of the quantization parameters and other information necessary for reconstruction). For subsequent use in the encoder (300), the LPC processing component (335) reconstructs the LPC values. The LPC processing component (335) may perform interpolation for LPC values (such as equivalently in the LSP representation or other representation) to smooth the transitions between different groups of LPC coefficients, or between the LPC coefficients used for different sub-frames from Marcos. The synthesis filter (or "short-term diction") (340) accepts the reconstructed LPC values (338) and incorporates them into the filter. The synthesis filter (340) receives an excitation signal and produces an approximation of the original signal. For a given frame, the synthesis filter (340) can store a number of reconstructed samples (for example, ten of a cover filter of ten) of the previous frame for the start of the prediction. The perceptual weight components (350, 355) apply perceptual weight to the original signal and the molded filter filter output. synthesis (340) to selectively de-emphasize the forming structure of the dialogue signals to make the audit systems less sensitive to quantification errors. Perceptual weight components (350, 355) exploit psychoacoustic phenomena such as masked. In one implementation, the perceptual weight components (350, 355) apply weights based on the original LPC values (332) received from the LPC analysis component (330). Alternatively, the perceptual weight components (350, 355) apply additional and / or additional weights. The following perceptual weight components (350, 355), the encoder (300) calculates the difference between the original perceptually weighted signal and the perceptually heavy output of the synthesis filter (340) to produce a difference signal (334). Alternatively, the encoder (300) uses a different technique to calculate the dialog parameters. The excitation parameter component (360) seeks to find a better combination of adaptive coding book indexes, fixed coding book indexes and achievement coding book indexes in terms of minimizing the difference between the original perceptually heavy signal and the synthesized signal (in terms of square error of heavy media or other means). Many parameters are calculated by the sub-frame, but more generally the parameters can be by super-frame, frame, or sub-frame, as discussed above, the parameters for Different bands of a frame or sub-frame may be different. Table 2 shows the available types of parameters for different frame classes in an implementation.
TABLE 2 PARAMETERS FOR DIFFERENT FRAMEWORK CLASSES In Figure 3, the excitation parameterization component (360) divides the frame into sub-frames and calculates coding book indexes and achievements for each sub-frame as appropriate. For example, the number and type of coding book stages to be used, and the resolutions of coding book indices, can initially be determined by a coding mode, wherein the mode can be dictated by the speed control component discussed previously. A particular mode can also dictate the decoding and decoding parameters other than the number and type of encoding book stages, for example, the resolution of encoding book indexes. The parameters of each coding book stage are determined by using the parameters to minimize the error between an objective signal and the contribution of that coding book signal for the synthesized signal. (As discussed here, the term "use" means finding an appropriate solution under applicable limitations such as distortion reduction, parameter search time, parameter search complexity, parameter bit rate, etc., as opposed to performing a complete search in the parameter space Similarly, the term "minimize" should be understood in terms of finding a suitable solution under applicable limitations). For example, optimization can be done by using a modified media square error technique. The objective signal for each stage is the difference between the residual signal and the sum of the contributions of the previous coding book stages, if any, to the synthesized signal. Alternatively, other optimization techniques may be used. Figure 4 shows a technique for determining coding book parameters according to an implementation. The excitation parameterization component (360) performs the technique, potentially in conjunction with other components such as a speed controller. Alternatively, another component in an encoder performs the technique. Referring to Figure 4, for each subframe in a frame with speech or transition, the excitation parameterization component (360) determines (410) whether an adaptive encoding book for the current subframe can be used. (For example, him speed control may dictate that no adaptive coding book will be used for a particular framework). If the adaptive coding book will not be used, then an identified adaptive coding book change will indicate that adaptive coding books are not to be used (435). For example, this can be done by setting a one-bit mark at the frame level that indicates that adaptable coding books are not to be used in the frame, by specifying a particular coding mode at the frame level, or by setting a one-bit mark for each sub-frame indicating that no adaptive encoding book will be used in the sub-frame. Referring now to Figure 4, if an adaptive coding book can be used, then the component (360) determines adaptive coding book parameters. These parameters include an index, or group value, which indicates a desired segment of the excitation signal history, as well as an achievement to apply to the desired segment. In Figures 3 and 4 the component (360) performs a closed turn group search (420). This search begins with the group determined by the optional open-ended group search component (325) in Figure 3. An open-spin group search component (325) analyzes the heavy signal produced by the weight component (350) ) to estimate your group. When starting with this estimated group, the closed loop group search (420) optimizes the group value to decrease the error between the target signal and the signal heavy synthesized generated from an indicated segment of the excitation signal history. The achievement value of the adaptive coding book is also optimized (425). The adaptive coding book achievement value indicates a multiplier to apply the predicted value per group (the values of the indicated segment of the excitation signal history), to adjust the scale of the values. The achievement multiplied by the expected group values is the contribution of the adaptive coding book to the excitation signal for the current frame of the sub-frame. Achievement optimization (425) and closed-turn group search (420) produce an achievement value and an index value, respectively, that minimize the error between the target signal and the heavy synthesized signal of the book contribution. adaptive code. If the component (360) determines (430) that the adaptive coding book will be used, then the adaptive coding book parameters are signaled (440) in the bit stream. If not, then it is indicated that no adaptive encoding book is used for the sub-frame (435), such as when setting a one-bit sub-frame level mark, as discussed above. This determination (430) may include determining whether the adaptive coding book contribution for the particular sub-frame is significant enough to be the number of bits required to signal the adaptive coding book parameters. Alternatively, some other bases can be used for the determination. In addition, although Figure 4 shows the signaling after the determination, alternatively, the signals are grouped until the technique ends by a frame or super-frame. The excitation parameterization component (360) also determines (450) whether a pulse coding book is used. In one implementation, the use or non-use of pulse coding book is indicated as part of a total coding mode for the current frame, or may be indicated or determined in other ways. A pulse coding book is a type of fixed coding book that specifies one or more pulses to contribute to the excitation signal. The pulse coding book parameters include pairs or indices and signs (achievements can be positive or negative). Each pair indicates a pulse to be included in the excitation signal, with the index indicating the position of the pulse, and the signal indicating the plurality of the pulse. The number of pulses included in the pulse coding book and used to contribute to the excitation signal may vary depending on the coding mode. Additionally, the number of pulses may depend on whether or not an adaptive coding book is used. If the adaptive coding book is used, then the adaptive pulse coding book parameters are optimized (455) to minimize the error between the contribution of the indicated pulses and a target signal. If the adaptive encoding number is not used, then the target signal is the original heavy signal. If an adaptive coding book is used, then the target signal is the difference between the original signal heavy and the contribution of the adaptive coding book to the heavy synthesized signal. At some point (not shown), the pulse coding line parameters are then signaled in the bit stream. The excitation parameterization component (360) also determines (465) whether any of the random fixed coding book steps are also to be used. The number (if any) of the random coding book stages is indicated as part of a total coding mode for the current frame, although it may be indicated or determined in other ways. A random coding book is a fixed coding book that uses a pre-defined signal model for the values it encodes. The coding parameters can include the starting point for an indicated segment of the signal model and a signal that can be positive or negative. The length or scale of the indicated segment is typically fixed and therefore not typically pointed out, but is alternatively indicated to a length or extent of the indicated segment. An achievement is multiplied by the values in the indicated segment to produce the random coding book contribution to the excitation signal. If at least one random coding book stage is used, then the coding book stage parameters for the coding book stage are optimized (470) to determine the error between the contribution of the random coding book stage and an objective signal. The objective signal is the difference between the original heavy signal and the sum of the contribution to the heavy synthesized signal from the adaptive coding book (if any), the pulse coding book (if any), and the previously determined random coding book stages (if there are any). At some point (not shown), the random encoding book parameters are then signaled on the bit stream. The component (360) then determines (480) whether any of the random coding book steps will be used. If so, then the parameters of the next random coding book stage are optimized (470) and pointed out as described above. That continues until all the parameters of the random coding book stages are determined. All random encoding book stages can use the same signal module, although they will probably indicate different segments of the model and have different values of achievement. Alternatively, different signal patterns can be used for different random coding book stages. Each excitation achievement can be quantified independently or two or more achievements can be quantified, as determined by the controller and / or other components. While a particular order was mentioned here to optimize the various coding book parameters, other orders and optimization techniques may be used. For example, the books of Random coding can be optimized simultaneously. Thus, although Figure 4 shows essentially calculation of different coding book parameters, alternatively, two or more different coding book parameters are optimized together (for example, by varying the parameters together and evaluating results according to some technique of non-linear optimization). Additionally, other coding book configurations that do not create excitation signal parameters can be used. The excitation signal in this implementation is the duma of any of the adaptive coding book contributions, the pulse coding book, and the random coding book stage (s). Alternatively, component (360) can calculate additional and / or additional parameters for the excitation signal. Now referring to Figure 3, the coding book parameters for the excitation signal are signaled or otherwise provided to a local decoder (365) encompassed by dotted lines in Figure 3) as well as the band output ( 392). Thus, for each band, the encoder output (392) includes the output of the LPC processing component (335) discussed above, as well as the output of the excitation parameterization component (360). The bit rate of the output (392) depends in part on the parameters used by the encoding books, and the encoder (300) can control the bit rate and / or quality when switching between different groups of indexes of coding books, when using embedded codes, and when using other techniques. Different combinations of the types and stages of coding books can generate different coding modes for different frames, bands, and / or sub-frames. For example, a frame without a voice can use only one random scrambling book stage. An adaptive coding book and a pulse coding book can be used for a frame with low speed speech. A high-speed frame can be encoded by using an adaptive encoding book, a pulse encoding book, and one or more random encoding book stages. The speed control module can determine or influence the type of mode for each frame. Still referring to Figure 3, the output of the excitation parameterization component (360) is received by coding book reconstruction components (370, 372, 374, 376) and achievement application components (380, 382, 484). , 386) corresponding to the coding books used by the parameterization component (360). The coding book stages (370, 372, 374, 376) and the corresponding achievement application components (380, 482, 384, 386) reconstruct the contributions of the coding noises. These contributions are added to produce an excitation signal (390), which is received by the synthesis filter (340), where it is used together with the "predicted" samples from which the subsequent linear prediction occurs. The delayed portions of the excitation signal are used as a excitation history signal by the adaptive coding book reconstruction component (370) to reconstruct subsequent adaptive coding book parameters (eg, contribution of groups), and by the parameterization component (360) in book parameters of Subsequent adaptive coding of computation (for example, group index and group achievement values). Referring again to Figure 2, the band output for each band is affected by the MUX (236) along with other parameters. Such other parameters may include, among other information, frame class information (222) of the frame classifier (214) and frame coding modes. The MUX (236) builds application layer packages to move to other software, or the MUX (236) places the data in packet payloads to follow a protocol such as RTP. In MUX you can save parameters to allow selective replay of parameters for direct error correction in subsequent packets. In one implementation, the MUX (236) packages into an individual packet that information of the primary encoded dialogue for a frame, together with the direct error correction information for all or part of one or more previous frames. The MUX (236) provides feedback such as current buffer integrity for speed control purposes. More generally, the various components of the encoder (230) (which include the frame classifier (214) and MUX (236)) can providing information to a speed controller (220) such as that shown in Figure 2. A bitstream DEMUX (276) of Figure 2 accepts the encoded dialogue information as input and analyzes it to identify and process parameters. The parameters may include frame class, some representation of LPC values, and encoding book parameters. The frame class can indicate that other parameters are present for a given frame. More generally, the DEMUX (276) uses the protocols used by the encoder (230) and extracts the parameters to the encoder (230) not in packets. For packets received in a dynamic packet switched network, DEMUX (276) includes a snow buffer to smooth short-term fluctuations in a packet rate in a given period of time. In some cases, the decoder (270) regulates the buffer delay and handles when the packets are read from the buffer to integrate delay, control quality, concealment of missing frames, etc. in decoding. In other cases, the application layer component handles the snow buffer, and the snow buffer is filled at a variable speed and emptied by the decoder (270) at a constant or relatively constant speed. The DEMUX (276) can receive multiple versions of parameters for a given segment, which includes a primary coded version and one or more versions of error correction high school. When the error correction fails, the decoder (270) uses concealment techniques such as parameter repetition or estimation based on information that was received correctly. Figure 5 is a block diagram of a generalized real-time dialog frame decoder (500) in conjunction with which one or more described modalities can be implemented. The band decoder (500) generally corresponds to any of the band decoding components (272) of Figure 2. The frame decoder (500) accepts encoded dialogue information (592) as input and produces a reconstructed output (502) after decoding. The decoder components (500) have corresponding components in the encoder (300), but the decoder total (500) is simpler since it lacks perceptual weight components, the excitation processing turn and the speed control. The LPC processing component (535) receives information representing LPC values in the form provided by the frame encoder (300) (as well as any of the quantization parameters and other information necessary for reconstruction). The LPC processing component (535) reconstructs the LPC values (538) which use the inverse of the conversion, quantization, coding, etc. previously applied to the LPC values. The LPC processing component (535) also performs interpolation for LPC values (in LPC representation or other representation such as LSP) to smooth transitions between different groups of LPC coefficients. The coding book stages (570, 572, 574, 576) and achievement application components (580, 582, 584, 586) decode the parameters of any of the corresponding coding book stages used for the excitation signal and calculates the contribution of each coding book stage that is used. More generally, the configuration and operations of the coding book stages (570, 572, 574, 576), and achievement components (580, 582, 584, 586) correspond to the configuration and operations of the coding book stages (370, 372, 374, 376) and achievement components (380, 382, 384, 386) in the encoder (300). The contributions of the coding book stages used are summed, and the resulting excitation signal (590) is fed into the synthesis filter (540). Derivating the delayed excitation signals (590) are also used as an excitation history by the adaptive encoding book (570) in calculating the contribution of the adaptive encoding book for subsequent portions of the excitation signal. The synthesis filter (540) accepts reconstructed LPC values (538) the thread is incorporated into the filter. The synthesis filter (540) stores samples previously reconstructed for processing. The excitation signal (590) passes through the synthesis filter to form an approximation of the original dialogue signal.
The relationships shown in Figures 2-5 indicate general information tubes; other relationships are not shown for the search for simplicity. Depending on the implementation and the type of compression desired, the components can be added, omitted, divided into multiple components, combined with other components, and / or replaced with similar components. For example, in the environment (200) shown in Figure 2, the speed controller (220) can be combined with the dialogue coder (230). Potential aggregate components include a multimedia encoding application (or replay) that handles the dialogue encoder (or decoder) as well as other encoders (or decoders) and collects the network and decoder condition information, and performs functions of adaptive correction. In alternative embodiments, combinations and configurations of components process dialogue information using techniques described herein.
IDI. Techniques and Tools of Vocabulary Decoding With various voluminous decoding techniques and tools, the quality of dialogue derived from a dialogue encoder-decoder, such as a real-time dialogue decoder-encoder, is improved. The various bulky decoding techniques and tools can be used separately or in combination. While the computing environment characteristics In particular and the decoder-decoder features are described above, one or more tools and techniques may be used with several different types of computing environments and / or several different types of decoder-encoders. For example, one or more bulky decoder techniques may be used with decoder-decoders that do not use the CELP coding model, such as adaptive differential pulse code modulation encoders-decoders, transform and / or decoder-decoder encoders. other types of decoder-decoders. As another example, one or more bulky decoder techniques may be used with single-band decoders-decoders or sub-band decoders-decoders. As another example, one or more bulky decoder techniques may be applied to an individual band of a multi-band decoder-decoder and / or to a synthesized and uncoded signal including multi-band contributions of an encoder-decoder.
A. Descodilficadloir's side VolP Internet protocol applications ("VolP") are one of the possible uses for the encoder-decoder described. The VolP applications typically operate in any of the so-called "push mode", or the so-called "pull mode". The techniques described can be used in any way, or in some other way. However, many of the current VolP applications use Decoding pull mode. Accordingly, Figure 6 generally illustrates a possible arrangement of some decoder-side pull mode operations. In general, the application VolP (610) passes all packets received from a network source (600) immediately to the decoder (620), which will decide the desired save delay. In Figure 6, the decoder (620) performs the concealment techniques described below, rather than delegating the task of hiding missing portions of the signal to the receiver application (610). Alternatively, the application (610) or some other component performs the concealment techniques described below. In the pull mode, when the presenter (630), an audio playback component, needs additional decoded information to reproduce (650) the information. The application VolP (610) retrieves (655) the information available from the network source (600) (not only for the requested reproduction period) and pushes (660) the information to the decoder (620). The decoder (620) decodes some portion of the information (e.g., samples for the requested reproduction period) and returns (665) the decoded information to the application VolP (610), which provides (670) the information to the presenter (630). If one or more frames are missing, the decoder (620) performs one or more concealment techniques and returns the resulting concealment information (eg, as audio samples) to the VolP application (610) instead of the actual samples for the missing frame (s). Because the presenter (630) requests samples as needed, the time relies only on the local clock used by the presenter (630). That way, this design can automatically take care of the time variation problems that can occur in other designs, such as push mode decoders that can rely on multiple clocks with slightly different time rates. As illustrated in Figure 7, the packets received by the VolP application of a network source (710) are immediately delivered to the decoder, which places them according to their sequential order (in the audio signal) in a Snow buffer (720). As illustrated in Figure 7, a "1" in the snow buffer (720) indicates that a corresponding packet is present, and a "0" indicates that the corresponding packet is absent. In addition, there is a sample buffer (730), which holds decoded output samples (740) available for playback. At any time that the presenter (760) wishes to play some audio, he pulls the desired samples (750) from the decoder sample buffer (730). If the sample buffer 730 does not have enough samples to satisfy the request, the decoder observes the snow buffer 720 to see if the next packet is available. If the following package is available, then decodes, and the decoded samples are placed in the sample buffer (730), and the requested samples are passed to the presenter (760) through the VolP applications. On the other hand, if the package is not available, then the concealment module is invoked to generate replacement samples. The replacement samples are placed in the buffer (730), and the request is satisfied. The replacement samples are placed in the sample buffer (730), and the request is satisfied. In any case, the snow buffer can be changed, and the snow buffer head information (725) is adjusted approximately.
B. Packet Delay Calculation In some bulky decoding techniques and tools, the decoder tracks the packet loss rate and estimates the network delay when verifying packet reception times. Figure 8 illustrates an illustrative packet delay distribution. The history of received packets can be used to determine any delay trend, and to calculate a desired delay (Ddeseacl0, 820). If the desired delay (820) changes or if the packet delay distribution changes (e.g., due to network fluctuations), the decoder can dynamically change the buffer delay and decode when using compression / extension techniques of time (or variable emission), which modifies the reproduction time of audio while maintaining the correct group. When using such techniques, when it is observed that the packet delay increases, the samples in the sample buffer can be extended in time so that subsequent packets can be accommodated to a certain acceptable extent for playback delay, interactivity, etc. For frames with voice, this can be done by using the WSOLA techniques described later. Similarly, if the delay is observed to decrease, then the samples in the buffer can be compressed in time (reproduce faster), so that the buffer is not complete. Traditional approaches to calculate the desired delay include simply using the maximum delay of previous packets as a guide. However, sometimes there may be outliers, which cause the system to adapt to representative packet delay values. In addition, sometimes it is better to allow some packages to arrive too late to reproduce than to impose a too long delay to receive and use the outlier, the subsequent packages. Some decoder-decoders calculate the desired delay based on formulating such as averages and variation calculations of operation. An example is described in R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, "Adaptive Emission Mechanisms for Audio Applications Packaged in Wide Area Networks", Procedures of IEEE INFOCOM '94, pages 680-688 , April 1994. However, many parameters need to be optimized in such calculations, and it is It is difficult to find the correct change between response speed on one side, and to base calculations on a representative population of historical values on the other hand. Another approach is to directly analyze a packet delay distribution (810), such as by using the implementation of the Concord algorithm described in C. J. Sreenan, J-C. Chen, P. Agrawal, and B. Narendran, "Delay Reduction Techniques for Intermediate Emission Memory", Proc. IEEE Trans. Multimedia, vol. 2, no 2, pages 88-100, June 2000. In that method a histogram of packet delays is maintained. The width of a deposit in the histogram represents the desired precision with which the optimal delay can be calculated. Decreasing deposit size improves accuracy but increases traction complexity. The shape of the histogram roughly reflects the base packet delay distribution. When a new package arrives, the package delay is delineated to the corresponding deposit and the account of the packages that falls into the deposit increases. To reflect the age of some old packages, the accounts in all the other deposits are scaled down in a procedure called "maturation". To find the desired new delay the decoder establishes a desired loss rate (L, 830). The typical desired loss rate values vary between 1% and 5%. The histogram is then analyzed to determine where the desired delay needs to achieve the speed of loss. The problem with this approach is that some parameters need to be tuned, such as deposit width and maturation factors. In addition, all old packages are treated similarly, and the maturation approach plays a very significant role in overall performance. In addition, when a clock variation situation occurs, the total delay has a positive or negative trend, which can also complicate the algorithm, especially when the histogram intends to remain static within the time delay time line. In a voluminous decoding technique presented here, a history of delays for an established number of packets is maintained. For example, the number of packages included in the history can be 500, 1000 or some other suitable number. With each incoming packet, the oldest packet delay is replaced with the newest packet delay, or the oldest packet is otherwise replaced with the newest packet in the history. That way, only the oldest package of calculations is removed. Referring to Figure 8, a packet delay distribution (810) histogram is generated according to the desired deposit width based on the history of all packet delays in the record. The optimal desired delay (820) is determined for the deposit delay value corresponding to the desired loss (830). This technique does not rely on maturation methods. Therefore, the problems that result from situations of clock variation decrease. However, the calculation requirements to regenerate a new histogram with a new package They are typically tall. For example, in one implementation, the histogram calculation consumed approximately 10% of the total decoder complexity. It may be possible to decrease this complexity by reusing a previous histogram, rather than reconstructing it from the beginning, but if the clock variation situation occurs, this may allow the delay histogram to vary along the time axis of delay even when it should remain static. In another voluminous decoding technique, a data structure is used to determine the optimal buffer that adapts to variant packet delays. In a particular implementation, a double-linked list of delay nodes placed linearly in memory (for example, in one order) is used for updates, as well as to determine the desired delay. The history in the new received packet is defined by History [n] =. { Dn, Dn. Dn.2, .... Dn.N + 1} where the size of the history is N packages. The delay values for the packages represented in the history are placed in a type of dually linked list of data structure. An example of such a data structure (910) with four nodes (920, 922, 924, 926) in an order is illustrated in Figure 9. Of course, the data structure can include many more nodes, such as 500 to 1000 nodes However, four nodes are shown for the simplicity search when explaining the data structure. Each node includes a delay value (940) corresponding to the delay of a received packet. The nodes (920, 922, 924, 926) they can be placed linearly in the memory, and an index P (930) can be circulated in all the nodes. In that way, the P index (1030) tracks the packet that is currently the oldest in the order. Each node (920, 922, 924, 926) has a following indicator (950) (here, simply a field of a node) and a previous indicator (960) (here, simply a field of a node). Each subsequent indicator (950) references the node with the next higher delay value. Each pre-indicator (960) refers to the node with the next lower delay value. Thus, in the data structure (910) illustrated in Figure 9, D ".i (in P3) > D "(in P) > Dn-2 (in P2) > Dn-3 (in P ^ The oldest packet (P ^ is at the beginning of the list initially.) When each new packet is received, the packet is added to the structure (910) to the point that the history is filled (by example, 500 tracked packages.) At that point, when a new packet is received, the node includes information about the oldest packet that is being replaced (for example, packet 1 is replaced with packet 501 and the oldest packet is now package 2, then package 2 is replaced with package 502 and the oldest package is now package 3.) When the old node information is replaced, the previous and next indicators are updated. the correct place in the ordered list of delay values is found when rotating in the following indicators and then the delay value is "inserted" in the order of delay values when updating the following and previous ones indicators. The linked list also maintains a reference to the higher delay in the list. When it is desired to find the optimum desired delay for a corresponding desired loss rate, a decoder starts at the upper delay and follows the previous indicators towards lower delays until a count of nodes corresponding to the desired loss rate is found. This can be done each time a new package is received, or at some other interval. In addition, finding the desired delay is a simple procedure because only some nodes are traversed (for example 1% to 5% for 1% to 5% of desired loss values). Also, this can provide a natural solution to the problem of clock variation because no real histograms are formed. In this way, the phenomenon is not confined to the precision specified by the deposit widths. Therefore, the calculations can be finer and more accurate than typical histogram techniques. Alternatively, instead of being implemented as an order in the linear memory, the doubly linked list is implemented as some other type of dynamic data structure.
C. Signal Dependent Hiding When one or more packets are missing, one of several different approaches can be used to produce a signal that hides the loss. A concealment technique can be chosen appropriate depending on the nature of the signal and the number of missing packets or frames. An example of a determination of signal technique is illustrated in Figure 10. Alternatively, other determinations are used. Referring to Figure 10, one or more missing frames (1010) are found. A concealment technique is selected (1020) based on one or more factors such as the nature of one or more adjacent frame (s) and the number of missing frames. Alternatively, other and / or additional features are used for the selection. A concealment signal (1030) is produced using the selected technique. Several illustrative signal-dependent concealment techniques follow. According to a first type of signal, a signal-dependent concealment technique is used, unidirectional concealment of speech dialogue. Unidirectional concealment can also be used, when future information is available, if the space between the available audio information is too long to use practically bi-directional concealment or to maintain low complexity, or for other reasons. You can also use the concealment technique of adding superimposed waveform similarity ("WSOLA") for unidirectional concealment. WSOLA is described in Y. J. Liang, N. Fáber, and B. Girod, "Adaptive Emission Programming and Loss Hiding for Voice Communication in IP Networks," IEEE Transaction in Multimedia, vol. 5, no. 4, pages 532-543, December 2003. The WSOLA technique understands a previous signal when generating a new cycle through heavy averages of existing group cycles. A desired length can be generated by using this method until the next packet is received. However, if the group extension becomes too excessive, such as where many packets are lost, the WSOL method may result in a mechanical sound. Thus, in the present bulky decoding technique, noise can be added if the group extension becomes too muye. Noise can be generated, for example, by using random white noise passing through a linear prediction coefficient filter optimized by the last received signal. The noise energy can increase while the extension increases. As discussed later, the energy of the group extensions can gradually decrease until the remains are the generated background noise. The extended signal is fused with a subsequent signal produced from a last received packet. This may include a signal coupling routine that maximizes the correlation between the two signals. The signal coupling routine can be a simple correlation routine, such as a correlation routine that multiplies corresponding sample amplitudes for the superimposed signals and adds the resulting products to produce a correlation value. The delay of the last signal can be adjusted to maximize that correlation value. Alternatively, they can use other signal coupling routines. The signals can be merged by using weight windows, such as weight windows designed to have a constant effect on the signal energy through the fused area. According to a second type of signal-dependent concealment technique, if the signal to be extended is voiceless by nature (i.e. clear voice or group information can not be found in the signal), then a different algorithm is used to extend the signal. In such cases, the signal to spread is analyzed for the corresponding linear prediction coefficients, and the random noise passes through a linear prediction coefficient filter that uses those coefficients to generate a signal without synthetic speech to merge with the signal to extend. The energy of the synthesized signal is derived from the signal used for the extension. In general, this derivation can include dividing a speech frame without a voice in half, and extending and merging each half. An example of such a derivation is illustrated in Figure 11. In one example shown, a voiceless dialogue signal (1110) is available for a frame, and a signal (1120) is missing for a subsequent frame. The dialogue signal without voice is divided into two superimposed available sample segments or signals (1130, 1140). The available samples (1130) to the left include the left half of the available signal (1110) plus some subsequent samples from the right half for smoothing. For example, forty or eighty subsequent samples may be used. Similarly, the Samples available on the right include the right half of the available signal (1110) plus some previous samples for smoothing. The left derived samples (1150) are created based on the energy of the left available samples (1130). Specifically, the samples on the left (1130) are analyzed to produce corresponding linear prediction coefficients. Random noise is generated and passed through a linear prediction coefficient filter that uses those corresponding coefficients, the resulting signal is scaled so that it has a signal energy equal to the signal energy of the available samples (1130), with it generates a signal without synthetic speech, or left derived samples (1150). Similarly, right-derived samples (1160) are created based on the energy of the right-hand samples (1140). Specifically, right hand samples (1140) are analyzed to produce corresponding linear prediction coefficients. Random noise is generated and passed through a linear prediction coefficient filter that uses those corresponding coefficients. The resulting signal is scaled so that it has a signal energy equal to the signal energy of the available samples (1140), thereby generating a signal without synthetic speech, or right-derived samples (1160). As illustrated in Figure 11, the four sample groups are arranged from left to right as follows: available samples left (1130), left derived samples (1150), right derived samples (1160), and right available samples (1140). The superimposed signals (1130, 1140, 1150, 1160) are fused by using weight windows (1170) to produce a resulting signal (1180). The duration of the resulting signal (1180) is sufficient to cover the playing time of the available signal (1110) and the missing signal (1120). The resulting signal (1180) can also be repeated to further extend the signal if subsequent frames are also missing. It is believed that this technique is able to perform well even in many situations where the energy within the framework without available voice increases or decreases significantly within the framework. Alternatively, the decoder divides the available information into more segments for derivation, merging, etc. According to a third type of signal-dependent concealment technique, if both audio information with future voice are available, then bi-directional concealment is used. Specifically, if the space between the available audio information is not too large (for example, three missing frames or less), then the information from both sides can be used to fill the space. Past information extends to a past bit of the midpoint between the past and future available frames, and future information extends back to a past bit of the midpoint. For example, past and future information may extend approximately one group length passing the midpoint for audio with voice, or each one can be extended by some predetermined fixed amount, such as forty or eighty samples passing the midpoint. One or both signals can be extended by using the WSOLA method, the voiceless concealment technique described above, and / or some other method. For example, if past information is put with voice and future information without voice, then the first technique described above can be used to extend past information and the second technique can be used to extend future information. For signals with speech, a signal coupling routine such as that described above can be used to ensure that the signals merge at the best location at the past and future ends and the midpoint. If the signals have voice, without voice, or both, the signals can then be merged when using weight windows, such as weight windows designed to keep the signal energy approximately constant in the merged area. Noise is typically not added to the signal when bi-directional concealment is used. In the first or second preceding type of signal-dependent concealment techniques, where there is an extended period without received packets, the energy of the speech signal with or without extended speech created by using one of the techniques described above gradually decreases, and the generated background noise gradually increases until the resulting signal reaches a predetermined energy level of background noise. This helps reduce artificial sounds such as timbre and mechanical sounds that can result from signals that extend for long periods of time. This technique can be particularly useful in applications that use voice activity detection without any generation of comfort noise. In such applications, when voice activity is stopped (for example, pauses between sentences, or while the other speaker is speaking), the sender application simply stops sending packets for the silent period. In such a case, although the decoder may find missing frames for such periods, the decoder may revert to background noise, which is the desired response during periods of silence. In one implementation, the noise begins to be introduced gradually after the signal is extended by approximately two frames and the noise typically becomes perceptible to a listener after approximately four frames. By the time the signal was extended by approximately six or seven frames, only white noise remains. These times can be adjusted based on the perceived audio quality for listeners in different circumstances.
D. Group MSeve Using the WSOLA technique or other speech dialogue extension techniques, a signal can be produced that uses the group characteristics of a previous signal segment, such as a frame or previous frames. The resulting signal segment is often reused to extend the signal, sometimes to hide the loss of many packages. Such repeated extension of a signal through group repetition (such as using WSOLA techniques) can result in timbre or mechanical sounds. The weight used in the WSOLA technique by itself can help reduce such effects. Noise addition techniques such as those described above can further reduce such effects. Timbre or mechanical sounds can also be reduced by adding a random, pseudo random, or other adjustment factor to the group size / balance in subsequent samples while the signal is extended. Therefore, instead of extending the signal in exact size group cycles, a factor is added to the group size so that the signal appears to be a more natural dialogue, which rarely has an exactly fixed group rate. While the length of the extension increases, the factor can be increased in magnitude. This technique can be implemented by using a factor or list box that includes a sequence of factors, with the scale of factors gradually increasing in magnitude. As an example, the balance adjustment factor can be zero samples for the first group cycle, one negative for the second cycle, one positive for the third group cycle, two positive for the fourth cycle, one negative for the fifth cycle , etc. Or, the adjustment factor can be selected randomly / pseudorandomly from the scale applicable during decoding. In one implementation, the maximum scale of the random factors is three negative samples three positive samples. However, the magnitude of the scale can be adjusted based on hearing tests. Alternatively, other techniques may be used to produce the adjustment factors and / or other scales of adjustment factors may be used. The three techniques described above for reducing the undesirable sound effects of repeating group cycles (WSOLA weight, noise addition, and group snow) can be used separately or in combination. Repeated tuning of the parameters of these techniques can be done to produce more desirable results in particular implementations.
E. Recovering Rflemoiriia from DescodSfficadloir Decoders with limited internal memory or without internal memory are reasonably simple to use in VolP applications. In such cases, packet losses do not present the encoder and decoder out of synchronization because there is no significant memory dependency (past history) in prior information when decoding current information. However, such voice coders-decoders are usually not very efficient with respect to bit rate. In decoder-decoders where the bit rate is lower, typically there is a strong memory dependency introduced in order to improve the quality at low bit rates. This may not be a problem in some applications, such as compressing for archival purposes. However, when they use such decoder-decoders in other applications such as VolP applications, packets are often lost. The memory dependency, such as memory dependency of the adaptive encoding book to produce an excitation signal, can cause distortions in the decoder beyond the lost packet. The following technique that can be used to produce substitute information for use as a historical signal by subsequent packets is generally illustrated in Figure 12. Alternatively, other and / or additional techniques are used. First, the decoder generates (1210) a concealment signal for one or more missing frames. In this way, the missing frame (s) is hidden from the part of the available signal for the next received packet. The above signal-dependent techniques can be used for such concealment, for example. Alternatively, other concealment techniques may be used.
The decoder then performs the least partial coding (and reconstruction) (1220) on the concealment signal. For example, the decoder passes the concealment signal produced by one or more concealment techniques through a simplified false encoder (residing in the decoder) to generate the internal decoder state. This typically speeds up the recovery of missing frame (s) in the subsequent decoding. In a specific implementation, the concealment signal is processed to produce a corresponding residual signal that can Replace yourself for the missing excitation signal history. This can be done by passing the concealment signal through the inverse of a linear prediction synthesis filter (or in other words, an analysis filter is linear prediction). The synthesis filter technique (as in the component (540) described above) the filter processes a reconstructed residual signal that uses reconstructed linear prediction coefficients to produce a combined or synthesized signal. In contrast, the linear prediction analysis filter of this technique processes a decoder generated signal (here the concealment signal) that uses known linear prediction coefficients to produce a residual residual signal. The linear prediction coefficients of the previous frame can be used in the linear prediction analysis filter. Alternatively, the new linear prediction coefficients can be calculated by analyzing the concealment signal. This can be done by using the techniques described above with reference to the linear prediction analysis component (330). It is believed that better results are obtained when calculating the new linear prediction coefficients for the concealment signal. However, it is simpler to use the linear prediction coefficients of the previous frame. Alternatively, the decoder uses some combination of such linear prediction coefficients, or some other method such as bandwidth expansion. The signal resulting from this technique can be used by the decoder as the memory for subsequent calculations of subsequent adaptive encoding book, instead of using excitation history of missing frame (s). Because the concealment signal is usually not perfect, the residual signal replaced by the decoder only approximates the actual memory, or history, that was created has the missing frame (s) decoded. However, even so, the decoder is able to re-cover the correct state much faster than in circumstances where the excitation history for missing frames is non-existent. Thus, it is believed that this technique can significantly improve the quality of the decoded signal following the lost packets. This technique can also be used for the purpose of bi-directional concealment if the following missing frame (s) has significant memory dependence. In that case, bi-directional concealment can be achieved as follows: first, perform unidirectional concealment for the missing frame (s); second, use concealment to regenerate an approximate memory for the next frame; third, decoding the next frame that regenerated memory uses; and fourth, perform bi-directional concealment of the missing frame (s) using the following frame and the frame proceeds to the missing frame (s). It is believed that such bidirectional concealment can produce better results than simply using unidirectional concealment or bi-directional concealment without memory for the next frame.
Having described and illustrated the principles of our invention with reference to the described modalities, it will be recognized that the described modalities can be modified in the order and detail without departing from said principles. It must be understood that the programsThe methods, procedures, or methods described herein are not related to or limited to any particular type of computing environment, unless otherwise indicated. Various types of general purpose or specialized computing environments can be used with or perform operations in accordance with the teachings described herein. The elements of the described modalities shown in software can be implemented in hardware and vice versa. In view of the many possible modalities to which the principles of our invention can be applied, we claim as our invention all such modalities that may come within the scope and spirit of the following claims and equivalents thereto.

Claims (20)

1. - A computer-implemented method comprising: processing a bitstream for an audio signal, including, if one or more missing frames are encountered while processing the bitstream, based at least partly on one or more factors, selecting a concealment technique among multiple multiple signal-dependent concealment techniques; and perform the selected concealment technique; and produce a result.
2. - The method according to claim 1, wherein one or more factors comprise a classification of a frame available before one or more frames missing in the audio signal, and wherein the classification is one of a group of classifications Possible that he understands with voice and without voice.
3. The method according to claim 5, wherein: if the prior frame classification available has a voice, then the selected concealment technique comprises using a group extension technique to generate a concealment signal; and if the available prior frame classification is voiceless, then the selected concealment technique comprises generating the concealment signal based at least in part on the energy of the previous framework available.
4. - The method according to claim 2, wherein one or more factors further comprise an account of one or more missing frames, the count indicating how many consecutive frames are missing.
5. - The method according to claim 4, wherein: if the classification of the previous frame available and a classification of an available frame following one or more frames missing in the audio signal are put with voice and the count of the missing frame is less than a threshold value, then the selected concealment technique comprises bi-directional concealment; and if the available prior frame classification has a voice, and the count of one or more missing frames is more than one threshold value, then the selected concealment technique comprises unidirectional concealment.
6. - The method according to claim 2, wherein if the classification of the previous frame available has voice, then the selected concealment technique comprises adding noise to a concealment signal if the concealment signal is longer than a duration of threshold.
7. - The method according to claim 2, wherein if the prior frame classification available has no voice, then the concealment technique selected comprises: identifying multiple segments of the previous frame available; use the multiple pre-frame segments to generate multiple segments of concealment signal; and merging the multiple preframe segments with the multiple concealment segments to generate a concealment signal.
8. - The method according to claim 1, wherein the concealment technique selected comprises: generating an extension signal; and add noise to the extension signal to produce a concealment signal.
9. - The method according to claim 8, wherein the concealment technique further selected comprises: gradually decreasing the energy of the extension signal along at least part of the audio signal; and gradually increasing noise energy along at least part of the audio signal.
10. - The method according to claim 1, wherein the multiple available signal-dependent concealment techniques include a unidirectional concealment technique for voice content, a unidirectional concealment technique for voiceless content, a two-way concealment technique directional, and a fading concealment technique.
11. - The method according to claim 1, wherein one or more factors include classification of a previous framework available, classification of a following framework available, and a account of missing frames between the previous frame available and the next frame available in the audio signal.
12. A computer implemented method comprising: when finding one or more missing frames of a bit stream for an audio signal, generating a concealment signal based at least in part on group cycles in one or more previous frames, which include introducing group snow; and produce a result.
13. The method according to claim 12, wherein the introduction of group snow includes adding random or pseudo-random factors to group the concealment signal.
14. - The method according to claim 13, wherein the random or pseudo-random factors include positive and negative values.
15. - The method according to claim 13, wherein a separate random or pseudo-random factor is applied to each plural part of the concealment signal.
16. - The method according to claim 12, wherein the introduction includes increasing the group snow while increasing the distance to one or more previous frames.
17. - A computer-implemented method comprising: finding one or more missing frames of a bit stream for an audio signal; produce a signal of concealment; find a subsequent framework that relies in part on information from one or more missing frames for decoding; produce substitute information from the concealment signal; and rely on the substitute information instead of the information from one or more missing frames to decode the subsequent frame.
18. The method according to claim 17, wherein an adaptive coding book index of the subsequent frame indicates a segment of one or more missing frames to be used in calculating an adaptive coding book contribution for at least a portion of a signal of excitation of the subsequent frame.
19. - The method according to claim 17, wherein producing the surrogate information comprises at least partially encoding at least a portion of the concealment signal.
20. The method according to claim 17, wherein producing the surrogate information comprises construction information representing a residual signal based on the concealment signal.
MXMX/A/2007/015190A 2005-05-31 2007-11-30 Robust decoder MX2007015190A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11142602 2005-05-31

Publications (1)

Publication Number Publication Date
MX2007015190A true MX2007015190A (en) 2008-10-03

Family

ID=

Similar Documents

Publication Publication Date Title
US7590531B2 (en) Robust decoder
US7734465B2 (en) Sub-band voice codec with multi-stage codebooks and redundant coding
US5886276A (en) System and method for multiresolution scalable audio signal encoding
US7707034B2 (en) Audio codec post-filter
US20050228651A1 (en) Robust real-time speech codec
KR20140005277A (en) Apparatus and method for error concealment in low-delay unified speech and audio coding
MX2007015190A (en) Robust decoder
Marks Joint source/channel coding for mobile audio streaming