WO2010103855A1 - 音声復号装置及び音声復号方法 - Google Patents
音声復号装置及び音声復号方法 Download PDFInfo
- Publication number
- WO2010103855A1 WO2010103855A1 PCT/JP2010/001793 JP2010001793W WO2010103855A1 WO 2010103855 A1 WO2010103855 A1 WO 2010103855A1 JP 2010001793 W JP2010001793 W JP 2010001793W WO 2010103855 A1 WO2010103855 A1 WO 2010103855A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- decoding
- layer
- code
- state
- frame
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 64
- 230000002194 synthesizing effect Effects 0.000 claims description 9
- 239000000284 extract Substances 0.000 abstract description 10
- 239000011159 matrix material Substances 0.000 description 89
- 230000015572 biosynthetic process Effects 0.000 description 27
- 238000003786 synthesis reaction Methods 0.000 description 27
- 238000012545 processing Methods 0.000 description 23
- 238000001228 spectrum Methods 0.000 description 15
- 238000012795 verification Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 12
- 239000000872 buffer Substances 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 238000001514 detection method Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 5
- 230000003111 delayed effect Effects 0.000 description 5
- 230000006866 deterioration Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 238000010295 mobile communication Methods 0.000 description 3
- 238000013139 quantization Methods 0.000 description 3
- 230000001131 transforming effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/167—Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/0212—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
- G10L19/24—Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
Definitions
- the present invention relates to a speech decoding apparatus and speech decoding method, and particularly to a speech decoding apparatus and speech decoding method used for a scalable codec having a hierarchical structure.
- IP Internet Protocol
- ITU-T International Telecommunication Union-Telecommunication Standardization Sector
- MPEG Motion Picture Experts Group
- CELP Code Excited Linear Prediction
- MPEG standard ACC MPEG standard ACC, MP3, etc.
- MPEG standard ACC MPEG standard ACC, MP3, etc.
- the voice / musical sound coding technology has made great progress, enabling high-quality communication and music appreciation.
- development and standardization ITU-T SG16 WP3 of a scalable codec that covers everything from voice to audio has been promoted in favor of all-IP, seamless, and broadband.
- This encoding technique is a codec having a structure in which the frequency band of voice to be transmitted is hierarchical, and the quantization error of the lower layer is encoded by the upper layer.
- Patent Document 1 discloses a basic invention of hierarchical coding in which lower layer quantization errors are encoded in an upper layer, and encoding of a wider frequency band from a lower layer to an upper layer using sampling conversion. A method of doing so is disclosed.
- ITU-T a five-layer scalable speech codec G. 718 was recommended (for example, Non-Patent Document 1).
- the decoder when transmitting a code of each layer (layer) of the scalable codec, it is conceivable to adopt a method of transmitting with a different packet for each layer.
- the reception order or reception timing of packets in each layer of each frame may differ on the decoder side.
- the decoder in voice communication, the decoder must continue to output decoded voice stably.
- a fluctuation absorbing buffer is provided, and a plurality of packets constituting the frame are stored in the fluctuation absorbing buffer, so that all packets arrive after waiting for arrival of the plurality of packets constituting the frame.
- An algorithm of synchronizing and decoding later can be considered.
- Patent Documents 2 to 5 disclose inventions that deal with this “fluctuation” regarding voice data communication.
- JP-A-8-263096 Japanese Patent Laid-Open No. 11-41287 JP 2003-87317 A JP 2000-151694 A JP 2007-235221 A
- Patent Documents 2 to 5 an audio signal for a predetermined time is transmitted using one packet, and the processing of each code of a plurality of layers is related to the above “fluctuation”. There is no description. That is, in Patent Documents 2 to 5, since all layer codes are received for each frame and then decoded at the same time, there is a time to wait until all layer codes are received. There is a problem that occurs every time. Further, even when decoding of each layer is started at a close-off, there is a problem that processing delay similarly occurs because there is a time to wait until the code of each layer is received for each frame. Therefore, Patent Documents 2 to 5 have a problem that the decoder processor cannot execute another process that requires a long time.
- Patent Documents 2 to 5 if a partway decoding is in the middle of a parting decoding when an interrupt is received from the outside, it is impossible to output a synthesized sound of the parting decoded frame. Therefore, in Patent Documents 2 to 5, it is important to perform the parting decoding process ahead of schedule and decode the synthesized sound early.
- the clock delay is a phenomenon in which clock deviations between the transmission-side clock and the reception-side clock accumulate, resulting in a large time deviation and synchronization being lost.
- the receiving side is advanced, one frame of the synthesized sound is added to the silent period, and if the receiving side is delayed, one frame of the synthesized sound is discarded, and the next frame after the discarded frame is synthesized. Output sound. Therefore, in the conventional apparatus, it is necessary to perform the decoding process ahead of schedule, generate the synthesized sound early, wait for the timing of the frame in the silent period, and add the synthesized sound or discard the synthesized sound.
- An object of the present invention is to perform decoding processing as soon as possible and generate synthesized sound early, so that the processor can be used for another application continuously for a predetermined time and the processor can be used for another application by an emergency interrupt.
- a speech decoding apparatus receives a code of each layer generated in a speech encoding apparatus composed of a plurality of layers and stores it over a plurality of frames, and decodes the code of each of the layers
- a decoding means comprising: a selecting means for selecting a frame number and a layer number corresponding to a code to be decoded first among codes of the respective layers in an undecoded state;
- the receiving means further stores a decoding state indicating whether the code of each layer is unreceived, undecoded or decoded, and when an update instruction is input, the decoding state And the selection means is the undecoded state when the decoding state is stored or updated in the receiving means, and after the storage or update First, the frame number and the layer number corresponding to a code to be decoded are selected by searching the decoding state, and the update instruction of the decoding state is output to the receiving unit, and the decoding is performed.
- the means decodes the code corresponding to the frame number and the layer number.
- the speech decoding method of the present invention includes a reception step of receiving a code of each layer generated in a speech encoding apparatus composed of a plurality of layers over a plurality of frames and storing it in a memory, and a code of each of the layers.
- a decoding step comprising: a decoding step comprising: selecting a frame number and a layer number corresponding to a code to be decoded first among codes of the respective layers in an undecoded state;
- the decoding step further stores in the memory a decoding state indicating whether the code of each layer has not been received, has not been decoded, or has been decoded, and is instructed to update Updating the decoding state of the memory, and the selecting step is performed before the decoding state is stored or updated in the memory.
- the decoding state is an undecoded state, and the frame number and the layer number corresponding to the code to be decoded first after storage or update are selected by searching the decoding state, and for the memory,
- the decoding state is instructed to be updated, and the decoding step decodes a code corresponding to the frame number and the layer number.
- the decoding process is performed as soon as possible and the synthesized sound is generated early, so that the processor can be used for another application continuously for a predetermined time, and the processor can be used for another application by an emergency interrupt. Even if it is used, a synthesized sound can be generated without interruption.
- the flowchart which shows the method to determine the frame number and layer number to decode concerning Embodiment 1 of this invention The figure which shows an example of the state matrix which concerns on Embodiment 1 of this invention.
- FIG. 7 is a flowchart showing a method for determining a frame number and a layer number to be decoded according to the second embodiment of the present invention.
- the block diagram which shows the structure of the decoding part of the speech decoding apparatus which concerns on Embodiment 2 of this invention.
- FIG. 1 is a block diagram showing a configuration of speech decoding apparatus 100 according to Embodiment 1 of the present invention.
- the audio decoding device 100 is an example of a scalable decoder (decoder of a scalable (multilayer) codec).
- each frame is composed of a plurality of layers, encoding is performed for each layer, a code is generated, and a packet storing the code is generated. Thereby, the code
- the speech decoding apparatus 100 includes a packet receiving unit 101, a frame number storage unit 102, a state / code storage unit 103, a layer selection unit 104, a decoding unit 105, a synthesized sound storage unit 106, a timer unit 107, The time limit determination unit 108, the synthesized sound verification unit 109, the compensation unit 110, the clock delay detection unit 111, the synthesis unit 112, and the speaker 113 are mainly configured.
- the process in the speech decoding apparatus 100 mainly includes four processes 150, 160, 170, and 180.
- the four processes 150, 160, 170, and 180 operate independently. However, the priority order of the process 170 is the highest, followed by the processes 180 and 160, and the process 150 has the lowest priority.
- processing is performed in the order of the priorities. Each configuration will be described in detail below.
- the packet receiving unit 101 receives a packet from the transmission path and sends data (ACK) indicating that the packet has been received to the transmission path.
- the packet receiving unit 101 extracts a code by decompressing and decoding the received packet. That is, the packet receiving unit 101 receives packets for each layer of a plurality of frames, and extracts the received codes for each layer of the plurality of frames. At this time, the packet reception unit 101 discards the packet if the code cannot be normally extracted due to a bit error detected or the like, and requests the transmission path to retransmit the packet with the discarded frame number and layer number. . Note that the packet receiving unit 101 may give up acquiring a packet without requesting retransmission.
- the packet receiving unit 101 refers to the reference number stored in the frame number storage unit 102, that is, the frame number of the packet corresponding to the voice currently output from the speaker 113. Then, the relative frame number is calculated.
- the packet reception unit 101 changes the state matrix and the encoded data matrix stored in the state / code storage unit 103 by storing the calculated frame number in the state / code storage unit 103. For example, when the reference number is “761”, the frame number of the received code is “763”, and the layer number of the received code is “2”, the packet receiving unit 101 determines the currently synthesized frame number. Is the frame of the second frame number, the relative frame number “1” is calculated, and the layer number “2” is not changed.
- the packet receiving unit 101 extracts the state matrix from the state / code storage unit 103, and the code reaches the value of state (1, 2) (that is, the packet has been decoded and the code has been extracted. ) Is set to “1”, and the state matrix is written. Then, the packet receiving unit 101 stores the written state matrix in the state / code storage unit 103 again. Further, the packet receiving unit 101 extracts the code data matrix from the state / code storage unit 103 and stores the code in code (1, 2). The packet receiving unit 101 stores the code data matrix storing the code in the state / code storage unit 103 again. In this case, if it is expressed in 10 bits, the value 0 to 1023 is recursively taken, so that the reference number “1023” is followed by “0, 1, 2,. , To accurately detect the future.
- the packet receiving unit 101 discards an unusable packet.
- an unusable packet is, for example, in the case of the above example, when the reference number is “761” and the frame number of the received code is “760”, the combination has already been completed and the code arrives. There is no use for it because it was too late. Therefore, in this case, the packet receiving unit 101 discards the frame number code not more than the reference number without storing it. This process can eliminate the waste of the subsequent decoding process.
- an unusable packet means that a synthesized sound cannot be created. Actually, there is a use in which a filter necessary for decoding a frame after that frame or a state necessary for prediction is created. . In this case, the created state is important information necessary for decoding, and it is desirable to create a state when a code is obtained.
- the frame number storage unit 102 stores the frame number of the packet corresponding to the voice currently output from the speaker 113 and input from the synthesis unit 112.
- the state / code storage unit 103 stores a communication state for each layer of each frame and a state matrix indicating whether or not the code for each layer of each frame has been decoded.
- the state matrix is a three-dimensional numerical two-dimensional matrix representing the state. Specifically, “0” indicates that the packet has not reached the speech decoding apparatus 100, and “1” indicates that the packet has arrived at the speech decoding apparatus (that is, the packet reception unit 101 has already decoded the packet) This indicates that the code (encoded information) is not decoded, but “2” indicates that the code (encoded information) is decoded.
- the state / code storage unit 103 stores the codes received by the packet receiving unit 101 as a code data matrix. The state matrix and the code data matrix will be described later.
- the layer selection unit 104 refers to the time measured by the timer unit 107 and refers to the state matrix stored in the state / code storage unit 103, and the frame number (relative frame number) to be decoded next and Determine the layer number. Then, the layer selection unit 104 notifies the decoding unit 105 of the determined frame number and layer number. In addition, when the layer selection unit 104 receives a decoding end notification from the time limit determination unit 108, the layer selection unit 104 ends the decoding process of the frame within a predetermined time (for example, four frames), and the frame within the next predetermined time The decryption process is started. Also, the layer selection unit 104 notifies the time limit determination unit 108 of the start of decoding when a new decoding is started. Also, the layer selection unit 104 selects a frame and a layer while looking at the decoding result of the synthesized sound input from the synthesis unit 112. A method for determining the frame number and the layer number to be decoded will be described later.
- the decoding unit 105 refers to the frame number and the layer number notified from the layer selection unit 104, and codes (encoding information) of the code data matrix code (i, j) stored in the state / code storage unit 103. ) In accordance with a predetermined algorithm (in this embodiment, decoding of ITU-T standard G.718. The algorithm is described in Non-Patent Document 1 and will not be described), and time series synthesis is performed. The sound y t or the synthesized sound z f of the frequency spectrum is obtained.
- the decoding unit 105 refers to the frame number, and with respect to the synthesized sound matrix syn (i, t) or spec (i, f) stored in the synthesized sound storage unit 106, the synthesized sound y obtained. Write t or z f . This process is shown in equations (1) and (2).
- decoding section 105 is obtained by decoding the code of the layer selected by layer selection section 104 in the frame selected by layer selection section 104 (frame number i in equations (1) and (2)).
- the decoding result time-series synthesized sound y t or frequency spectrum synthesized sound z f
- the decoding result synthetic sound matrix syn (i, t) or spec (i, f)) of another layer whose code has been decoded.
- the decoding unit 105 stores the synthesized sound matrix syn (i, t) or spec (i, f) into which the synthesized sound is written by the above process in the synthesized sound storage unit 106 again.
- the decoding unit 105 extracts the state matrix stored in the state / code storage unit 103, rewrites the frame number i and the layer number j from “1” to “2”, and rewrites the state matrix after rewriting. It is stored again in the state / code storage unit 103. As a result, by referring to the state matrix, it can be seen that the codes of the frame number i and the layer number j have been decoded.
- the decoding unit 105 notifies the time limit determination unit 108 of the end of decoding when the decoding process for a predetermined time is completed.
- the synthesized sound storage unit 106 stores a synthesized sound matrix that is sequentially rewritten in the decoding unit 105 as decoding progresses.
- the sum of layer synthesized sounds is the final synthesized sound, so the synthesized sound storage unit 106 has a synthesized sound buffer of one frame length for each frame.
- the time-series signal and the frequency spectrum signal are stored separately in buffers. The reason is that, in the transform coding layer mainly used in the upper layer, the synthesized sound of each layer is added by a frequency spectrum (for example, MDCT (Modified Discrete Cosine Transform) spectrum), and finally, an inverse transform (for example, IDCT (Inverse Disc Discte) is performed. Cosine ⁇ Transform)), and it is generally converted to time series.
- MDCT Modified Discrete Cosine Transform
- IDCT Inverse Disc Discte
- Cosine ⁇ Transform Cosine ⁇ Transform
- the timer unit 107 has a function of measuring time, and has a function of accurately reducing the numerical value T indicating the set time toward 0 by the actual time to be measured.
- the time of the timer unit 107 can be viewed from the outside, and the time T can be reset.
- the decoding process is performed while the synthesized sound is output from the speaker 113.
- the timer unit 107 has a function of measuring the time until the next synthesis is performed.
- the time limit determination unit 108 refers to the numerical value T indicated by the timer unit 107, and if the numerical value T is equal to or greater than the lower limit value T limit , the decoding process may be continued, so that the layer selection unit 104 is notified accordingly. . That is, the decoding process is continued until the time lower limit value T limit is reached . In addition, when the numerical value T is less than the lower limit value T limit , the time limit determination unit 108 notifies the layer selection unit 104 of the end of the decoding process. Further, upon receiving a notification of decoding start from the layer selection unit 104, the time limit determination unit 108 starts comparison between the numerical value T indicated by the timer unit 107 and the lower limit value T limit .
- the lower limit value T limit is a predetermined constant. In the timer unit 107, the set time decreases toward 0. If this time becomes smaller than a certain time, the next synthesized sound output is performed unless the process proceeds to the process of generating the synthesized sound from the decoding process. Will not be in time.
- the lower limit value T limit is a constant indicating it.
- the lower limit value T limit is “(time required for processing of the synthesized sound verification unit 109) + (maximum time required for the compensation unit 110) + (synthesized sound is output to the speaker 113 in the synthesis unit 112) Time) + (maximum time required for decoding one layer) ”.
- the synthesized sound verification unit 109 refers to the state of the state matrix from the layer 0 toward the upper layer. At this time, if there is a layer that is not “2”, all the layers higher than that layer are not decoded because they have no code, so it may be necessary to perform compensation processing for the undecoded layer. Here, it is necessary to perform compensation processing when there is no synthesized sound from layer 0 or when the frequency changes in layer 2 as in frequency scalable. In contrast to the above case, in other cases, the sound quality deterioration when the encoding distortion of the lower layer cannot be decoded due to the absence of the upper layer code (encoding information) is more audible than the sound quality deterioration due to compensation.
- the compensation unit 110 performs compensation processing on the synthesized sound input from the synthesized sound verification unit 109.
- the description is abbreviate
- the clock delay detection unit 111 monitors the magnitude of clock deviation between a voice encoding device (not shown) on the transmission side and the voice decoding device 100 on the reception side, and sets a flag according to the clock deviation. Setting is performed, and an instruction is sent to the synthesis unit 112 by a flag. Specifically, the clock delay detection unit 111 sends a flag “0” when there is no clock shift, and when the clock shift does not exceed one frame but is greater than a predetermined value, the flag “1”. When the clock deviation exceeds one frame, a flag “2” is sent. As described above, the clock delay detection unit 111 sends the clock shift by switching to the flag, and instructs the synthesis unit 112.
- the synthesizing unit 112 When the synthesized sound is input from the synthesized sound verification unit 109, the synthesizing unit 112 immediately transmits the synthesized sound to the output buffer of the speaker 113. Then, after combining one frame, the combining unit 112 advances all states by one frame. Specifically, the synthesis unit 112 obtains a numerical value obtained by adding 1 to the reference number stored in the frame number storage unit 102. If the obtained numerical value exceeds the upper limit value, 0 is set as the frame number storage unit. If the obtained numerical value does not exceed the upper limit value, the obtained numerical value is stored in the frame number storage unit 102.
- the state matrix and code data matrix stored in the state / code storage unit 103 and the synthesized sound matrix stored in the synthesized sound storage unit 106 are shifted and initialized.
- the synthesizing unit 112 stores the memory shift and initialized state matrix and code data matrix in the state / code storage unit 103 again, and stores the memory shift and initialized synthesized sound matrix in the synthesized sound storage unit 106 again. To do.
- the method of memory shift and initialization is shown in equation (3).
- the synthesis unit 112 resets the time T of the timer unit 107 to a time required for the speaker 113 to output one frame of synthesized sound. Further, the synthesizer 112 constantly monitors the signal sent from the clock delay detector 111. When the adjustment instruction is received from the clock delay detection unit 111, the synthesis unit 112 checks the output power of the synthesized sound before transmitting the synthesized sound to the output buffer of the speaker 113. Then, when the synthesis unit 112 determines that the sound proceeds too much and the power of the synthesized sound is a silent section (hereinafter referred to as “state 1”), the synthetic section 112 first sends the silent section to the speaker 113 and outputs the synthesized sound of the current frame. Then send it.
- state 1 silent section
- the synthesis unit 112 determines that the synthesized sound is delayed and the power of the synthesized sound is a silent section, and if the synthesized sound of two frames or more has already been decoded in the synthesized sound storage unit 106 (hereinafter referred to as “state 2”). Is discarded without producing the synthesized sound of the current frame, and the second synthesized sound is sent to the speaker 113.
- state 2 the synthesized sound storage unit 106
- the combining unit 112 performs memory shift processing by one more frame.
- the synthesis unit 112 continues to wait until the state 1 or the state 2 is reached, and the silent period comes. A process for adjusting the output of the frame is performed when it is time to cope.
- the speaker 113 has an output buffer for two frames, and has a function of waiting for input to the other output buffer while outputting one digital / analog (D / A). If the output buffer is composed of one toggle buffer that is slightly longer than the frame length, this leads to saving of memory capacity. In speech decoding apparatus 100 according to the present embodiment, synthesized sound for one frame is not replenished unless one frame is output immediately before D / A is output. By using this, the available memory capacity is reduced. Can be saved.
- FIG. 2 is a flowchart showing a method for determining a frame number and a layer number to be decoded.
- the layer selection unit 104 selects the next frame (ST205) and performs the determination of ST202.
- the layer selection unit 104 has reached the layer number j of the selected frame number i by referring to the state matrix. It is determined whether there is a number “1” indicating that it has not been decoded yet (ST206). However, at this time, immediately before referring to the state matrix state (i, j), the contents of the state matrix in the state / code storage unit 103 must be re-read. Note that the contents of the state matrix of the state / code storage unit 103 are re-read every time a determination is made by the function of the packet reception unit 101 of the process 150 that performs processing preferentially when a packet arrives at the packet reception unit 101. Because is rewritten.
- the layer selection unit 104 If there is a number “1” (ST206: YES), the layer selection unit 104 outputs the frame number i and the layer number j to the decoding unit 105.
- the layer selection unit 104 searches for the number “2” indicating that the layer number j of the selected frame number i has been decoded, It is determined whether or not there is a number “2” (ST207).
- the layer selection unit 104 selects the next layer (ST208) and performs the determination of ST204.
- the layer selection unit 104 selects the next frame (ST205) and performs the determination of ST202.
- the layer selecting unit 104 refers to the state matrix, and for each frame, the number “1” indicating that it has arrived from the lower layer to the upper layer but has not been decoded yet. Will continue to explore. At this time, if the layer selection unit 104 detects the number “0” indicating that the packet has not arrived, the layer selection unit 104 searches for the next frame because it cannot be decoded even if a higher layer is searched. That is, the layer selection unit 104 has decoded all the codes extracted in all layers lower than a certain layer among a plurality of layers of each frame (state matrix number “2”) and is higher than that layer.
- a specific layer (the layer having the state matrix number “1”) in which the extracted code is not decoded and is not the lowest layer, or a specific code in which the extracted code is not decoded
- the layer selection unit 104 searches for a frame from frame 0 in the direction of time progress. That is, the layer selection unit 104 performs the search in order from the frame with the earliest time (that is, frame 0) among the plurality of frames.
- the layer selection unit 104 starts searching for the next frame when the layer number exceeds the number of layers, and returns to the first frame and continues the search when the frame number exceeds the number of frames.
- This process is theoretically an infinite loop, but when the numerical value T in the timer unit 107 of the process 170 that is preferentially processed becomes smaller than the lower limit value T limit , the next synthesized sound is output as an interrupt process. There must be. Therefore, layer selection section 104 determines whether or not numerical value T in timer section 107 is smaller than lower limit value T limit (ST209). If numerical value T in timer section 107 is smaller than lower limit value T limit , time limit is determined. An interrupt process is notified to the determination unit 108.
- FIG. 3 is a diagram illustrating an example of a state matrix.
- frame 0 is the state of each layer of the code of the synthesized sound that should be output most recently.
- Frame 1 is the state of each layer of the code of the synthesized sound output after frame 0.
- the state matrix stores the state of the code of the synthesized sound to be output.
- FIG. 4 is a diagram illustrating an example of the code data matrix code (i, j).
- FIG. 4 stores the codes received in the case of the state matrix state of FIG.
- frames and layers where packets reach the speech decoding apparatus 100 are shaded, and frames and layers where packets do not reach the speech decoding apparatus 100 are outlined.
- a synthesized sound (decoded sound) is obtained.
- frame 0 is the code of the synthesized sound that should be output most recently.
- Frame 1 is a code of the synthesized sound output after frame 0. In this way, the code of the synthesized sound to be output is stored.
- FIG. 5 and 6 are diagrams showing an example of the synthesized sound matrix.
- FIG. 5 is a diagram showing the synthesized sound matrix syn (i, t) in the case of the code data matrix state of FIG.
- FIG. 6 is a diagram showing the synthesized sound matrix spec (i, f) in the case of the code data matrix state of FIG.
- frame 2 has no synthesized sound because the code of layer 0 has not yet arrived
- frame 3 has no synthesized sound because the code has arrived but has not been decoded yet.
- Frame 0 and frame 1 have a synthesized sound because the code of layer 0 is decoded.
- frame 0 is the synthesized sound that should be output most recently.
- Frame 1 is a synthesized sound output after frame 0, and thus stores a synthesized sound to be output in the future. That is, in the present embodiment, it is possible to decode not only the most recently output frame 0 but also the synthesized sound of frame 1 output next to frame 0. Note that the speaker 113 continues to output one frame-length synthesized sound during all processes.
- the processor can be used for another purpose continuously for a predetermined time, and also by an emergency interrupt. Even if the processor is used for other purposes, the synthesized sound can be generated without interruption.
- the speech decoding apparatus further determines whether or not each frame is a silent section, and selects a frame and a layer to be decoded depending on whether or not each frame is a silent section.
- FIG. 7 is a block diagram showing a configuration of speech decoding apparatus 200 according to the present embodiment. Note that speech decoding apparatus 200 shown in FIG. 7 has the same basic configuration as speech decoding apparatus 100 (FIG. 1) shown in Embodiment 1, and the same components are assigned the same reference numerals. The description is omitted.
- the silence flag storage unit 201 stores the silence flag created by the decoding unit 203 as decoding progresses.
- the silence flag sflag (i) is a three-stage numerical value indicating the state of the frame with frame number i. Specifically, “0” indicates that the code (encoded information) has not been decoded up to layer 2, and “1” indicates that the code (encoded information) has been decoded up to layer 2 and there is a synthesized sound.
- the determination result of whether it is sound or silence (hereinafter referred to as “sound / silence determination”) indicates that there is sound, “2” indicates that the code (encoded information) is decoded up to layer 2 and This indicates that the determination result of the voice / silence determination of the synthesized sound is silent.
- the layer selection unit 202 refers to the time measured by the timer unit 107 in the same manner as the layer selection unit 104 of the first embodiment, and also stores the state matrix and silence flag storage unit 201 stored in the state / code storage unit 103. Next, the frame number (relative frame number) to be decoded and the layer number are determined with reference to the silence flag stored in. Then, the layer selection unit 202 notifies the decoding unit 203 of the determined frame number and layer number. A method for determining the frame number and the layer number to be decoded in the layer selection unit 202 will be described later.
- the decoding unit 203 refers to the frame number and the layer number notified from the layer selection unit 202 in the same manner as the decoding unit 105 of the first embodiment, and stores the code data stored in the state / code storage unit 103.
- the code (encoding information) of the matrix code (i, j) is decoded according to a predetermined algorithm (in this embodiment, ITU-T standard G. 718. The algorithm is described in Non-Patent Document 1 and will be described. To obtain a time-series synthesized sound y t or a frequency spectrum synthesized sound z f .
- the decoding unit 203 refers to the frame number and stores the synthesized sound matrix syn (i, t) or spec (i, f) stored in the synthesized sound storage unit 106.
- the obtained synthesized sound y t or z f is written according to the equations (1) and (2).
- the ITU-T standard G. 718 scalable codec of 5 layers (layer 0 to layer 4) is used.
- layer 2 is a layer in which the synthesized sound changes from a narrow band to a wide band, and also a layer in which the synthesized sound changes from a time series to a frequency spectrum. Therefore, the decoding unit 203 writes the synthesized sounds up to layers 0 and 1 to syn (i, t), which is a time-series synthesized sound matrix, and the synthesized sounds up to layers 2 to 4 are synthesized into the frequency spectrum synthesized sound matrix. Is written in spec (i, f). When the layer 2 synthesized sound is written, the matrix memory is cleared.
- the final synthesized sound of layer 2 or higher is converted to time series by frequency spectrum spec (i, f) by IMDCT (Inverse Modified Discrete Cosine Transform), and the synthesized sound after conversion is converted to time series synthesized sound. It is calculated by adding to the matrix syn (i, t).
- the decoding unit 203 stores the synthesized sound matrix syn (i, t) or spec (i, f) into which the synthesized sound has been written by the above process in the synthesized sound storage unit 106 again. Then, the decoding unit 203 extracts the state matrix stored in the state / code storage unit 103, rewrites the values of the frame number i and the layer number j from “1” to “2”, and rewrites the state matrix after rewriting. It is stored again in the state / code storage unit 103. As a result, by referring to the state matrix, it can be seen that the codes of the frame number i and the layer number j have been decoded.
- the decoding unit 203 determines whether or not the frame with the frame number i is a silent section (sound / silence determination of the synthesized sound is performed). ). Then, the decoding unit 203 outputs a silence flag sflag (i) indicating a determination result of sound / silence determination for the synthesized sound of the frame number i to the silence flag storage unit 201.
- ITU-T standard G A scalable codec of 718 five layers (layer 0 to layer 4) is used, layers 0 and 1 are set to a narrow band (200 Hz to 3.4 kHz), and layers 2 to 4 are set to a wide band (10 Hz to 7 kHz). Therefore, when decoding is performed in order from layer 0, the decoding unit 203 can obtain a wideband synthesized sound only after decoding up to layer 2. Therefore, the decoding unit 203 can determine whether the frame is voiced or silent when decoding up to layer 2 is performed.
- the decoding unit 203 cannot detect the presence / absence of the component in the high frequency band by using only the layers 0 and 1, and therefore cannot perform the sound / silence determination of the frame. Accordingly, the decoding unit 203 is based on the synthesized sound obtained by decoding up to layer 2, that is, based on the time-series synthesized sound matrix syn (i, t) and the frequency spectrum synthesized sound matrix spec (i, f). Perform sound / silence determination. Then, the decoding unit 203 represents a silence flag indicating the determination result by a numerical value (“0” to “2”) for each frame (here, 4 frames (frames 0 to 3)). Details of the voice / silence determination processing in the decoding unit 203 will be described later.
- IDCT inversely transforming the extracted spectrum of spec (0, f)
- the synthesized sound (syn (0, t), t 0 ⁇ L), which is the addition result, is output to the synthesis unit 205.
- the synthesized sound verification unit 204 refers to the state of the state matrix from the layer 0 toward the upper layer. At this time, if there is a layer whose state matrix state is not “2”, all layers higher than that layer are not decoded because there is no code, so it may be necessary to perform compensation processing for the undecoded layer There is sex. Here, it is necessary to perform compensation processing when there is no synthesized sound from layer 0 or when the frequency changes in layer 2 as in frequency scalable.
- the sound quality deterioration when the encoding distortion of the lower layer cannot be decoded due to the absence of the upper layer code (encoding information) is more audible than the sound quality deterioration due to compensation.
- the synthesis unit 205 Similar to the synthesis unit 112 of the first embodiment, the synthesis unit 205 immediately transmits the synthesized sound to the output buffer of the speaker 113 when the synthesized sound is input from the synthesized sound verification unit 204. Then, after combining one frame, the combining unit 205 advances all states by one frame. Similarly to the first embodiment, the memory shift of the state matrix and code data matrix stored in the state / code storage unit 103 and the synthesized sound matrix stored in the synthesized sound storage unit 106 according to the equation (3) And initialization is performed.
- the synthesizing unit 205 stores the state matrix and the code data matrix subjected to the memory shift and initialization again in the state / code storage unit 103, and combines the synthesized sound matrix subjected to the memory shift and initialization into the synthesized sound.
- the data is stored again in the storage unit 106. Further, the memory shift and initialization of the silence flag stored in the silence flag storage unit 201 are performed.
- the synthesizing unit 205 stores the silence flag that has undergone the memory shift and initialization in the silence flag storage unit 201 again. The method of memory shift and initialization of the silence flag is shown in equation (4).
- the synthesizing unit 205 resets the time T of the timer unit 107 to the time required for the speaker 113 to output one frame of synthesized sound, similarly to the synthesizing unit 112 of the first embodiment. Further, the synthesis unit 205 constantly monitors the signal sent from the clock delay detection unit 111. When the adjustment instruction is received from the clock delay detection unit 111, the synthesis unit 205 indicates that the adjustment instruction is “too advanced” and the silence flag from the silence flag storage unit 201 is “2” (silence interval). In the case (hereinafter referred to as “state 1”), the silent section is sent to the speaker 113 first, and the synthesized sound of the current frame is sent thereafter.
- the synthesis unit 205 indicates that the adjustment instruction is “delayed”, and the silence flag from the silence flag storage unit 201 is “2”. If it has been decoded (hereinafter referred to as “state 2”), the synthesized sound of the current frame is discarded without being output, and the second synthesized sound is sent to the speaker 113. In the case of the state 2, the combining unit 205 performs memory shift processing by one more frame. Further, when an adjustment instruction is received from the clock delay detection unit 111 and the state does not apply to the state 1 and the state 2 described above, the synthesis unit 205 continues to wait until the state 1 or the state 2 is reached, and the silence flag is “2”. ”(Silence interval) comes, and when it is time to deal with, the process of adjusting the output of the frame is performed.
- FIG. 8 is a flowchart showing a method for determining the frame number and layer number of the code to be decoded.
- FIG. 8 the same processes as those in the flowchart shown in FIG. 8.
- the reason for re-reading the contents of the silence flag at each determination is that it is necessary to make a determination using the latest contents of the silence flag, and the contents of the silence flag storage unit 201 are read by the decoding unit 203 and the combining unit 205. This is because it may have been rewritten.
- the layer selecting unit 202 When the layer number is larger than “4”, or when the layer number j is larger than “2” and the silence flag sflag (i) is “2” (ST304: YES), the layer selecting unit 202 Then, the next frame is selected (ST305), and ST302 is determined.
- the layer selection is performed.
- the unit 202 refers to the state matrix state (i, j) to reach the layer number j of the selected frame number i (that is, extracts the code (encoding information) since the packet has been decoded). It is determined whether there is a number “1” indicating that the code (encoded information) has not been decoded yet (ST306).
- the contents of the state matrix of the state / code storage unit 103 must be set in the same manner as in the first embodiment (ST 206 and 207 shown in FIG. 2). Read it again.
- the layer selection unit 202 If there is a number “1” (ST306: YES), the layer selection unit 202 outputs the frame number i and the layer number j to the decoding unit 203.
- the layer selection unit 202 searches for the number “2” indicating that the layer number j of the selected frame number i has been decoded, It is determined whether or not there is a number “2” (ST307).
- the layer selection unit 202 selects the next layer (ST308), and performs the determination of ST304.
- the layer selection unit 202 selects the next frame (ST305) and performs the determination of ST302.
- the layer selection unit 202 refers to the state matrix and the silence flag, so that the packet arrives from the lower layer (lower layer) to the upper layer (upper layer) for each frame, but is extracted from the packet.
- the number (1) indicating that the encoded code (encoded information) has not yet been decoded is searched for in state (i, j). At this time, if the layer selection unit 202 detects the number “0” indicating that the packet has not arrived, the frame in which “0” is detected cannot be decoded even if a higher layer is searched. Search for the next frame.
- the layer selection unit 202 as algorithms for determining a frame number and a layer number of a code to be decoded, as shown in FIG. 8, two algorithms having approximate configurations (the algorithms of ST301 to 308 shown in FIG. 8 and ST201 to 208 algorithm) are connected in series.
- ST301 to 308 shown in FIG. 8 when layer selection section 202 determines that the frame being searched is a silent section (ST304: YES shown in FIG. 8), it stops searching for that frame. Then, the next frame search is started.
- the layer selection unit 202 has a layer number j greater than 2 (layer 3 or higher) and a silence flag sflag (i ) Is “2” (silent section), the search for the frame is stopped and the search for the next frame is started. That is, the layer selection unit 202 uses a frame other than a frame in which the silence flag sflag (i) indicates “2” among the plurality of frames (a frame indicating that the determination result of the sound / silence determination is a silent section). The layer having the state matrix number “1” is searched.
- the layer selection unit 202 determines that the power of the synthesized sound obtained by decoding the codes of the layers 3 and 4 is very small, and the layers 3 and 4 It is determined that the necessity of decoding for the code is low.
- layer selection section 202 if no layer having a code to be decoded is found in the frame search in ST301 to 308 (ST302: YES shown in FIG. 8), Without referring to sflag (i), only the state matrix state (i, j) is referred to and the number “1” indicating that it has arrived but has not been decoded is searched again.
- layer selection section 202 refers to the silence flag in ST301 to 308 shown in FIG. 8 to lower the decoding priority of the upper layers (layers 3 and 4) of the frame that is a silence section (ie, (Skip decoding of upper layer code) and search for another frame. Thereafter, when a layer having a code to be decoded is not found, layer selection section 202 searches for a layer having a code to be decoded for all layers in ST 201 to 208 shown in FIG.
- FIG. 9 is a block diagram illustrating a configuration of a configuration unit that performs voice / silence determination processing, of the internal configuration of decoding unit 203 according to the present embodiment.
- the amplitude search unit 231 extracts the time-series synthesized sound syn (i, t) and the frequency spectrum synthesized sound spec (i, f) from the synthesized sound storage unit 106. Then, the amplitude searching unit 231 searches for the maximum value of each amplitude of the synthesized sound syn (i, t) and spec (i, f). Specifically, the amplitude search unit 231 compares the synthesized signal syn (i, t) and spec (i, t) and spec (i) by comparing the absolute values of the signal values of the synthesized sound syn (i, t) or spec (i, f). i, f) Search for the maximum value of each amplitude.
- the maximum amplitude of syn (i, t) is set to maxsyn (i)
- the maximum amplitude of spec (i, f) is set to maxspec (i).
- the amplitude search unit 231 outputs maxsyn (i) and maxspec (i), which are search results, to the comparison unit 233.
- the constant storage unit 232 stores constants for the synthesized sounds syn (i, t) and spec (i, f).
- the constant for syn (i, t) is Msyn
- the constant for spec (i, f) is Mspec.
- the two constants Msyn and Mspec are set in advance to a sufficiently small value that can be determined to be audibly silent.
- the comparison unit 233 compares maxsyn (i) and maxspec (i) input from the amplitude search unit 231 with the constants Msyn and Mspec stored in the constant storage unit 232, respectively. That is, the comparison unit 233 compares maxsyn (i) with Msyn and compares maxspec (i) with Mspec. Then, as a result of the comparison, if maxsyn (i) is smaller than Msyn and maxspec (i) is smaller than Mspec, the comparison unit 233 determines that the frame of frame number i is silent, and sets it as a silence flag sflag (i). “2” is generated.
- the comparison unit 233 determines that the frame with the frame number i is sound and generates “1” as the silence flag sflag (i). Then, the comparison unit 233 outputs the generated silence flag sflag (i) to the silence flag storage unit 201.
- the decoding unit 203 determines that the frame is silent only when all the synthesized sounds of a certain frame are smaller than a preset amplitude (constant). In other words, the decoding unit 203 determines that the frame is sound if any one of the synthesized sounds of a certain frame has a predetermined amplitude (constant) or more. In addition, the decoding unit 203 individually performs the sound / silence determination of the time-series synthesized sound syn (i, t) and the frequency spectrum synthesized sound spec (i, f), and determines that both are silent. Only the frame with frame number i is determined to be silent.
- the decoding unit 203 determines that any one of the time-series synthesized sound syn (i, t) and the frequency spectrum synthesized sound spec (i, f) is sound, the decoding unit 203 It is determined that the frame with the number i is sounded.
- the speech decoding apparatus 200 estimates the importance of the codes (encoding information) of the layers 3 and 4 by using the silence flag obtained when the decoding unit 203 decodes the layer 2. Specifically, speech decoding apparatus 200 estimates that the significance of the codes of layers 3 and 4 is small when the silence flag indicates a silence interval (in the case of “2”). This is because in the scalable codec, the encoding error (coding distortion) in the lower layer is encoded in the upper layer, and therefore the expected power value becomes smaller in the upper layer.
- a synthesized sound obtained by decoding codes (encoding information) of layers 3 and 4 that are higher layers than layer 2 is a synthesized sound of a lower layer. Even if it adds, possibility that it will be determined that an addition result is also a silence area becomes high. Therefore, speech decoding apparatus 200 lowers the decoding priority for the codes of layers 3 and 4 of the frame whose silence flag is “2” (that is, silence period) (that is, skips decoding for the codes of layers 3 and 4). Therefore, scalable codec decoding can be performed efficiently.
- FIG. 10 is a diagram illustrating an example of the silence flag sflag (i).
- FIG. 10 shows silence flags stored in the silence flag storage unit 201 in the state matrix state shown in FIG. 3 and the code data matrix state shown in FIG.
- frame 0 shown in FIG. 10 since the code up to layer 2 is decoded as shown in FIG. 3, the sound / silence determination is performed. In FIG. 10, frame 0 is “1” indicating that the sound is sounded. On the other hand, for frames 1 to 3 shown in FIG. 10, the code after layer 2 is not decoded as shown in FIG. Therefore, in FIG. 10, frames 1 to 3 are “0” indicating that the codes after layer 2 are not decoded.
- the speech decoding apparatus when searching for a layer having a code to be decoded, in the same manner as in Embodiment 1, is higher than a layer in which no packet reaches in each frame. The next frame is searched without searching for the next frame. Furthermore, the speech decoding apparatus, when searching for a layer having a code to be decoded, does not search for a layer higher than that layer when it is determined that it is a silent section in a certain layer of each frame, Search for the next frame. Therefore, according to the present embodiment, the decoding process is performed as soon as possible and the synthesized sound is generated earlier than in the first embodiment, so that the processor can be used for another application continuously for a predetermined time. At the same time, the synthesized sound can be generated without interruption even when the processor is used for other purposes due to an emergency interruption.
- the present invention is not limited to this, and can be applied to scalable codecs having various numbers of layers.
- the number of frames may be changed variously for the convenience of the system. If you use a code data matrix for many frames, there is room for the number of frames even if the packets arrive apart, so the probability that high-quality decoding will be done using all the sent code data increases, Packets are not wasted. If it is necessary to reduce the packet processing delay as much as possible due to the performance of the system, the number of frames may be adjusted to reduce the delay. That is, the present invention does not depend on the number of frames.
- the present invention is not limited to this, and the maximum number of layers used in the speech decoding apparatus is determined, and the decoding results of the codes of the maximum number of layers are obtained.
- the present invention is also effective in the specification of outputting synthesized sound generated by synthesis.
- the packet reception unit 101 may discard unnecessary higher layer packets. That is, the present invention does not depend on the difference in the number of layers between the speech decoding apparatus and the speech encoding apparatus.
- the function of the synthesis unit 112 (or the synthesis unit 205) performs a memory shift when updating the matrix stored in the state / code storage unit 103 and the synthesized sound storage unit 106.
- the present invention is not limited to this, and a configuration in which the memory of each matrix is cyclically used for the frame and the memory shift is not performed may be employed. As a result, the amount of calculation for memory shift can be saved.
- the present invention is not limited to this, and the present invention is also applied to the case where several layers of codes are transmitted together. Is valid. This is because the matrix stored in the state / code storage unit 103 and the synthesized sound storage unit 106 may be read and written collectively. Moreover, even if it does not read / write collectively, it can also handle as a code
- the decoding results of the layer 3 and 4 packets cannot be used for synthesis.
- the present invention is not limited to this, and the decoding results of the packets of layers 3 and 4 can be used to create a filter or a prediction state to be used when decoding subsequent frames. Thereby, the decoding performance of subsequent frames can be ensured.
- the speech decoding apparatus searches for a layer having the state matrix number “1” in order from an earlier frame (that is, a frame having a smaller frame number) among a plurality of frames.
- the speech decoding apparatus may select a frame regardless of the frame number order.
- the speech decoding apparatus described in the above embodiment can be mounted on a communication terminal apparatus and a base station apparatus in a mobile communication system, and thereby a communication terminal apparatus having the same effects as described above, A base station apparatus and a mobile communication system can be provided.
- the present invention is not limited to this, and can be realized by software.
- the same function as the speech decoding apparatus according to the present invention can be realized by describing the algorithm according to the present invention in a programming language, storing the program in a memory, and causing the information processing means to execute the program. it can.
- each functional block in the above embodiment is typically realized as an LSI which is an integrated circuit. These may be individually made into one chip, or may be made into one chip so as to include a part or all of them.
- the LSI may be referred to as an IC, a system LSI, a super LSI, or an ultra LSI depending on the degree of integration.
- the method of circuit integration is not limited to LSI, and implementation with a dedicated circuit or a general-purpose processor is also possible.
- An FPGA Field Programmable Gate Array
- a reconfigurable processor that can reconfigure the connection or setting of circuit cells inside the LSI may be used.
- the speech decoding apparatus is particularly suitable for application to a scalable codec having a hierarchical structure.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
図1は、本発明の実施の形態1に係る音声復号装置100の構成を示すブロック図である。音声復号装置100は、スケーラブルデコーダ(スケーラブル(多層)コーデックの復号器)の一例である。なお、本実施の形態に係る通信システムでは、各フレームは複数のレイヤにより構成されるとともに、レイヤ毎に符号化が実行されて符号が生成され、その符号を格納したパケットが生成される。これにより、スケーラブルコーデックの各レイヤの符号が伝送される。
本実施の形態では、音声復号装置は、さらに、各フレームが無音区間であるか否かを判定し、無音区間であるか否かに応じて、復号すべきフレームおよびレイヤを選択する。
Claims (6)
- 複数レイヤから構成される音声符号化装置において生成されたそれぞれのレイヤの符号を、複数フレームにわたって受信して格納する受信手段と、
前記それぞれのレイヤの符号を復号する復号手段と、
を具備する音声復号装置であって、
未復号状態の前記それぞれのレイヤの符号のうち、最初に復号を行う符号に対応するフレーム番号とレイヤ番号とを選択する選択手段、
を更に具備し、
前記受信手段は、
前記それぞれのレイヤの符号が未受信であるか未復号であるか復号済みであるかを示す復号状態を更に格納し、更新指示を入力した場合には前記復号状態の更新を行い、
前記選択手段は、
前記受信手段において前記復号状態が格納又は更新された時点で前記復号状態が未復号状態であって、格納又は更新後に最初に復号を行う符号に対応する前記フレーム番号及び前記レイヤ番号を、前記復号状態を探索することにより選択するとともに、前記受信手段に対して、前記復号状態の前記更新指示を出力し、
前記復号手段は、
前記フレーム番号及び前記レイヤ番号に対応する符号を復号する、
音声復号装置。 - 前記選択手段は、
前記復号状態の探索において、1つのフレームについて、前記復号状態が未復号状態であるレイヤのうち、前記復号状態が復号済みであるレイヤよりも1つ上位に存在するレイヤの符号、又は、前記1つのフレームにおける最下位レイヤの符号を最初に復号を行う符号とし、前記最初に復号を行う符号に対応する前記フレーム番号及び前記レイヤ番号を選択し、
前記1つのフレームについて前記選択ができなかった場合には、次のフレームについて探索を継続する、
請求項1記載の音声復号装置。 - フレーム毎に無音区間であるか否かを判定する判定手段と、
前記判定手段における判定結果をフレーム毎に格納する判定結果格納手段と、
を更に具備し、
前記選択手段は、
前記復号状態とともに前記判定結果も用いて前記復号状態の探索を行う、
請求項1記載の音声復号装置。 - 前記選択手段は、
前記複数フレームのうち、前記判定結果が無音区間であるフレームをスキップして前記復号状態の探索を行う、
請求項3記載の音声復号装置。 - 1つのフレームについて、前記復号手段において前記フレーム番号及び前記レイヤ番号の符号を復号して生成された復号信号と、復号済みの他のレイヤの復号信号とを合成して合成音を生成する合成手段、を更に具備する、
請求項1記載の音声復号装置。 - 複数レイヤから構成される音声符号化装置において生成されたそれぞれのレイヤの符号を、複数フレームにわたって受信してメモリに格納する受信ステップと、
前記それぞれのレイヤの符号を復号する復号ステップと、
を具備する音声復号方法であって、
未復号状態の前記それぞれのレイヤの符号のうち、最初に復号を行う符号に対応するフレーム番号とレイヤ番号とを選択する選択ステップ、
を更に具備し、
前記受信ステップは、
前記それぞれのレイヤの符号が未受信であるか未復号であるか復号済みであるかを示す復号状態を更に前記メモリに格納し、更新を指示された場合には前記メモリの前記復号状態の更新を行い、
前記選択ステップは、
前記メモリにおいて前記復号状態が格納又は更新された時点で前記復号状態が未復号状態であって、格納又は更新後に最初に復号を行う符号に対応する前記フレーム番号及び前記レイヤ番号を、前記復号状態を探索することにより選択するとともに、前記メモリに対して、前記復号状態の更新を指示し、
前記復号ステップは、
前記フレーム番号及び前記レイヤ番号に対応する符号を復号する、
音声復号方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/255,982 US20120041761A1 (en) | 2009-03-13 | 2010-03-12 | Voice decoding apparatus and voice decoding method |
CN201080011371XA CN102349102A (zh) | 2009-03-13 | 2010-03-12 | 语音解码装置及语音解码方法 |
JP2011503738A JPWO2010103855A1 (ja) | 2009-03-13 | 2010-03-12 | 音声復号装置及び音声復号方法 |
EP10750611A EP2407962A1 (en) | 2009-03-13 | 2010-03-12 | Voice decoding apparatus and voice decoding method |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009-060792 | 2009-03-13 | ||
JP2009060792 | 2009-03-13 | ||
JP2009166796 | 2009-07-15 | ||
JP2009-166796 | 2009-07-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010103855A1 true WO2010103855A1 (ja) | 2010-09-16 |
Family
ID=42728151
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/001793 WO2010103855A1 (ja) | 2009-03-13 | 2010-03-12 | 音声復号装置及び音声復号方法 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20120041761A1 (ja) |
EP (1) | EP2407962A1 (ja) |
JP (1) | JPWO2010103855A1 (ja) |
CN (1) | CN102349102A (ja) |
WO (1) | WO2010103855A1 (ja) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9021327B2 (en) * | 2013-02-19 | 2015-04-28 | Harris Corporation | Dynamic packet redundancy for a free space optical communication link |
JP2014230159A (ja) * | 2013-05-23 | 2014-12-08 | 株式会社メガチップス | 画像処理装置 |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08263096A (ja) | 1995-03-24 | 1996-10-11 | Nippon Telegr & Teleph Corp <Ntt> | 音響信号符号化方法及び復号化方法 |
JPH1141287A (ja) | 1997-07-22 | 1999-02-12 | Oki Electric Ind Co Ltd | 音声通信ゆらぎ吸収方法 |
JP2000151694A (ja) | 1998-11-05 | 2000-05-30 | Hitachi Commun Syst Inc | 音声ゆらぎ補正制御方法、並びに音声再生装置および音声中継装置 |
JP2001100791A (ja) * | 1999-09-28 | 2001-04-13 | Sanyo Electric Co Ltd | 復号方法、コンピュータにそれを実行させるためのプログラムを記録した記憶媒体および復号装置 |
JP2003087317A (ja) | 2001-09-12 | 2003-03-20 | Nec Corp | 音声パケット遅延揺らぎ吸収装置及び吸収方法 |
JP2004138789A (ja) * | 2002-10-17 | 2004-05-13 | Nippon Telegr & Teleph Corp <Ntt> | ディジタル信号符号化方法、復号化方法、符号器、復号器及びこれらのプログラム |
WO2005106848A1 (ja) * | 2004-04-30 | 2005-11-10 | Matsushita Electric Industrial Co., Ltd. | スケーラブル復号化装置および拡張レイヤ消失隠蔽方法 |
JP2007235221A (ja) | 2006-02-27 | 2007-09-13 | Fujitsu Ltd | 揺らぎ吸収バッファ装置 |
JP2009060792A (ja) | 2007-09-04 | 2009-03-26 | Fujifilm Corp | 錠剤またはカプセル剤の食品。 |
JP2009166796A (ja) | 2008-01-21 | 2009-07-30 | Mazda Motor Corp | 車両のサスペンション装置 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100335611B1 (ko) * | 1997-11-20 | 2002-10-09 | 삼성전자 주식회사 | 비트율 조절이 가능한 스테레오 오디오 부호화/복호화 방법 및 장치 |
DE60130180T2 (de) * | 2000-04-14 | 2008-05-15 | Sony Corp. | Verfahren zur kodierung und dekodierung, aufzeichnungsmedium und programm |
US20070282607A1 (en) * | 2004-04-28 | 2007-12-06 | Otodio Limited | System For Distributing A Text Document |
-
2010
- 2010-03-12 WO PCT/JP2010/001793 patent/WO2010103855A1/ja active Application Filing
- 2010-03-12 JP JP2011503738A patent/JPWO2010103855A1/ja active Pending
- 2010-03-12 US US13/255,982 patent/US20120041761A1/en not_active Abandoned
- 2010-03-12 EP EP10750611A patent/EP2407962A1/en not_active Withdrawn
- 2010-03-12 CN CN201080011371XA patent/CN102349102A/zh active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08263096A (ja) | 1995-03-24 | 1996-10-11 | Nippon Telegr & Teleph Corp <Ntt> | 音響信号符号化方法及び復号化方法 |
JPH1141287A (ja) | 1997-07-22 | 1999-02-12 | Oki Electric Ind Co Ltd | 音声通信ゆらぎ吸収方法 |
JP2000151694A (ja) | 1998-11-05 | 2000-05-30 | Hitachi Commun Syst Inc | 音声ゆらぎ補正制御方法、並びに音声再生装置および音声中継装置 |
JP2001100791A (ja) * | 1999-09-28 | 2001-04-13 | Sanyo Electric Co Ltd | 復号方法、コンピュータにそれを実行させるためのプログラムを記録した記憶媒体および復号装置 |
JP2003087317A (ja) | 2001-09-12 | 2003-03-20 | Nec Corp | 音声パケット遅延揺らぎ吸収装置及び吸収方法 |
JP2004138789A (ja) * | 2002-10-17 | 2004-05-13 | Nippon Telegr & Teleph Corp <Ntt> | ディジタル信号符号化方法、復号化方法、符号器、復号器及びこれらのプログラム |
WO2005106848A1 (ja) * | 2004-04-30 | 2005-11-10 | Matsushita Electric Industrial Co., Ltd. | スケーラブル復号化装置および拡張レイヤ消失隠蔽方法 |
JP2007235221A (ja) | 2006-02-27 | 2007-09-13 | Fujitsu Ltd | 揺らぎ吸収バッファ装置 |
JP2009060792A (ja) | 2007-09-04 | 2009-03-26 | Fujifilm Corp | 錠剤またはカプセル剤の食品。 |
JP2009166796A (ja) | 2008-01-21 | 2009-07-30 | Mazda Motor Corp | 車両のサスペンション装置 |
Also Published As
Publication number | Publication date |
---|---|
US20120041761A1 (en) | 2012-02-16 |
CN102349102A (zh) | 2012-02-08 |
EP2407962A1 (en) | 2012-01-18 |
JPWO2010103855A1 (ja) | 2012-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7805297B2 (en) | Classification-based frame loss concealment for audio signals | |
JP5072835B2 (ja) | 堅牢なデコーダ | |
US7996217B2 (en) | Method for adaptive codebook pitch-lag computation in audio transcoders | |
US8428959B2 (en) | Audio packet loss concealment by transform interpolation | |
JP6574820B2 (ja) | 高周波帯域信号を予測するための方法、符号化デバイス、および復号デバイス | |
RU2408089C9 (ru) | Декодирование кодированных с предсказанием данных с использованием адаптации буфера | |
US9123328B2 (en) | Apparatus and method for audio frame loss recovery | |
US9361904B2 (en) | Method for predicting bandwidth extension frequency band signal, and decoding device | |
US8055499B2 (en) | Transmitter and receiver for speech coding and decoding by using additional bit allocation method | |
JP2009219151A (ja) | 圧縮された音声信号を伸長する方法、伸長装置および電子装置 | |
JP2003501675A (ja) | 時間同期波形補間によるピッチプロトタイプ波形からの音声を合成するための音声合成方法および音声合成装置 | |
EP1617417A1 (en) | Voice coding/decoding method and apparatus | |
WO2010103855A1 (ja) | 音声復号装置及び音声復号方法 | |
JPWO2006009075A1 (ja) | 音声符号化装置および音声符号化方法 | |
US20140334484A1 (en) | System, device, and method of voice-over-ip communication | |
WO2009122757A1 (ja) | ステレオ信号変換装置、ステレオ信号逆変換装置およびこれらの方法 | |
US20040019480A1 (en) | Speech encoding device having TFO function and method | |
JP2005173215A (ja) | 音声認識システム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201080011371.X Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10750611 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2011503738 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010750611 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13255982 Country of ref document: US |