WO1997016821A1 - Procede et systeme de compression d'un signal vocal par prediction non lineaire - Google Patents
Procede et systeme de compression d'un signal vocal par prediction non lineaire Download PDFInfo
- Publication number
- WO1997016821A1 WO1997016821A1 PCT/US1996/017308 US9617308W WO9716821A1 WO 1997016821 A1 WO1997016821 A1 WO 1997016821A1 US 9617308 W US9617308 W US 9617308W WO 9716821 A1 WO9716821 A1 WO 9716821A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech data
- speech
- energy
- subsequence
- predictive coding
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/06—Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
Definitions
- This invention relates generally to speech coding and, more particularly, to speech data compression.
- the speech is converted to an analog speech signal with a transducer such as a microphone.
- the speech signal is periodically sampled and converted to speech data by, for example, an analog to digital converter.
- the speech data can then be stored by a computer or other digital device.
- the speech data can also be transferred among computers or other digital devices via a communications medium.
- the speech data can be converted back to an analog signal by, for example, a digital to analog converter, to reproduce the speech signal.
- the reproduced speech signal can then be amplified to a desired level to play back the original speech.
- the speech data In order to provide a recognizable and quality reproduced speech signal, the speech data must represent the original speech signal as accurately as possible. This typically requires frequent sampling of the speech signal, and thus produces a high volume of speech data which may significantly hinder data storage and transfer operations. For this reason, various methods of speech compression have been employed to reduce the volume of the speech data. As a general rule, however, the greater the compression ratio achieved by such methods, the lower the quality of the speech signal when reproduced. Thus, a more efficient means of compression is desired which achieves both a high compression ratio and a quality of the speech signal.
- FIG. 1 is a flowchart of the speech compression process performed in a preferred embodiment of the invention.
- FIG. 2 is a flowchart the speech parameter generation process of the preferred embodiment of the invention.
- FIG. 3 is a block diagram of the speech compression system of the preferred embodiment of the invention.
- FIG. 4 is an illustration of the sequence of speech data in the preferred embodiment of the invention.
- FIG. 5 is a block diagram of the speech parameter generator of the preferred embodiment of the invention.
- a method and system are provided for compressing a speech signal into compressed speech data.
- a sampler initially samples the speech signal to form a sequence of speech data.
- a segmenter then segments the sequence of speech data into at least one subsequence of segmented speech data, called herein a segment.
- a speech parameter generator generates speech parameters by fitting each segment to a nonlinear predictive coding equation.
- the nonlinear predictive coding equation includes a linear predictive coding equation having linear terms.
- the nonlinear predictive coding equation includes at least one cross term that is proportional to a product of two or more of the linear terms .
- the speech parameters are generated as the compressed speech data for each segment. Inclusion of the cross term provides the advantage of a more accurate speech compression with a minimal addition of compressed speech data.
- An energy is determined in the segment and compared to an energy threshold.
- the compressed speech data further includes an energy flag indicating whether the energy is greater than the energy threshold. If the energy is greater than the energy threshold, a sinusoidal term is included in the nonlinear predictive coding equation, and the compressed speech data further includes a sinusoidal coefficient of the sinusoidal term, an amplitude of the sinusoidal term and a frequency of the sinusoidal term. This provides greater accuracy in the speech data for the voiced segment, which requires more description for accurate reproduction of the speech signal than an unvoiced segment. If the energy of the segment is not greater than the energy threshold, a noise term is included in the nonlinear predictive coding equation instead of the sinusoidal term. This provides a sufficiently accurate model of the speech signal for the segment while allowing for greater compression of the speech data.
- the nonlinear predictive coding equation is used to decompress the compressed speech data when the speech signal is reproduced.
- FIG. 1 is a flowchart of the speech compression process performed in a preferred embodiment of the invention. It is noted that the flowcharts in the description of the preferred embodiment do not necessarily correspond directly to lines of software code or separate routines and subroutines, but are provided as illustrative of the concepts involved in the relevant process so that one of ordinary skill in the art will best understand how to implement those concepts in the specific configuration and circumstances at hand.
- the speech compression method and system described herein may be implemented as software executing on a computer.
- the speech compression method and system may be implemented in digital circuitry such as one or more integrated circuits designed in accordance with the description of the preferred embodiment.
- One possible embodiment of the invention includes a polynomial processor designed to perform the polynomial functions which will be described herein, such as the polynomial processor described in "Neural Network and Method of Using Same", having serial number 08/076,601, which is herein incorporated by reference.
- One of ordinary skill in the art will readily implement the method and system that is most appropriate for the circumstances at hand based on the description herein.
- a speech signal is sampled periodically to form a sequence of speech data.
- the sequence of speech data is segmented into at least one subsequence of segmented speech data, called herein a segment.
- step 120 includes segmenting the sequence of speech data into overlapping segments .
- Each segment and a sequentially adjacent subsequence of segmented speech data, called herein an adjacent segment overlap so that both the segment and the adjacent segment include a segment overlap component representing one or more same sampling points of the speech signal.
- speech parameters are generated for the segment based on the speech data, as described in the flowchart in FIG. 2.
- speech coefficients are generated by fitting the segment to a nonlinear predictive coding equation.
- the speech coefficients are generated using a curve-fitting technique such as a least-squares method or a matrix- inversion method.
- the nonlinear predictive coding equation includes a linear predictive coding equation with linear terms.
- the nonlinear predictive coding equation further includes at least one cross term that is proportional to a product of two or more of the linear terms. The inclusion of the cross term provides for significantly greater accuracy than the linear predictive coding equation alone.
- the nonlinear predictive coding equation will be described in detail later in the specification.
- step 220 it is determined whether the speech is voiced or unvoiced.
- An energy is determined for the segment and compared to an energy threshold. If the energy in the segment is greater than the energy threshold, the segment is determined to be voiced, and steps 240 and 250 are performed.
- step 240 sinusoidal parameters are generated for a voiced segment. Specifically, a sinusoidal term is included in the nonlinear predictive coding equation, and a sinusoidal coefficient, an amplitude and a frequency of the sinusoidal term are generated. The sinusoidal term is used for a voiced portion of the speech signal because more accuracy is required in the speech data to represent voiced speech than unvoiced speech.
- an energy flag is generated indicating that the energy is greater than the energy threshold, thus identifying the segment as voiced.
- step 260 a noise term is included in the nonlinear predictive coding equation for an unvoiced segment.
- the noise term is included because less accuracy is required in the speech data to represent unvoiced speech, and thus greater compression can be realized.
- step 270 an energy flag is generated indicating that the energy is not greater than the energy threshold, thus identifying the segment as unvoiced.
- step 280 the speech coefficients, the energy flag, and the sinusoidal parameters are included as speech parameters in the compressed speech data for the segment.
- the nonlinear predictive coding equation will include either the sinusoidal term or the noise term, depending on whether the energy flag indicates that the segment is voiced or unvoiced, and the compressed speech data will be converted accordingly.
- steps 120 and 130 are repeated for each additional segment as long as the sequence of speech data contains more speech data. When the sequence of speech data contains no more speech data, the process ends.
- FIG. 3 is a block diagram of the speech compression system of the preferred embodiment of the invention.
- the preferred embodiment may be implemented as a hardware embodiment or a software embodiment as a matter of choice for one of ordinary skill in the art.
- the system of FIG. 3 is implemented as one or more integrated circuits specifically designed to implement the preferred embodiment of the invention as described herein.
- the integrated circuits include a polynomial processor circuit as described above, designed to perform the polynomial functions of the preferred embodiment of the invention.
- the polynomial processor is included as part of the speech parameter generator described below.
- the system of FIG. 3 is implemented as software executing on a computer, in which case the blocks refer to software functions realized in the digital circuitry of the computer.
- a sampler 310 receives the speech signal and samples the speech signal periodically to produce a sequence of speech data.
- the speech signal is an analog signal which represents actual speech.
- the speech signal is, for example, an electrical signal produced by a transducer, such as a microphone, which converts the acoustic energy of sound waves produced by the speech to electrical energy.
- the speech signal may also be produced by speech previously recorded on any appropriate medium.
- the sampler 310 periodically samples the speech signal at a sampling rate sufficient to accurately represent the speech signal in accordance with the Nyquist theorem.
- the frequency of detectable speech falls within a range from 100 Hz to 3400 Hz. Accordingly, in an actual embodiment, the speech signal is sampled at a sampling frequency of 8000 Hz.
- Each sampling produces an 8-bit sampling value representing the amplitude of the speech signal at a corresponding sampling point of the speech signal.
- the sampling values become part of the sequence of speech data in the order in which they are sampled.
- the sampler is implemented by, for example, a conventional analog to digital converter.
- a segmenter 320 receives the sequence of speech data from the sampler 310 and divides the sequence of speech data into segments. Because the preferred embodiment of the invention employs curve-fitting techniques, the speech signal is compressed more efficiently in separate segments. In the preferred embodiment, the segmenter divides the sequence of speech data into overlapping segments as shown in FIG. 4.
- the sequence of speech data 400 is provided into segments 410.
- Each segment 410 includes a segment overlap component 420 on each end.
- each segment 410 has 164 1-byte sampling values, including 160 sampling values and the 2 segment overlap components 420 on each end, each having 2 sampling values. Because each segment 410 and its adjacent segment share a segment overlap component 420, a smoother transition between segments can be accomplished when the speech signal is reproduced. This is accomplished by averaging the overlap components of each segment and its adjacent segment, and replacing the sampling values with the resulting averages.
- One of ordinary skill in the art will readily implement the segmenter based on the description herein.
- a speech parameter generator 330 receives the segments from the segmenter 320.
- the speech parameter generator 330 of the preferred embodiment is described in FIG. 5.
- each segment is received by a speech coefficient generator 510.
- the speech coefficient generator 510 generates the speech coefficients by fitting the speech data in the segment to a nonlinear predictive coding equation.
- the speech coefficient generator 510 generates the speech parameters using a curve-fitting technique such as a least-squares method or a matrix- inversion method.
- the nonlinear predictive coding equation includes a linear predictive coding equation with linear terms . Linear predictive coding is well known to those of ordinary skill in the art, and is described in "Voice Processing", by Gordon E. Pelton, on pp.
- the nonlinear predictive coding equation further includes at least one cross term that is proportional to a product of two or more of the linear terms.
- the speech coefficient generator 510 generates the speech coefficients by fitting the speech data in the segment to y(k) such that:
- y(k) is the sampling value described above for each sampling point k taken over n past samples y(k-i) and ai are the speech coefficients.
- ⁇ a ⁇ y(k-i) is the linear predictive coding equation
- a n + ⁇ y(k-l)y(k-2) is the cross term.
- the speech coefficient generator 510 generates the speech coefficients ai and includes the speech coefficients in the compressed speech data for the segment. For example, the numeric values of the speech coefficients are assigned to a portion of a data structure allocated to contain the speech data.
- An energy detector 520 determines the energy of the speech signal for the segment by integrating all of the points in the segment, and compares the energy determined, that is, the average value of the integration, to an energy threshold. The energy detector 520 sets an energy flag indicating whether the energy is greater than the energy threshold. Specifically, in the preferred embodiment, the energy detector 520 sets a voiced bit to 1 when the energy determined is greater than the energy threshold, indicating that the segment is voiced. The energy detector 520 sets the voiced bit to 0 when the energy is not greater than the energy threshold, indicating that the segment is unvoiced.
- the energy detector 520 generates the voiced bit, including the voiced bit in the compressed speech data for the segment.
- a sinusoidal parameter generator 530 is invoked by the energy detector 520 when the energy detector 520 determines that the energy is greater than the energy threshold segment. That is, the sinusoidal parameter generator 530 is invoked when the segment is voiced.
- the sinusoidal parameter generator 530 generates the sinusoidal parameters to be included in the speech data for the voiced segment.
- the sinusoidal parameter generator 530 includes a sinusoidal term in the nonlinear predictive coding equation such that:
- b sin(w k/K) is the sinusoidal term
- b is a sinusoidal coefficient of the sinusoidal term (also referred to in the art as gain)
- w is a frequency of the sinusoidal term (also referred to in the art as pitch)
- K is a constant.
- the sinusoidal parameter generator 530 generates the sinusoidal coefficient, the amplitude and the frequency of the sinusoidal term as the sinusoidal parameters, and includes the sinusoidal parameters in the compressed speech data for the segment along with the speech coefficients in the manner described above.
- One of ordinary skill in the art will readily implement the sinusoidal parameter generator 530 based on the description herein.
- a white noise generator 540 is invoked by the energy detector 520 when the energy detector 520 determines that the energy is not greater than the energy threshold segment. That is, the white noise generator 540 is invoked when the segment is unvoiced.
- the white noise generator 540 includes a noise term in the nonlinear predictive coding equation such that:
- n(k) is the noise term.
- n(k) can be represented as cN(k), where c is the energy of the noise, and N(k) is the normalized white noise.
- the voiced bit Upon decompression of the compressed speech signal, the voiced bit will indicate whether to include the noise term in the nonlinear predictive coding equation when applying the equation to produce the decompressed speech data for the segment.
- the noise term is a Gaussian white noise term.
- one of ordinary skill in the art may use other noise models as are appropriate for the objectives of the speech compression system, and will readily implement the white noise generator 540 based on the description herein.
- Decompression is essentially the reversal of the compression process described above and will be easily accomplished by one of ordinary skill in the art.
- the speech parameters are converted back into speech data using the nonlinear predictive coding equation for each segment. If the segment is voiced, as determined by the voiced bit, the sinusoidal term has been included in the nonlinear predictive coding equation used to reproduce the speech data. This provides greater accuracy in the speech data for the voiced segment, which requires more description for accurate reproduction of the speech signal than an unvoiced segment. If the segment is unvoiced, as determined by the voiced bit, the noise term has been included in the nonlinear predictive coding equation. This provides a sufficiently accurate model of the speech signal while allowing for greater compression of the speech data.
- the segment overlap components 420 in each segment 410 are averaged with the segment overlap components 420 in each adjacent segment and the segment overlap components 420 are replaced by the averaged values .
- the segments are aggregated until all of the segments have been aggregated back into a decompressed sequence of speech data.
- the decompressed sequence of speech data can then be converted to an analog speech signal and played or recorded as desired.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU74812/96A AU7481296A (en) | 1995-10-31 | 1996-10-30 | Method and system for compressing a speech signal using nonlinear prediction |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US55072395A | 1995-10-31 | 1995-10-31 | |
US08/550,723 | 1995-10-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1997016821A1 true WO1997016821A1 (fr) | 1997-05-09 |
Family
ID=24198350
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1996/017308 WO1997016821A1 (fr) | 1995-10-31 | 1996-10-30 | Procede et systeme de compression d'un signal vocal par prediction non lineaire |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU7481296A (fr) |
WO (1) | WO1997016821A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0896321A1 (fr) * | 1997-08-08 | 1999-02-10 | Nec Corporation | Méthode et système pour la compression et la décompression de son |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0504627A2 (fr) * | 1991-02-26 | 1992-09-23 | Nec Corporation | Méthode et dispositif de codage de paramètres de voix |
-
1996
- 1996-10-30 WO PCT/US1996/017308 patent/WO1997016821A1/fr active Application Filing
- 1996-10-30 AU AU74812/96A patent/AU7481296A/en not_active Abandoned
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0504627A2 (fr) * | 1991-02-26 | 1992-09-23 | Nec Corporation | Méthode et dispositif de codage de paramètres de voix |
Non-Patent Citations (3)
Title |
---|
BIGLIERI E: "Theory of Volterra processors and some applications", PROCEEDINGS OF ICASSP 82. IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, PARIS, FRANCE, 3-5 MAY 1982, 1982, NEW YORK, NY, USA, IEEE, USA, pages 294 - 297 vol.1, XP002025605 * |
GERSHO A: "ADVANCES IN SPEECH AND AUDIO COMPRESSION", PROCEEDINGS OF THE IEEE, vol. 82, no. 6, 1 June 1994 (1994-06-01), pages 900 - 918, XP000438340 * |
SHIHUA WANG ET AL: "PERFORMANCE OF NONLINEAR PREDICTION OF SPEECH", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP), KOBE, NOV. 18 - 22, 1990, vol. 1 OF 2, 18 November 1990 (1990-11-18), ACOUSTICAL SOCIETY OF JAPAN, pages 29 - 32, XP000503306 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0896321A1 (fr) * | 1997-08-08 | 1999-02-10 | Nec Corporation | Méthode et système pour la compression et la décompression de son |
US6098045A (en) * | 1997-08-08 | 2000-08-01 | Nec Corporation | Sound compression/decompression method and system |
Also Published As
Publication number | Publication date |
---|---|
AU7481296A (en) | 1997-05-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US4301329A (en) | Speech analysis and synthesis apparatus | |
RU2255380C2 (ru) | Способ и устройство воспроизведения речевых сигналов и способ их передачи | |
JPS6035799A (ja) | 人間の音声エンコード装置及び方法 | |
US5991725A (en) | System and method for enhanced speech quality in voice storage and retrieval systems | |
Mermelstein | Evaluation of a segmental SNR measure as an indicator of the quality of ADPCM coded speech | |
US6654718B1 (en) | Speech encoding method and apparatus, input signal discriminating method, speech decoding method and apparatus and program furnishing medium | |
US3909533A (en) | Method and apparatus for the analysis and synthesis of speech signals | |
US6246979B1 (en) | Method for voice signal coding and/or decoding by means of a long term prediction and a multipulse excitation signal | |
US5864795A (en) | System and method for error correction in a correlation-based pitch estimator | |
FI96247C (fi) | Menetelmä puheen muuntamiseksi | |
US5696875A (en) | Method and system for compressing a speech signal using nonlinear prediction | |
JP2738533B2 (ja) | マルチレベル・フィルタ励起を用いる音声合成 | |
JPH10313251A (ja) | オーディオ信号変換装置及び方法、予測係数生成装置及び方法、予測係数格納媒体 | |
US5701391A (en) | Method and system for compressing a speech signal using envelope modulation | |
JP2005181458A (ja) | 信号検出装置および方法、ならびに雑音追跡装置および方法 | |
JP2006171751A (ja) | 音声符号化装置及び方法 | |
JPH07199997A (ja) | 音声信号の処理システムにおける音声信号の処理方法およびその処理における処理時間の短縮方法 | |
WO1997016821A1 (fr) | Procede et systeme de compression d'un signal vocal par prediction non lineaire | |
KR101038446B1 (ko) | 오디오 코딩 | |
JP2002049397A (ja) | ディジタル信号処理方法、学習方法及びそれらの装置並びにプログラム格納媒体 | |
JP4645866B2 (ja) | ディジタル信号処理方法、学習方法及びそれらの装置並びにプログラム格納媒体 | |
WO2004112256A1 (fr) | Dispositif de codage de donnees vocales | |
JP4645867B2 (ja) | ディジタル信号処理方法、学習方法及びそれらの装置並びにプログラム格納媒体 | |
JP4645868B2 (ja) | ディジタル信号処理方法、学習方法及びそれらの装置並びにプログラム格納媒体 | |
US6594601B1 (en) | System and method of aligning signals |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AM AT AU BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LK LR LT LU LV MD MG MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TT UA UG US UZ VN |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
NENP | Non-entry into the national phase |
Ref country code: JP Ref document number: 97517476 Format of ref document f/p: F |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: CA |