WO1991006945A1 - Speech compression system - Google Patents

Speech compression system Download PDF

Info

Publication number
WO1991006945A1
WO1991006945A1 PCT/US1990/006437 US9006437W WO9106945A1 WO 1991006945 A1 WO1991006945 A1 WO 1991006945A1 US 9006437 W US9006437 W US 9006437W WO 9106945 A1 WO9106945 A1 WO 9106945A1
Authority
WO
WIPO (PCT)
Prior art keywords
persyl
perception
speech
condensing
units
Prior art date
Application number
PCT/US1990/006437
Other languages
French (fr)
Inventor
Iben Browning
William B. Dress
Original Assignee
Summacom, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Summacom, Inc. filed Critical Summacom, Inc.
Publication of WO1991006945A1 publication Critical patent/WO1991006945A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • G10L19/038Vector quantisation, e.g. TwinVQ audio

Definitions

  • the invention pertains generally to speech compression, processing and transmission systems and is more particularly directed to those speech compression, processing and transmission systems that can be used to transmit compressed speech in units that, when expanded, can be perceived by the human ear as intelligible speech.
  • a spectrally pure signal may sound like a tone when presented to the ear for a certain time longer than a frequency-dependent minimum, but like a click if presented for a shorter time. This is thought to be due to a physical hysteresis associated with the tissues of the hearing system and to an averaging property of sensory nerve cells.
  • this upper bound is to note that a tone is only perceived to become increasingly louder as its time of presentation to the ear is increased up to this upper bound; but then the tone is perceived to be both louder and last longer as the presentation time is increased beyond this upper bound.
  • the presence of this upper time bound may also be verified by noting that two tones of the same frequency but having intensities differing by, say, J2 are perceived to be the same, providing that the less intense tone lasts for twice as long as the more intense one, and that both tones last longer than the lower time bound and less than upper time bound.
  • This upper time bound may be termed the "sound fusion" time where the ear integrates specific sounds into an inseparable whole.
  • This realm of constant tone perception can be used to define a perception unit having components that are selected from frequency bands in an analyzed sound wave.
  • This perception unit will be termed a "persyl” to indicate both its perceptual nature and an analogy to taking a "slice” of sound, albeit not in constant time, but in constant perception.
  • a time sequence of persyls would advantageously encode the information in human speech that is relevant to the perception of phonemes, words, and phrases.
  • Vector encoding is a conventional method of speech compression where a code book of values representing speech is stored. Speech patterns are then formed by analysis of actual speech, for example, by linear predictive coding and matched against the stored code book. The closest code from the book is used to represent a particular speech pattern. Compression takes place when the number of code values are less than the possible actual patterns analyzed. See Robert M. Gray, "Vector Qantization", IEEE ASSP Magazine, pp. 4-29, April 1984. Because of the already compacting nature of the persyl unit, it would be highly advantageous to increase this factor by vector encoding persyl units. This technique would be even more beneficial if the unique properties of the persyl unit could be used in forming the code book.
  • the invention provides a unique speech condensation and compression system that significantly reduces the amount of information that must be transmitted to describe and faithfully reproduce segments of speech.
  • the system is based on the concept of a persyl unit, which is the unit of constant perception that a human hearing system uses in deciphering actual speech from sound pressure waves.
  • the system comprises an analyzer at an encoder station where the speech is analyzed and encoded based on the persyl unit stream it contains.
  • a vector coded representation of the persyl unit content of the speech stream is then transferred to a decoder station as a time sequence of codes. Between the time of coding and transferral, the persyl units can be processed, stored, directly transferred or transmitted.
  • a vector decoder transforms the coded information back into persyl units, which are thereafter assembled into a digital bit stream that can be converted back into intelligible sound.
  • the encoder station includes an analyzer having a plurality of persyl component detectors, each having a tunable filter and an associated integrator.
  • the filters are tunable based on both the frequency and the bandwidth necessary to detect adequately a particular component of a persyl unit.
  • the integrator integrates the amplitude of the detected component over a sampling time consistent with the region of constant perception to yield an average amplitude for that component.
  • the persyl components are chosen to have time durations in excess of the minimum perception time for the associated frequency and less than sound fusion time for that frequency.
  • the persyl components are assembled in an assembler for a particular sampling time to become persyl units.
  • the raw persyl units can be normalized for sounds which vary because of a particular scaling factor, such as phase, amplitude, pitch, duration, etc.
  • An extractor receives an output of the assembler and forms descriptors that indicate the scaling factors. The descriptors are used to modify the persyl units so that the assembler can pass along normalized persyl units. This completes the process of condensation of the speech signal. Compression takes place when the normalized persyl units are then vector encoded by a persyl unit encoder.
  • the result of the persyl unit encoding is a sequential stream of fixed or variable bit length codes, each representing one of a multiplicity of code words which describe the set of persyl units found in human speech.
  • the descriptor for each persyl code is then attached to form a digital representation of a speech sequence.
  • the speech has been condensed and compressed to a very low bit total.
  • the speech is also in a very convenient form, a digital code, which can be used in a variety of ways.
  • the codes can be advantageously stored because they will consume much less memory than normally digitized speech.
  • the codes can be advantageously transmitted because in their compressed form they consume much less bandwidth and have a lower bit rate than normally digitized speech, thus requiring both less time and channel resources.
  • the codes can be advantageously processed because in their compressed form they require fewer processor cycles for the same procedure than for normally digitized speech. Moreover, any of these processes can be done in any order, or in any combination.
  • the decoder station receives the digital codes and separates the descriptors from the persyl unit codes.
  • the persyl unit codes are then applied to an decoder which transforms the codes to normalized persyl units.
  • the persyl units are then applied to a constructor, which using the descriptors and the normalized persyl units, creates digital frequency components of a predetermined or descriptor-specified time duration.
  • the digital frequency components are then converted into an analog speech signal which is used to power a transducer which converts the analog signal into sounds that can be perceived.
  • FIG. 1 is a system block diagram of a speech compression system constructed in accordance with the invention
  • FIG. 2 is a detailed system flow chart of the speech compression process for the system illustrated in FIG. l;
  • FIG. 3 is a graphical representation of the constant perception region for the human ear in cycle number (period) as a function of frequency over the speech range;
  • FIG. 4 is a tabular representation of a subset of the preferred frequency and cycle number assignments for the perception components of persyl units used in the speech compression system illustrated in FIG. 1;
  • FIG. 5 is a histogram representation of a persyl unit illustrating the various components as a function of frequency and illustrating a set of associated descriptors;
  • FIG. 6 is a schematic representation of a persyl space showing a preferred partitioning and the assignment of binary codes to several particular partitions;
  • FIG. 7 is a pictorial representation of a "retina" buffer for the n-tuple pattern recognition process used by the system illustrated in FIG. 1;
  • FIG. 8 depicts a part of the n-tuple pattern recognition process used by the system illustrated in FIG. 1;
  • FIG. 9 is a detailed functional block diagram of a preferred embodiment of the speech compression system illustrated in FIG. 1;
  • FIG. 10 is a detailed schematic block diagram of the encoder station of the speech compression system illustrated in FIG. 9;
  • FIG. 11 is a detailed schematic block diagram of the decoder station of the speech compression system illustrated in FIG. 9;
  • FIG. 12 is a detailed functional diagram of one of the filter-integrator pairs of the analyzer illustrated in FIG. 10; and FIG. 13 is a detailed functional diagram of one of the resonator-filter pairs of the synthesizer illustrated in FIG. 11.
  • FIG. 1 A system for condensing and compressing speech, which is constructed in accordance with the invention is shown in FIG. 1.
  • the system comprises generally an encoder station 8 and a decoder station 9 which can be connected together by an intermediate functional link 6 or a direct (immediate) functional link 7.
  • the encoder station 8 functions to reduce the bit rate required for representing the essential information content of speech (input voice signal) by the compaction and compression encoding processes of the invention which are based on constant human perception units or persyl units. Accordingly, the speech is transduced from an input voice signal by a physical to electrical transducer, a microphone 10. The analog speech signal from the output of microphone 10 is then converted to a series of digital samples by an A/D converter 12, producing a relatively high bit rate.
  • the high bit rate signal is converted to a lower bit rate signal by extracting persyl units from it with a persyl encoder 2.
  • the persyl encoder 2 transfers the persyl units to a vector encoder 3 in a second encoding step (compression) for a further significant bit rate reduction to yield a very low bit rate signal.
  • the low bit-rate encoded signal is passed through an intermediate link 6, which can include transmission, storage, or processing, before being received at the decoder station 9.
  • the functions of the intermediate link 6 can occur in any order or combination, and should be broadly construed.
  • transmission can include direct communication as by connecting the encoder station 8 and decoder station 9 back-to-back at the same location. Transmission also contemplates all forms of digital communication such as wire (cable) , wireless, over-the-air, and others.
  • processing can include error correcting, protocol conversion, encryption, and other processes.
  • Storage can occur prior to, or after processing, and prior to transmission; can be on line or off line; and can use volatile, non-volatile memory, or other forms of memory.
  • the decoder station 9 after receipt of the persyl unit codes, uses a vector decoder 4 to convert the codes back into persyl units of the lower bit rate.
  • the persyl units are then applied to a persyl decoder 5 which converts them back into complex sounds.
  • the output of the persyl decoder is at the high bit rate, similar to the output of the A/D converter 12.
  • This high bit rate signal is then converted by a D/A converter 25 into an analog signal that can be transduced by an electrical-to-physical transducer (speaker) 27 into an output sound signal (reconstructed voice signal) .
  • the system includes other advantageous implementations, including an immediate functional link 7 by which the persyl units from the persyl encoder 2 can be directly transferred to the persyl decoder 5.
  • This implementation eliminates the vector encoding and decoding steps but at the cost of a higher bit rate.
  • the immediate link 7 could additionally include transmission, storage, or processing functions.
  • the analog-to-digital converter 12 could be used after an analog persyl encoder 2 or after an analog vector encoder 3.
  • the digital-to-analog converter 27 can be used either before an analog vector decoder 4 or an analog persyl decoder 5.
  • the preferred implementation is the digital embodiment as is shown.
  • the digital implementation which will be more fully described hereinafter, should be taken only as exemplary and not limiting to overall system configuration.
  • the overall method of the digital implementation of the invention is shown more particularly in the flow chart of FIG. 2 where specific blocks illustrate the functional operations on the physical sound signal.
  • the physical sound signal is transduced into an electrical signal, which can be manipulated and processed by the system.
  • the physical signal would be from a microphone, or the like, but could be an analog recording or some other input, such as from another communication system providing a substitute signal for the transducer signal and permitting further processing.
  • the next step, in block A12 is to convert the analog electrical signal into digital samples.
  • Blocks A10 and A12 are bypassed should the signal already be in digital form from a source other than the microphone 10. Thereafter, in block A14, the electrical signal can be separated into spectral persyl components.
  • the separation of the electrical signal into persyl components can be done, of course, either in the analog or digital domain, by the implementation of an analyzer more fully described hereinafter, or by many other methods.
  • the next step in the process is to integrate the several persyl spectral components in the time domain in a manner consistent with the perception time required by the physiology of hearing as explained above.
  • This time-analysis process accomplished by simple signal averaging or integration, produces a group of intensity components for each frequency being analyzed.
  • the totality of the frequencies analyzed in this way produces a persyl unit, which is then an ordered set of numbers or intensities representing the particular constant perception unit obtained over the perceptual sampling time.
  • the persyl unit is then vector coded in block A18 before being stored in a memory or processed in block A20 or, alternatively, being transmitted in block A22.
  • the representation of the persyl unit can be that illustrated in FIG.
  • each Pj has the form (p 1 p 2 , ... ,p n ), where n ranges from 3 to 10 in the example shown in FIG. 5, but is not generally restricted.
  • the code which has been transmitted or communicated, is received in block A24.
  • the received code may either be stored in, or retrieved from, a memory, and/or processed in block A26.
  • the coded samples are then decoded into persyl unit form in block A28.
  • the formation of the persyl units from the codes is just the opposite from the encoding process, except that there is a one-to-one relationship between the codes and the persyl units when decoding and no similarity search process need be undertaken.
  • the persyl units are then used to regenerate the spectral and temporal persyl components based on the decoded values in block A30.
  • the persyl unit components are then concatenated or combined in a time sequence to form sequential speech samples in block A32, where the various components are matched in amplitude so as not to introduce high-frequency noise in the reconstructed signal.
  • a digital-to-analog conversion is performed in block A34.
  • the reconstituted analog signal can be converted by a transducer into a physical signal, namely sound which can be heard.
  • the two step process of condensation and compression produces a substantially reduced bandwidth signal without the loss of fidelity or information and one that can be decoded in the two-step expansion process with facility.
  • the first step of condensing speech into persyl units extracts from the possible information in the digitized representation of the pressure wave only those sounds that can be humanly perceived as speech.
  • the condensation step provides a substantial reduction to the broad information content of the input signal without the loss of the desired information (speech) .
  • the second step that of compression by vector or matrix encoding, substantially reduces the bandwidth and storage requirements of the condensed information without any loss of the relevant speech content.
  • the particular vector encoding technique uses the persyl unit and is based on the unique characteristics obtained in considering the nature of speech perception by the ear. This technique further increases the amount of compression that can be obtained.
  • the two-step process of reconstructing the persyl units from the codes and then generating speech from the decoded persyl units provides a facile method of producing only those sounds that can be heard and understood as speech, while disregarding component sounds not originally in the speech database, which is described below.
  • This process uses a subset of persyl units that describe human speech, in a particular form, to a high degree of fidelity. The generation of speech in this manner reduces the complexity and processing requirements of the equipment necessary for the decoding.
  • a persyl component recognition graph is a physiological chart for the speech-production range (approximately 100 Hz to 5000 Hz) indicating as a lower bound the number of cycles necessary to perceive a particular frequency and as an upper bound the number of cycles that create sound fusion.
  • the number of cycles necessary for tone perception varies slowly over the lower frequencies, and then increases rapidly above 4500 Hz.
  • a persyl frequency component is defined as the intensity obtained at a particular discrete or bandwidth-limited frequency in the wave-number range lying between the minimum number of cycles that is required to perceive a tone at that frequency and the maximum boundary for sound fusion.
  • These persyl components may be taken at any frequency in the bounded region and in any number, but preferably each is obtained from discrete segments having a specific width or bandpass, and at integral wave numbers.
  • an optimized curve is determined within the range.
  • the optimized curve can be determined empirically based on a number of different parameters including perceived quality of sound, actual fidelity for a given usage, etc. Alternatively, the curve can be determined analytically, based on minimization of bandwidth, compatibility with hardware characteristics, etc.
  • This specific optimized curve is to be contrasted with several typical Fourier transforms, which appear as straight lines in a frequency wave-number plane. It is noted particularly that, based on the defined constant perception range, the Fourier transform generally provides too much information for higher frequencies (above 2000 to 4000 Hz) and is thus wasteful; and too little information for frequencies lying below about 500 Hz, and is thus inadequate.
  • the entire speech-production frequency range (from about 100 Hz to greater than 5000 Hz) can be represented by a single generic persyl unit having as few as ten components. It is these particular persyl units that represent the perceivable elementary sounds of human speech and are given codes by the vector encoder 3 in FIG. 1.
  • the invention may be variable in the perception probability requirement where different perception unit subsets are stored and the system uses one subset for one particular transmission requirement and another subset for another particular transmission requirement.
  • One purpose of changing between these subsets would be for traffic-related requirements where the minimum perception curve and the longest sampling times were used to decrease the information rate down to a minimum. This technique could be used until the encoded voice traffic was barely perceptible as speech.
  • traffic requirements decrease, the higher probability perception curves and shorter sampling times could be used at the cost of a higher information rate to increase the intelligibility and fidelity.
  • FIG. 5 is a pictorial representation, in the form of a set of histograms, of an assembled persyl unit comprising ten frequency components each component comprising from three to ten intensity entries.
  • the ten frequency components are those of the generic persyl unit chosen for illustration purposes and the numerical intensity is that which would be determined as measured by the component analysis system described in more detail hereinafter.
  • the frequency axis lies horizontally in the plane of the paper. It is labeled by a Frequency Index corresponding to the various analyzing filters discussed below.
  • Index 1 corresponds to the highest frequency, and has accordingly the highest number of entries.
  • Index 10 corresponds to the lowest frequency, and has the smallest number of entries.
  • the axis indicating depth into the page is a sample number axis and is labeled by a Entry Index, index 1 being the first entry obtained during analysis and assembling of a persyl unit, and index 10, the last.
  • the curve lying in the frequency-period plane and labeled "isochronal line" indicates the locus of those points each having a time equal to the duration time over which the persyl is analyzed as measured from the start analysis process, i.e. , the sample time.
  • a persyl sample is determined over different times (periods) for each frequency.
  • each component sample has the same time of measurement.
  • the persyl component labeled by frequency index 10 consists of three samples, each measured for 7 cycles at 200 Hz corresponding to the table in FIG. 4; and the component with frequency index 1 consists of 10 samples, each measured for
  • the representation shown in FIG. 5 has been normalized by modifying the measured numerical intensities of the samples of each component according to one or more normalization factors. It is believed human speech can be quantized with a much smaller persyl unit subset when normalization for certain factors has been undertaken.
  • the information extracted in the normalization process is attached to the persyl vector as an ordered set of descriptors shown on the left quadrant of the representation in FIG. 5 along a
  • Descriptor Index where the descriptor number identifies the type of normalization factor and the numeric intensity defines the degree of variance for that factor. For the purposes of the invention, four descriptors will be described which relate to intensity, pitch, phase, and duration. Each of the normalization factors, its use, and method of extraction will now be discussed.
  • An intensity normalization factor (Descriptor Index 1) is obtained by measuring the energy content of each persyl unit by summing the squares of all samples in the persyl unit. To normalize, each persyl sample is divided by the square root of the number so calculated and multiplied by 2 b , where b is the number of bits of resolution desired in the normalized components.
  • the intensity descriptor is the divisor used above, normalized to the same number of bits by dividing by the maximum intensity obtainable (as computed from the number of bits used to digitize the analog signal) and likewise multiplying by 2 b .
  • a pitch normalization factor (Descriptor Index 2) is employed to shift the entire persyl unit down (or up) in frequency to the point where the lowest (or highest) component or frequency list contains samples of significant measure. The amount, as measured by the number of steps in frequency, is the pitch descriptor.
  • a shift step is determined by the frequency difference from one filter to an adjacent filter, and the shift is accomplished based on the actual response function of the filters, taking into consideration the degree of overlap between two adjacent filters. Pitch shifting can also take place in small steps by interpolating between filters.
  • a phase descriptor (Descriptor Index 3) is simply a delay, measured in digital samples or other suitable units, that a particular persyl is to be delayed before reconstruction is started. The phase descriptor is obtained by measuring the difference between the last sample incorporated into the previous persyl unit and the first sample to be incorporated into the next persyl unit during the analysis stage. The phase descriptor can be used to represent long periods of silence.
  • voiced vowels have a duration between 50 and 400 milliseconds in normal speech.
  • a vowel can consist of one or more nearly identical persyls, so a savings in bits transmitted may be made by noting the time period over which a persyl remains roughly constant, especially within a vowel. This additional processing takes place in the integrators discussed below, and represents a time-correlation across the entire set of integrators. When this correlation disappears, the current persyl has ended, and a new one must be started.
  • a duration descriptor (Descriptor Index 4) is obtained similar to the phase descriptor, however it is a measure of the number of samples or duration time between the first sample incorporated into a persyl and the last sample when that period is variable. Any given descriptor need not be present; its presence or absence may be indicated by the presence or absence of the corresponding index in the descriptor map.
  • Reconstruction of the persyl unit from its normalized components and the set of descriptors comprises a set of processes inverse to the ones described above. Thus, during reconstruction, a normalized persyl unit is first shifted by the pitch descriptor, if any, and then amplitude adjusted by the intensity descriptor, if any.
  • the output is then delayed the number of sample intervals described by the phase descriptor, if any, and the output continues for the number of sample intervals described by the duration descriptor, if any.
  • the vector coding of the persyl units will now be described with reference to FIGS. 6, 7, and 8. Note that if the digitizer shown in FIG. 1 provides a resolution of B d bits, then the number of bits encoding a sample of the speech can be no more than B d bits each. This establishes a minimum granularity of the discrete space that contains all possible persyl units having 68 components, e.g., the number of histogram bars in the period-frequency plane shown in FIG. 5 and used herein as a particular example for purposes of explanation only.
  • any possible persyl unit can be represented in a persyl space having 68 dimensions and having a granularity of 1 part in 256 along each of 68 axes, since only positive values and zero are considered as valid for the intensity components of the persyl unit.
  • the procedure for obtaining a suitable vector code is to partition this space (precisely, the first octant thereof) into an optimal number and distribution of partitions. This partitioning scheme is known as a "covering". Suppose that there are 1024 such partitions that adequately and optimally cover the space.
  • one essential step of obtaining very high compression ratios while maintaining high speech quality lies in the choice of physiologically plausible persyl units on which to base the decomposition, analysis, and reconstruction synthesis.
  • the structure of a persyl unit is designed specifically to optimize the encoding of speech information present in the human voice. This fact is used to obtain the necessary data for construction of codes giving high compression ratios while maintaining high speech quality.
  • persyl sample units must first be derived.
  • the persyl encoder 2 in FIG. 1 can be employed as the front end to a device that produces and records a range of persyl sample units that span the range of the human speech under consideration.
  • a partition of the data (thought of as a large cloud of points in the persyl space in a geometrical visualization in FIG. 6) is made.
  • This partition is obtained adaptively from actual human speech data represented as perception units - not from an arbitrary selection of possible sounds. It is this severe restriction of the number and position of the possible sounds in the 68-dimensional, binary vector space to those represented by the actual persyl units that provide a "preselection" for the region of interest. This preselection limits both the range of interest and the type of partition that is required to cover the persyl "cloud" to the degree required for a given fidelity of speech reconstruction.
  • the speech is reconstructed from a sequence of codes in the center of the partitions and not directly from the actual measured persyl units lying within those partitions.
  • compression ratios of the order of 160 or 320 to 1 for telephone quality speech and even higher rates of the order of 500:1 or 600:1 for lower quality speech can be obtained.
  • a much smaller set of partitions are needed to cover this low-fidelity space than are needed for a telephone-quality space, indicating that the compression ratio obtained is much higher.
  • the preferred embodiment is based upon a statistical n-tuple pattern matching method wherein a pattern of binary bits is placed into a fixed-form buffer analogous to a placing a pattern on a retina. See W.W. Bledsoe and I. Browning, "Pattern Recognition and Reading by Machine", 1959 Proc. Eastern Joint Computer Conf. , pp 225-32, December 1959.
  • a retina buffer consists of 68 columns each of 8 bits as shown in FIG. 7
  • the persyl components are placed into this buffer, as if for storage, represented by the dots.
  • the dot pattern shown is one of several means able to represent the persyl components of the histogram having index 1 shown in FIG. 5.
  • a fixed randomly chosen partition of the retina buffer's 544 binary cells into n discrete parts, each part consisting of 544/n of the retina-buffer cells is used to map the effective 544-bit binary number, represented by the dots on the retina, into a much larger binary space consisting of 2 n X 544/n dimensions where n is the partition chosen for the retina buffer (typically n is, but not restricted to, 2 or 4).
  • Information is stored as bit set in as many of the vectors in the larger space as there are partitions to be obtained in the covering process.
  • a pattern on the retina buffer matches a memory vector of that or similar patterns exactly or partially depending on how many of the 272 partitions in the pattern vector have a matching bit in the corresponding memory vector.
  • memory vector 300 matches the pattern vector 302 in 3 places shown, whereas memory vector 304 matches only in 2 places.
  • memory vector 300 has a more perfect match, while memory vector 304 is merely close.
  • the pattern vector 302 is more similar to the group of patterns stored in memory vector 300 than it is to those stored in memory vector 304.
  • a pattern vector has exactly one bit set in each of its 272 partitions, while a memory vector has at least one bit set in each of its partitions.
  • An automatic process used in the preferred embodiment is to partition the persyl space with the aid of the n-tuple pattern-matching method.
  • the method comprises selecting a threshold number, typically 80% of the number of partitions in the binary n-tuple vector, e.g., 218 in the example, and demand that any persyl pattern having a score of 218 or higher is incorporated into the memory vector producing that score. Should the score be less than 218, a new memory vector is automatically generated having as its first memory bit pattern exactly those bits set in the pattern to be assigned a partition in the persyl space. At the end of the process, after having treated each of the persyl units in the human speech data base, there will be a certain number of memory vectors. These memory vectors are the binary representations of the partitions of persyl space required by the persyl coding step.
  • FIG. 9 illustrates a more detailed functional diagram of a preferred implementation of the general digital condensation and compression system illustrated in FIG. 1.
  • the system includes the encoder station 8 and the decoder station 9.
  • the encoder station 8 comprises the microphone 10 and a linear amplifier 11 which converts real-time speech from a person or other source into an analog electronic signal. This electronic signal is then converted to a time sequence of digital samples at a periodic sampling rate by A/D converter 12. The sampling rate may vary depending upon the application and the frequencies in the original signals which are to be saved.
  • the Nyquist criterion is that the speech be sampled at twice the highest frequency, i.e., 14 KHz.
  • the sampling rate is chosen to be 8 KHz to cover the conventional bandwidth as explained previously.
  • frequencies above 1 KHz may be ignored, and a digitizing rate of 2000 Hz or less may be used.
  • the digital speech signal is applied to an analyzer 13 for the start of the condensation process.
  • the analyzer 13 takes the digital samples of the speech signal and filters from them a plurality of component perception frequencies, which can be heard, and measures the intensity of each.
  • the perception frequencies which are found by the analyzer 13 are integrated for the appropriate time periods as explained previously, and are then assembled into persyl units in an assembler 14 and applied to a vector encoder means 16 which identifies the pattern of persyl components as shown in FIG. 5, and assigns a persyl identifier code to them.
  • a descriptor (e.g., pitch, phase, intensity, duration) extractor means 15 also receives the persyl components and provides any needed descriptor to the persyl units and adjusts their form or intensity in assembler 14 before encoding and transmission by the transmitter 17.
  • the output of the transmitter 17 is a time sequence of digital numbers representing the descriptors and the codes.
  • the coded voice messages are transmitted and received with conventional digital transmitting and receiving equipment 17 and 20, respectively, or can be stored or applied directly to the receiver 20.
  • a separator 21 separates the coded message into the persyl codes and the descriptors.
  • the descriptors are passed to an instruction generator 22 and a constructor means 23, while the persyl codes are applied to a decoder means 24 which generates normalized persyl units from the codes.
  • the persyl units are then applied to the constructor 23 which has as another input the output from the instruction generator 22.
  • Timing signals signifying the arrival of the persyl codes are extracted by the separator 21 and applied -to the instruction generator 22. From these timing signals the instruction generator produces synchronization signals for recombining the descriptors and persyl units into a unified digital speech signal with the intensity, pitch, phase, and duration of the persyls extracted from the original speech signal.
  • the persyl units in each perception slice are then added to signals of the same frequencies in adjacent perception slices in a manner such that there are no discontinuities.
  • the smoothing effect takes place in constructor 23 which has the necessary memory allocation for storing the persyl components from the immediate past.
  • the different persyl units are then concatenated to produce the final reconstructed sound signal.
  • a final volume adjustment by the descriptors for total volume can be made by instructions from the instruction generator 22 controlling the output of a linear audio amplifier 26.
  • a transducer comprising the speaker 27, earphone, or the like, is used to convert the assembled electronic sound signal back into sound energy which can be heard.
  • FIG. 10 there is shown a detailed circuit implementation of the encoder station 8.
  • the analog signal from the microphone 10 and linear amplifier 11 is followed by an digitizer-filter means 144 (including an optional anti-aliasing filter before digitization and a lowpass filter following digitization) to convert the analog signal to a stream of digitized numbers occurring at the sampling rate of the digitizer.
  • an digitizer-filter means 144 including an optional anti-aliasing filter before digitization and a lowpass filter following digitization
  • These separate channels communicate with the assembler 143 which takes their output and assembles the amplitudes of the plurality of outputs into perception units.
  • persyl units are then encoded by the encoder 146 into code numbers and subsequently sent to the transmitter 147.
  • the extractor 148 forms the descriptors from the components of the assembler 143 and controls the assembler 143 to output normalized persyls based on the extracted descriptors.
  • the extractor 148 also provides the descriptors to the transmitter 147.
  • the code numbers and descriptor numbers are combined and then transmitted sequentially by a transmitter 147.
  • the encoded numbers can be stored in a message memory 145 prior to their transmission.
  • a microprocessor based control and timing circuit 142 provides timing and control signals to the analyzer 149, the encoder 146, the transmitter 147, the digitizer 144, and the message memory 145 to regulate the flow of data and the timing of the particular steps in the process.
  • each filter and integrator pair for example filter 100 and integrator 120, are used to detect a particular set of persyl components belonging to the frequency of the filter.
  • the filter is frequency sensitive to a particular bandpass frequency and the integrator 120 to a set duration of cycles such that a unique separation of a persyl component from the incoming analog signal occurs within the bandpass of said filter.
  • the output of the filter 100 is a representation of the detection of a particular frequency and intensity, and the integrator 120 then integrates that output over the perception duration (as obtained from FIG. 4) into an indication of the number of quantizing units of the persyl components corresponding to that filter and frequency band.
  • the result is a number of digital samples for each component frequency, each representing the intensity for perception time and duration.
  • FIG. 5 is a pictorial representation of the output of the plurality of integrators 120-138 as it is assembled by the assembler 143. Each output is uniquely associated with a persyl component, represented as the histogram bars at a particular frequency number.
  • the components When taken together, the components form a perception unit or persyl unit having the same number of frequencies as the filter units of the analyzer 149.
  • the frequency information is stored as positional information accessed by the Frequency Index; and the period information within a frequency list is likewise stored as positional information and accessed by the Component Index.
  • the assembler 143 contains memory to store the persyl unit as it is being developed by the analyzer 149.
  • the list has been fully developed and is shifted to a buffer storage in the assembler 143 where it can be processed by the extractor 148 to determine the descriptors and to reload the buffer list as a normalized persyl unit. While the buffer list is being normalized by the extractor 148, the current memory list is being updated with new persyl unit samples from the analyzer 149.
  • the filters 100-118 are preferably recursive in nature and can comprise either analog or digital infinite impulse response type filters.
  • the control and timing circuit 142 may vary the frequency and time duration sensitivity of the filters via the control lines 140. This permits an adaptive analysis which can change as the input changes. For example, the variability may be to make the analyzer 149 less sensitive to low frequency persyl components, as the overall amplitude of the signal rises, for noise suppression purposes.
  • the control and timing circuit 142 may control the integration constants of the integrators 120-138 over control lines 141 to provide a more exacting detection of a particular perception component, equivalent to altering the persyl curve of FIG. 3.
  • FIG. 12 illustrates a detailed functional diagram of one of the filter and integrator combinations. Even though the figure will be described from a digital processing standpoint, it will be evident to one skilled in the art of signal processing that equivalent analog circuits are available. Both a generalized filter function and a particular integration method are shown in the figure.
  • the filter function 300 receives the stream of input samples, x, from the digitizer means 144 and produces an output stream of filtered data, y.
  • the filter function 300 is shown as recursive where the new input samples x are summed with weighting coefficients A and added to old output samples y summed with weighting coefficients B to produce new output samples y'.
  • the frequency and bandwidth, along with the other factors of the filter - such as quality factor, are determined by selection of the coefficients A and B as is conventional.
  • These weighting coefficients are selectable by the control and timing circuit 142 via line 301.
  • the output stream, y is multiplied via multiplier means 310 by a sensitivity factor, ⁇ , on a line 303 as selected by the control and timing circuit 142.
  • the result, a persyl component frequency stream is applied to an integrator function 320 whose output is z.
  • the integrator function 320 includes a readout and reset line 322 provided from the control and timing circuit 142. Periodically the read-out line 322 is strobed at the perception rate of that frequency to output the digital samples of the persyl component. The strobe also provides a reset signal to the integrator function 320 for the next integration interval. The time constant, ⁇ , of the integration function 320 is adjusted as required by the control and timing circuit 142.
  • the assembler 143 takes the perception components at the perception sample rate and subjects the persyl unit to a normalization process.
  • the normalization process can be used for a number of different reasons, the main purpose being to determine overall speech amplitude and possibly perception slice subset selection.
  • the descriptor numbers for each persyl unit are attached based on this normalization process. Once the persyl has been normalized, it is used to determine which of the prototypical partition persyls it is most like by similarity-matching process in the encoder 148.
  • Appropriate means of similarity matching include a least-mean-squares algorithm operating on the individual components, a geometrical method that chooses the partition vector making the smallest angle with the persyl under investigation, or the statistical n-tuple pattern- matching described herein.
  • the preferred embodiment of similarity matching is a particular version of the n-tuple pattern matching algorithm.
  • the binary memory vectors corresponding to each partition obtained in the partitioning of the persyl speech databases are used in the method.
  • the persyl unit to be encoded is placed in the retina buffer as shown in FIG. 7 and a similarity score is obtained for each of the memory vectors as shown in FIG. 8.
  • the memory vector having the highest score was necessarily constructed from actual persyl units most similar to the one under investigation.
  • the code for the most similar partition prototype (n-tuple binary memory vector) is then obtained from a simple look-up table.
  • the encoder 146 determines the persyl code, it is assembled in sequential order according to the desired transmission rate and output to the transmitter 147 or stored in the message memory 145.
  • the transmission scheme is to provide sequential words of either a set number or specifiable number of bits having two fields which contain the descriptor bits and the persyl code bits, respectively.
  • a detailed circuit implementation of the preferred decoder station 9 is shown to advantage in FIG. 11.
  • the receiver 228, which can include demodulation or other processing techniques, provides at its output a digital sample stream that is identical to the transmitted coded stream.
  • This stream is processed in separator 225 to separate the descriptor codes from the persyl unit codes.
  • the persyl codes are decoded in decoder 226 to generate persyl units of the type encoded at the encoder 16.
  • the decoder 226 separates the frequency components of the persyl units and applies each, after modification by a modifier 201, to an associated resonator circuit 200-219 in the constructor 225.
  • the descriptors from separator 225 are applied to the instruction generator ' 224 and to the modifier 221 of the constructor 225.
  • Each resonator circuit 200-218, according to the value of the persyl component applied, will generate a digital representation of a persyl component, i.e., digital samples at a particular frequency and intensity for a predetermined number of cycles.
  • the output of the decoder 226 is a list of samples arranged in component frequencies describing a normalized persyl unit as shown in FIG. 5. This list is stored in the modifier 221 in order to be modified according to the descriptors attached to the code word used to decode it.
  • the modifier 221 provides processing to vary the sample lists according to the pitch descriptor and to apply the component samples to the resonators according to the phase and duration descriptors.
  • the intensity descriptor is processed in the modifier 221 and used to generate an intensity factor ⁇ that is used to control the amplitude of the resonators 200-218.
  • the output of the resonators 200-218 are digital perception streams, one for each component, which represent the persyl components being applied to the modifier 221 from the decoder 226.
  • the perception streams from all of the resonator circuits 220-218 are mixed in an adder 223 which is under control of the instruction generator 224.
  • the digital speech stream exiting the adder 223 can be directly applied to the digital-to-analog converter 229 to provide a reconstructed analog speech signal.
  • the analog speech signal is amplified by amplifier 230 according to a time-varying input from the instruction generator 224 to provide an overall amplitude correction to the speech signal, such as for an operator selectable volume control.
  • the speech signal thereafter is converted back to sound by the speaker 231.
  • FIG. 13 A detailed functional diagram of an implementation of one of the resonator-filter pairs of the constructor 225 is illustrated in FIG. 13.
  • the reconstructed samples of a persyl component, y 0 are applied to a resonator function 350 as the initial value of the recursive filter function. Without additional inputs, such as x, the output y would be a pure sine wave with intensity corresponding to the persyl component, y 0 .
  • Input samples, x, from the harmonic generator 220 and/or the random noise generator 222 may be applied as required by the reconstruction process (and for other additional reasons such as creating special effects) to the resonator function 350 thus modifying the pure sine wave.
  • the resultant output, y is one of the component frequencies of the generic persyl unit.
  • the output y may be further modified by multiplication with an intensity factor ⁇ in a multiplier 360 by line 361 from modifier 221.
  • the output of the resonator, ⁇ y, on line 363 is a sequence of digital samples representing that specific component frequency for the duration of the period of the component in the particular frequency sample list for the persyl unit undergoing synthesis to a corresponding sound-perception fragment.
  • the resonator acts both as a pure sine-wave generator with programmable amplitude, phase and duration, and as a filter that adjusts its output according to the spectral content of the input waveform, x.
  • persyl unit decoded and applied to constructor 225 may have more or less bits than the ones which were encoded. Preferably, they have the same number of bits such that the same code book of persyl memory vectors can be used for the encoding and decoding process, but this is not necessary to the operation of the implementation.

Abstract

A speech compression system including a persyl encoder (2) and vector encoder (3) at a transmission station where speech is analyzed based on the perceptual information it contains. A vector-coded digital representation of the persyl content (3) is then transmitted (6) to a receiver station. At the receiver station, a vector decoder (4) and a persyl decoder (5) transform the coded information into a speech sequence and then convert into intelligible sound (25, 27).

Description

Speech Compression System
Field of the Invention
The invention pertains generally to speech compression, processing and transmission systems and is more particularly directed to those speech compression, processing and transmission systems that can be used to transmit compressed speech in units that, when expanded, can be perceived by the human ear as intelligible speech.
Background of the Invention The increasingly sophisticated methods of communication developed in the past have fueled a tremendous growth in the amount of voice and information traffic that is carried each year on the world's communication systems. However, despite their sophistication, these methods have not taken into consideration the way a human ear and brain perceives sound. No matter how sophisticated the communication process is, if the human ear and brain cannot efficiently perceive the final output as speech, then the technology has been wasted and a more efficient method of information transfer is still possible.
The basic approach to the concept of using electronic means to transmit human speech has to be reexamined. In the past, the approach has been to sample an analog signal at discrete intervals into digital intensity codes and then transmit the digital numbers, one intensity code for each time slice that the analog signal is sampled. This pulse code modulation (PCM) allows for effective storage, transmission, error correction, possible encryption, and exact and precise decoding of voice messages; but the method requires excessive bandwidth and, depending upon the degree of fidelity necessary in the reproduced voice message, is relatively memory and computation intensive. As an example, in ordinary digital telephone transmission having a bandwidth of approximately 200 Hz to 3200 Hz, the analog signal is sampled at 8000 times per second to extract 8-bit intensity values. This results in 8 bits X 8000 samples per second, or 64,000 bits per second, being transmitted for the communication of a nominal 3000 Hz bandwidth speech signal. Higher fidelity speech necessarily increases the bit rate by requiring both more samples per second and more bits per sample. Another approach has been to model the human vocal tract mathematically by a series of filters, e.g. , linear predictive encoding (LPC) . This is a quite complicated procedure and does not adequately provide for pauses, prosody, and other natural expressions in speech, although it does improve considerably on the bit rate required to transmit a telephone-quality speech signal. Acceptable bit-rate compression obtained by this method is of the order of 13 times, yielding 4800 bits per second. However, this method has a considerable computational cost. These prior methods have viewed speech encoding as requiring a physics and mathematics solution implemented by manipulating an electronic signal or its mathematical representation rather than by deciphering only that information contained in actual sound (longitudinal pressure waves) which has meaning to the human ear and brain. The correct approach is that the problem should be viewed in the domain of human physiology, not simply physics and mathematics.
It is known that, while the human hearing system can perceive a vast and complex range of sounds - from the whisper of the wind in nature, to a 120-piece symphony orchestra, to the overpowering noise of a nearby freight train - only a small number of these possible sounds are relevant to intelligible human speech. Thus, there exists the opportunity to condense speech by determining which portion of the overall information present in a sound-pressure wave can be perceived so as to determine which portion is necessary for speech understanding.
However, while the human auditory system is sensitive to sound frequencies from about 20 Hz to 18,000 Hz, there is not a uniform relationship to the perception of all frequencies, either as to intensity sensitivity or time sensitivity (the minimum number of cycles that it takes to perceive a particular frequency) . Thus, a spectrally pure signal may sound like a tone when presented to the ear for a certain time longer than a frequency-dependent minimum, but like a click if presented for a shorter time. This is thought to be due to a physical hysteresis associated with the tissues of the hearing system and to an averaging property of sensory nerve cells. Also, if two or more tones are presented for a time longer than the perceptual minimum but within an upper time bound for discrimination of differences, sound fusion of the tones can occur resulting in a singly perceived, but more complex, sound. This composite complex sound is generally shorter in time than the usual phone unit (the sound stream produced in articulating a phoneme) . Intelligibility of these phone subunits is a matter of amplitude-frequency discrimination by the auricular system. It is believed, when the human auricular and perception system is actually processing spoken words, it is decoding these time sequences of amplitude-frequency information in order to extract phoneme, word, phrase, and sentence meanings. Other information in the physical sound signal is either redundant or nonessential to this deciphering process.
Just as there is a lower time bound for perception of a pure tone (which can be expressed by the number of waves, or cycles, that the tone is presented to the ear) , there is an upper bound in the perception process, as mentioned above.
One way to determine this upper bound is to note that a tone is only perceived to become increasingly louder as its time of presentation to the ear is increased up to this upper bound; but then the tone is perceived to be both louder and last longer as the presentation time is increased beyond this upper bound. The presence of this upper time bound may also be verified by noting that two tones of the same frequency but having intensities differing by, say, J2 are perceived to be the same, providing that the less intense tone lasts for twice as long as the more intense one, and that both tones last longer than the lower time bound and less than upper time bound. This upper time bound may be termed the "sound fusion" time where the ear integrates specific sounds into an inseparable whole. This is believed to be a property of the response-time physiology of the nerve cells in the auditory system. These two bounds define a frequency-dependent region over which the perception of a pure tone remains effectively constant while the parameters of duration and intensity are varied inversely to each other.
This realm of constant tone perception can be used to define a perception unit having components that are selected from frequency bands in an analyzed sound wave. This perception unit will be termed a "persyl" to indicate both its perceptual nature and an analogy to taking a "slice" of sound, albeit not in constant time, but in constant perception. A time sequence of persyls would advantageously encode the information in human speech that is relevant to the perception of phonemes, words, and phrases.
Vector encoding is a conventional method of speech compression where a code book of values representing speech is stored. Speech patterns are then formed by analysis of actual speech, for example, by linear predictive coding and matched against the stored code book. The closest code from the book is used to represent a particular speech pattern. Compression takes place when the number of code values are less than the possible actual patterns analyzed. See Robert M. Gray, "Vector Qantization", IEEE ASSP Magazine, pp. 4-29, April 1984. Because of the already compacting nature of the persyl unit, it would be highly advantageous to increase this factor by vector encoding persyl units. This technique would be even more beneficial if the unique properties of the persyl unit could be used in forming the code book.
Summary of the Invention
The invention provides a unique speech condensation and compression system that significantly reduces the amount of information that must be transmitted to describe and faithfully reproduce segments of speech. The system is based on the concept of a persyl unit, which is the unit of constant perception that a human hearing system uses in deciphering actual speech from sound pressure waves.
In its general form the system comprises an analyzer at an encoder station where the speech is analyzed and encoded based on the persyl unit stream it contains. A vector coded representation of the persyl unit content of the speech stream is then transferred to a decoder station as a time sequence of codes. Between the time of coding and transferral, the persyl units can be processed, stored, directly transferred or transmitted. At the decoder station, a vector decoder transforms the coded information back into persyl units, which are thereafter assembled into a digital bit stream that can be converted back into intelligible sound.
In one preferred embodiment, the encoder station includes an analyzer having a plurality of persyl component detectors, each having a tunable filter and an associated integrator. The filters are tunable based on both the frequency and the bandwidth necessary to detect adequately a particular component of a persyl unit. The integrator integrates the amplitude of the detected component over a sampling time consistent with the region of constant perception to yield an average amplitude for that component. Preferably, the persyl components are chosen to have time durations in excess of the minimum perception time for the associated frequency and less than sound fusion time for that frequency. The persyl components are assembled in an assembler for a particular sampling time to become persyl units. According to another feature of the invention, the raw persyl units can be normalized for sounds which vary because of a particular scaling factor, such as phase, amplitude, pitch, duration, etc. An extractor receives an output of the assembler and forms descriptors that indicate the scaling factors. The descriptors are used to modify the persyl units so that the assembler can pass along normalized persyl units. This completes the process of condensation of the speech signal. Compression takes place when the normalized persyl units are then vector encoded by a persyl unit encoder. The result of the persyl unit encoding is a sequential stream of fixed or variable bit length codes, each representing one of a multiplicity of code words which describe the set of persyl units found in human speech. The descriptor for each persyl code is then attached to form a digital representation of a speech sequence.
Once the persyl codes have been formed, the speech has been condensed and compressed to a very low bit total. The speech is also in a very convenient form, a digital code, which can be used in a variety of ways. The codes can be advantageously stored because they will consume much less memory than normally digitized speech. Alternatively, the codes can be advantageously transmitted because in their compressed form they consume much less bandwidth and have a lower bit rate than normally digitized speech, thus requiring both less time and channel resources. Further, the codes can be advantageously processed because in their compressed form they require fewer processor cycles for the same procedure than for normally digitized speech. Moreover, any of these processes can be done in any order, or in any combination.
The decoder station, in a preferred implementation, receives the digital codes and separates the descriptors from the persyl unit codes. The persyl unit codes are then applied to an decoder which transforms the codes to normalized persyl units. The persyl units are then applied to a constructor, which using the descriptors and the normalized persyl units, creates digital frequency components of a predetermined or descriptor-specified time duration. The digital frequency components are then converted into an analog speech signal which is used to power a transducer which converts the analog signal into sounds that can be perceived.
These and other objects, features, and aspects of the invention will be more fully understood and better described when a reading of the detailed description is undertaken in conjunction with the appended drawings wherein: Brief Description of the Drawings
FIG. 1 is a system block diagram of a speech compression system constructed in accordance with the invention;
FIG. 2 is a detailed system flow chart of the speech compression process for the system illustrated in FIG. l;
FIG. 3 is a graphical representation of the constant perception region for the human ear in cycle number (period) as a function of frequency over the speech range;
FIG. 4 is a tabular representation of a subset of the preferred frequency and cycle number assignments for the perception components of persyl units used in the speech compression system illustrated in FIG. 1;
FIG. 5 is a histogram representation of a persyl unit illustrating the various components as a function of frequency and illustrating a set of associated descriptors;
FIG. 6 is a schematic representation of a persyl space showing a preferred partitioning and the assignment of binary codes to several particular partitions;
FIG. 7 is a pictorial representation of a "retina" buffer for the n-tuple pattern recognition process used by the system illustrated in FIG. 1;
FIG. 8 depicts a part of the n-tuple pattern recognition process used by the system illustrated in FIG. 1;
FIG. 9 is a detailed functional block diagram of a preferred embodiment of the speech compression system illustrated in FIG. 1; FIG. 10 is a detailed schematic block diagram of the encoder station of the speech compression system illustrated in FIG. 9;
FIG. 11 is a detailed schematic block diagram of the decoder station of the speech compression system illustrated in FIG. 9;
FIG. 12 is a detailed functional diagram of one of the filter-integrator pairs of the analyzer illustrated in FIG. 10; and FIG. 13 is a detailed functional diagram of one of the resonator-filter pairs of the synthesizer illustrated in FIG. 11.
Detailed Description of the Preferred Embodiment
A system for condensing and compressing speech, which is constructed in accordance with the invention is shown in FIG. 1. The system comprises generally an encoder station 8 and a decoder station 9 which can be connected together by an intermediate functional link 6 or a direct (immediate) functional link 7. The encoder station 8 functions to reduce the bit rate required for representing the essential information content of speech (input voice signal) by the compaction and compression encoding processes of the invention which are based on constant human perception units or persyl units. Accordingly, the speech is transduced from an input voice signal by a physical to electrical transducer, a microphone 10. The analog speech signal from the output of microphone 10 is then converted to a series of digital samples by an A/D converter 12, producing a relatively high bit rate. In a first encoding step (condensation) , the high bit rate signal is converted to a lower bit rate signal by extracting persyl units from it with a persyl encoder 2. The persyl encoder 2 transfers the persyl units to a vector encoder 3 in a second encoding step (compression) for a further significant bit rate reduction to yield a very low bit rate signal.
The low bit-rate encoded signal is passed through an intermediate link 6, which can include transmission, storage, or processing, before being received at the decoder station 9. The functions of the intermediate link 6 can occur in any order or combination, and should be broadly construed. For example, transmission can include direct communication as by connecting the encoder station 8 and decoder station 9 back-to-back at the same location. Transmission also contemplates all forms of digital communication such as wire (cable) , wireless, over-the-air, and others. Moreover, processing can include error correcting, protocol conversion, encryption, and other processes. Storage can occur prior to, or after processing, and prior to transmission; can be on line or off line; and can use volatile, non-volatile memory, or other forms of memory. The decoder station 9, after receipt of the persyl unit codes, uses a vector decoder 4 to convert the codes back into persyl units of the lower bit rate. The persyl units are then applied to a persyl decoder 5 which converts them back into complex sounds. The output of the persyl decoder is at the high bit rate, similar to the output of the A/D converter 12. This high bit rate signal is then converted by a D/A converter 25 into an analog signal that can be transduced by an electrical-to-physical transducer (speaker) 27 into an output sound signal (reconstructed voice signal) . The system includes other advantageous implementations, including an immediate functional link 7 by which the persyl units from the persyl encoder 2 can be directly transferred to the persyl decoder 5. This implementation eliminates the vector encoding and decoding steps but at the cost of a higher bit rate. It is evident that the immediate link 7 could additionally include transmission, storage, or processing functions. In addition, the analog-to-digital converter 12 could be used after an analog persyl encoder 2 or after an analog vector encoder 3. Similarly, the digital-to-analog converter 27 can be used either before an analog vector decoder 4 or an analog persyl decoder 5. However, the preferred implementation is the digital embodiment as is shown. Thus, the digital implementation, which will be more fully described hereinafter, should be taken only as exemplary and not limiting to overall system configuration. The overall method of the digital implementation of the invention is shown more particularly in the flow chart of FIG. 2 where specific blocks illustrate the functional operations on the physical sound signal. In block A10, the physical sound signal is transduced into an electrical signal, which can be manipulated and processed by the system. Normally, for a real-time conversion system, the physical signal would be from a microphone, or the like, but could be an analog recording or some other input, such as from another communication system providing a substitute signal for the transducer signal and permitting further processing. The next step, in block A12, is to convert the analog electrical signal into digital samples. Blocks A10 and A12 are bypassed should the signal already be in digital form from a source other than the microphone 10. Thereafter, in block A14, the electrical signal can be separated into spectral persyl components. The separation of the electrical signal into persyl components can be done, of course, either in the analog or digital domain, by the implementation of an analyzer more fully described hereinafter, or by many other methods.
The next step in the process, as disclosed in block A16, is to integrate the several persyl spectral components in the time domain in a manner consistent with the perception time required by the physiology of hearing as explained above. This time-analysis process, accomplished by simple signal averaging or integration, produces a group of intensity components for each frequency being analyzed. The totality of the frequencies analyzed in this way produces a persyl unit, which is then an ordered set of numbers or intensities representing the particular constant perception unit obtained over the perceptual sampling time. The persyl unit is then vector coded in block A18 before being stored in a memory or processed in block A20 or, alternatively, being transmitted in block A22. The representation of the persyl unit can be that illustrated in FIG. 5 or other forms of representation for compression and coding purposes as discussed herein. For example, a list representation such as (dι,d2,d3,d4; Pi, ..., P10) , wherein each dj is a single number representing one of the descriptors and each Pj is a list of numbers representing components for the ith frequency. Accordingly, each Pj has the form (p1 p2, ... ,pn), where n ranges from 3 to 10 in the example shown in FIG. 5, but is not generally restricted. On the receiving side, the code, which has been transmitted or communicated, is received in block A24. The received code may either be stored in, or retrieved from, a memory, and/or processed in block A26. The coded samples are then decoded into persyl unit form in block A28. The formation of the persyl units from the codes is just the opposite from the encoding process, except that there is a one-to-one relationship between the codes and the persyl units when decoding and no similarity search process need be undertaken. The persyl units are then used to regenerate the spectral and temporal persyl components based on the decoded values in block A30. The persyl unit components are then concatenated or combined in a time sequence to form sequential speech samples in block A32, where the various components are matched in amplitude so as not to introduce high-frequency noise in the reconstructed signal. After the combination of the persyl components into speech samples, a digital-to-analog conversion is performed in block A34. Subsequently in block A36, the reconstituted analog signal can be converted by a transducer into a physical signal, namely sound which can be heard.
On the encoding side, the two step process of condensation and compression produces a substantially reduced bandwidth signal without the loss of fidelity or information and one that can be decoded in the two-step expansion process with facility. The first step of condensing speech into persyl units extracts from the possible information in the digitized representation of the pressure wave only those sounds that can be humanly perceived as speech. The condensation step provides a substantial reduction to the broad information content of the input signal without the loss of the desired information (speech) . The second step, that of compression by vector or matrix encoding, substantially reduces the bandwidth and storage requirements of the condensed information without any loss of the relevant speech content. The particular vector encoding technique, more fully described hereinafter, uses the persyl unit and is based on the unique characteristics obtained in considering the nature of speech perception by the ear. This technique further increases the amount of compression that can be obtained.
On the decoding side, the two-step process of reconstructing the persyl units from the codes and then generating speech from the decoded persyl units provides a facile method of producing only those sounds that can be heard and understood as speech, while disregarding component sounds not originally in the speech database, which is described below. This process uses a subset of persyl units that describe human speech, in a particular form, to a high degree of fidelity. The generation of speech in this manner reduces the complexity and processing requirements of the equipment necessary for the decoding. FIG. 3 illustrates a persyl component recognition graph, which is a physiological chart for the speech-production range (approximately 100 Hz to 5000 Hz) indicating as a lower bound the number of cycles necessary to perceive a particular frequency and as an upper bound the number of cycles that create sound fusion. The number of cycles necessary for tone perception varies slowly over the lower frequencies, and then increases rapidly above 4500 Hz. A persyl frequency component is defined as the intensity obtained at a particular discrete or bandwidth-limited frequency in the wave-number range lying between the minimum number of cycles that is required to perceive a tone at that frequency and the maximum boundary for sound fusion. These persyl components may be taken at any frequency in the bounded region and in any number, but preferably each is obtained from discrete segments having a specific width or bandpass, and at integral wave numbers.
To select a specific number of cycles for a particular frequency, an optimized curve is determined within the range. The optimized curve can be determined empirically based on a number of different parameters including perceived quality of sound, actual fidelity for a given usage, etc. Alternatively, the curve can be determined analytically, based on minimization of bandwidth, compatibility with hardware characteristics, etc. This specific optimized curve is to be contrasted with several typical Fourier transforms, which appear as straight lines in a frequency wave-number plane. It is noted particularly that, based on the defined constant perception range, the Fourier transform generally provides too much information for higher frequencies (above 2000 to 4000 Hz) and is thus wasteful; and too little information for frequencies lying below about 500 Hz, and is thus inadequate. Thus, Fourier transform methods of speech compression are valid only over a restricted range of frequencies in the constant perception space and become computationally expensive if this limitation is overcome by employing a plurality of Fourier transforms, each covering a different part of the perception space for only a portion of the transformed information as illustrated in FIG. 3, where three different Fourier transforms are shown, each possessing a factor of 2 fewer points in proceeding from left-to-right across the figure. For the preferred embodiment, twenty four persyl frequency components are chosen from the optimized curve of FIG. 3. A tabular representation of these components (for purposes of illustration, only ten frequency components are used in the example) is shown in FIG. 4 for the range normally found in speech analysis. The numbers illustrated can be read directly off the curves in FIG. 3. In a this manner, the entire speech-production frequency range (from about 100 Hz to greater than 5000 Hz) can be represented by a single generic persyl unit having as few as ten components. It is these particular persyl units that represent the perceivable elementary sounds of human speech and are given codes by the vector encoder 3 in FIG. 1.
As discussed previously, different physiological perception curves, other than the one shown in FIG. 3, can be drawn for optimum perception as long as the curves remain within the bounds of the graph. Likewise, the audio frequency range and those frequencies which are identified with a particular persyl component may be changed as needed for further perception requirements, such as fidelity optimization, or for a particular application, such as telephony. As another example, certain languages may use particular persyl units more frequently than others and the range and component assignment would be chosen for maximum perception of a speaker in a certain language. In addition, there may be a subset of components within a particular language which lends itself to a particular perception subset. It is evident that such subsets can be chosen for many different reasons.
Likewise, the invention may be variable in the perception probability requirement where different perception unit subsets are stored and the system uses one subset for one particular transmission requirement and another subset for another particular transmission requirement. One purpose of changing between these subsets would be for traffic-related requirements where the minimum perception curve and the longest sampling times were used to decrease the information rate down to a minimum. This technique could be used until the encoded voice traffic was barely perceptible as speech. When traffic requirements decrease, the higher probability perception curves and shorter sampling times could be used at the cost of a higher information rate to increase the intelligibility and fidelity. FIG. 5 is a pictorial representation, in the form of a set of histograms, of an assembled persyl unit comprising ten frequency components each component comprising from three to ten intensity entries. The ten frequency components are those of the generic persyl unit chosen for illustration purposes and the numerical intensity is that which would be determined as measured by the component analysis system described in more detail hereinafter. In the figure, the frequency axis lies horizontally in the plane of the paper. It is labeled by a Frequency Index corresponding to the various analyzing filters discussed below. Index 1 corresponds to the highest frequency, and has accordingly the highest number of entries. Index 10 corresponds to the lowest frequency, and has the smallest number of entries. The axis indicating depth into the page is a sample number axis and is labeled by a Entry Index, index 1 being the first entry obtained during analysis and assembling of a persyl unit, and index 10, the last. The curve lying in the frequency-period plane and labeled "isochronal line" indicates the locus of those points each having a time equal to the duration time over which the persyl is analyzed as measured from the start analysis process, i.e. , the sample time.
Since perception times are different for different frequencies, a persyl sample is determined over different times (periods) for each frequency. Within a given frequency band (as determined by a particular analyzing filter) , each component sample has the same time of measurement. Thus, the persyl component labeled by frequency index 10 consists of three samples, each measured for 7 cycles at 200 Hz corresponding to the table in FIG. 4; and the component with frequency index 1 consists of 10 samples, each measured for
45 cycles at 4500 Hz. These components assume the same slice or sampling times of 20 msec. It may happen that the last entry in some of the frequency components is not integrated for the full amount of time as determined by the value selected and shown in the last column in FIG. 4. The remainder of that entry will appear in the first entry of the following persyl unit so nothing will be lost, and no discontinuities are introduced.
The representation shown in FIG. 5 has been normalized by modifying the measured numerical intensities of the samples of each component according to one or more normalization factors. It is believed human speech can be quantized with a much smaller persyl unit subset when normalization for certain factors has been undertaken. The information extracted in the normalization process is attached to the persyl vector as an ordered set of descriptors shown on the left quadrant of the representation in FIG. 5 along a
Descriptor Index where the descriptor number identifies the type of normalization factor and the numeric intensity defines the degree of variance for that factor. For the purposes of the invention, four descriptors will be described which relate to intensity, pitch, phase, and duration. Each of the normalization factors, its use, and method of extraction will now be discussed.
An intensity normalization factor (Descriptor Index 1) is obtained by measuring the energy content of each persyl unit by summing the squares of all samples in the persyl unit. To normalize, each persyl sample is divided by the square root of the number so calculated and multiplied by 2b, where b is the number of bits of resolution desired in the normalized components. The intensity descriptor is the divisor used above, normalized to the same number of bits by dividing by the maximum intensity obtainable (as computed from the number of bits used to digitize the analog signal) and likewise multiplying by 2b.
A pitch normalization factor (Descriptor Index 2) is employed to shift the entire persyl unit down (or up) in frequency to the point where the lowest (or highest) component or frequency list contains samples of significant measure. The amount, as measured by the number of steps in frequency, is the pitch descriptor. A shift step is determined by the frequency difference from one filter to an adjacent filter, and the shift is accomplished based on the actual response function of the filters, taking into consideration the degree of overlap between two adjacent filters. Pitch shifting can also take place in small steps by interpolating between filters. A phase descriptor (Descriptor Index 3) is simply a delay, measured in digital samples or other suitable units, that a particular persyl is to be delayed before reconstruction is started. The phase descriptor is obtained by measuring the difference between the last sample incorporated into the previous persyl unit and the first sample to be incorporated into the next persyl unit during the analysis stage. The phase descriptor can be used to represent long periods of silence.
It is noted that voiced vowels have a duration between 50 and 400 milliseconds in normal speech. (See Douglas O'Shaughnessy, "Speech Communication", page 64, Addision- Wesley 1989) A vowel can consist of one or more nearly identical persyls, so a savings in bits transmitted may be made by noting the time period over which a persyl remains roughly constant, especially within a vowel. This additional processing takes place in the integrators discussed below, and represents a time-correlation across the entire set of integrators. When this correlation disappears, the current persyl has ended, and a new one must be started.
A duration descriptor (Descriptor Index 4) is obtained similar to the phase descriptor, however it is a measure of the number of samples or duration time between the first sample incorporated into a persyl and the last sample when that period is variable. Any given descriptor need not be present; its presence or absence may be indicated by the presence or absence of the corresponding index in the descriptor map. Reconstruction of the persyl unit from its normalized components and the set of descriptors comprises a set of processes inverse to the ones described above. Thus, during reconstruction, a normalized persyl unit is first shifted by the pitch descriptor, if any, and then amplitude adjusted by the intensity descriptor, if any. The output is then delayed the number of sample intervals described by the phase descriptor, if any, and the output continues for the number of sample intervals described by the duration descriptor, if any. The vector coding of the persyl units will now be described with reference to FIGS. 6, 7, and 8. Note that if the digitizer shown in FIG. 1 provides a resolution of Bd bits, then the number of bits encoding a sample of the speech can be no more than Bd bits each. This establishes a minimum granularity of the discrete space that contains all possible persyl units having 68 components, e.g., the number of histogram bars in the period-frequency plane shown in FIG. 5 and used herein as a particular example for purposes of explanation only. It is well established that 16 bits (resulting in a dynamic range of 0 to 65,535 for each intensity component) is more than adequate for high-fidelity reproduction; indeed the so-called CD player uses 16-bit resolution for music and voice reproduction. Modern telephony uses 8 bits (dynamic range of 0 to 255) for good-quality digital voice transmission, while 3 bits may be all that is necessary for certain restricted applications (dynamic range of 0 to 7) for the persyl components. Suppose, for example, 8 bits are selected, consistent with digital telephony, for each of the persyl components. Then any possible persyl unit can be represented in a persyl space having 68 dimensions and having a granularity of 1 part in 256 along each of 68 axes, since only positive values and zero are considered as valid for the intensity components of the persyl unit. Each specific persyl unit can then be represented as a coordinate point in the 68 dimensional space, or equivalently, as a vertex of a binary hypercube having 68 X 8 = 544 dimensions. The procedure for obtaining a suitable vector code is to partition this space (precisely, the first octant thereof) into an optimal number and distribution of partitions. This partitioning scheme is known as a "covering". Suppose that there are 1024 such partitions that adequately and optimally cover the space. Then we can assign a binary number of 10 bits ranging from 0000000000 to 1111111111 to describe any one of the 1024 partitions uniquely. It is these binary codes that represent any given and possible persyl to within the quality set by the granularity of the covering and resolution of the persyl space; and it is these codes, or an encrypted or compressed version thereof, that is transmitted, processed, or stored as indicated in FIG. 1. Up to this point, the description of the coding process is consistent with the conventional practice of vector coding and the building of code books for representing speech signals. Assuming that telephone quality speech requires a transmission rate of 64,000 bits per second as explained above, a standard method (linear-predictive coding) has produced compression ratios of 13:1 for telephone quality speech (4800 bits per second) and 80:1 for the much lower vocoder quality speech (800 bits per second) .
However, in accordance with the invention, one essential step of obtaining very high compression ratios while maintaining high speech quality lies in the choice of physiologically plausible persyl units on which to base the decomposition, analysis, and reconstruction synthesis. The structure of a persyl unit is designed specifically to optimize the encoding of speech information present in the human voice. This fact is used to obtain the necessary data for construction of codes giving high compression ratios while maintaining high speech quality. To begin the code construction, persyl sample units must first be derived. The persyl encoder 2 in FIG. 1 can be employed as the front end to a device that produces and records a range of persyl sample units that span the range of the human speech under consideration. For example, if we restrict the system to a particular application such as American English, we require sufficient samples of speech from the major dialects of American English (midwestern, southern, western. New England, etc.) as well as from representative speakers (men, women, children, youths, senior citizens, etc.).
Once these data are obtained, or even while they are being obtained, depending on the analysis method chosen, a partition of the data (thought of as a large cloud of points in the persyl space in a geometrical visualization in FIG. 6) is made. This partition is obtained adaptively from actual human speech data represented as perception units - not from an arbitrary selection of possible sounds. It is this severe restriction of the number and position of the possible sounds in the 68-dimensional, binary vector space to those represented by the actual persyl units that provide a "preselection" for the region of interest. This preselection limits both the range of interest and the type of partition that is required to cover the persyl "cloud" to the degree required for a given fidelity of speech reconstruction. The speech is reconstructed from a sequence of codes in the center of the partitions and not directly from the actual measured persyl units lying within those partitions. In this manner, compression ratios of the order of 160 or 320 to 1 for telephone quality speech and even higher rates of the order of 500:1 or 600:1 for lower quality speech can be obtained. Here, the number of quantizing bits may be reduced to perhaps 3 per persyl component instead of 8 or 16, and the number of persyl components to perhaps 20. This reduces the dimensionality of the equivalent binary space from 68 X 8 = 544 to a mere 3 X 20 = 60 dimensions. Clearly, a much smaller set of partitions are needed to cover this low-fidelity space than are needed for a telephone-quality space, indicating that the compression ratio obtained is much higher.
There are several methods of partitioning the persyl space as described above; e.g., clustering analysis, methods involving Bayesian probability analysis, methods involving the Vernoi partition. For reasons of performance and for consistency with the similarity-finding process discussed herein, the preferred embodiment is based upon a statistical n-tuple pattern matching method wherein a pattern of binary bits is placed into a fixed-form buffer analogous to a placing a pattern on a retina. See W.W. Bledsoe and I. Browning, "Pattern Recognition and Reading by Machine", 1959 Proc. Eastern Joint Computer Conf. , pp 225-32, December 1959.
As an example, suppose the persyl unit has 68 components each with a range of 0 to 255 as above. Then a retina buffer consists of 68 columns each of 8 bits as shown in FIG. 7
(wherein only one of the lists is shown for clarity) . The persyl components are placed into this buffer, as if for storage, represented by the dots. The dot pattern shown is one of several means able to represent the persyl components of the histogram having index 1 shown in FIG. 5. Instead of storage, however, a fixed randomly chosen partition of the retina buffer's 544 binary cells into n discrete parts, each part consisting of 544/n of the retina-buffer cells, is used to map the effective 544-bit binary number, represented by the dots on the retina, into a much larger binary space consisting of 2n X 544/n dimensions where n is the partition chosen for the retina buffer (typically n is, but not restricted to, 2 or 4). For example, if n = 2, then there will be 22 X 544/2 = 1088 binary components in a typical vector in the larger space. Note that there are 544/n = 272 partitions each comprised of 2n = 4 bits in each vector. Information is stored as bit set in as many of the vectors in the larger space as there are partitions to be obtained in the covering process. A pattern on the retina buffer matches a memory vector of that or similar patterns exactly or partially depending on how many of the 272 partitions in the pattern vector have a matching bit in the corresponding memory vector. In FIG. 8, memory vector 300 matches the pattern vector 302 in 3 places shown, whereas memory vector 304 matches only in 2 places. Thus memory vector 300 has a more perfect match, while memory vector 304 is merely close. From the viewpoint of the set of patterns assumed to make up memory vector 300 and a different set making up memory vector 304, the pattern vector 302 is more similar to the group of patterns stored in memory vector 300 than it is to those stored in memory vector 304. Note that a pattern vector has exactly one bit set in each of its 272 partitions, while a memory vector has at least one bit set in each of its partitions. An automatic process used in the preferred embodiment is to partition the persyl space with the aid of the n-tuple pattern-matching method. The method comprises selecting a threshold number, typically 80% of the number of partitions in the binary n-tuple vector, e.g., 218 in the example, and demand that any persyl pattern having a score of 218 or higher is incorporated into the memory vector producing that score. Should the score be less than 218, a new memory vector is automatically generated having as its first memory bit pattern exactly those bits set in the pattern to be assigned a partition in the persyl space. At the end of the process, after having treated each of the persyl units in the human speech data base, there will be a certain number of memory vectors. These memory vectors are the binary representations of the partitions of persyl space required by the persyl coding step. Each of these memory vectors is now assigned a code, for example, an integer representing the order in which they were created. These codes are then used by the persyl encoder to produce binary numbers representing the actual persyl units measured in the speech. FIG. 9 illustrates a more detailed functional diagram of a preferred implementation of the general digital condensation and compression system illustrated in FIG. 1. The system includes the encoder station 8 and the decoder station 9. The encoder station 8 comprises the microphone 10 and a linear amplifier 11 which converts real-time speech from a person or other source into an analog electronic signal. This electronic signal is then converted to a time sequence of digital samples at a periodic sampling rate by A/D converter 12. The sampling rate may vary depending upon the application and the frequencies in the original signals which are to be saved. For the speech-production range of frequencies, approximately 100 - 7,000 Hz, the Nyquist criterion is that the speech be sampled at twice the highest frequency, i.e., 14 KHz. However, a for common transmission system, such as a telephone, the sampling rate is chosen to be 8 KHz to cover the conventional bandwidth as explained previously. For very low bit-rate transmission, where low fidelity and loss of speaker recognition is acceptable, frequencies above 1 KHz may be ignored, and a digitizing rate of 2000 Hz or less may be used.
The digital speech signal is applied to an analyzer 13 for the start of the condensation process. The analyzer 13 takes the digital samples of the speech signal and filters from them a plurality of component perception frequencies, which can be heard, and measures the intensity of each. The perception frequencies which are found by the analyzer 13 are integrated for the appropriate time periods as explained previously, and are then assembled into persyl units in an assembler 14 and applied to a vector encoder means 16 which identifies the pattern of persyl components as shown in FIG. 5, and assigns a persyl identifier code to them. A descriptor (e.g., pitch, phase, intensity, duration) extractor means 15 also receives the persyl components and provides any needed descriptor to the persyl units and adjusts their form or intensity in assembler 14 before encoding and transmission by the transmitter 17. The output of the transmitter 17 is a time sequence of digital numbers representing the descriptors and the codes.
The coded voice messages are transmitted and received with conventional digital transmitting and receiving equipment 17 and 20, respectively, or can be stored or applied directly to the receiver 20. A separator 21 separates the coded message into the persyl codes and the descriptors. The descriptors are passed to an instruction generator 22 and a constructor means 23, while the persyl codes are applied to a decoder means 24 which generates normalized persyl units from the codes. The persyl units are then applied to the constructor 23 which has as another input the output from the instruction generator 22. Timing signals signifying the arrival of the persyl codes are extracted by the separator 21 and applied -to the instruction generator 22. From these timing signals the instruction generator produces synchronization signals for recombining the descriptors and persyl units into a unified digital speech signal with the intensity, pitch, phase, and duration of the persyls extracted from the original speech signal.
The persyl units in each perception slice are then added to signals of the same frequencies in adjacent perception slices in a manner such that there are no discontinuities. The smoothing effect takes place in constructor 23 which has the necessary memory allocation for storing the persyl components from the immediate past. The different persyl units are then concatenated to produce the final reconstructed sound signal. A final volume adjustment by the descriptors for total volume can be made by instructions from the instruction generator 22 controlling the output of a linear audio amplifier 26. A transducer, comprising the speaker 27, earphone, or the like, is used to convert the assembled electronic sound signal back into sound energy which can be heard.
With respect now to FIG. 10, there is shown a detailed circuit implementation of the encoder station 8. The analog signal from the microphone 10 and linear amplifier 11 is followed by an digitizer-filter means 144 (including an optional anti-aliasing filter before digitization and a lowpass filter following digitization) to convert the analog signal to a stream of digitized numbers occurring at the sampling rate of the digitizer. This assumes a digital filter method is employed for the filters 100-118; should analog filters be employed, means 144 is not used and the digitization of the plurality of analog signals from the filters 100-118 takes place following integration, with the digitizing devices residing in the devices 120-138. These separate channels communicate with the assembler 143 which takes their output and assembles the amplitudes of the plurality of outputs into perception units. These persyl units are then encoded by the encoder 146 into code numbers and subsequently sent to the transmitter 147. The extractor 148 forms the descriptors from the components of the assembler 143 and controls the assembler 143 to output normalized persyls based on the extracted descriptors. The extractor 148 also provides the descriptors to the transmitter 147. The code numbers and descriptor numbers are combined and then transmitted sequentially by a transmitter 147. Alternatively, the encoded numbers can be stored in a message memory 145 prior to their transmission. A microprocessor based control and timing circuit 142 provides timing and control signals to the analyzer 149, the encoder 146, the transmitter 147, the digitizer 144, and the message memory 145 to regulate the flow of data and the timing of the particular steps in the process. In the analyzer, each filter and integrator pair, for example filter 100 and integrator 120, are used to detect a particular set of persyl components belonging to the frequency of the filter. The filter is frequency sensitive to a particular bandpass frequency and the integrator 120 to a set duration of cycles such that a unique separation of a persyl component from the incoming analog signal occurs within the bandpass of said filter. The output of the filter 100, therefore, is a representation of the detection of a particular frequency and intensity, and the integrator 120 then integrates that output over the perception duration (as obtained from FIG. 4) into an indication of the number of quantizing units of the persyl components corresponding to that filter and frequency band. The result is a number of digital samples for each component frequency, each representing the intensity for perception time and duration. FIG. 5 is a pictorial representation of the output of the plurality of integrators 120-138 as it is assembled by the assembler 143. Each output is uniquely associated with a persyl component, represented as the histogram bars at a particular frequency number. When taken together, the components form a perception unit or persyl unit having the same number of frequencies as the filter units of the analyzer 149. The frequency information is stored as positional information accessed by the Frequency Index; and the period information within a frequency list is likewise stored as positional information and accessed by the Component Index.
The assembler 143 contains memory to store the persyl unit as it is being developed by the analyzer 149. A memory space having the same number of locations as the samples of a persyl unit, i.e., the number of frequency components times the number of samples per component, is reserved for a current persyl unit. As each sample from a frequency component is developed, it is stored in its particular place in the memory such that a list is formed describing the components of the persyl unit for the particular sampling time. When a sampling time has expired, the list has been fully developed and is shifted to a buffer storage in the assembler 143 where it can be processed by the extractor 148 to determine the descriptors and to reload the buffer list as a normalized persyl unit. While the buffer list is being normalized by the extractor 148, the current memory list is being updated with new persyl unit samples from the analyzer 149.
The filters 100-118 are preferably recursive in nature and can comprise either analog or digital infinite impulse response type filters. The control and timing circuit 142 may vary the frequency and time duration sensitivity of the filters via the control lines 140. This permits an adaptive analysis which can change as the input changes. For example, the variability may be to make the analyzer 149 less sensitive to low frequency persyl components, as the overall amplitude of the signal rises, for noise suppression purposes. Similarly, the control and timing circuit 142 may control the integration constants of the integrators 120-138 over control lines 141 to provide a more exacting detection of a particular perception component, equivalent to altering the persyl curve of FIG. 3. In this regard, the control and timing circuit 142 has access to the persyl units and descriptors through a feedback loop comprising the extractor 148 and encoder 146, and message memory 145. FIG. 12 illustrates a detailed functional diagram of one of the filter and integrator combinations. Even though the figure will be described from a digital processing standpoint, it will be evident to one skilled in the art of signal processing that equivalent analog circuits are available. Both a generalized filter function and a particular integration method are shown in the figure. The filter function 300 receives the stream of input samples, x, from the digitizer means 144 and produces an output stream of filtered data, y. The filter function 300 is shown as recursive where the new input samples x are summed with weighting coefficients A and added to old output samples y summed with weighting coefficients B to produce new output samples y'. The frequency and bandwidth, along with the other factors of the filter - such as quality factor, are determined by selection of the coefficients A and B as is conventional. These weighting coefficients are selectable by the control and timing circuit 142 via line 301. The output stream, y, is multiplied via multiplier means 310 by a sensitivity factor, μ, on a line 303 as selected by the control and timing circuit 142. The result, a persyl component frequency stream, is applied to an integrator function 320 whose output is z. The integrator function 320 includes a readout and reset line 322 provided from the control and timing circuit 142. Periodically the read-out line 322 is strobed at the perception rate of that frequency to output the digital samples of the persyl component. The strobe also provides a reset signal to the integrator function 320 for the next integration interval. The time constant, β , of the integration function 320 is adjusted as required by the control and timing circuit 142.
Referring now to FIG. 10 again, the assembler 143 takes the perception components at the perception sample rate and subjects the persyl unit to a normalization process. The normalization process can be used for a number of different reasons, the main purpose being to determine overall speech amplitude and possibly perception slice subset selection. The descriptor numbers for each persyl unit are attached based on this normalization process. Once the persyl has been normalized, it is used to determine which of the prototypical partition persyls it is most like by similarity-matching process in the encoder 148. Appropriate means of similarity matching include a least-mean-squares algorithm operating on the individual components, a geometrical method that chooses the partition vector making the smallest angle with the persyl under investigation, or the statistical n-tuple pattern- matching described herein. The preferred embodiment of similarity matching is a particular version of the n-tuple pattern matching algorithm. In this version, the binary memory vectors corresponding to each partition obtained in the partitioning of the persyl speech databases are used in the method. The persyl unit to be encoded is placed in the retina buffer as shown in FIG. 7 and a similarity score is obtained for each of the memory vectors as shown in FIG. 8. The memory vector having the highest score was necessarily constructed from actual persyl units most similar to the one under investigation. The code for the most similar partition prototype (n-tuple binary memory vector) is then obtained from a simple look-up table. Once the encoder 146 determines the persyl code, it is assembled in sequential order according to the desired transmission rate and output to the transmitter 147 or stored in the message memory 145. The transmission scheme is to provide sequential words of either a set number or specifiable number of bits having two fields which contain the descriptor bits and the persyl code bits, respectively. A detailed circuit implementation of the preferred decoder station 9 is shown to advantage in FIG. 11. The receiver 228, which can include demodulation or other processing techniques, provides at its output a digital sample stream that is identical to the transmitted coded stream. This stream is processed in separator 225 to separate the descriptor codes from the persyl unit codes. The persyl codes are decoded in decoder 226 to generate persyl units of the type encoded at the encoder 16. The decoder 226 separates the frequency components of the persyl units and applies each, after modification by a modifier 201, to an associated resonator circuit 200-219 in the constructor 225. The descriptors from separator 225 are applied to the instruction generator'224 and to the modifier 221 of the constructor 225. Each resonator circuit 200-218, according to the value of the persyl component applied, will generate a digital representation of a persyl component, i.e., digital samples at a particular frequency and intensity for a predetermined number of cycles. The output of the decoder 226 is a list of samples arranged in component frequencies describing a normalized persyl unit as shown in FIG. 5. This list is stored in the modifier 221 in order to be modified according to the descriptors attached to the code word used to decode it. The modifier 221 provides processing to vary the sample lists according to the pitch descriptor and to apply the component samples to the resonators according to the phase and duration descriptors. The intensity descriptor is processed in the modifier 221 and used to generate an intensity factor μ that is used to control the amplitude of the resonators 200-218. The output of the resonators 200-218 are digital perception streams, one for each component, which represent the persyl components being applied to the modifier 221 from the decoder 226.
The perception streams from all of the resonator circuits 220-218 are mixed in an adder 223 which is under control of the instruction generator 224. The digital speech stream exiting the adder 223 can be directly applied to the digital-to-analog converter 229 to provide a reconstructed analog speech signal. The analog speech signal is amplified by amplifier 230 according to a time-varying input from the instruction generator 224 to provide an overall amplitude correction to the speech signal, such as for an operator selectable volume control. The speech signal thereafter is converted back to sound by the speaker 231.
A detailed functional diagram of an implementation of one of the resonator-filter pairs of the constructor 225 is illustrated in FIG. 13. The reconstructed samples of a persyl component, y0, are applied to a resonator function 350 as the initial value of the recursive filter function. Without additional inputs, such as x, the output y would be a pure sine wave with intensity corresponding to the persyl component, y0. Input samples, x, from the harmonic generator 220 and/or the random noise generator 222 may be applied as required by the reconstruction process (and for other additional reasons such as creating special effects) to the resonator function 350 thus modifying the pure sine wave. The resultant output, y, is one of the component frequencies of the generic persyl unit. The output y may be further modified by multiplication with an intensity factor μ in a multiplier 360 by line 361 from modifier 221. The output of the resonator, μy, on line 363 is a sequence of digital samples representing that specific component frequency for the duration of the period of the component in the particular frequency sample list for the persyl unit undergoing synthesis to a corresponding sound-perception fragment. In this fashion, the resonator acts both as a pure sine-wave generator with programmable amplitude, phase and duration, and as a filter that adjusts its output according to the spectral content of the input waveform, x. Therefore, there has been described a means to reconstruct persyl units into intelligible speech sounds. It is noted that the persyl unit decoded and applied to constructor 225 may have more or less bits than the ones which were encoded. Preferably, they have the same number of bits such that the same code book of persyl memory vectors can be used for the encoding and decoding process, but this is not necessary to the operation of the implementation.
While the preferred embodiments of the invention have been illustrated, it will be obvious to those skilled in the art that various modifications and changes may be made thereto without departing from the spirit and scope of the invention as hereinafter defined in the appended claims.

Claims

What is Claimed is:
1. A method for condensing and compressing speech comprising the steps of: analyzing an electronic speech signal to determine its persyl component content; assembling said persyl component content into persyl units; assigning a persyl code to each persyl unit contained in said electronic speech signal; and assembling said persyl codes into an ordered set from which the time sequence of said persyl units can be decoded as speech.
2. A method for condensing and compressing speech as set forth in Claim 1 which further includes the step of: assigning a unique descriptor to each persyl unit in said ordered set which is representative of at least one parameter of said persyl unit in said electronic speech signal.
3. A method for condensing and compressing speech as set forth in Claim 2 wherein said step of assigning a unique descriptor includes: assigning a descriptor indicative of the intensity of the persyl unit.
4. A method for condensing and compressing speech as set forth in Claim 2 wherein said step of assigning a unique descriptor includes: assigning a descriptor indicative of the phase of the persyl unit,
5. A method for condensing and compressing speech as set forth in Claim 3 wherein said step of assigning a unique descriptor includes: assigning a descriptor indicative of the pitch of the persyl unit.
6. A method for condensing and compressing speech as set forth in Claim 3 wherein said step of assigning a unique descriptor includes: assigning a descriptor indicative of the duration of the persyl unit.
7. A method for condensing and compressing speech as set forth in Claim 2 wherein said step of assigning a unique descriptor includes: assigning a descriptor indicative of the intensity, phase, pitch, and duration of the persyl unit.
8. A method for condensing and condensing speech as set forth in Claim 1 which further includes the step of: storing said ordered set of persyl units.
9. A method for condensing and condensing speech as set forth in Claim 7 which further includes the step of: transmitting said ordered set of persyl units.
10. A method for condensing and condensing speech as set forth in Claim 1 which further includes the step of: transmitting said ordered set of persyl units.
11. A method for condensing and condensing speech as set forth in Claim 1 which further includes the step of: modifying or altering said ordered set of persyl units before or during reconstruction.
12. A method for condensing and condensing speech as set forth in Claim 1 wherein said step of analyzing said electronic speech signal includes: analyzing said electronic signal into its digital persyl component content.
13. A method for condensing and compressing speech as set forth in Claim 12 wherein said step of assigning further includes: assigning digital identifiers to each different persyl.
14. A method for condensing and condensing speech as set forth in Claim 13 which further includes: storing said ordered set of persyl units sequentially in digital storage means.
15. A method for condensing and condensing speech as set forth in Claim 1 which further includes the step of: transmitting said ordered set of persyl units.
16. A method for condensing and compressing speech as set forth in Claim 15 wherein said step of transmitting includes: transmitting said ordered set of persyl components digitally as a matrix.
17. A method for decoding condensed and compressed speech comprising the steps of: receiving an ordered set of persyl unit codes; generating persyl units according to the time sequence of said ordered set of persyl unit codes; generating persyl components from said persyl units; and assembling said generated persyl components into an electronic speech signal.
18. A method of encoding a longitudinal pressure wave into a time sequence of digital samples, said encoding method comprising the steps of: sampling the longitudinal pressure wave at a predetermined sampling frequency to produce a timed sequence of digital sample; filtering said digital samples with a plurality of digital filters, each of said digital filters being adapted to detect a particular frequency and particular number of cycles of said particular frequency to form perception component samples; sampling said perception component samples at a predetermined perception slice frequency to determine the number of perception units detected by each of said digital filters and the average amplitude for the predetermined period of time; and assigning a code to each perception slice.
19. A method of encoding a longitudinal pressure wave as defined in Claim 18 wherein: said digitizing frequency is between 2,000 Hz and 40,000.
20. A method of encoding a longitudinal pressure wave as defined in Claim 18 wherein: said digital samples contain between 3 and 16 bits per sample.
21. A method of encoding a longitudinal pressure wave as defined in Claim 18 wherein: said perception slice frequency is between 10 Hz and 20,000 Hz.
22. A method of encoding a longitudinal pressure wave as defined in Claim 18 wherein said step of assigning a code further includes the steps of: providing a predetermined number of perception vectors defining a perception database which describes a plurality of parameters for said persyl units wherein each of said perception' vectors is associated with a predetermined digital code; identifying the perception vector that most closely correlates with each detected perception unit; and assigning the digital code of the identified perception vector to the detected perception unit.
23. A method of encoding a longitudinal pressure wave as defined in Claim 22 which further includes the step of: storing said assigned digital codes in the same time sequence as the detected perception units.
24. A method of encoding a longitudinal pressure wave as defined in Claim 22 wherein the step of providing a perception database further includes the step of: providing a perception database based on identifiable characteristics of a human voice.
25. A method of encoding a longitudinal pressure wave as defined in Claim 24 wherein the step of providing a perception database further includes the step of: providing a perception data base based on an identifiable language.
26. A method of encoding a longitudinal pressure wave as defined in Claim 25 wherein: said identifiable language is English.
27. A method of encoding a longitudinal pressure wave as defined in Claim 24 wherein the step of providing a perception database further includes the step of: providing a perception data base based on sex.
28. A method of encoding a longitudinal pressure wave as defined in Claim 27 wherein the step of providing a perception database based on sex includes the step of: providing a perception data base based on the perception vectors most frequently used by males.
29. A method of encoding a longitudinal pressure wave as defined in Claim 27 wherein the step of providing a perception database based on sex includes the step of: providing a perception data base based on the perception vectors most frequently used by females.
30. A method of encoding a longitudinal pressure wave as defined in Claim 24 wherein the step of providing a perception database further includes the step of: providing a perception data base based upon the identity of a particular person.
31. A method of encoding a longitudinal pressure wave as defined in Claim 30 wherein the step of providing a perception database based on the identity of a particular person includes the step of: providing a perception data base based on the perception vectors most frequently used by that particular person.
PCT/US1990/006437 1989-11-06 1990-11-06 Speech compression system WO1991006945A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US43265989A 1989-11-06 1989-11-06
US432,659 1989-11-06

Publications (1)

Publication Number Publication Date
WO1991006945A1 true WO1991006945A1 (en) 1991-05-16

Family

ID=23717081

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1990/006437 WO1991006945A1 (en) 1989-11-06 1990-11-06 Speech compression system

Country Status (2)

Country Link
AU (1) AU6757790A (en)
WO (1) WO1991006945A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0673013A1 (en) * 1994-03-18 1995-09-20 Mitsubishi Denki Kabushiki Kaisha Signal encoding and decoding system
US5845251A (en) * 1996-12-20 1998-12-01 U S West, Inc. Method, system and product for modifying the bandwidth of subband encoded audio data
US5864820A (en) * 1996-12-20 1999-01-26 U S West, Inc. Method, system and product for mixing of encoded audio signals
US5864813A (en) * 1996-12-20 1999-01-26 U S West, Inc. Method, system and product for harmonic enhancement of encoded audio signals
US6463405B1 (en) 1996-12-20 2002-10-08 Eliot M. Case Audiophile encoding of digital audio data using 2-bit polarity/magnitude indicator and 8-bit scale factor for each subband
US6782365B1 (en) 1996-12-20 2004-08-24 Qwest Communications International Inc. Graphic interface system and product for editing encoded audio data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3097349A (en) * 1961-08-28 1963-07-09 Rca Corp Information processing apparatus
US3287649A (en) * 1963-09-09 1966-11-22 Research Corp Audio signal pattern perception device
US3610831A (en) * 1969-05-26 1971-10-05 Listening Inc Speech recognition apparatus
US4720861A (en) * 1985-12-24 1988-01-19 Itt Defense Communications A Division Of Itt Corporation Digital speech coding circuit
US4809331A (en) * 1985-11-12 1989-02-28 National Research Development Corporation Apparatus and methods for speech analysis
US4815135A (en) * 1984-07-10 1989-03-21 Nec Corporation Speech signal processor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3097349A (en) * 1961-08-28 1963-07-09 Rca Corp Information processing apparatus
US3287649A (en) * 1963-09-09 1966-11-22 Research Corp Audio signal pattern perception device
US3610831A (en) * 1969-05-26 1971-10-05 Listening Inc Speech recognition apparatus
US4815135A (en) * 1984-07-10 1989-03-21 Nec Corporation Speech signal processor
US4809331A (en) * 1985-11-12 1989-02-28 National Research Development Corporation Apparatus and methods for speech analysis
US4720861A (en) * 1985-12-24 1988-01-19 Itt Defense Communications A Division Of Itt Corporation Digital speech coding circuit

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PROCEEDINGS OF THE IEEE, Volume 73, No. 11, November 1985, (MAKHOUI et al.), "Vector Quantization in Speech Coding". *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0673013A1 (en) * 1994-03-18 1995-09-20 Mitsubishi Denki Kabushiki Kaisha Signal encoding and decoding system
US5864794A (en) * 1994-03-18 1999-01-26 Mitsubishi Denki Kabushiki Kaisha Signal encoding and decoding system using auditory parameters and bark spectrum
US5845251A (en) * 1996-12-20 1998-12-01 U S West, Inc. Method, system and product for modifying the bandwidth of subband encoded audio data
US5864820A (en) * 1996-12-20 1999-01-26 U S West, Inc. Method, system and product for mixing of encoded audio signals
US5864813A (en) * 1996-12-20 1999-01-26 U S West, Inc. Method, system and product for harmonic enhancement of encoded audio signals
US6463405B1 (en) 1996-12-20 2002-10-08 Eliot M. Case Audiophile encoding of digital audio data using 2-bit polarity/magnitude indicator and 8-bit scale factor for each subband
US6782365B1 (en) 1996-12-20 2004-08-24 Qwest Communications International Inc. Graphic interface system and product for editing encoded audio data

Also Published As

Publication number Publication date
AU6757790A (en) 1991-05-31

Similar Documents

Publication Publication Date Title
US4790016A (en) Adaptive method and apparatus for coding speech
US9135923B1 (en) Pitch synchronous speech coding based on timbre vectors
US6484140B2 (en) Apparatus and method for encoding a signal as well as apparatus and method for decoding signal
Crochiere On the Design of Sub‐band Coders for Low‐Bit‐Rate Speech Communication
EP0910067A1 (en) Audio signal coding and decoding methods and audio signal coder and decoder
US20030088400A1 (en) Encoding device, decoding device and audio data distribution system
RU2366007C2 (en) Method and device for speech restoration in system of distributed speech recognition
KR19980028284A (en) Method and apparatus for reproducing voice signal, method and apparatus for voice decoding, method and apparatus for voice synthesis and portable wireless terminal apparatus
WO1997027578A1 (en) Very low bit rate time domain speech analyzer for voice messaging
US6141637A (en) Speech signal encoding and decoding system, speech encoding apparatus, speech decoding apparatus, speech encoding and decoding method, and storage medium storing a program for carrying out the method
US6052658A (en) Method of amplitude coding for low bit rate sinusoidal transform vocoder
US4703505A (en) Speech data encoding scheme
US5828993A (en) Apparatus and method of coding and decoding vocal sound data based on phoneme
US20020169601A1 (en) Encoding device, decoding device, and broadcast system
JP3344944B2 (en) Audio signal encoding device, audio signal decoding device, audio signal encoding method, and audio signal decoding method
WO1991006945A1 (en) Speech compression system
JP4359949B2 (en) Signal encoding apparatus and method, and signal decoding apparatus and method
JP2001184090A (en) Signal encoding device and signal decoding device, and computer-readable recording medium with recorded signal encoding program and computer-readable recording medium with recorded signal decoding program
EP0208712B1 (en) Adaptive method and apparatus for coding speech
JP4256189B2 (en) Audio signal compression apparatus, audio signal compression method, and program
JP2796408B2 (en) Audio information compression device
JP3230782B2 (en) Wideband audio signal restoration method
JP3994332B2 (en) Audio signal compression apparatus, audio signal compression method, and program
JP3297238B2 (en) Adaptive coding system and bit allocation method
JPH0414813B2 (en)

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BR CA JP KR

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

NENP Non-entry into the national phase

Ref country code: CA