EP0484455A1 - Procede et appareil d'identification d'une langue et de l'interlocuteur - Google Patents

Procede et appareil d'identification d'une langue et de l'interlocuteur

Info

Publication number
EP0484455A1
EP0484455A1 EP90913683A EP90913683A EP0484455A1 EP 0484455 A1 EP0484455 A1 EP 0484455A1 EP 90913683 A EP90913683 A EP 90913683A EP 90913683 A EP90913683 A EP 90913683A EP 0484455 A1 EP0484455 A1 EP 0484455A1
Authority
EP
European Patent Office
Prior art keywords
spectral
determining
distributions
distribution
spectral distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP90913683A
Other languages
German (de)
English (en)
Other versions
EP0484455A4 (en
Inventor
Stephen J. Guerreri
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of EP0484455A1 publication Critical patent/EP0484455A1/fr
Publication of EP0484455A4 publication Critical patent/EP0484455A4/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • the present invention defines a method and apparatus for recognizing aspects of sound. More specifically, the invention allows recognition by pre-storing a histogram of occurrences of spectral vectors of all the aspects and building an occurrence table of these spectral vectors for each known aspect. Pattern recogntion is used to recognize the closest match to this occurrence table to recognize the aspect.
  • This aspect may include identifying a language being spoken, identifying a particular speaker, identifying a device, such as a helicopter or airplane and a type of the device, and identifying a radar signature, for instance.
  • identifying a device such as a helicopter or airplane and a type of the device
  • identifying a radar signature for instance.
  • a user may have a tape recording of information, which the user needs to understand. If this information is in a foreign language, it may be required to be translated. However, without knowing what language the information is in, it will be difficult for the user to choose a proper translator.
  • the English language for example, has thirty-eight phonetic sounds that make up every single word. In average English continuous speech, there are approximately ten phonetic sounds which are uttered every second. Other languages are composed of other phonetic sounds.
  • Prior techniques for recognizing languages have attempted to identify a number of these phonetic sounds. When a determined number of phonetic sounds are identified, a match to the particular language which has these phonetic sounds is established. However, this technique takes a long time to determine the proper language, and may allow errors in the language determination. * The inventor of the present invention has recognized that one reason for this is certain phonetic sounds are found in more than one language. Therefore, it would take a very long time to recognize any particular language, as many of the phonetic sounds, some of which are infrequently uttered, will have to be recognized before a positive language match can be determined.
  • the present invention makes use of this property of languages in a new way which is independent of the actual phonetic sounds which are being uttered.
  • the present invention obviates all these problems which have existed in the prior art by providing a new technique for recognizing aspects of sound. According to the present invention, these aspects can include identifying a language being spoken, identifying a particular speaker, a device, a radar signature, or any other aspect. Identifying the language being spoken will be used herein as an example.
  • One aspect of the invention creates energy distribution diagrams for known speech. In the preferred embodiment, this is done by using an initial learning phase, during which histograms for each of the languages to be recognized are formed. This learning phase uses a two pass process. The preferred embodiment uses a two pass learning technique described below. A first pass enters a number of samples of speech, and each of these samples of speech are continually processed.
  • each sample of speech is Fast Fourier Transformed (FFT) to create a spectrum showing frequency content of the speech at that instant of time (a spectral vector) .
  • This frequency content represents a sound at a particular instant.
  • the frequency content is compared with frequency contents which have been stored. If the current spectral vector is close enough to a previously stored spectral vector, a weighted average between the two is formed, and a weight indicating frequency of a current is incremented. If the current value is not similar to one which has been previously stored, it is stored with an initial weight of "1".
  • the end result of this first pass is a plurality of frequency spectrums for the language, for each of a plurality of instants of time, and numbers of occurrences of each of these frequency spectrum.
  • the most common frequency spectrum as determined from those with a highest number of occurrences, are determined for each language to form a basis set for the language.
  • Each of these frequency spectrum for each of the languages are grouped together to form a composite basis set. This composite basis set therefore includes the most commonly occurring frequency spectrum for each of the many languages which can be recognized.
  • a second pass then puts a sample of sounds, which may be the same sounds or different sounds than the previously obtained sounds, through the Fast Fourier Transform to again obtain frequency spectrums.
  • the obtained frequency spectrums are compared against all of the pre-stored frequency spectra in the composite basis set, and a closest match is determined. A number of occurrences of each frequency spectra in the composite basis set is maintained.
  • a number of occurrences of each of the frequency spectrum for each of the languages is obtained.
  • This information is used to form a histogram between the various spectrum of the composite basis set and the number of occurrences of each of the frequency spectrum.
  • This histogram is used during the recognition phase to determine a closest fit between an unknown language which is currently being spoken and, one of the known languages which has been represented in terms of histograms during the learning phase.
  • the unknown language is Fast
  • a histogram ' is formed from the number of occurrences of each element of the composite basis set. This histogram of the unknown language is compared against all of the histograms for all of the known languages, and a closest fit is determined.
  • inter-language dependencies come from the composite basis set including the most common spectrum distributions from each of the languages to be determined and not just from the one particular language.
  • spectral distributions at predetermined instants of time ensure that all phonetic sounds, and not just those which are the easiest to recognized using machine recognition, enter into the recognition process.
  • FIGURE 1 shows a block diagram of the hardware used according to the present invention
  • FIGURES 2A and 2B respectively show summary flowcharts of the learning and recognition phases of the present invention
  • FIGURE 3 shows a flowchart used by the first pass of the learning of the present invention, in which the composite basis vector set is formed
  • FIGURE 4 shows a flowchart of the second pass of the learning operation of the present invention in which the histograms for each of a plurality of languages are formed
  • FIGURE 5 shows the recognition phase of the present invention in which an unknown language to be determined is compared against the pre-stored references;
  • FIGURE 6 shows a summary flowchart using the concepts of FIGURES 3-5 but applied to speaker identification
  • FIGURE 7A-7C show representative histograms for English, Russian, and Chinese.
  • FIGURE 1 shows an overview of the hardware configuration of the recognition system of the present invention.
  • the initial data comes from an audio source 100 which can be a tape recorder, a radio, a radar device , a microphone or any other source of sound.
  • the information is first amplified by amplifier 102, and then is band pass filtered by band pass filter 104.
  • Band pass filter 104 limits the pass band of the filter to telephone bandwidths, approximately 90 Hz to 3800 Hz. This is necessary to prevent the so-called aliasing or frequency folding in the sampling process. It would be understood by those of skill in the art that the aliasing filter may not be necessary for other than speech applications.
  • the band pass filtered signal 105 is coupled to first processor 106.
  • First processor 106 includes an A-D converter 108 which digitizes the band pass filtered sounds 105 at 8 kHz to produce a 14 bit signal 110.
  • the digitized signal 110 is coupled to a digital signal processor 112 which processes the language recognition according to the invention as will be described later with reference to the flowcharts.
  • User interface is accomplished using a 82286-82287 microprocessor pair which is coupled to a user interface 116.
  • the actual operation of the present invention is controlled by the signal processor 112, which in this embodiment is a TI TMS320C25.
  • the code for the C25 in this embodiment was written in TI assembler language and assembled using a TI XASM25 assembler. This code will be described in detail herein.
  • the first embodiment of the invention recognizes a language which is being spoken, from among a plurality of languages.
  • the language recognition system of the present invention typically operates using pre-stored language recognition information. The general operation of the system is shown by the flowcharts of FIGURE 2.
  • FIGURE 2A begins at step 200 with a learning mode which is done off line.
  • a known language is entered at step 202.
  • This known language is converted into basis vectors or a set of representative sounds at step 204.
  • the basis vectors are combined into a composite basis vector, and a histogram of ' occurances of the elements of the composite basis vector is created at step 206.
  • FIGURE 2B shows the recognition mode which is the mode normally operating in digital signal processor- 112.
  • FIGURE 2B begins with step 220, in which the unknown language is entered at step 222.
  • the unknown language is compared with the basis vector to build a histogram. Euclidean distance to each of the basis vectors in the composite basis vector is determined at step 224 to recognize a language.
  • the learning mode is a mode in which the reference basis vectors and histograms, used to recognize the spoken language, is created. Once these vectors are created, they are user-transparent, and are stored in memory 122. Depending on the amount of memory available, many different basis vectors may be created and stored. For instance, different basis vectors can be created for all known languages, as well as all known dialects of all known languages. Alternately, only the most common ones may be created, if desired.
  • the technique used to create the basis vectors will now be described in detail. This technique uses a two pass system of learning. In summary, the first pass determines all possible spectral contents of all languages, and the second pass detejrmines the occurrences of each of these spectral contents.
  • FIGURE 3 shows the first pass of the learning mode of the present invention.
  • the learning mode exposes the computer system to a known language such that the computer system, using the unique technique of the present invention, can produce the basis vectors used for later recognition of this known language. Using pattern recognition parlance, this is doing a "future selection".
  • the technique of the present invention arranges these features in a sequence and uses them in a process called vector quantization which will be described herein.
  • the first pass of the embodiment of the present invention creates a first bank of information for each language.
  • the first pass uses at least five speakers, each of which speak for at least five minutes. A better distribution may be obtained by using five male and five female speakers. However, the actual number of speakers and time of speaking can obviously be changed without changing the present invention.
  • the data is entered into the system at step 300 where the A-D converter 108 digitizes the sounds every 16 ms (8 kHz) .
  • a 128 point butterfly Fast Fourier Transform (FFT) is done after 128 samples are taken. This equivalently creates information which represents the energy in each of a plurality of frequency cells.
  • the 128 point FFT results in sixty-four indications of energy, each indicating an energy in one spectral range.
  • each of these numbers is represented in the computer by a word, and each of the sixty-four words represent the energy in one of the cells.
  • the cells are evenly spaced from 0 to 3800 hertz, and therefore are each separated by approximately 60 Hz. Therefore, the sixty-four numbers represent energy in 60 hertz cells over the spectral range extending from 0 to 3800 Hz.
  • the 128 point FFT gives us 64 numbers representing these 64 cells. Therefore, for instance, cell 1 covers from 0 through approximately 60 hertz (this should always be zero due to the bandpass filtering below 90 hertz) . Cell 2 covers approximately 60 through approximately 120 hertz. ... Cell 64 covers approximately 3740 through 3800 hertz. Each of the cells is represented by two 8- bit bytes or one computer word.
  • the 64 word array therefore represents a spectral analysis of the entered sound at a snapshot of time.
  • the 64 computer words, taken as a whole, are called the SPECTRA vector. At any given time, this vector represents the energy distribution of the spoken sound.
  • the process gives us 64 words of data.
  • This data is then stored in.an array called SPECTRA, which has 64 memory Ifcations. Since this information is also obtained every period of time, the array in which it is stored must also have a second dimension for holding the information obtained at each period of time.
  • step 306 the contents of the array SPECTRA at position n is obtained. While step 306 shows the value (N,64), it should be understood that this is shorthand for SPECTRA (1,1-64), and is intended to denote the contents of the entire SPECTRA vector from position l through position 64.
  • SPECTRA (N,64) is obtained, the current values are compared with this stored SPECTRA (N,64) using a dot product technique. This dot product technique will be described in detail later on. To summarize, however, the dot product produces an angle indicative of a vector difference between the vector formed by the current values and the vector formed by SPECTRA (N,64), which is from 0 to 90°. This embodiment considers the two vectors to be similar if the angle of difference is than 2.5°.
  • An array of weights is stored as WEIGHT (N) in which the number of values which have been weighted in the array SPECTRA at position n is maintained. This value WEIGHT (N) is obtained and stored in a first temporary position TI. The value of the array SPECTRA (N,64) at position N is multiplie ' d by TI (the number of values making up the weighted value) and maintained at a second temporary position T2. A third temporary position T3 gets the value of "the weighted SPECTRA value in position, added T2 to the current values, to produce a new weighted value in position T3.
  • WEIGHT (N) is then incremented to indicate one additional value stored in SPECTRA (N,64), and the new weighted average value of SPECTRA (N,64) is stored in the proper position by dividing the value of T3 by the incremented weight.
  • iA flag is also set to 0 indicating that the current value has been stored, and the loop is ended in any appropriate way, depending upon the programming language which is being used.
  • step 310 If the result at step 310 is no, (the angle is not less than 2.5°) , the loop is incremented to the next N value at step 314. This is done until the last N value has been tested and therefore all of the values of SPECTRA array have been tested. If the angle is greater than 2.5° for all values already stored at the end of the loop, this means that no previously stored value is sufficiently close to the current values to do a weighted average, and the current values therefore need to be stored as a new value. Therefore, step 350 is executed in which the current values are stored in the array SPECTRA (1,64) at position I. Step 354 sets WEIGHT (I) of the weight matrix to 1, indicating that one value is stored in position i of SPECTRA.
  • the value i (the pointer) is then incremented at step 356, and control then passes to position A in FIGURE 3. Position A returns to step 300 where another sound is digitized.
  • the loop is ended either by an external timer interrupt, or by the operator.
  • a typical pass of information would be five minutes of information for five different speakers of each sex. This creates a set of features from the five speakers which indicates average spectral distributions of sound across these five people.
  • Each set of 64 values obtained from the FFT can be considered as a vector having magnitude and direction (in 64 dimensions) .
  • are magnitudes of the vectors. 14
  • the desired end result of the dot product is the value of the angle ⁇ which is the correlation angle between the two vectors. Conceptually, this angle indicates the similarity in directions between the two vectors.
  • pass 1 After pass 1 is completed, a number of basis vectors are obtained, and each one has a weight which indicates the number of the occurrences of that vector.
  • the basis vectors created, along with the weights, are further processed in pass 2. It is understood that pass 1 should be processed in real time, to minimize the amount of memory used. However, with an unlimited storage, both pass 1 and pass 2 could be performed as a single sample is taken. Alternately, with a sufficient amount of processor capability, both pass l and pass 2 could simultaneously be processed while the data is being obtained.
  • the pass 2 operation creates a histogram using information from the basis sets which have already been created in pass 1.
  • This histogram represents the frequency of occurrence for each basis sound for each language or speaker.
  • the key point of the present invention is that the histogram which is created, is an occurrence vector of each basis set among all basis sets for all languages to be recognized, and does not represent the basis sounds themselves. This will be described in detail with reference to FIGURE 4 which represents the pass 2 technique.
  • What is obtained at the end of pass 1 is an average of the spectral content of all occurrences of the sounds which have been detected in the language, and the weight (number of times of occurrence) for each spectrum.
  • Each spectrum represents one basis vector, and each basis vector has a weight dependent on its frequency of occurrence.
  • pass 1 At the end of pass 1, we therefore have enough information to prepare a histogram between the different basis vectors in the language and the frequency of occurrence of each of these basis vectors. This would be sufficient to prepare a histogram which would enable the different languages to be recognized.
  • pass 2 adds additional inter-language dependency to this technique which enables the recognition process to converge faster.
  • Pass 2 can be conceptually explained as follows.
  • Each language as discussed above, consists of a number of phonetic sounds which are common to the language. By determining the frequency of occurrence of these phonetic sounds, the language could be recognized. However different languages share common phonetic sounds.
  • phonetic sound x may be common to English, French and German. It may even have a relatively high frequency of occurrence in all three languages.
  • Phonetic sound y may also be common to English, French and German, but may have a high frequency of occurrence in the English language. In the other languages, phonetic sound y may have a low frequency of occurrence.
  • Another problem with prior recognition systems is that some phonetic sounds are sub-vocalized, and therefore hard to recognize.
  • the inventor of the present invention has recognized that the inter-language dependencies ' (that is, phonetic sounds which are common to multiple languages) enable ready recognition of the various languages.
  • the inventor has also recognized that spectral distributions calculated at all times obviate the problem of difficulty of detecting sub-vocalized sounds.
  • Pass 2 calculates the histograms by using all the values- determined in pass 1 for all languages, to add inter-language dependencies between the various languages.
  • Step 400 gets the x most common SPECTRA values (those with the highest weights) for each of y languages to be recognized and stores this in the CBASIS array.
  • the preferred value of x is 15. If, for example, there are ten languages to be recognized, this yields 150 x 64 entries in the array CBASIS.
  • Each of these 150 entries represents a basis vector which has been found as having a high occurrence in one of the languages to be recognized.
  • Each of these basis vectors which has a high frequency of occurrence in one language.
  • the inter-language dependencies of the various sounds (SPECTRA (64)) in each of the languages can be determined, not just those languages in which the values occur.
  • Step 402 begins the second pass in which new sounds from the language to be recognized are obtained. These sounds are digitized and fast Fourier transformed in the same way as steps 300 and 302 Of FIGURE 3.
  • the next step for each sound which is entered is to form the histogram for each known language.
  • a for loop is set up between steps 404 and 406, which increments between 1 and (x*yl (which is all of the various basis vectors) .
  • each element of the composite vector array CBASIS is compared with the current SPECTRA which has been obtained at step 410.
  • the comparison is actually a comparison measuring using euclidian distance, comparing the incoming SPECTRA (64) with each vector in the composite basis set CBASIS (n,64).
  • Step 412 determines if this distance is less than 20,000. This value has been empirically determined as sufficiently close to represent a "hit". If the value is less than
  • the value is compared against a previous lowest answer which has been previously stored.
  • a very large "previous answer” is initially stored as an initial value. If the current answer is greater than the previous answer, flow passes to step 406 which increments the loop, without changing the current stored minimum. If the answer at step 412 is less than 20,000, and the answer at 414 is less than the ⁇ previous answer, that means that this pass of the loop has received a lower answer than any previous pass of the loop. Accordingly, the current answer becomes the previous answer at step 416, and the current count of the closest match, which is kept in a temporary location TI becomes N (the current loop count) . The loop is then incremented again at step 406. The temporary location TI keeps the number of the lowest answer, and therefore the closest match. Accordingly, as long as any occurrences of the answer less than 20,000 have been determined at step 412, the histogram address of TI is incremented at step 420.
  • the histogram array or vector is successively incremented through its different values as the loop is executed.
  • Each of the different values of the histogram represent one specific sound or SPECTRA, among the set of sounds or SPECTRA making up each of the most common spectral distributions of each of the known languages.
  • the effect is to find an average distribution for the particular language. This average distribution also includes the effect of inter-language dependency.
  • Pass 2 therefore provides us with a histogram in which each of a plurality of sounds or SPECTRA from each of the languages are plotted to show their number of occurrences.
  • These reference histograms are used during the recognition phase, which will be described in detail with reference to FIGURE 5.
  • FIGURE 5 flowchart shows the steps used by the present invention to recognize one of the plurality of languages, and therefore is the one that is normally executed by the hardware assembly shown in FIGURE 1.
  • the learning modes will typically have been done prior to the final operation and are therefore are transparent to the user.
  • histograms for each of the languages of interest have therefore been previously produced.
  • the objective of the recognition mode is to find the histogram vector, among the set of known histogram vectors, which is closest to the histogram vector created for the unknown language. This is done by determining the euclidian distances with the known language histogram vectors. If the nearest euclidian distance is sufficiently close, this is assumed to be a match, and therefore indicates a recognition. For purposes of explanation, the euclidian distance will now be described. So-called euclidian distance is the distance between two vector points in free space. Using the terminology that
  • Step 500 is the initial step of the recognition phase, and could equally well be a part of the preformed data.
  • Step 500 first loads the composite basis array CBASIS where the array has xy elements: the x most common SPECTRA values for each of the y languages to be recognized.
  • Step 500 also loads y histograms, and using the CBASIS array and the y histograms forms a reference array.
  • This reference array has a correlation between each of the y histograms, each of the xy SPECTRA in each of the histograms, and the values of the xy SPECTRA.
  • Step 502 gets the sounds of the language to be analyzed and digitizes and FFTs these sounds, similar to the way this is done in steps 300 and 302 of FIGURE 3.
  • Step 504 compares the input sounds against silence. This is done according to the present invention by taking the sum of all of the SPECTRA cells, and adding these up. If all of these sounds add up to fourty or less, the SPECTRA is labeled as silence and is appropriately ignored. If the SPECTRA is determined not to be silence in step 504, a histogram for the language to be analyzed is created at step 506. This histogram is created in the same way as the histogram created in steps 404-420 of FIGURE 4, using all of the spectral categories for all of the languages to be analyzed.
  • Step 508 compares the histogram for the language to be analyzed to all elements of the reference array 1 through y where y is the number of languages being analyzed. This comparison yields a euclidian distance for each of the values 1 through y.
  • Step 510 determines the minimum among these euclidian distances and determines if this minimum is less than 20,000. If the minimum distance is not less than 20,000, step 512 updates the histogram for the language to be analyzed, and returns control to step 508 to redo the test. At this point, we assume that the analysis has not "converged”. However, if the result is positive at step 510, and the minimum distance is. less than 20,000, then the minimum distance language is determined to be the proper one at step 512 thus ending the recognition phase.
  • the present invention enables a quicker determination of a proper language.
  • a phonetic sound may be present in two or more languages, typically this phonetic sound will sound slightly different in different languages.
  • the closest possible fit is determined. Therefore, even if there are many similar sounds, the closest one will be chosen, thereby choosing the proper language even when the sounds are similar for different languages. This enables the recognition to converge faster.
  • An additional nuance of the system averages all the language histograms and creates a null language. This null language is loaded as one of the y histograms. Whenever the system recognizes this null language as being the closest match, this is determined as a rejection of the language.
  • a second embodiment of the invention operates similar to the first embodiment, but the aspect to be determined is optimized for speaker identification, as compared with language identification.
  • Language identification identifies the language which is being spoken.
  • Speaker identification identifies the specific speaker who is speaking the language. The techniques and concepts are much the same as the first embodiment.
  • This second embodiment is shown in the flowchart of FIGURE 6 in somewhat summary form. Step 600 executes pass 1 for each of the speakers to be recognized. Each speaker is executed for five minutes, or for some other user selectable amount of time. This creates a set of basis vectors for each of the z speakers to be recognized.
  • the pass 2 system is executed at step 602 where the x most common SPECTRA values for each of the z speakers to be recognized is first determined to form CBASIS, or composite basis vector just as in pass 2 shown in FIGURE 4.
  • Step 604 then executes the rest of pass 2 with the only exception that step 412 in FIGURE 4 is replaced with a comparison with 15,000 as the euclidian distance' instead of the comparison with 20,000. This is because the match for speaker recognition is required to be closer than the necessary match for language recognition.
  • the histograms for each of the speakers to be analyzed has been formed.
  • Step 606 begins the recognize phase, and executes all elements of the recognize flowchart of FIGURE 5 with the exception of step 510 in which the value to be compared with is 15,000.
  • the system is operated by use of a plurality of user friendly menus which enable the user to perform various functions.
  • the main menu allows the user to choose between building new basis sets, looking at previously stored language histograms, or entering the recognized language's menu.
  • Some sub-menus allow changing the rate of sampling, the number of points of FFT transformation, and the different ways in which the data is being distributed.
  • FIGURE S 7A-7C A sample set of reference histograms for English, Chinese, and Russian are shown in FIGURE S 7A-7C. These histograms show the sound indicated by numbers on the x axis, and show the number of occurrences on the y axis. These examples use only approximately 68 different sounds as the possible sounds, but it is understood that many more than these are possible to be used. Many modifications in the above program and technique are possible. For instance, as stated above, it would be quite feasible to operate the entire learning phase in a single pass, assuming that sufficient processing speed and power and sufficient memory were available. This would obviate the need for two different entries of data. Of course, the various empirical values which have been described herein could be modified by users. In addition, any number of languages could be used by this system, and limited only by the amount of available memory space.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

Une source audio (100) est amplifiée (102), filtrée (104) puis numérisée (108) de sorte qu'une transformation de Fourier puisse être réalisée par un processeur (112) de signaux numériques. Les éléments de fréquence intéressants sont ensuite mis en forme d'histogramme pendant une durée de l'ordre de 5 minutes, les paroles étant échantillonnées tous les 16 ms. L'histogramme et l'identification de la source audio sont réalisés par un ordinateur (114) commandé par un algorithme à programmation appropriée.
EP19900913683 1989-07-28 1990-07-20 A method and apparatus for language and speaker recognition Withdrawn EP0484455A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US38642589A 1989-07-28 1989-07-28
US386425 1989-07-28

Publications (2)

Publication Number Publication Date
EP0484455A1 true EP0484455A1 (fr) 1992-05-13
EP0484455A4 EP0484455A4 (en) 1993-03-10

Family

ID=23525511

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19900913683 Withdrawn EP0484455A4 (en) 1989-07-28 1990-07-20 A method and apparatus for language and speaker recognition

Country Status (3)

Country Link
EP (1) EP0484455A4 (fr)
CA (1) CA2063723A1 (fr)
WO (1) WO1991002347A1 (fr)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548507A (en) 1994-03-14 1996-08-20 International Business Machines Corporation Language identification process using coded language words
US6009382A (en) * 1996-08-19 1999-12-28 International Business Machines Corporation Word storage table for natural language determination
US6023670A (en) * 1996-08-19 2000-02-08 International Business Machines Corporation Natural language determination using correlation between common words
US5913185A (en) * 1996-08-19 1999-06-15 International Business Machines Corporation Determining a natural language shift in a computer document
US6002998A (en) * 1996-09-30 1999-12-14 International Business Machines Corporation Fast, efficient hardware mechanism for natural language determination
AU2002300314B2 (en) * 2002-07-29 2009-01-22 Hearworks Pty. Ltd. Apparatus And Method For Frequency Transposition In Hearing Aids

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3673331A (en) * 1970-01-19 1972-06-27 Texas Instruments Inc Identity verification by voice signals in the frequency domain
GB2033637A (en) * 1978-10-10 1980-05-21 Philips Nv Method of verifying a speaker
EP0273615A1 (fr) * 1986-12-17 1988-07-06 BRITISH TELECOMMUNICATIONS public limited company Identification du locuteur

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE2536585C3 (de) * 1975-08-16 1981-04-02 Philips Patentverwaltung Gmbh, 2000 Hamburg Anordnung zur statistischen Signalanalyse
JPS58130396A (ja) * 1982-01-29 1983-08-03 株式会社東芝 音声認識装置
US4720863A (en) * 1982-11-03 1988-01-19 Itt Defense Communications Method and apparatus for text-independent speaker recognition
JPS6057475A (ja) * 1983-09-07 1985-04-03 Toshiba Corp パタ−ン認識方式
US4827519A (en) * 1985-09-19 1989-05-02 Ricoh Company, Ltd. Voice recognition system using voice power patterns

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3673331A (en) * 1970-01-19 1972-06-27 Texas Instruments Inc Identity verification by voice signals in the frequency domain
GB2033637A (en) * 1978-10-10 1980-05-21 Philips Nv Method of verifying a speaker
EP0273615A1 (fr) * 1986-12-17 1988-07-06 BRITISH TELECOMMUNICATIONS public limited company Identification du locuteur

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
AT & T/TECHNICAL JOURNAL, vol. 66, no. 2, March-April 1987, pages 14-26, Short Hills, NJ, US; F.K. SOONG et al.: "A vector quantization approach to speaker recognition" *
EUROPEAN CONFERENCE ON SPEECH TECHNOLOGY, Edinburgh, September 1987, vol. 2, pages 426-427, CEP Consultants, Edinburgh, GB; A. AKTAS et al.: "A selforganizing clustering technique for vector quantization in speech recognition" *
ICASSP'86 (IEEE-IECEJ-ASJ INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, Tokyo, 7th - 11th April 1986), vol. 4, pages 2679-2682, IEEE, New York, US; O. WATANUKI et al.: "Speaker-independent isolated word recognition using label histograms" *
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, vol. ASSP-33, no. 2, April 1985, pages 440-443, New York, US; D.K. BURTON et al.: "Speaker-dependent isolated word recognition using speaker-independent vector quantization codebooks augmented with speaker-specific data" *
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, vol. ASSP-35, no. 2, February 1987, pages 133-143, New York, US; D.K. BURTON: "Text-dependent speaker verification using vector quantization source coding" *
See also references of WO9102347A1 *
SYSTEMS & COMPUTERS IN JAPAN, vol. 19, no. 6, June 1988, pages 63-71, Silver Spring, MD, US; K. SHIRAI et al.: "Speaker identification based on frequency distribution of vector-quantized spectra" *

Also Published As

Publication number Publication date
WO1991002347A1 (fr) 1991-02-21
EP0484455A4 (en) 1993-03-10
CA2063723A1 (fr) 1991-01-29

Similar Documents

Publication Publication Date Title
US5189727A (en) Method and apparatus for language and speaker recognition
US4100370A (en) Voice verification system based on word pronunciation
EP0601778B1 (fr) Classification de mots-clés/de mots non clés dans la reconnaissance du langage par mots isolés
AU682380B2 (en) Multi-language speech recognition system
US5862519A (en) Blind clustering of data with application to speech processing systems
US5457770A (en) Speaker independent speech recognition system and method using neural network and/or DP matching technique
US3989896A (en) Method and apparatus for speech identification
CA2304747C (fr) Reconnaissance de formes au moyen de modeles de reference multiples
CN108831456A (zh) 一种通过语音识别对视频标记的方法、装置及系统
Kekre et al. Performance comparison of speaker recognition using vector quantization by LBG and KFCG
US5864807A (en) Method and apparatus for training a speaker recognition system
EP0484455A1 (fr) Procede et appareil d'identification d'une langue et de l'interlocuteur
Rudresh et al. Performance analysis of speech digit recognition using cepstrum and vector quantization
US5721807A (en) Method and neural network for speech recognition using a correlogram as input
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Brunet et al. Speaker recognition for mobile user authentication: An android solution
Biem et al. A discriminative filter bank model for speech recognition.
Mousa MareText independent speaker identification based on K-mean algorithm
Faraoun et al. Artificial Immune Systems for text-dependent speaker recognition
Indumathi et al. Speaker identification using bagging techniques
Jokic et al. Towards enabling measurement of similarity of acoustical environments using mobile devices
Strube et al. Word and Speaker Recognition Based on Entire Words without Framewise Analysis¹
Kinoshita et al. Forensic voice comparison using sub-band cepstral distances as features: A first attempt with vowels from 306 Japanese speakers under channel mismatch conditions
Saha et al. An F-Ratio based optimization on noisy data for speaker recognition application
Pol et al. USE OF MEL FREQUENCY CEPSTRAL COEFFICIENTS FOR THE IMPLEMENTATION OF A SPEAKER RECOGNITION SYSTEM

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19920203

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB IT LI LU NL SE

A4 Supplementary search report drawn up and despatched

Effective date: 19930122

AK Designated contracting states

Kind code of ref document: A4

Designated state(s): AT BE CH DE DK ES FR GB IT LI LU NL SE

17Q First examination report despatched

Effective date: 19951030

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 19960510