WO2002103677A1

WO2002103677A1 - Method of employing prefetch instructions in speech recognition

Info

Publication number: WO2002103677A1
Application number: PCT/CN2001/001028
Authority: WO
Inventors: Chunrong Lai; Qingwei Zhao; Jielin Pan
Original assignee: Intel China Ltd; Intel Corp
Current assignee: Intel China Ltd; Intel Corp
Priority date: 2001-06-19
Filing date: 2001-06-19
Publication date: 2002-12-27
Anticipated expiration: 2003-12-19
Also published as: CN1545696A; CN1223986C

Abstract

In general, the new prefetching method pursuant to one embodiment of the invention, employed by a computer system engaged in human speech recognition provides an efficient method of computing and searching speech features based on a Gaussian Distribution of acoustic Hidden Markov Model states. The new method transfers the speech data to be processed while the processor is engaged in acoustically processing a speech data in process. Accordingly, the prefetching method, pursuant to one embodiment of the invention, employed by a computer system engaged in human speech recognition, reduces or eliminates memory latency caused by the processor waiting idle while the memory transfers speech data to be processed to the processor.

Description

METHOD OF EMPLOYING PREFETCH INSTRUCTIONS IN SPEECH RECOGNITION

FIELD

[0001] The invention relates to speech recognition. More specifically, the invention relates to a new apparatus and method of employing prefetch instructions to transfer speech data to be acoustically processed from main memory to cache memory while the system is acoustically processing speech data in process during the acoustic recognition phase of the speech recognition process.

BACKGROUND

[0002] In the past few years, the art and science of human speech recognition by a machine has made significant advancements. Today, there are a number of applications for Automatic Speech Recognition (ASR) and Large Vocabulary Continuous Speech Recognition (LVCSR). To accomplish speech recognition, a computer system may be employed as a speech engine to handle a multitude of computations and searches to analyze and recognize acoustic signals which carry the human speech features. Accordingly, the efficiency of a computer system in performing these operations has an impact on the performance of the speech engine.

[0003] Generally, a speech recognition system performs several operations on a human speech signal to determine what was said. For example, when a person utters the words, "my name is John", a speech capturing device, such as a microphone, captures the utterance as an analog acoustic signal. The analog signal is then converted into a digital signal in order to be processed by a digital computer. The resulting signal which carries the speech features may be quantized and represented as a plurality of feature vectors using a mathematical model. For example, Mel Frequency Cepstral Coefficients (MFCC) may be used to represent the speech features.

[0004] The computed features are then processed acoustically by a computer system. During acoustic processing, the features are compared with known phonetic units included in an acoustic model. An example of an acoustic model is the Hidden Markov Model (HMM). The comparison of the speech feature with known phonetic units contained in the model may result in one or more matches. The matched phonetic units are then processed linguistically using, for example, a dictionary and a grammar lexicon in order to form a recognized word sequence.

[0005] To perform acoustic processing, the speech engine, uses a large number of probability distribution, for example, the mixture of M Gaussian distribution function of the N-dimension space which is the space of the feature vectors of the speech signal. The mean and variance of each feature vector is calculated and stored in a memory of the computer system. Later, each parameter is fetched from memory in order for the speech engine to complete the computation of the Gaussian functions.

[0006] Figure 1 , is an illustration of memory and execution cycles of a prior art computer system engaged in human speech recognition. The figure illustrates a time-based comparison of the execution unit and the memory bus during acoustic processing of a speech signal. While the memory bus is transferring speech data to be processed from memory, the execution unit remains idle until the data to be processed becomes available to the processor. This memory latency, i.e., the time wasted while the memory delivers the data to be processed adds up quickly because of the sheer number of computations necessary in acoustic analysis. This problem is particularly acute when an LVCSR is continuously receiving speech signal. Many operations need be completed every second, and this shortcoming severely limits the speed and efficiency of the system.

BRIEF DESCRIPTION OF THE DRAWINGS [0007] Figure 1 is an illustration of memory and execution cycles of a computer system engaged in acoustic processing according to prior art. [0008] Figure 2 is a block diagram of an exemplary speech recognition system according to the method pursuant to one embodiment of the invention. [0009] Figure 3 is a flow diagram of an exemplary speech recognition system according to the method pursuant to one embodiment of the invention. [0010] Figure 4 is an exemplary method of speech feature computations during acoustic processing of a speech signal.

[0011] Figure 5 is an exemplary computer code in C language which employs the new prefetching technique according to the methods pursuant to the invention.

[0012] Figure 6 is an exemplary computer code in assembly language which employs the new prefetching technique according to the methods pursuant to one embodiment of the invention.

[0013] Figure 7, is an illustration of memory and execution cycles of an exemplary computer system engaged in acoustic processing according to the method pursuant to one embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION [0014] In the following detailed description of the embodiments of the invention, numerous specific details are set forth. However, it will be obvious to one skilled in the art that the methodology pursuant to the embodiments of the invention may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the embodiments of the invention.

[0015] The methods pursuant to the invention include various functional steps, which will be described below. The functional steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose processor programmed with the instructions to perform the functional steps. Alternatively, the functional steps may be performed by a combination of hardware and software. [0016] The embodiments of the invention reveal a new prefetching technique to be implemented during the acoustic processing phase of human speech recognition. The new prefetching technique can be used to reduce or eliminate the memory latency resulting from the execution unit waiting idle while the data to be processed is being delivered to it from main memory during acoustic processing. In a preferred embodiment, for example, while the execution unit is busy computing speech features, the application executes in parallel a prefetch instruction for the data to be processed. Accordingly, while the execution unit is busy computing, the memory bus is busy prefetching the data that the execution unit will need for its next computation.

[0017] Referring now to Figure 2, a block diagram of an exemplary speech recognition system 200 is shown. The system comprises a speech capturing device 210, an analog-to-digital converter 212, a computer system 250, and a series of I/O devices, such as controller device 240, display device 242, network interface card 244, and printing device 246. The computer system 250, in turn, comprises a processor 252, memory 280, cache memory 260, cache controller 262, memory bus 272, and I/O bus 270. Optionally the computer system may further comprise a direct memory access 242.

[0018] The system operates as follows: A person speaks into the microphone 210 resulting in an analog speech signal. This signal is then passed through the analog-to-digital converter 212 in order to form a digitalized representation of the analog speech signal. The digitalized representation is then input to the computer system 250. The processor 252 then begins identifying speech features associated with the speech signal and stores these features in the memory 280 of the computer system 250. A cache memory 260 is utilized to store prefetched data needed in the computation of speech features. A cache controller 262 which is coupled to both processor 252 and cache memory 260 coordinates data transfers between the processor 252 and cache memory 260. [0019] Also stored in memory 280 is a plurality of known phonetic units, known as an acoustic model. The acoustic model employed by this embodiment may be a speaker-dependent (SD) model or it may be a speaker-independent (SI) model. An SD model is trained by a specific person's voice and the recognition system is expected to be used by the same person. For example, a mobile phone or a Personal Digital Assistant typically employs an SD models since the same person, the owner of the unit, is expected to use it. An SI model, on the other hand, is used when the person using the system changes. For example, an Automatic Teller Machine typically uses an SI model. [0020] After the processor 252 has completed computation of features of the speech signal and stored them in the memory 280, it may then search the acoustic model, which is also stored in memory 280, for one or more matches. The specific search method used does not affect the method pursuant to the embodiment. For example, a single-best or an N-best hypothesis may be used. Alternatively, a word graph or a phonetic word graph may be used to represent the matches made during the search of the acoustic model. [0021] Regardless, the matches are processed linguistically in order to determine a recognized word sequence. Alternatively, the processor 252 may utilize the network interface card 242 to send the matched results to another computer such as a server device (not shown) which may then perform the language processing. If the processor 252 is programmed to also perform language processing on the matched results, it may utilize the printing device 246 to print out the associated recognized word sequence. Alternatively, the recognized word sequence may be displayed on display device 242, or be sent to controller device 240, for example, in order to issue control signals to another system resulting in controlling a device.

[0022] Referring now to Figure 3, a flow diagram of an exemplary speech recognition system according to the method pursuant to one embodiment is shown. In step 306, an exemplary human speech signal is captured in an analog form. The captured speech signal carries speech features associated with what the speaker has said. The specific speech features chosen does not affect the method pursuant to this embodiment. For example, the chosen speech features may be the energy concentration of the speech signal measured at frequent intervals. The features change as the person speaks, and the features may be represented by a plurality of feature vectors, each feature vector having a direction and a magnitude. The speech signal may then be represented mathematically as the sum of the feature vectors measured at different time intervals. The shorter the time intervals, or the sampling frequency, the more accurate the representation of the speech signal. In order to compute these features, the signal is first converted into a digital form so that it can be processed by a digital computer as shown in step 308. In step 310, the features of the digitalized speech signal are computed and stored in the memory unit of the system. For example, a mathematical model typically used to represent the speech features is Mel Frequency Cepstral Coefficients (MFCC). [0023] Also stored in the memory unit of the system, is an acoustic model 330 and a language model 332. Step 340 represents acoustic and language processing. During this step, a search is conducted based on a search algorithm such as, a search (decoding) algorithm based on token propagation. During this "search process" or "matching process", the execution unit looks for matches between the computed features in step 310 (for example, the MFCC of the speech signal), and the known phonetic features included in the acoustic model. ln this stage, best candidate, for example, a phonetic unit list is obtained by selecting candidates with the highest matching probability. [0024] The search space varies according to the particular recognition application that the system has been programmed to perform. For example, for dictation tasks, the search space may be organized as a lexicon tree; while for command and control tasks, the search space may be organized as a word graph. Any one of known searches such as a single-best or an N-best hypothesis may be performed. Regardless, after the search, a word graph may be generated by the execution unit. This word graph which includes word alternatives of the matches made by utilizing the acoustic model, may then be processed linguistically, and in step 350 produce a recognized word sequence. The method pursuant to the different embodiments of the invention may be used during the matching operation of feature vectors with known features included in the acoustic model, i.e., the acoustic model matching and formation processes. [0025] During language processing, a language model may be used to form a single best sentence. The language model may employ a dictionary lexicon and a grammar lexicon to prune unlikely or disallowed words from matched candidates. The resulting best sentence may be used as a control signal or it may be simply stored, for example, in a dictation application. [0026] Referring now to Figure 4, an exemplary method of handling acoustic processing of a speech signal is shown. Typically, a speech signal is presented as a mathematical model based on , for example, MFCC. This model is calculated in accordance with a Gaussian distribution function representing the states associated with a plurality of feature vectors. One example of such a mathematical model is formed using a Gaussian distribution probability function according to formula 410. Here x=(x1 , x2, ...xN) are feature vectors 1 through N of the speech signal and the Mean 412 and Variance 413 are the i-th dimension vector, m-th mixture of Gaussian distribution of acoustic HMM state. Typically, the logarithm computation is used in order to accelerate the computation of the feature vectors. For example, if logarithm 408 is needed to be computed, generally the following formula is used to accelerate the above computation because log (Wmfm(x)) can be computed as follows: [0027] Log(y1 +y2)= Log(y1 )+Log(1 +y2/y1 )=Logy1 +log(1 +e POWER logy2- logyl ).

[0028] In order for the processor to perform this calculation a counted loop may be utilized. In this loop block, arithmetic instructions are correlated with the previous data transfer functions. Before the computation takes place, the data such as the values associated with Mean Vector 412 and Variance Vector 413 of each feature vector need be present at the processor. A Prefetch instruction can be used to transfer the Mean and Variance values of each feature vector. In a preferred embodiment , the prefetch instruction is executed while the execution unit is busy computing the current data. The prefetch instruction can be executed during any period in which the execution unit is busy with a current computation. The two events need not be at exactly the same time, however, in a preferred embodiment, the prefetch instruction is executed concurrently with the execution unit's current computation cycle.

[0029] The Gaussian computation may be used to compute the Gaussian probability from the feature vectors, Mean vectors, Variance vectors many times until the speech signal is completed. Typically, a loop is utilized to perform this computation. While the execution unit is busy with one set of Mean and Variance vectors used in the computation, the software may include, for example, a prefetch instruction to prefetch the next several Mean and Variance Vectors such that when the execution unit has completed its computation and is ready for the next set of Mean and Variance Vectors, the values are already present at the cache memory. Having the values prefetched at the cache means that the execution unit need not sit idle and wait for the data. The data to be processed is already available and the execution unit can simply perform its next computation after it has completed its current computation.

[0030] Figure 5 is an exemplary computer code in C language which employs the prefetch instructions according to one embodiment of the invention. In line 514, a prefetch instruction has been placed in order to prefetch data needed in the computation of the function ippsLogGauss1_32f_D2 which is shown on line 518. Function _mm_prefetch() is an exemplary prefetch instruction in the C library. Any other prefetch instruction in any other computer language will work so long as the instruction cause the memory to transfer the data located at the prefetch address to be transferred to the cache memory. In this embodiment, any computer language may be used.

[0031] When the prefetch instruction is executed, typically a cache line is prefetched. In a system having a cache line equal to 32 bytes, the _mm_prefetch loads 8 floating numbers to the cache memory since each floating number comprises 4 bytes. Accordingly, the prefetch address may be calculated by adding an increment to the next prefetch address. This increment will ensure that when data prefetching is complete, the prefetched data is needed shortly thereafter. Otherwise, the operation may result in cache pollution resulting in lower efficiency of the overall system. Similarly, if the increment is too small, the prefetch will not effectively hide the latency of the fetch prior to the beginning of the next computation cycle of the execution unit. If the increment is too large, the start-up cost for data which is not prefetched for initial iterations diminishes the benefits of prefetching the data, and the prefetched data may wrap around and dislodge previously prefetched data prior to its actual use. For large loops the increment may be set to 32 bytes or 8 floating numbers. [0032] Generally, the value of the increment depends on the proportion between the computing cost and memory loading cost of the loop. The ideal value of the increment may be achieved through experimentation and design parameters. For large loops, the value of the increment may be set to 16. This will result in prefetching the third cache line during the computation. By using an increment value of 16, the cache misses may be reduced by about one-half. [0033] The increment may also vary according to the computer language used. For example, experiment has shown that in C language, the best result was achieved when the third cache line was prefetched. In assembly language, however, the best result was achieved when the fourth cache line was prefetched. The reason for the differences lie in the particular compiler that is utilized by the language chosen. In C language, because of the compiler the prefetch instructions are issued more randomly. With out-of-order core processors, the difference in performance is minor and negligible. However, the best result was achieved with a code written in assembly language. [0034] Prefetch instructions can also be added inside main loop of ippsLogGauss1_32f_D2, as shown on line 528 and 529. This shows prefetching after memory-loading explicitly, which can get similar effort. [0035] Figure 6 shows the modified codes of main loop shown in line 529 of Figure 5. The exemplary computer code in assembly language employs prefetch instructions according to one embodiment of the invention. The loop is unrolled to make it handle 32 bytes, and the data in the fourth cache line is prefetched. This method can reduce the decoding cost of speech recognition. For example, experiment on a speech recognition system with a Mandarin Chinese (51 K) language model has shown a 9% improvement.

[0036] Figure 7 is a time-versus-activity illustration of execution unit and memory cycles of an exemplary computer system engaged in human speech recognition according to one embodiment of the invention. The method pursuant to this embodiment takes advantage of long computation cycles of Gaussian probability distribution functions by prefetching the next mean and variance values of the corresponding feature vectors. As Figure 7 illustrates, while the execution unit is engaged in computation of vertex (n-1 ), the memory bus is prefetching the data for the vertex (n) . Similarly, during the next cycle, when the execution unit is busy computing vertex (n), the memory bus is busy prefetching vertex (n+1 ). This way, the execution unit is not waiting idle for the memory bus to load the data it needs to complete the computation. The result is the elimination of the latency inherent in prior art handling of acoustic recognition.

Claims

CLAIMS What is claimed is:

1. A method comprising: receiving a human speech signal; acoustically processing a first set of speech data associated with said human speech signal; transferring a second set of speech data to be acoustically processed from a first memory to a second memory while said first set of speech data is being acoustically processed; linguistically processing said acoustically processed first and second sets of speech data; and forming a recognized word sequence associated with said human speech signal.

2. The method of claim 1 wherein said first memory comprises a main memory.

3. The method of claim 1 , wherein said second memory comprises a cache memory.

4. The method of claim 1 , wherein said first and second sets of speech data comprises a Mean Vector and a Variance Vector based on a Gaussian Distribution of acoustic Hidden Markov Model states.

5. The method of claim 4, wherein said Mean Vector and said Variance Vector are used in computing a feature vector which is then used to search an acoustic model.

6. The method of claim 1 , wherein said recognized word sequence is used to control a device.

7. A method comprising: acoustic processing of a first set of speech data; and transferring a second set of speech data to be acoustically processed from a first memory to a second memory while said first set of speech data is being acoustically processed.

8. The method of claim 7, wherein said first and said second sets of speech data comprises a Mean Vector and a Variance Vector based on a Gaussian Distribution of acoustic Hidden Markov Model states.

9. The method of claim 7, wherein said first memory is slower than said second memory.

10. The method of claim 7, further comprising: linguistic processing of said acoustically processed first and second sets of speech data; and recognizing at least one word corresponding to said speech data.

11. A system comprising: a client device including, a processor to acoustically process first and second sets of speech data, a main memory coupled to said processor, the main memory to store said first and second sets of speech data, a cache memory coupled to said processor and said main memory, and said processor acoustically processes said first set of speech data simultaneously with a transfer of said second set of speech data from said main memory to said cache memory; and, a sender module coupled to said processor of the client device, the sending module to send said acoustically processed first and second sets of speech data to a server.

12. The system of claim 11 , further comprising: a human speech capturing module for capturing a human speech signal; an analog-to-digital converter module for converting said human speech signal to a digital speech signal; and, a speech feature identifier module for identifying features of said digital speech signal.

13. The system of claim 11 , wherein said client device is selected from a group consisting of a mobile phone, a Personal Digital Assistant, and a portable computer system.

14. The system of claim 12, wherein said speech feature identifier module also performs end-point detection, pre-emphasizing filtration and quantization on said human speech signal.

15. The system of claim 11 , wherein said speech data comprises a Mean Vector and a Variance Vector based on a Gaussian Distribution of acoustic Hidden Markov Model states.

16. The system of claim 11 , wherein said acoustically processed speech data is a word graph.

17. The system of claim 16, wherein said sender module forms a binary representation of said word graph and places said binary representation along with a source address and a destination address onto a data packet before transmitting said word graph.

18. An apparatus comprising: a main memory to store a first and second sets of speech data; a cache memory; and a processor to acoustically process said first set of speech data simultaneously with a transfer of said second set of speech data from said main memory to said cache memory.

19. The apparatus of claim 18, wherein said speech data is a Mean and a Variance Vector of a feature vector associated with a human speech signal.

20. The apparatus of claim 18, wherein said apparatus is selected from a group consisting of a wireless device, a Personal Digital Assistant, and a mobile device.

21. The apparatus of claim 18, further comprising: a direct memory access module coupled to said main memory for sending an acoustically processed speech data over a network for language processing.

22. The apparatus claim 21 , wherein said network is the internet.

23. A computer-readable medium including a program executable by a processor, comprising: a first subroutine for receiving a human speech signal; a second subroutine for acoustically processing a first set of speech data associated with said human speech signal; a third subroutine for transferring a second set of speech data to be acoustically processed from a first memory to a second memory while said first set of speech data is being acoustically processed; a fourth subroutine for linguistically processing said acoustically processed first set of speech data; and a fifth subroutine for forming a recognized word sequence associated with said human speech signal.

24. The computer-readable medium of claim 23, wherein said first and said second set of speech data comprises a Mean Vector and a Variance Vector based on a Gaussian Distribution of acoustic Hidden Markov Model states.

25. The computer-readable medium of claim 24, wherein said acoustically processed speech data comprises a word graph.

26. The computer-readable medium of claim 25, further comprising a sixth subroutine for packetizing said word graph into a data packet; and a seventh subroutine for transmitting said data packet over a network.

27. The computer-readable medium of claim 26, wherein said network nternet.