CN1545696A

CN1545696A - Method of employing prefetch instructions in speech recognition

Info

Publication number: CN1545696A
Application number: CNA018235549A
Authority: CN
Inventors: 赖春荣; 赵庆伟; 潘杰林
Original assignee: Intel China Ltd; Intel Corp
Current assignee: Intel China Ltd; Intel Corp
Priority date: 2001-06-19
Filing date: 2001-06-19
Publication date: 2004-11-10
Anticipated expiration: 2021-06-19
Also published as: WO2002103677A1; CN1223986C

Abstract

In general, the new prefetching method pursuant to one embodiment of the invention, employed by a computer system engaged in human speech recognition provides an efficient method of computing and searching speech features based on a Gaussian Distribution of acoustic Hidden Markov Model states. The new method transfers the speech data to be processed while the processor is engaged in acoustically processing a speech data in process. Accordingly, the prefetching method, pursuant to one embodiment of the invention, employed by a computer system engaged in human speech recognition, reduces or eliminates memory latency caused by the processor waiting idle while the memory transfers speech data to be processed to the processor.

Description

In speech recognition, adopt the method for prefetched instruction

Technical field

The present invention relates to speech recognition.Especially, the present invention relates to a kind of new apparatus and method, when it carries out acoustic processing to speech data in system in the processing of the voice recognition phase process of voice recognition processing, adopt prefetched instruction being sent to high-speed cache from primary memory by the speech data of acoustic processing.

Background technology

In the past few years, technology and the science by a people's that machine carried out speech recognition obtained big development.Today, have many application programs that are used for the big vocabulary continuous speech recognition (LVCSR) of automatic speech recognition (ASR).In order to realize speech recognition, a kind of computer system can be used as a large amount of speech engines that calculate and search for of processing, to analyze and to discern the voice signal of carrier's phonetic feature.Correspondingly the efficient of a computer system in carrying out these operations has influence to the performance of speech engine.

Usually, a speech recognition system is carried out several operations to a people's voice signal, to determine said content.For example, when a people said following sentence " my name is John ", for example the such voice capture device of microphone was caught this pronunciation as an analoging sound signal.This simulating signal is converted into a digital signal then, so that handled by digital machine.The institute's picked up signal that carries phonetic feature can be used a mathematical model and quantize and show as a plurality of eigenvectors.For example, Mel frequency cepstrum (Cepstral) coefficient (MFCC) can be used to indicate phonetic feature.

The feature of being calculated is carried out acoustic processing by a computer system then.In the acoustic processing process, this feature is compared with the known phonetic symbol unit in being included in a sound model.The example of a sound model is hidden Markov model (HMM).This phonetic feature may cause one or more couplings with the comparison that is included in the known phonetic symbol unit in this model.The phonetic symbol unit that is mated for example uses a dictionary or grammer dictionary to carry out Language Processing then, to form a word string of being discerned.

In order to carry out acoustic processing, this speech engine uses a large amount of probability distribution, for example as the mixing of the M gauss of distribution function of the N dimension space in the space of the eigenvector of this voice signal.The mean value of each eigenvector and variance calculated and the storer of this computer system of storer in.Afterwards, each parameter was taken out from storer, finished the calculating of Gaussian function to be used for this speech engine.

Fig. 1 is the storage of existing computer system related in people's speech recognition and the synoptic diagram of performance period.This illustrate in the acoustic processing process of voice signal this performance element and memory bus the time base relatively.When memory bus wanting processed speech data when storer transmits, it is idle that this performance element keeps, till wanting processed data to become can be obtained by this processor.Owing to whole calculated amount required in phonetic analysis, this memory latency time increases fast, and promptly the time of being wasted when this storer transmits the data of wanting processed increases.When the continuous received speech signal of LVCSR, this problem especially severe.Many action needs were finished in p.s., and this shortcoming seriously limits the speed and the efficient of this system.

Description of drawings

Fig. 1 is used for according to the storage of the computer system of the acoustic processing of prior art and the synoptic diagram of performance period.

Fig. 2 is the block scheme of the signal speech recognition system of method according to an embodiment of the invention.

Fig. 3 is for illustrating the process flow diagram of speech recognition system according to an embodiment of the invention.

Fig. 4 is the illustrative method that the phonetic feature in the acoustic processing process of voice signal calculates.

Fig. 5 is the signal computer code of the C language of the new prefetching technique of employing the method according to this invention.

Fig. 6 adopts the signal computer code of the assembly language of the new prefetching technique of method according to an embodiment of the invention.

Fig. 7 is used for illustrating according to an embodiment of the invention the storage of computer system and the synoptic diagram of performance period.

Embodiment

In the following detailed description of embodiments of the invention, provide various details.But those of ordinary skill in the art obviously can realize according to an embodiment of the invention method as can be seen and not have these details.In other words, well-known method, process, parts and circuit are not described in detail, and obscure to avoid embodiments of the invention are caused.

The method according to this invention comprises the various functional steps that will be described below.This functional steps can be realized by hardware component, perhaps can be presented as the executable instruction of machine, and it can be used to make carries out this functional steps with the general processor of this instruction programming.In addition, this functional steps can be carried out by the combination of hardware and software.

Embodiments of the invention disclose a kind of new prefetching technique that will realize in the acoustic processing phase process of people's speech recognition.When wanting processed data by when primary memory is sent to performance element in the acoustic processing process, this new prefetching technique can be used to reduce or eliminate because performance element is waited for the memory latency time that the free time is caused.In a preferred embodiment, for example, when this performance element was busy with the computing voice feature, this application program was carried out the prefetched instruction that is used for data that will be processed concurrently.Correspondingly, when this performance element was busy with calculating, this memory bus this performance element that is busy with looking ahead calculated required data next time.

Referring now to Fig. 2,, the block scheme of the speech recognition system 200 of a signal shown in it.This system comprises voice capture device 210, analog to digital converter 212, computer system 250 and a series of I/O equipment, for example controller equiment 240, display device 242, network interface unit 244 and printing device 246.This computer system 250 comprises processor 252, storer 280, high-speed cache 260, director cache 262, memory bus 272 and I/O bus 270 again.Preferably, this computer system may further include a direct memory access (DMA) 274.

The following work of this system: a people speaks to microphone 210, obtains an analog voice signal.This signal is then by analog to digital converter 212, to form the digitized representations of this analog voice signal.This digitized expression is imported into this computer system 250 then.This processor 252 begins to discern the phonetic feature relevant with this voice signal then, and these characteristic storage in the storer 280 of computer system 250.A high-speed cache 260 is used to be stored in prefetch data required in the calculating of phonetic feature.A director cache 262 coprocessors 252 and the data between the high-speed cache 260 that are connected to processor 252 and high-speed cache 260 transmit.

Also being stored in the storer 280 is a plurality of known phonetic symbol unit, and it is called as a sound model.By the employed sound model of present embodiment can be relevant (SD) model with the speaker or can with speaker-independent (SI) model.This SD model is by a specific people's sound institute efficient, and this recognition system is supposed to be used by identical people.For example, mobile phone or personal digital assistant adopt the SD model usually, because its estimates to be used by identical people (owner of this equipment).When the people of this system of use changes, use the SI model on the other hand.For example, an auto-teller (ATM) generally uses the SI model.

Processor 252 finished this voice signal feature calculating and be stored in them in this storer 280 after, it can in the sound model in also being stored in storer 280, seek the coupling.Used particular search method does not influence the method that is used for this embodiment.For example, can use single the best or N best hypothesis.In addition, a word figure or a phonetic symbol word figure can be used to indicate the coupling that obtains in the search procedure of sound model.

In any case this coupling is carried out Language Processing, with the word string of determining to be identified.In addition, this processor 252 can utilize this display device 242 that the result of coupling is sent to another computing machine, for example can carry out the server apparatus (not shown) of this Language Processing.If this processor 252 is programmed to also the effective language as a result of coupling be handled, then it can utilize printing device 246 to print relevant institute's identification string.In addition, the word string of being discerned may be displayed on the display device 242, perhaps for example is sent to controller equiment 240, control signal is sent to another system, controls an equipment.

Referring now to Fig. 3,, shown in it according to the process flow diagram of the use speech recognition system of an embodiment.In step 3e06, catch the people's of a signal voice signal with analog form.The voice signal of being caught carries and the relevant phonetic feature of the said content of this speaker.Selected special sound feature does not influence the method according to present embodiment.For example, selected phonetic feature can be the voice signal energy intensity of measuring according to frequency interval.When the people speaks, this characteristic change, and this feature can be represented by a plurality of eigenvectors, and each eigenvector has a direction and amplitude.Then this voice signal can by mathematics be expressed as summation with the eigenvector of different time interval measurement.This time interval or sample frequency are short more, and then the expression of this voice signal is accurate more.In order to calculate these features, this signal at first is converted into digital form, makes it by a digital machine reason of living in shown in the step 308.In step 310, the feature of this digitized voice signal is calculated and is stored in the storage unit of this system.For example, a mathematical model that generally is used to indicate phonetic feature is a Mel frequency Cepstral coefficient (MFCC).

Also be stored in the storage unit of this system is a sound model 330 and language model 332.Step 340 expression sound and Language Processing.In this step process, carry out search according to a searching algorithm, for example search (decoding) algorithm of propagating based on token.In this " search handle " or " matching treatment " process, this performance element is searched institute's calculated characteristics (for example, the MFCC of voice signal) and is included in coupling between the known phonetic symbol feature in this sound model in step 310.In this stage, the candidate item of high matching probability obtains optimal candidate item, for example a phonetic symbol unit list by selecting to have.

This search volume has been programmed the specific identification application program of carrying out according to this system and has changed.For example, for listening writing task, this search volume can be organized as a words tree; And in order to order and control task, this search volume can be organized as a word figure.Can carry out any known searching method, for example single the best or N best hypothesis.In any case, after search, can produce a word figure by this performance element.The word figure that replaces option by the word that utilizes the coupling that this sound model does is carried out Language Processing then, and produces a word string of being discerned in step 350.Eigenvector be included in the matching operation process of the known features in this sound model, that is, sound model coupling and forming is handled, and can use the method according to different embodiments of the invention.

In the Language Processing process, a language model can be used to form single best sentences.This language model can adopt dictionary and grammer dictionary to eliminate from the candidate item of coupling not similar or not allow the word that occurs.The best sentences that is obtained can be used as a control signal, and perhaps it can be stored in the dictation application simply.

Referring now to Fig. 4,, handles the illustrative method of the acoustic processing of a voice signal shown in it.In general, voice signal for example is represented as a mathematical model based on MFCC.This model is calculated by the gauss of distribution function according to the expression state relevant with a plurality of eigenvectors.An example of this mathematical model uses a Gaussian distribution probability function according to formula 410 to form.Wherein x=(x1, x2 ... xN) be the eigenvector 1 to N of voice signal, and mean value 412 and variable 413 be the i n dimensional vector n, m mixing of the Gaussian distribution of sound HMM state.In general, this algorithm computation is used, to quicken the calculating of eigenvector.For example, if computational algorithm 408, then common following formula is used to quicken aforementioned calculation, because log (Wmfm (x)) can be calculated as follows:

Log(y1+y2)＝Log(y1)+Log(1+y2/y1)＝Logy1+log(1+e?POWERlogy2-logy1)

In order to make this processor carry out this calculating,, can utilize the circulation of a counting.In this loop blocks, arithmetic instruction and former data transmission functional dependence.Before carry out calculating, for example relevant with the variance vector 413 of mean vector 412 and each eigenvector such data of numerical value will be provided for this processor.A prefetched instruction can be used to transmit the average and variate-value of each eigenvector.In a preferred embodiment, when this performance element was busy with calculating current data, this prefetched instruction was performed.This prefetched instruction can be carried out in this performance element is busy with any periodic process of current calculating.Two incidents are not necessarily wanted fully simultaneously, but in a preferred embodiment, the current computation period of this prefetched instruction and this performance element is carried out simultaneously.

This Gaussian Computation can many times be used for calculates gaussian probability from this eigenvector, mean vector, variance vector, when this voice signal is done till.In general, a circulation is used to carry out this calculating.When this performance element is busy with a cell mean used in this calculating and variance vector, this software for example can comprise a prefetched instruction of the several average and variance vector then of looking ahead, make that having finished its calculating and preparation when this performance element is used for next group on average and during variable vector, this numerical value is this memory buffer place Already in.The numerical value of looking ahead at this high-speed cache place means that this performance element does not need the idle pending data that also waits.Data that will be processed are available, and after finishing current calculating, the next one that this performance element can be carried out it simply calculates.

Fig. 5 adopts the signal computer code of the C language of prefetched instruction according to an embodiment of the invention.Be expert in 514, the lattice prefetched instruction is set up, the required data of calculating of the function ippsLogGauss1_32f_D2 shown in 518 that is expert to look ahead.Function _ mm_prefetch () is the prefetched instruction of a signal in the C language library.Also can use any other prefetched instruction that calls the turn at any other machine word, as long as this instruction makes storer send the data that are positioned at prefetch address that will be sent to this high-speed cache.In this embodiment, can use any computerese.

When carrying out this prefetched instruction, the cache line of generally looking ahead.In system with cache line that equals 32 bytes, should _ mm_prefetch is loaded into 8 floating numbers in this high-speed cache, because each floating number comprises 4 bytes.Correspondingly, can be by an increment and next prefetch address addition be calculated this prefetch address.This increment will guarantee when data pre-fetching is finished, and and then need prefetched data afterwards.Otherwise this operation may cause the pollution (cache pollution) of high-speed cache, causes the poor efficiency of total system.Similarly, if this increment is too little, then before the next computation period of this performance element began, this was looked ahead and will not hide this stand-by period of looking ahead effectively.If increment is too big, then the start-up cost of the data of not looking ahead for primary iteration is reduced the advantage of these data of looking ahead, and surround before should prefetched data may former data of looking ahead being actually used and replace this former data of looking ahead.For big circulation, this increment can be set to 32 bytes or 8 floating numbers.

Usually, the numerical value of this increment depends on and assesses the cost and this round-robin storer is filled ratio between cost.The desired quantity of this increment can obtain by experience and design parameter.For systemic circulation, the numerical value of this increment can be set to 16.This will cause the 3rd cache line of looking ahead in this calculation process.By using increment numerical value 16, the situation of can slip up high-speed cache (miss) reduces half.

This increment can also change according to used computerese.For example, experience shows, in the C language, obtains optimum when looking ahead the 3rd cache line.But in assembly language, when looking ahead the 4th cache line, obtain optimum.The reason of this difference is the specific compiler by selected speech selection use.In the C language, because this compiler makes that prefetched instruction is sent more randomly.Utilize unordered core processor, the difference on performance is less and can be left in the basket.But, obtain optimum performance by code with the compilation language compilation.

Prefetched instruction can also be added in the major cycle of ippsLogGauss1_32f_D2, as shown in row 528 and 529.This is illustrated in and is specifically shows at looking ahead after the memory loads, and it can obtain similar effects.

Fig. 6 is illustrated in the correction code of the major cycle shown in the row 529 of Fig. 5.The signal computer code of this employing assembly language adopts prefetched instruction according to an embodiment of the invention.This circulation is unfolded so that it handles 32 bytes, and the data in the 4th cache line are prefetched.This method can reduce the decoding cost of speech recognition.For example, the experiment on a speech recognition system with Chinese (51K) language model shows 9% improvement.

Fig. 7 for the performance element of the signal computer system that in people's according to an embodiment of the invention speech recognition, relates to and memory cycle time-the action synoptic diagram.Utilize the advantage of the long computation period of gaussian probability distribution function by the next average and variance numerical value of the corresponding eigenvector of looking ahead according to the method for present embodiment.As shown in Figure 7, when this performance element was used for the calculating on summit (n-1), this memory bus was looked ahead and is used for the data of this summit (n).Similarly, in the next cycle process, when this performance element is busy with calculating summit (n), this memory bus summit (n+1) of being busy with looking ahead.In this manner, this performance element does not wait for idly that this memory bus loads him and finish the required data of this calculating.Consequently eliminate the intrinsic stand-by period in the processing of the voice recognition of prior art.

Claims

1. a method comprises:

Recipient's voice signal;

The first group speech data relevant with described people's voice signal carried out acoustic processing;

When described first group of speech data during by acoustic processing, being sent to second memory from first memory by second group of speech data of acoustic processing;

Described first and second groups of speech datas through acoustic processing are carried out Language Processing; And

Form an institute identification string relevant with described people's voice signal.

2. method according to claim 1, wherein said first memory comprise a primary memory.

3. method according to claim 1, wherein said second memory comprise a high-speed cache.

4. method according to claim 1, wherein said first and second groups of speech datas comprise based on mean vector of the Gaussian distribution of the hidden Markov model state of sound and variance vector.

5. method according to claim 4, wherein said mean vector and described variance vector are used to calculate an eigenvector, and it then is used to search for a sound model.

6. method according to claim 1, the word string of wherein said identification are used to control an equipment.

7. method, comprising:

First group of speech data carried out acoustic processing; And

When described first group of speech data is carried out acoustic processing, being sent to second memory from first memory by second group of speech data of acoustic processing.

8. method according to claim 7, wherein said first and second groups of speech datas comprise mean vector and the variance vector based on the Gaussian distribution of the hidden Markov model state of sound.

9. method according to claim 7, wherein said first memory is slower than described second memory.

10. method according to claim 7 wherein further comprises:

Identification is corresponding at least one word of described speech data.

11. a system, comprising:

Client devices, it comprises:

The processor that first and second groups of speech datas are carried out acoustic processing,

Be connected to the primary memory of described processor, the described first and second groups of speech datas of this main memory store,

Be connected to the high-speed cache of described processor and described primary memory, and

When being sent to described high-speed cache from described primary memory with described second group of speech data, described processor carries out acoustic processing to described first group of speech data, and the transmitter module that is connected to the described processor of this client devices, this transmitter module sends to a server to described first and second groups of speech datas through acoustic processing.

12. system according to claim 11 wherein further comprises:

People's voice trapping module, the voice signal that is used to catch the people;

The analog to digital converter module is used for described people's voice signal is converted to audio digital signals; And

The phonetic feature identifier module is used to discern the feature of described audio digital signals.

13. system according to claim 11, wherein said client devices is selected from mobile phone, personal digital assistant and portable computer system.

14. system according to claim 12, wherein said phonetic feature identifier module also carry out end point detection, emphasize filtering and quantification in advance described people's voice signal.

15. system according to claim 11, wherein said speech data comprises mean vector and the differential vector based on the Gaussian distribution of the hidden Markov model state of sound.

16. system according to claim 11, wherein said speech data through acoustic processing is a word figure.

17. system according to claim 16, wherein said transmitter module forms the binary representation of described word figure, and before sending described word figure, and described binary representation and source address and destination address are together placed a packet.

18. a device, comprising:

Store the primary memory of first and second groups of speech datas;

High-speed cache; And

When sending to described high-speed cache from described primary memory, to described first group of processor that speech data carries out acoustic processing with described second group of speech data.

19. device according to claim 18, wherein said speech data are the average and differential vector of the eigenvector relevant with people's voice signal.

20. device according to claim 18, wherein said device is selected from wireless device, personal digital assistant and mobile device.

21. device according to claim 18 wherein further comprises:

Be connected to the direct memory access (DMA) module of described primary memory, be used for sending a speech data, be used for Language Processing through acoustic processing by network.

22. device according to claim 21, wherein said network is the internet.

23. one kind comprise can be by the computer-readable medium of the performed program of processor, comprising:

First subroutine is used for recipient's voice signal;

Second subroutine is used for the first group speech data relevant with described people's voice signal carried out acoustic processing;

The 3rd subroutine is used for when described first group of speech data is carried out acoustic processing, being sent to second memory from first memory by second group of speech data of acoustic processing;

The 4th subroutine is used for described first group of speech data by acoustic processing carried out Language Processing; And

The 5th subroutine is used to form an institute identification string relevant with described people's voice signal.

24. computer-readable medium according to claim 23, wherein said first and described second group of speech data comprise mean vector and differential vector based on the Gaussian distribution of the hidden Markov model state of sound.

25. computer-readable medium according to claim 24, wherein said speech data through acoustic processing comprises a word figure.

26. computer-readable medium according to claim 25 wherein further comprises:

The 6th subroutine is used for described word figure is packaged as a packet; And

The 7th subroutine is used for sending described packet by a network.

27. computer-readable medium according to claim 26, wherein said network is the internet.