CN100489765C - Coprocessor - Google Patents

Coprocessor Download PDF

Info

Publication number
CN100489765C
CN100489765C CNB2007101184303A CN200710118430A CN100489765C CN 100489765 C CN100489765 C CN 100489765C CN B2007101184303 A CNB2007101184303 A CN B2007101184303A CN 200710118430 A CN200710118430 A CN 200710118430A CN 100489765 C CN100489765 C CN 100489765C
Authority
CN
China
Prior art keywords
links
module
data bus
address
bus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007101184303A
Other languages
Chinese (zh)
Other versions
CN101082859A (en
Inventor
董明
梁维谦
李鹏
智强
刘润生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CNB2007101184303A priority Critical patent/CN100489765C/en
Publication of CN101082859A publication Critical patent/CN101082859A/en
Application granted granted Critical
Publication of CN100489765C publication Critical patent/CN100489765C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a protocol processor in the integrated circuit design domain, which comprises the following parts: memory mode, address generating mode, register mode, control circuit mode and calculating mode. The invention can improve the property of inserted voice identifying system based on martensite distant calculation and multiplication accumulation calculation of HMM voice identifying algorism, which has wide uniformity, low cost and low energy consumption.

Description

A kind of coprocessor
Technical field
The invention belongs to (the System On Chip of VLSI (very large scale integrated circuit) SOC in the areas of information technology, SOC (system on a chip)) design field, be particularly related to Embedded Speech Recognition System application of new generation, especially a kind of based on the mahalanobis distance calculating of HMM (Hidden Markov Models, hidden Markov model) and the coprocessor of vector multiply accumulating computing.
Background technology
The research of built-in speech recognition system is the important developing direction that speech recognition technology is used always.It provides good man machine language mutual for portable, miniaturization product, as the speech retrieval of amusing products such as the phonetic dialing of third generation intelligent language learning machine, phone, mp3, vehicle-mounted speech control system, intelligent toy and home appliance voice remote control etc.In recent years because embedded MCU (Micro Controller Unit, microcontroller) and DSP (Digital Signal Processor, digital signal processor) raising of handling property and the improvement of speech recognition algorithm; Existing embedded speech treatment S OC realizes both at home and abroad, and beginning is used in batches.But high performance speech recognition algorithm is a more complicated, the recognition performance of existing chip, and recognizer operation time, power consumption and cost all can not satisfy demands of applications well.For example realize based on the continuous hidden Markov model of words (Hidden Markov Model, HMM) medium vocabulary identification mission, 16bit fixed DSP (as counter 2054 series of TI) is needed approximately the clock frequency of 100MHz, the MCU chip (as ARM9) of 32bit is needed approximately the clock frequency of 200MHz, the power consumption of these chips is big, cost is high, is difficult to generally use on portable equipment.
And hardware realizes that the speed of speech recognition algorithm is very fast, referring to Fig. 1, key algorithm in the speech recognition is made speech recognition key operation module by large scale integrated circuit to be realized, with this speech recognition key operation module as coprocessor, cooperate in master cpu work or the embedding master cpu, can make master cpu under lower clock frequency, finish the function of voice recognition processing like this, and then reduce power consumption and cost.
Referring to Fig. 2, comprise three basic steps based on the Embedded Speech Recognition System algorithm of HMM:
Step 11: the original figure voice are carried out feature extraction, obtain speech characteristic vector.As MFCC (Mel-FrequencyCeptral Coefficients, Mel frequency cepstral coefficient) eigenvector;
Step 12: according to speech characteristic vector and acoustics HMM Model Calculation output probability;
Step 13: utilize the output probability matrix that obtains that recognition network is carried out web search and obtain final recognition result.
Wherein, the operand that calculates output probability will account for more than 70% of total system operand, and the computing of calculating output probability is calculated as the master with mahalanobis distance, even after algorithm is optimized, still account for 55% of whole operand.
For reaching real-time processed voice recognition system, prior art can only be carried out the calculating of mahalanobis distance with DSP.Because DSP is at the objective optimization of general-purpose computations, it adds fundamental operation at taking advantage of in the mahalanobis distance and can not efficiently handle, therefore, use large scale integrated circuit and realize that the mahalanobis distance computing chip is most important to the performance that improves built-in speech recognition system.
Summary of the invention
In order to improve operation efficiency and to reduce the computing cost, the invention provides a kind of coprocessor.Described technical scheme is as follows:
A kind of coprocessor, described coprocessor comprises: memory module, address generating module, registers group module, control circuit module and computing module;
Described memory module links to each other with described address generating module by internal address bus, link to each other with ppu by outer address bus, external data bus and external control bus, link to each other with described control circuit module by Internal Control Bus IBC, link to each other with described computing module by internal data bus, be used for eigenvector, model state vector and the result of calculation of storage computation process;
Described address generating module links to each other with described registers group module by internal data bus, link to each other with described control circuit module by Internal Control Bus IBC, be used under the effect of described control circuit module producing the eigenvector that described memory module stores and the address of result of calculation;
Described registers group module and described memory module unified addressing link to each other with described computing module by internal data bus, are used to store the start address of each eigenvector and result of calculation;
The input end of described control circuit module links to each other by the output terminal of internal data bus with described registers group module, and the content that is used for each register by reading described registers group module is carried out counter initialization control, the control of vector multiply accumulating and vector and taken advantage of multiply accumulating control;
Described computing module links to each other with Internal Control Bus IBC with internal data bus, is used for the eigenvector of described memory module is carried out signed magnitude arithmetic(al) and takes advantage of multiply-add operation, and result calculated is exported to described memory module by internal data bus.
Described memory module comprises: first selector, input FPDP, second selector, address port, third selector, control port, output data port and storage unit;
Described first selector links to each other with external data bus or internal data bus links to each other;
Described input FPDP links to each other with external data bus or internal data bus by described first selector, and the data that are used for receiving write described storage unit;
Described second selector links to each other with outer address bus or internal address bus;
Described address port links to each other with outer address bus or internal address bus by described second selector, for the data in the described storage unit provide corresponding address;
Described third selector links to each other with external control bus or Internal Control Bus IBC;
Described control port links to each other with external control bus or Internal Control Bus IBC by described third selector, controls the duty of described storage unit;
The input end of described output data port links to each other with described storage unit, output terminal links to each other with internal data bus with external data bus respectively, is used for importing data to ppu input data or by internal data bus to described computing module by external data bus;
Described storage unit links to each other with described input FPDP, described address port, described control port and described output data port, is used to store eigenvector and result of calculation.
Described address generating module comprises: a plurality of counters and a selector switch;
The input end of described a plurality of counters links to each other with internal data bus, and output terminal links to each other with described selector switch, is respectively applied for current data address that produces the eigenvector in the described control circuit module and the address that produces the result of calculation of described computing module;
The output terminal of described selector switch links to each other with internal address bus, is used for internal address bus is exported in the address that described a plurality of counters produce.
Described registers group module comprises: a plurality of registers and code translator;
The output terminal of described a plurality of registers links to each other with internal data bus, and input end links to each other with external data bus, is respectively applied for start address, the protection of depositing different vectors to be calculated and overflows setting and account form setting;
The output terminal of described code translator links to each other with the Enable Pin of described a plurality of registers, is used to control writing of described a plurality of register values.
Described control circuit module comprises: counter, contactor, counter initialization control module, vector are taken advantage of accumulation Operations Analysis and mahalanobis distance Operations Analysis;
Described counter is a downward counter, links to each other with internal data bus, is used for according to the number of times that State Control cycle calculations be set of described registers group module by internal data bus output;
Described contactor has two input ends, described two input ends link to each other with internal data bus with described counter respectively, are used for opening or close described control circuit module according to the state value of the described registers group module of the output valve of described counter or internal data bus input;
The input end of described counter initialization control module links to each other with internal data bus with the output terminal of described contactor, be used for when described contactor is opened, described counter is carried out initialization control, and the described vector of gating is taken advantage of accumulation Operations Analysis or described mahalanobis distance Operations Analysis;
Described vector takes advantage of the input end of accumulation Operations Analysis to link to each other with the output terminal of described counter initialization control module, is used to control described computing module and carries out the multiply accumulating computing, and final calculation result is exported;
The input end of described mahalanobis distance Operations Analysis links to each other with the output terminal of described counter initialization control module, is used to control described computing module and carries out the mahalanobis distance computing, and final calculation result is exported.
Described computing module comprises: the signed magnitude arithmetic(al) unit, take advantage of multiply-add operation unit and output port;
Described signed magnitude arithmetic(al) unit links to each other with internal data bus, is used for two eigenvectors of input are carried out signed magnitude arithmetic(al), and result calculated is outputed to the described multiply-add operation unit of taking advantage of;
The described multiply-add operation unit of taking advantage of specifically comprises: two selector switchs, shift unit, multiplier and totalizers, described two selector switchs link to each other with internal data bus with described signed magnitude arithmetic(al) unit respectively, are used to receive result of calculation and another feature vector of described signed magnitude arithmetic(al) unit;
The input end of described shift unit links to each other with internal data bus, and output terminal links to each other with described two selector switchs, is used for the operation result of described multiplier is shifted;
The input end of described multiplier links to each other with described two selector switchs, is used for the eigenvector that described two selector switchs are exported is carried out multiplying;
Described totalizer links to each other with described multiplier with described shift unit, is used for the result after the displacement is carried out a square accumulating operation, gives described output port with the accumulating operation result transmission;
Described output port links to each other with the described output terminal of multiply-add operation unit of taking advantage of, and is used for the described result of calculation of multiply-add operation unit of taking advantage of is exported to internal data bus, by internal data bus result of calculation is transferred to described memory module.
Described memory module is standard static random access memory SRAM.
The beneficial effect of technical scheme provided by the invention is:
Coprocessor provided by the invention can carry out calculating and the multiply accumulating computing based on the mahalanobis distance of HMM speech recognition algorithm, can improve the performance of built-in speech recognition system, can finish simultaneously the vector multiply accumulating computing in the general dsp, expanded its versatility, has the raising performance, reduce cost, fall advantage of low power consumption.
Description of drawings
Fig. 1 is the simple structure synoptic diagram that hardware of the prior art is realized speech recognition algorithm;
Fig. 2 is the process flow diagram of the Embedded Speech Recognition System algorithm based on HMM of the prior art;
Fig. 3 is the structural representation of the coprocessor that provides of the embodiment of the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, embodiment of the present invention is described further in detail below in conjunction with accompanying drawing.
Present embodiment provides a kind of coprocessor, and this coprocessor adopts standard SRAM (Static Random AccessMemory, static RAM) interface, can be used with multiple microcontroller MCU.This coprocessor can be finished mahalanobis distance computing and the computing of vector multiply accumulating by the register programming Control.
Referring to Fig. 3, this coprocessor consists of the following components: memory module 100, address generating module 200, registers group module 300, control circuit module 400 and computing module 500 also comprise internal address bus: A1, outer address bus among the figure:
A2, internal data bus: B1, external data bus: B2, Internal Control Bus IBC: C1, external control bus: C2.
Wherein, memory module 100 links to each other with address generating module 200 by internal address bus A1, link to each other with ppu with external control bus C2 by outer address bus A2, external data bus B2, link to each other with control circuit module 400 by Internal Control Bus IBC C1, link to each other with computing module 500 by internal data bus B1, be used for eigenvector, model state vector and the result of calculation of storage computation process;
Address generating module 200 links to each other with registers group module 300 by internal data bus B1, link to each other with control circuit module 400 by Internal Control Bus IBC C1, be used under the effect of control circuit module 400, producing the eigenvector of memory module 100 storages and the address of result of calculation;
Registers group module 300 and memory module 100 unified addressing link to each other with computing module 500 by internal data bus B1, are used to store the start address of each eigenvector and result of calculation;
The input end of control circuit module 400 links to each other with the output terminal of registers group module 300 by internal data bus B1, and the content that is used for each register by read register pack module 300 is carried out counter initialization control, the control of vector multiply accumulating and vector and taken advantage of multiply accumulating to control;
Computing module 500 links to each other with Internal Control Bus IBC C1 with internal data bus B1, is used for the eigenvector of memory module 100 is carried out signed magnitude arithmetic(al) and takes advantage of multiply-add operation, and result calculated is exported to memory module 100 by internal data bus B1.
Further, memory module 100 is made up of 1 read-write single port SRAM, specifically comprises: first selector 101, input FPDP 102, second selector 103, address port 104, third selector 105, control port 106, output data port one 07 and storage unit 108; Wherein,
First selector 101 links to each other with external data bus B2 or internal data bus B1;
Input FPDP 102 links to each other with external data bus B2 or internal data bus B1 by first selector 101, and before module was started working, first selector 101 is gating external data bus B2 passage always, can guarantee writing of external data; After module was started working, under the control of control circuit module 400, first selector 101 gating internal data bus B1 can guarantee that the result of calculation of computing module 500 writes in the storage unit 108 of memory module 100.
Second selector 103 links to each other with outer address bus A2 or internal address bus A1;
Address port 104 links to each other with outer address bus A2 or internal address bus A1 by second selector 103, and before module was started working, second selector 103 is gating outer address bus A2 passage always; After module is started working, under the control of control circuit module 400, second selector 103 gating internal address bus A1 passages.
Third selector 105 links to each other with external control bus C2 or Internal Control Bus IBC C1;
Control port 106 links to each other with external control bus C2 or Internal Control Bus IBC C1 respectively by third selector 105, and before module was started working, control port 106 was controlled by ppu; After module was started working, control port 106 was controlled by internal circuit.
The input end of output data port one 07 links to each other with storage unit 108, output terminal links to each other with internal data bus B1 with external data bus B2 respectively, is used for importing data to ppu input data with by internal data bus B1 to computing module 500 by external data bus B2;
Storage unit 108 links to each other with output data port one 07 with input FPDP 102, address port 104, control port 106, is used to store eigenvector and result of calculation.
Address generating module 200 specifically comprises in the present embodiment: first counter 201, second counter 202, the 3rd counter 203, four-counter 204 and the 5th counter 205 be totally 5 counters, and they are upwards counter.First counter 201, second counter 202 and the 3rd counter 203 are respectively applied for the current data address that produces the eigenvector in the control circuit module 400, and every group of vector address stored is a continuous sector address space, the address that the 5th counter 205 is used to produce result of calculation, four-counter 204 is used for alternative.The output terminal of 5 counters is connected on the input end of the 4th selector switch 206.According to actual needs, the number of counter and function can change in this module.
The 4th selector switch 206 is used for internal address bus A1 is exported in the address that above-mentioned counter produces, and the output terminal of the 4th selector switch 206 links to each other with internal address bus A1.
Registers group module 300 comprises 8 registers in the present embodiment, the data input pin of 8 registers links to each other with external data bus B2, present embodiment is respectively first register 301, second register 302, the 3rd register 303, the 4th register 304, the 5th register 305, the 6th register 306, the 7th register 307 and the 8th register 308, and its function and address are referring to table 1.
The output of first register 301, second register 302, the 3rd register 303, the 4th register 304, the 5th register 305 and the 6th register 306 inserts the input of the 6th counter 401, first counter 201, second counter 202, the 3rd counter 203, four-counter 204 and the 5th counter 205 respectively.
The 7th register 307 is a shift register, and output inserts the shift unit 505 of computing module 500.
The 8th register 308 is a status register, output Access Control circuit module 400.
Table 1
Register The address Function
First register 301 0xffO The 6th counter 401 initialization registers are used for the number of times that Control Circulation is calculated
Second register 302 0xff1 First counter, 201 initialization registers are used to deposit the start address of vector to be calculated
The 3rd register 303 0xff2 Second counter, 202 initialization registers are used to deposit the start address of vector to be calculated
The 4th register 304 0xff3 The 3rd counter 203 initialization registers are used to deposit the start address of vector to be calculated
The 5th register 305 0xff4 Four-counter 204 initialization registers are used to deposit the start address of vector to be calculated
The 6th register 306 0xff5 The 5th counter 205 initialization registers are used to deposit the start address of preserving result of calculation
The 7th register 307 0xff6 The carry digit register is used for hardware protection and overflows setting
The 8th register 308 0xff7 Status register, its function sees Table 2.
According to actual needs, the number of registers and function can change in this module.Wherein, the 8th register 308 is a status register, and its function is referring to table 2.
Table 2
Figure place Value is 0 Value is 1
The 0th To 401 initialization of the 6th counter, the 6th counter 401 is not counted with current intrinsic value; To 401 initialization of the 6th counter, read in the value of first register 301;
The 1st To 201 initialization of first counter, first counter 201 is not counted with current intrinsic value; To 201 initialization of first counter, read in the value of second register 302;
The 2nd To 202 initialization of second counter, second counter 202 is not counted with current intrinsic value; To 202 initialization of second counter, read in the value of the 3rd register 303;
The 3rd To 203 initialization of the 3rd counter, the 3rd counter 203 is not counted with current intrinsic value; To 203 initialization of the 3rd counter, read in the value of the 4th register 304;
The 4th To four-counter 204 initialization, four-counter 204 is not counted with current intrinsic value; To four-counter 204 initialization, read in the value of the 5th register 305;
The 5th To 205 initialization of the 5th counter, the 5th counter 205 is not counted with current intrinsic value; To 205 initialization of the 5th counter, read in the value of the 6th register 306;
The 6th Carrying out mahalanobis distance calculates; Carry out the computing of vector multiply accumulating;
The 7th Quit work, after internal calculation finishes automatically with this clear 0; Work enables.
Registers group module 300 also comprises code translator 309, and this code translator 309 links to each other with outer address bus A2, and the output of code translator 309 inserts the Enable Pin of above-mentioned 8 registers.
Control circuit module 400 specifically comprises: the 6th counter 401, contactor 402, counter initialization control module 403, vector are taken advantage of accumulation Operations Analysis 404 and mahalanobis distance Operations Analysis 405; Wherein,
The 6th counter 401 is a downward counter, and B1 links to each other with internal data bus, is used for according to the number of times that State Control cycle calculations be set of registers group module 300 by internal data bus B1 output;
Contactor 402 has two input ends, two input ends link to each other with internal data bus B1 with the 6th counter 401 respectively, are used for opening or closing control circuit module 400 according to the state value of the registers group module 300 of the output valve of the 6th counter or internal data bus B1 input;
The input end of counter initialization control module 403 links to each other with internal data bus B1 with the output terminal of contactor 402, be used for when contactor 402 is opened, read the information of least-significant byte in the 8th register 308, according to low 6 values the 6th counter 401 is carried out initialization control, specifically control referring to table 2.Value decision according to the 6th is carried out the multiply accumulating computing or is carried out the mahalanobis distance computing, and the gating vector is taken advantage of accumulation Operations Analysis 404 or mahalanobis distance Operations Analysis 405;
Vector takes advantage of the input end of accumulation Operations Analysis 404 to link to each other with the output terminal of counter initialization control module 403, is used to control computing module 500 and carries out the multiply accumulating computing, and final calculation result is exported;
The input end of mahalanobis distance Operations Analysis 405 links to each other with the output terminal of counter initialization control module 403, is used to control computing module 500 and carries out the mahalanobis distance computing, and final calculation result is exported.
Computing module 500 specifically comprises: signed magnitude arithmetic(al) unit 501, take advantage of multiply-add operation unit 502 and output port 508;
Wherein,
Signed magnitude arithmetic(al) unit 501 links to each other with internal data bus B1, is used to finish signed magnitude arithmetic(al), is input as vector a, vector b, and these two groups of vector datas are kept in second register 302 and the 3rd register 303 in the start address of memory module 100.Signed magnitude arithmetic(al) unit 501 outputs to result calculated and takes advantage of multiply-add operation unit 502.
Take advantage of multiply-add operation unit 502 to link to each other with internal data bus B1, specifically comprise: the 5th selector switch 503, the 6th selector switch 504, shift unit 505, multiplier 506 and totalizer 507 with described signed magnitude arithmetic(al) unit.This takes advantage of multiply-add operation unit 502 to finish mahalanobis distance computing or multiply accumulating computing by the 8th status register 308 is set.
When carrying out the mahalanobis distance computing, this takes advantage of multiply-add operation unit 502 to finish a mahalanobis distance computing 2 clock period.Detailed process is:
In first clock period, the 5th selector switch 503 and the 6th selector switch 504 insert the output and the vector C of signed magnitude arithmetic(al) unit 501 respectively, and the start address of vector C in memory module 100 is kept in the 4th register 304.
In second clock period, the 5th selector switch 503 and the 6th selector switch 504 insert the output of shift unit 505, the input of shift unit 505 is the output of first clock period inner multiplication device 506 result of calculations, simultaneously, by The pipeline design, totalizer 507 can in second clock period, finish on one dimension square result's accumulating operation.
When carrying out the multiply accumulating computing, these parts can be finished a multiply accumulating computing in 1 clock period, this moment, the 5th selector switch 503 and the 6th selector switch 504 inserted vector a and vector b respectively, in this clock period, pass through The pipeline design, multiplier 506 is finished the multiplying of vector a and vector b currentElement, and totalizer 507 is finished the accumulating operation of a clock cycle multiplication result.Generation unit mainly is that result of calculation is carried out anti-spilled control as a result.
Output port 508 is used for the result of calculation of taking advantage of multiply-add operation unit 502 is exported to the selector switch 101 of memory module 100.
To carry out the process of mahalanobis distance computing as follows to use above-mentioned coprocessor:
As the formula (1), ai, bi, ci are three vectors, and D is the vector dimension.
Σ i = 1 D ( | a i - b i | · c i ) 2 - - - ( 1 )
Before carrying out the mahalanobis distance computing, need vector ai, bi, the value of ci writes in the memory module 100 in advance, and same group of vector must leave in the continuous sector address space.Dimension D is write in the register 301, the first address of ai is write register 302, the first address of bi is write the 3rd register 303, the first address of ci is write the 4th register 304, the address of the result of calculation deposited in the storage unit 108 in the memory module 100 is write the 5th register 305.Usually before the square operation that carries out formula (1), earlier will need | the value of ai-bi|ci is carried out shifting processing, can be in the 7th register 307 the preset in advance carry digit, if the 7th register 307 is not provided with, then is defaulted as it carried out the operation of 16 bit shifts.At last state the 8th register 308 is provided with, makes circuit according to intended purposes work.When the 8th register 308 is made as 0 x 00BF, indication circuit will begin to carry out the mahalanobis distance computing, to carry out assignment again to the 6th counter 401, first counter 201, second counter 202, the 3rd counter 203, four-counter 204 and the 5th counter 205, write the value in first register 301, second register 302, the 3rd register 303, the 4th register 304 and the 5th register 305 respectively.
After circuit is started working, address generating module 200 is the output of gating first counter 201 at first, behind the address of the first element of internal address bus A1 output vector a, make first counter 201 increase 1 certainly, and by the value of internal data bus B1 to the first element of computing module 500 output vector a.
Second clock period, the output of address generating module 200 gatings first counter 201, behind the address of the first element of internal address bus A1 output vector b, make second counter 202 from increasing 1, and by the value of internal data bus B1 to the first element of computing module 500 output vector b, absolute value calculation unit 501 begins to carry out computing immediately.
The 3rd clock period, the output of address generating module 200 gatings the 3rd counter 203, behind the address of the first element of internal address bus A1 output vector c, make the 3rd counter 203 from increasing 1, and by the value of internal data bus B1 to the first element of computing module 500 output vector c, this moment, absolute calculation was finished, the 5th selector switch 503 in the computing module and the 6th selector switch 504 will be distinguished the output and the vector C of gating absolute value calculation unit 501, and circuit begins to carry out multiplying.
The 4th clock period, address generating module 200 is with the output of gating first counter 201, behind the address of second element of internal address bus A1 output vector a, make first counter 201 from increasing 1, and by the value of internal data bus B1 to second element of computing module 500 output vector a, multiplying in this moment in the last clock cycle is finished, its result is input to shift unit 505, the 5th selector switch 503 in the computing module 500 and the 6th selector switch 504 prepare to carry out square operation with the output of gating shift unit.
The 5th clock period, the output of address generating module 200 gatings second counter 202, behind the address of second element of internal address bus A1 output vector b, make second counter 202 from increasing 1, and by the value of internal data bus B1 to second element of computing module 500 output vector b, absolute calculation begins to carry out immediately, meanwhile, the square operation of last one-period is also finished, and its result of calculation input totalizer 507 prepares to carry out accumulating operation.
The 6th clock period, the output of address generating module 200 gatings the 3rd counter 203, behind the address of second element of internal address bus A1 output vector c, make the 3rd counter 203 from increasing 1, and by the value of internal data bus B1 to second element of computing module 500 output vector c, absolute calculation and the accumulation calculating of last one-period of this moment are all finished, circuit will subtract 1, and carry out multiplying at the absolute calculation result certainly the 6th counter 401.
In the clock period subsequently, circuit will repeat the work of above-mentioned six clock period of the 4th clock period to the.When the 6th counter 401 is output as 0, calculate and finish, computing module 500 quits work.Output port 508 writes internal data bus B1 with high 16 of result of calculation earlier, simultaneously, the output of address generating module 200 gatings the 5th counter 205, make the 5th counter 205 from increasing 1 after high 16 storage addresses of internal address bus A1 output result of calculation, control circuit module 400 writes the relevant position by internal control signal control store module 100 with value; In the next clock period, hang down 16 operation, do not repeating.After result of calculation write end, control circuit module 400 was clear 0 with the 7th of register r7, waits for next time and calls.
To carry out the process of multiply accumulating computing as follows to use above-mentioned coprocessor:
As shown in Equation (2), ai, bi are three vectors, and D is the vector dimension.
Σ i = 1 D ( a i · b i ) - - - ( 2 )
Before carrying out the multiply accumulating computing, need vector ai, the value of bi writes in the memory module 100 in advance, and same group of vector must leave in the continuous sector address space.Dimension D is write first register 301, the first address of ai is write second register 302, the first address of bi is write the 3rd register 303, the address of depositing result of calculation in the sheet is write the 6th register 306.At last the 8th register 308 is provided with, makes circuit according to intended purposes work.When the 8th register 308 is made as 0x00E7, indication circuit will begin to carry out the multiply accumulating computing, and the 6th counter 401, first counter 201, second counter 202, the 5th counter 205 carried out assignment again, write the value in first register 301, second register 302, the 3rd register 303, the 6th register 306 respectively.
After circuit is started working, address generating module 200 is the output of gating first counter 201 at first, behind the address of the first element of internal address bus A1 output vector a, make first counter 201 increase 1 certainly, and by the value of internal data bus B1 to the first element of computing module 500 output vector a;
Second clock period, the output of address generating module gating second counter 202, behind the address of the first element of internal address bus A1 output vector b, make second counter 202 from increasing 1, and by the value of internal data bus B1 to the first element of computing module 500 output vector b, the 5th selector switch 503 and the 6th selector switch 504 in the computing module 500 will be distinguished gating vector a and vector b, and circuit begins to carry out multiplying;
The 3rd clock period, the output of address generating module 200 gatings first counter 201, behind the address of second element of internal address bus A1 output vector a, make first counter 201 from increasing 1, and by the value of internal data bus B1 to second element of computing module 500 output vector a, the multiplying of last one-period of this moment is finished, and its result of calculation is input to prepares to carry out accumulating operation in the totalizer 507;
The 4th clock period, address generating module is with the output of gating second counter 202, behind the address of second element of internal address bus A1 output vector b, make second counter 202 from increasing 1, and by the value of internal data bus B1 to second element of computing module 500 output vector b, the 5th selector switch 503 of multiplier and the 6th selector switch 504 will be distinguished gating vector a and vector b in the computing module 500, circuit begins to carry out multiplying, accumulating operation in this moment in the last clock cycle is finished, and counter 401 is subtracted 1 certainly;
In the clock period subsequently, circuit is with the work of the 3rd clock period of above-mentioned repetition and the 4th clock period.When the 6th counter 401 is output as 0, calculate and finish, computing module 500 quits work.The writing mode of result of calculation in memory module 100 is identical with writing mode under the mahalanobis distance operational pattern, is not repeating.After result of calculation write end, control circuit module 400 was clear 0 with the 7th of the 8th register 308, waits for next time and calls.
The coprocessor that the embodiment of the invention provides can account for calculated amount in the Embedded Speech Recognition System 50% mahalanobis distance and calculate and realize with hardware logic.This coprocessor improves the performance of built-in speech recognition system greatly, and further promotes speech recognition being extensive use of on portable equipment.In addition, this coprocessor has very strong extensibility, can be used as to be used in many real-time voice identification applications separately, as language learner, on-vehicle hand-free system, acoustic control MP3 etc.The technical indicator of this coprocessor is as shown in table 3.
Table 3
Object library With warship 0.18 micrometre CMOS process storehouse
Scale Die area 0.9mm * 0.8mm logical gate scale: 12000 equivalent gates (standard 2 input nand gates) SRAM scale: 64K bit
Function Finish based on " mahalanobis distance " vector and take advantage of multiply-add operation and the computing of vector multiply accumulating
Maximum operation frequency 150MHz
The identification scale The processed voice frame number is adjustable, and maximum processing capacity is 8K sampling 128 frame speech datas down
Performance Under the frequency of operation of 50MHz, when the model state number is 358, during the 8K sampled speech data of one 128 frame of identification, need 0.27 second
The mahalanobis distance that coprocessor provided by the invention has been finished based on the HMM speech recognition algorithm calculates and the multiply accumulating computing.The mahalanobis distance that carries out same dimension calculates, the time of the present invention's needs only is with 20% of the time of DSP calculating needs, therefore use the performance that this coprocessor can improve built-in speech recognition system greatly, also saved expensive DSP device, can finish simultaneously the vector multiply accumulating computing in the general dsp, expand its versatility, thereby reached the raising performance, reduce cost, reduce the target of power consumption.
The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (7)

1. a coprocessor is characterized in that, described coprocessor comprises: memory module, address generating module, registers group module, control circuit module and computing module;
Described memory module links to each other with described address generating module by internal address bus, link to each other with ppu by outer address bus, external data bus and external control bus, link to each other with described control circuit module by Internal Control Bus IBC, link to each other with described computing module by internal data bus, be used for eigenvector, model state vector and the result of calculation of storage computation process;
Described address generating module links to each other with described registers group module by internal data bus, link to each other with described control circuit module by Internal Control Bus IBC, be used under the effect of described control circuit module producing the eigenvector that described memory module stores and the address of result of calculation;
Described registers group module and described memory module unified addressing link to each other with described computing module by internal data bus, are used to store the start address of each eigenvector and result of calculation;
The input end of described control circuit module links to each other by the output terminal of internal data bus with described registers group module, and the content that is used for each register by reading described registers group module is carried out counter initialization control, the control of vector multiply accumulating and vector and taken advantage of multiply accumulating control;
Described computing module links to each other with Internal Control Bus IBC with internal data bus, is used for the eigenvector of described memory module is carried out signed magnitude arithmetic(al) and takes advantage of multiply-add operation, and result calculated is exported to described memory module by internal data bus.
2. coprocessor as claimed in claim 1 is characterized in that, described memory module comprises: first selector, input FPDP, second selector, address port, third selector, control port, output data port and storage unit;
Described first selector links to each other with external data bus or internal data bus links to each other;
Described input FPDP links to each other with external data bus or internal data bus by described first selector, and the data that are used for receiving write described storage unit;
Described second selector links to each other with outer address bus or internal address bus;
Described address port links to each other with outer address bus or internal address bus by described second selector, for the data in the described storage unit provide corresponding address;
Described third selector links to each other with external control bus or Internal Control Bus IBC;
Described control port links to each other with external control bus or Internal Control Bus IBC by described third selector, controls the duty of described storage unit;
The input end of described output data port links to each other with described storage unit, output terminal links to each other with internal data bus with external data bus respectively, is used for importing data to ppu input data or by internal data bus to described computing module by external data bus;
Described storage unit links to each other with described input FPDP, described address port, described control port and described output data port, is used to store eigenvector and result of calculation.
3. coprocessor as claimed in claim 1 is characterized in that, described address generating module comprises: a plurality of counters and a selector switch;
The input end of described a plurality of counters links to each other with internal data bus, and output terminal links to each other with described selector switch, is respectively applied for current data address that produces the eigenvector in the described control circuit module and the address that produces the result of calculation of described computing module;
The output terminal of described selector switch links to each other with internal address bus, is used for internal address bus is exported in the address that described a plurality of counters produce.
4. coprocessor as claimed in claim 1 is characterized in that, described registers group module comprises: a plurality of registers and code translator;
The output terminal of described a plurality of registers links to each other with internal data bus, and input end links to each other with external data bus, is respectively applied for start address, the protection of depositing different vectors to be calculated and overflows setting and account form setting;
The output terminal of described code translator links to each other with the Enable Pin of described a plurality of registers, is used to control writing of described a plurality of register values.
5. coprocessor as claimed in claim 1 is characterized in that, described control circuit module comprises: counter, contactor, counter initialization control module, vector are taken advantage of accumulation Operations Analysis and mahalanobis distance Operations Analysis;
Described counter is a downward counter, links to each other with internal data bus, is used for according to the number of times that State Control cycle calculations be set of described registers group module by internal data bus output;
Described contactor has two input ends, described two input ends link to each other with internal data bus with described counter respectively, are used for opening or close described control circuit module according to the state value of the described registers group module of the output valve of described counter or internal data bus input;
The input end of described counter initialization control module links to each other with internal data bus with the output terminal of described contactor, be used for when described contactor is opened, described counter is carried out initialization control, and the described vector of gating is taken advantage of accumulation Operations Analysis or described mahalanobis distance Operations Analysis;
Described vector takes advantage of the input end of accumulation Operations Analysis to link to each other with the output terminal of described counter initialization control module, is used to control described computing module and carries out the multiply accumulating computing, and final calculation result is exported;
The input end of described mahalanobis distance Operations Analysis links to each other with the output terminal of described counter initialization control module, is used to control described computing module and carries out the mahalanobis distance computing, and final calculation result is exported.
6. coprocessor as claimed in claim 1 is characterized in that, described computing module comprises: the signed magnitude arithmetic(al) unit, take advantage of multiply-add operation unit and output port;
Described signed magnitude arithmetic(al) unit links to each other with internal data bus, is used for two eigenvectors of input are carried out signed magnitude arithmetic(al), and result calculated is outputed to the described multiply-add operation unit of taking advantage of;
The described multiply-add operation unit of taking advantage of specifically comprises: two selector switchs, shift unit, multiplier and totalizers, described two selector switchs link to each other with internal data bus with described signed magnitude arithmetic(al) unit respectively, are used to receive result of calculation and another feature vector of described signed magnitude arithmetic(al) unit;
The input end of described shift unit links to each other with internal data bus, and output terminal links to each other with described two selector switchs, is used for the operation result of described multiplier is shifted;
The input end of described multiplier links to each other with described two selector switchs, is used for the eigenvector that described two selector switchs are exported is carried out multiplying;
Described totalizer links to each other with described multiplier with described shift unit, is used for the result after the displacement is carried out a square accumulating operation, gives described output port with the accumulating operation result transmission;
Described output port links to each other with the described output terminal of multiply-add operation unit of taking advantage of, and is used for the described result of calculation of multiply-add operation unit of taking advantage of is exported to internal data bus, by internal data bus result of calculation is transferred to described memory module.
7. coprocessor as claimed in claim 1 is characterized in that, described memory module is standard static random access memory SRAM.
CNB2007101184303A 2007-07-05 2007-07-05 Coprocessor Expired - Fee Related CN100489765C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101184303A CN100489765C (en) 2007-07-05 2007-07-05 Coprocessor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101184303A CN100489765C (en) 2007-07-05 2007-07-05 Coprocessor

Publications (2)

Publication Number Publication Date
CN101082859A CN101082859A (en) 2007-12-05
CN100489765C true CN100489765C (en) 2009-05-20

Family

ID=38912445

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101184303A Expired - Fee Related CN100489765C (en) 2007-07-05 2007-07-05 Coprocessor

Country Status (1)

Country Link
CN (1) CN100489765C (en)

Also Published As

Publication number Publication date
CN101082859A (en) 2007-12-05

Similar Documents

Publication Publication Date Title
Zheng et al. An ultra-low power binarized convolutional neural network-based speech recognition processor with on-chip self-learning
CN107832844A (en) A kind of information processing method and Related product
CN110390385A (en) A kind of general convolutional neural networks accelerator of configurable parallel based on BNRP
CN104137178B (en) Acoustic treatment unit interface
CN102043760B (en) Data processing method and system
CN101504637A (en) Point-variable real-time FFT processing chip
CN102831895A (en) Method for achieving MFCC (Mel Frequency Cepstrum Coefficient) parameter extraction by field-programmable gate array
CN102945224A (en) High-speed variable point FFT (Fast Fourier Transform) processor based on FPGA (Field-Programmable Gate Array) and processing method of high-speed variable point FFT processor
CN107256424A (en) Three value weight convolutional network processing systems and method
CN102135951A (en) FPGA (Field Programmable Gate Array) implementation method based on LS-SVM (Least Squares-Support Vector Machine) algorithm restructured at runtime
CN102789779A (en) Speech recognition system and recognition method thereof
CN113902102A (en) Non-invasive load decomposition method based on seq2seq
CN105895081A (en) Speech recognition decoding method and speech recognition decoding device
CN108960414A (en) Method for realizing single broadcast multiple operations based on deep learning accelerator
CN112307421B (en) Base 4 frequency extraction fast Fourier transform processor
CN110853630A (en) Lightweight speech recognition method facing edge calculation
CN103176949B (en) Realize circuit and the method for FFT/IFFT conversion
CN100489765C (en) Coprocessor
CN102129419B (en) Based on the processor of fast fourier transform
CN112669819B (en) Ultra-low power consumption voice feature extraction circuit based on non-overlapping framing and serial FFT
CN106228976A (en) Audio recognition method and device
Xiang et al. Implementation of LSTM accelerator for speech keywords recognition
CN101593520A (en) The implementation method that high-performance speech recognition coprocessor and association thereof handle
CN102541813B (en) Method and corresponding device for multi-granularity parallel FFT (Fast Fourier Transform) butterfly computation
CN108008665B (en) Large-scale circular array real-time beam former based on single-chip FPGA and beam forming calculation method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090520

Termination date: 20180705