CN106816147A

CN106816147A - Speech recognition system based on binary neural network acoustic model

Info

Publication number: CN106816147A
Application number: CN201710055681.5A
Authority: CN
Inventors: 俞凯; 钱彦旻; 项煦
Original assignee: Shanghai Jiaotong University; Suzhou Speech Information Technology Co Ltd
Current assignee: AI Speech Ltd
Priority date: 2017-01-25
Filing date: 2017-01-25
Publication date: 2017-06-09

Abstract

A kind of speech recognition system based on binary neural network acoustic model, is distributed to the observation probability of HMM using binary neural network and is modeled, and is trained using the phonetic feature after extracting, so as to obtain acoustic model.It is bold and unrestrained to have used two-value to replace 32 traditional floating numbers so that the storage of model and EMS memory occupation decline to a great extent；The binary neural network for using computationally can fully use the hardware instruction to carry out acceleration computing, and being formerly only available the model for being calculated using multiple GPU on the server can be run on the CPU of mobile device now；And the present invention has benefited from the acceleration of binary neural network when model training is carried out, the model training time also can significantly shorten.

Description

Speech recognition system based on binary neural network acoustic model

Technical field

It is specifically a kind of to be based on binary neural network acoustic mode the present invention relates to a kind of technology of field of information processing The speech recognition system of type.

Background technology

The existing neutral net (including but not limited to DNN, CNN, RNN) for acoustic model modeling, with million very To hundreds of millions of network weights, and each weight be stored with 32 floating numbers, it is necessary to substantial amounts of storage and Internal memory could run.For common neutral net acoustic model, parameter amount means that greatly that amount of calculation is also very big very much, to equipment The requirement of computing capability is very high.For common neutral net acoustic model, it is necessary to be trained mould using a large amount of speech datas Type, even if there is substantial amounts of computing resource, the training time is also very long.

Data type of 32 floating numbers as network weight is being used, in existing hardware (such as CPU, Central Processing Unit, central processing unit, GPU, Graphic Processing Unit, graphic process unit) and software (MKL, Intel company specifically designed for matrix and other mathematical operations calculating accelerate storehouse, cuDNN, it is tall and handsome up to company specifically designed for depth The GPU of study is calculated and is accelerated storehouse) on, in the case where large-scale data is processed, deep neural network model speed is fast not enough.

The content of the invention

The present invention is slower for model training speed in the prior art, and cannot carry out bit arithmetic by using CPU or GPU The defects such as acceleration, propose a kind of speech recognition system based on binary neural network acoustic model, have used two-value to replace tradition 32 floating numbers so that the storage of model and EMS memory occupation decline to a great extent；The binary neural network for using computationally can be with Acceleration computing fully is carried out using hardware instruction, the model for being calculated using multiple GPU on the server is formerly only available present Can be run on the CPU of mobile device；And the present invention has benefited from the acceleration of binary neural network when model training is carried out, The model training time also can significantly shorten.

The present invention is achieved by the following technical solutions：

The present invention relates to a kind of implementation method of the binary neural network acoustic model towards speech recognition, two-value god is used Observation probability distribution through network to HMM (HMM) is modeled, and is carried out using the phonetic feature after extracting Training, so as to obtain acoustic model.

Described binary neural network is recurrent neural network, convolutional neural networks or depth feedforward neural network, its tool Body includes：Input layer, at least two hidden layers and the output layer being sequentially connected, wherein：Each hidden layer carries out non-to input vector Exported after linear process.Hidden layer in the present invention carries out binary conversion treatment to input vector x and network weight W.

Described binary conversion treatment refers to：- 1 is output as when input is not more than zero, 1 is otherwise output as.

Described Nonlinear Processing refers to：- 1 is output as when input is less than -1,1 is output as when input is more than 1, other In the case of input be equal to output.

In order to accelerate deep neural network processing speed, multiple input vectors are preferably merged into a matrix as input Computing is carried out, then binary neural network is output as matrix form.Under normal circumstances, network major part operation time spend in square On battle array multiplication.

Described calculating, using but be not limited to hardware instruction and realize that such as popcnt () of Intel CPU is tall and handsome up to GPU _ _ popcll ().

Described feature refers to：By audio (usually wav files) by framing, that is, be cut into it is multiple it is adjacent two sections between have The segment of overlap, then applied mathematics conversion (such as Fourier transformation) on these segments, each section of voice have reformed into feature, have carried The feature for taking is used as the input of speech recognition system.

Described HMM includes multiple states, state transition probability distribution (state-transition matrix) and observation Probability distribution (uses GMM, i.e. gauss hybrid models to be modeled), wherein：One phoneme of speech sound one HMM of correspondence, and one HMM generally comprises multiple states.

Described phoneme refers to：Natural quality according to voice mark off come least speech unit, such as according to The CMU The phoneme of Pronouncing Dictionary divides specification, and the pronunciation factor sequence corresponding to English word cat is K AE T (source http://www.speech.cs.cmu.edu/cgi-bin/cmudict).

Described training refers to：Using the feature extracted from audio in advance, calculate and voice feature data and text The parameter of the hidden Markov model matched somebody with somebody, i.e. state transition probability are distributed and observation probability distribution.

The speech recognition system of the binary neural network acoustic model obtained the present invention relates to the above method, including：Collection Module, characteristic extracting module, training module and identification module, wherein：Acquisition module is defeated to characteristic extracting module in off-line procedure Go out to train audio and corresponding text, original audio, characteristic extracting module are exported to characteristic extracting module in on-line testing process Respectively to the tag file and corresponding label of training module output training audio, to the feature text of identification module output original audio Part, training module is using the feature and label training binary neural network acoustic model extracted and by the binary neural networks after training Network acoustic model is exported to identification module, and identification module is identified by the model to the tag file of original audio.

Described identification refers to：Identification module is counted to the tag file of original audio using hidden Markov model Calculate, draw the corresponding hidden state sequence of maximum probability, so as to draw aligned phoneme sequence, further obtain the corresponding text of audio.

Technique effect

Having with existing other neutral net speeding schemes will use 32 neutral nets for training of floating number format Each network weight is approximate using 2 or 4, reduces model volume and operand.The present invention in training and test model, Binary neural network relative to traditional neural network much faster, can significantly save the time of researcher.

Compared to other neutral net speeding schemes, binary neural network volume is smaller, thus hardware device cache (cache of CPU or GPU cores) utilization rate is higher, reduces and repeats to be loaded into the number of times of data from internal memory, so that more Power saving.

Compared to other neutral net speeding schemes, binary neural network is fast because of its small volume, speed, is particularly suited for moving In dynamic equipment, can be used to develop the related local mobile app (application program of mobile device) of speech recognition, using model Enclose wider.

Brief description of the drawings

Fig. 1 is neutral net schematic diagram；

Fig. 2 is each hidden layer schematic diagram of binary neural network of the present invention；

In figure：Binarize represents binaryzation, and HardTanh represents nonlinear transformation；

Fig. 3 is speech recognition system structural representation of the present invention.

Specific embodiment

As shown in figure 1, the binary neural network that the present embodiment is used includes：The input layer that is sequentially connected, hidden layer and Output layer, wherein：Each hidden layer is exported after Nonlinear Processing is carried out to input vector.

As shown in Fig. 2 the nonlinear transformation that will may be used in the present invention in the binary neural network structure for illustrating HardTanh is (for example) replace with other nonlinear functions such as Sign (sign function).It is different types of Influence of the nonlinear transformation to neural network model is different, but can obtain similar effect in some cases, so this Replacement is that possible occur.

Because matrix multiplication is time-consuming most in the computing of neutral net, generally more than 80%, so matrix multiplication Accelerate particularly important to the acceleration of whole neural network computing.

When function argument is for vector or during matrix, this conversion is done respectively to vector or matrix each element.Adopt After this structure, the input and output of the hidden layer of binary neural network all have passed through binaryzation, matrix operation Wx (or W ' x) Bit arithmetic operation can essentially be converted into, the popcnt () such that it is able to make full use of Intel CPU is instructed or NVIDIA GPU _ _ popcll () instruction carry out speed-up computation.

For the acceleration of binaryzation matrix operation, the multiplication of a line and a row can be referred to following efficient hardware in matrix A*b=popcnt (xnor (a, b)) is made to be substituted, wherein popcnt is a kind of bit arithmetic, for calculating one the two of integer System represent in 1 number, xnor truth tables are as follows：

It should be noted that binaryzation by weight not only comprising { -1,1 } this expression is converted into, in order on computers It is convenient to implement, it is also possible to use { 0,1 } this expression, { -1,1 } and { 0,1 } both expression a line in matrix operation is carried out During with the multiplication of a row, a constant is only differed, therefore hereinafter both expressions can be exchanged.

With a=(1,1,0,1), as a example by b=(1,0,1,1), Traditional calculating methods are：A*b=1*1+1*0+0*1+1*1= 2, it is necessary to carry out 4 multiplication and the operation of 3 sub-additions, it is necessary to instruct number to be at least 7.

And after using binaryzation network model, a and b can be regarded as 2 systems of two integers represent, so as to use position Computing is operated to the two integers, c=xnor (a, b)=(1,0,0,1), popcnt (c)=2, it is only necessary to which 2 times hardware refers to Order.

All it is that two-value is represented due to weight in view of realistic model, 64 weights can use 1 64 integer table Show, thus calculate speed-up ratio relative in example a*b it is bigger.

There are following 64 network weights：

1,0,1,0,1,0,1,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,1,0,0,0,1,1,0,0,1,1, 1,0,1,0,1,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0, one can be seen as The binary representation of individual 64 integers, and this 64 integers are 12345678901234567890.

Matrix multiplication experimental result on GPU shows that two sizes are the matrix multiple of (8192,8192), use two Value represent with GPU hardware instruction _ _ popcll (), in contrast to being represented using 32 floating numbers and added up to cuDNN using tall and handsome Fast storehouse computing, when matrix multiplication is carried out, speed be able to can reach up to more than 3 times relative to the matrix multiplication baseline speed being not optimised More than more than 20 times.

Above-mentioned specific implementation can by those skilled in the art on the premise of without departing substantially from the principle of the invention and objective with difference Mode local directed complete set is carried out to it, protection scope of the present invention is defined and not by above-mentioned specific implementation institute by claims Limit, each implementation in the range of it is by the constraint of the present invention.

Claims

1. the implementation method of a kind of binary neural network acoustic model towards speech recognition, it is characterised in that use two-value god Observation probability distribution through network to HMM is modeled, and is trained using the phonetic feature after extracting, So as to obtain acoustic model.

2. method according to claim 1, it is characterized in that, described binary neural network is recurrent neural network, convolution Neutral net or depth feedforward neural network, it is specifically included：Input layer, at least two hidden layers and the output being sequentially connected Layer, wherein：Each hidden layer is exported after Nonlinear Processing is carried out to input vector；Described hidden layer is to input vector x and network Weight W carries out binary conversion treatment.

3. method according to claim 2, it is characterized in that, described binary conversion treatment refers to：When input is not more than zero - 1 is output as, 1 is otherwise output as.

4. method according to claim 2, it is characterized in that, described Nonlinear Processing refers to：It is defeated when input is less than -1 It is -1 to go out, and 1 is output as when input is more than 1, and input is equal to output in the case of other.

5. method according to claim 1, it is characterized in that, multiple input vectors are merged into a matrix as be input into Row computing, then binary neural network be output as matrix form.

6. method according to claim 1, it is characterized in that, described feature refers to：By audio by framing, that is, it is cut into many There are the segment of overlap, then the applied mathematics conversion on these segments between individual adjacent two sections, each section of voice has reformed into feature, The feature extracted is used as the input of speech recognition system.

7. method according to claim 1, it is characterized in that, described HMM includes multiple states, state Transfering probability distribution and the observation probability distribution being modeled based on gauss hybrid models, wherein：One phoneme of speech sound correspondence one Individual HMM, and a HMM includes multiple states.

8. method according to claim 1, it is characterized in that, described training refers to：Use what is extracted from audio in advance Feature, calculate the parameter with the hidden Markov model of voice feature data and text matches, i.e. state transition probability distribution and Observation probability is distributed.

9. a kind of speech recognition system of the binary neural network acoustic model obtained based on any of the above-described claim methods described System, it is characterised in that including：Acquisition module, characteristic extracting module, training module and identification module, wherein：Acquisition module from Line process is exported in on-line testing process to characteristic extracting module output training audio and corresponding text to characteristic extracting module Original audio, characteristic extracting module trains the tag file and corresponding label of audio to training module output respectively, to identification mould Block exports the tag file of original audio, and training module uses the feature and label training binary neural network acoustic model extracted And export to identification module the binary neural network acoustic model after training, identification module is by the model to original audio Tag file is identified.

10. system according to claim 9, it is characterized in that, described identification refers to：Spy of the identification module to original audio Part of soliciting articles is calculated using hidden Markov model, the corresponding hidden state sequence of maximum probability is drawn, so as to draw phoneme Sequence, further obtains the corresponding text of audio.