CN107293290A

CN107293290A - The method and apparatus for setting up Speech acoustics model

Info

Publication number: CN107293290A
Application number: CN201710640480.1A
Authority: CN
Inventors: 吕广杰; 刘芮
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2017-10-24

Abstract

The invention discloses a kind of method and apparatus for setting up Speech acoustics model.Methods described includes：Obtain the audio signal of speech data；Feature extraction is carried out to audio signal, the spectrogram of audio signal is obtained；Image recognition is carried out to the spectrogram, result is identified；According to the actual sound data of recognition result and the speech data, Speech acoustics model is set up.

Description

The method and apparatus for setting up Speech acoustics model

Technical field

The present invention relates to field of information processing, espespecially a kind of method and apparatus for setting up Speech acoustics model.

Background technology

Machine learning has become one of data analysing method most popular in information industry at present, and it can make analysis mould The foundation automation of type, algorithm is continued to optimize from data with existing by algorithm iteration and optimal model, machine learning is formed So that computer has " brain ", allow them can not see clearly those data for being hidden in depths by explicit programming.Although Miscellaneous machine learning algorithm is existing for a long time, but from past information occlusion progresses data explosion till now when Generation, the data volume and data scale in each field all exponentially go up pattern, and the explosive growth of this data scale brings huge Big opportunity and change potentiality, it is possible to use the advantage such as integrality of these data helps us preferably to make in all trades and professions Decision-making, the research for being changed into data-driven in for scientific research provides good example, so for machine learning and greatly The combination of data just becomes particularly important, and we pursue, and calculating is more and more faster, and more and more accurate, model is more and more accurate.

Machine learning under big data greatly improves sample size, and this classification for allowing for many problems has rich Rich sample size is as support, and this is the advantage place of big data.But huge data volume also can bring one to machine learning The problems such as relation between fixed puzzlement, data, screening of valid data, can largely effect on the accurate of machine learning model training Degree and training time.Rule in data and our institutes are hidden so being excavated from the data that the scale of construction is huge, structure is various Information is needed, so that data play maximized value, it is a core objective of big data technology.

Prediction claims, in following several years, and information is searched on the internet will increasingly rely on phonetic entry, rather than keyboard Input, this represents the emergence for this conventional machines study for setting up Speech acoustics model, exactly because the introducing of deep learning Help with big data causes the degree of accuracy and intelligent continuous improvement for setting up Speech acoustics model, how to improve and sets up a standard The high Speech acoustics model of exactness is urgent problem to be solved.

The content of the invention

In order to solve the above-mentioned technical problem, the invention provides a kind of method for setting up Speech acoustics model, it can set up The high Speech acoustics model of the degree of accuracy.

In order to reach the object of the invention, the invention provides a kind of method for setting up Speech acoustics model, including：

Obtain the audio signal of speech data；

Feature extraction is carried out to audio signal, the spectrogram of audio signal is obtained；

Image recognition is carried out to the spectrogram, result is identified；

According to the actual sound data of recognition result and the speech data, Speech acoustics model is set up.

Wherein, it is described that image recognition is carried out to the spectrogram, result is identified, including：

Spectrogram is handled successively using multiple convolutional layers in deep layer convolutional network, result is identified.

Wherein, it is described that image recognition is carried out to the spectrogram, result is identified, in addition to：

After convolutional layer processing, at the result after being handled using the pond layer in deep layer convolutional network convolutional layer Reason, is identified result.

Wherein, image recognition is carried out to the spectrogram, be identified before result, methods described also includes：Obtain institute The weight matrix of audio signal is stated, wherein the weight matrix is going out in voice according to the voice data of the audio signal Determined between current with the importance in voice；The data of frequency spectrum are handled using the weight matrix.

Wherein, methods described also includes：The mark of valid data is carried out to the voice data in acoustic model.

A kind of device for setting up Speech acoustics model, including：

Signal acquisition module, the audio signal for obtaining speech data；

Extraction module, for carrying out feature extraction to audio signal, obtains the spectrogram of audio signal；

Identification module, for carrying out image recognition to the spectrogram, is identified result；

Determining module, for the actual sound data according to recognition result and the speech data, sets up Speech acoustics model.

Wherein, the identification module specifically for：

Wherein, the identification module is additionally operable to：

Wherein, described device also includes：

Matrix acquisition module, for being treated using convolutional layer in journey, obtains the weight square of the audio signal Battle array, wherein the weight matrix is the weight in the voice data of audio signal time of occurrence and voice in voice The property wanted is determined；

Processing module, for being handled using the weight matrix the data of frequency spectrum.

Wherein, described device also includes：

Mark module, the mark for carrying out valid data to the voice data in acoustic model.

The embodiment that the present invention is provided, by obtaining the spectrum information of audio signal, schemes to the image of spectrum information As identification, audio signal is handled as view data, more the acoustic information of degree of accuracy attribute sound really, improve voice sound The degree of accuracy for learning model is high.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and obtained in claim and accompanying drawing.

Brief description of the drawings

Accompanying drawing is used for providing further understanding technical solution of the present invention, and constitutes a part for specification, with this The embodiment of application is used to explain technical scheme together, does not constitute the limitation to technical solution of the present invention.

The flow chart for the method for setting up Speech acoustics model that Fig. 1 provides for the present invention；

The schematic flow sheet for setting up Speech acoustics model that Fig. 2 provides for the present invention；

Fig. 3 handles the schematic flow sheet of audible spectrum image for the deep layer convolutional neural networks that the present invention is provided；

The structure chart for the device for setting up Speech acoustics model that Fig. 4 provides for the present invention.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing to the present invention Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application Feature can mutually be combined.

Can be in the computer system of such as one group computer executable instructions the step of the flow of accompanying drawing is illustrated Perform.And, although logical order is shown in flow charts, but in some cases, can be with suitable different from herein Sequence performs shown or described step.

The flow chart for the method for setting up Speech acoustics model that Fig. 1 provides for the present invention.Method includes shown in Fig. 1：

Step 101, the audio signal for obtaining speech data；

Step 102, to audio signal carry out feature extraction, obtain the spectrogram of audio signal；

Step 103, to the spectrogram carry out image recognition, be identified result；

Step 104, the actual sound data according to recognition result and the speech data, set up Speech acoustics model.

The embodiment of the method that the present invention is provided, by obtaining the spectrum information of audio signal, enters to the image of spectrum information Row image recognition, audio signal is handled as view data, more the acoustic information of degree of accuracy attribute sound really, improves language The degree of accuracy of phonematics model is high..

The embodiment of the method that the present invention is provided is described further below：

The present invention is using in deep layer convolutional network (Deep Convolutional Neural Networks, deep CNN) Multiple convolutional layers spectrogram is handled successively, be identified result.

Applied using deep layer convolutional neural networks algorithm in Speech acoustics model is set up, the frequency spectrum of voice signal is worked as Image procossing is done, indeformable using convolution overcomes the diversity of voice signal in itself, can be substantially improved and set up Speech acoustics The degree of accuracy of model.

Convolutional layer processing after, recycle pond layer handled, reduce convolution kernel size, can train deeper, The more preferable convolutional neural networks model of effect, so as to lift recognition accuracy.

In actual applications, the different time may be different with the importance of frequency corresponding points, such as, current time correspondence Frame importance than front and rear several vertical frame dimensions some, so, weight matrix need to be introduced, each layer is done before convolution operation first Enter row element with this matrix to be intelligently multiplied, be weighted equivalent to according to importance, wherein the initialization value of weight is 1.

Specifically, carrying out image recognition to the spectrogram, it is identified before result, methods described also includes：

The weight matrix of the audio signal is obtained, wherein the weight matrix is the audio number according to the audio signal Determined according to the importance in time of occurrence and voice in voice；The data of frequency spectrum are carried out using the weight matrix Processing.

Wherein, in big data aspect, the valid data gone out using big data Analysis and Screening (with or without mark) are to model Exercise supervision or unsupervised training, while calibrating patterns, lift scheme precision, that is, lift speech discrimination accuracy.

The embodiment of the method that the present invention is provided, speech recognition Acoustic Modeling is applied to by deep layer volume and nerual network technique In, significantly lift the degree of accuracy of speech recognition.The achievement of image recognition in recent years, and voice and image have been used for reference in profit With the intercommunity of CNN model trainings, compared to the existing convolutional neural networks combination deep neural network technology of industrial quarters, mistake Rate relative reduction 10%.Many algorithms based on data set may fail in big data at present, so need to utilize simultaneously Big data technology, carries out adjusting ginseng and correction to model.

The schematic flow sheet for setting up Speech acoustics model that Fig. 2 provides for the present invention.Flow includes shown in Fig. 2：

Signal transacting and feature extraction to input signal, are carried out at noise reduction and channel distortion to original audio signal Reason, frequency domain is transformed into by signal from time domain, is acoustic model extraction characteristic vector below.

Core formula in Speech acoustics model is set up is as follows, and its core is W to be found so that P (W) and P (X | W it is) all big.P (W) represents the language model set up in Speech acoustics model, that is, this string of words or word have in itself it is many " as Words "；P (X | W), which represents the acoustic model set up in Speech acoustics model, i.e. the words and had, great may send out into this cross-talk. So that the two value maximums are exactly the core missions for the lifting degree of accuracy for setting up Speech acoustics model, the comprehensive acoustic mode of decoding search The result of type fraction and language model fraction, regards recognition result by overall output fraction highest word sequence.

The modeling for setting up Speech acoustics model is to need the relationship modeling between voice signal and word content.Normal conditions Under, the voice spectrum being all based on after the time frequency analysis completion of Speech acoustics model is set up, and wherein voice time-frequency spectrum is tool There is design feature.If improving the rate for setting up Speech acoustics model, exactly need to overcome voice signal to face various each The diversity (such as various regions dialect, various language, liaison, change voice etc.) of sample, the diversity (such as noise jamming) of environment.

Fig. 3 handles the schematic flow sheet of audible spectrum image for the deep layer convolutional neural networks that the present invention is provided, specific real Existing method is as follows：

Using convolutional neural networks, because it is locally connected, (each neuron is not necessarily to carry out global image in fact Perceive, it is only necessary to the local letter for perceiving, then local informix being got up to just to have obtained the overall situation in higher Breath.) and weight it is shared the characteristics of so that it has good translation invariance.The thought of convolutional neural networks is applied to and built In the Acoustic Modeling of vertical Speech acoustics model, then the diversity of voice signal in itself can be overcome using the consistency of convolution Many, while connecing pooling layers after layer convolution again, reducing the size of convolution kernel can allow us to train deeper, effect The more preferable CNN models of fruit.From this view point, then may be considered the time-frequency spectrum that obtains whole speech signal analysis as One image is equally handled, and it is identified using wide variety of deep layer convolutional network in image.Meanwhile, in model knot In structure, deepCNN helps model to have the translation invariance in good time domain, so that model has preferably noise immunity.

For the frequency spectrum input of convolutional layer, the different times may different (current times with the importance of frequency corresponding points The importance of corresponding frame than front and rear several vertical frame dimensions some), weight matrix (initialization value of weight be 1) is introduced, to each layer Enter row element before doing convolution operation with this matrix first to be intelligently multiplied, be weighted equivalent to according to importance.

Calibration and training energy service hoisting of the big data to the mark of language material and analysis and to model simultaneously is in deepCNN skills The degree of accuracy for setting up Speech acoustics model under art.

As seen from the above, the analysis by the introducing and big data of deep CNN algorithms to training corpus, image is known The deepCNN algorithms of not middle extensive utilization are applied in Speech acoustics model is set up, and the frequency spectrum of voice signal is regarded into image Processing, indeformable using convolution overcomes the diversity of voice signal in itself, can be substantially improved and set up Speech acoustics model The degree of accuracy.

The degree of accuracy of Speech acoustics model is set up by two aspect liftings, is that in terms of algorithm lifting, will generally answer first Convolutional neural networks technology for image recognition is applied to set up Speech acoustics model, and whole speech signal analysis is obtained Frequency spectrum equally handled as image, the degree of accuracy for setting up Speech acoustics model can be greatly improved.Next to that big Data plane, the valid data gone out using big data Analysis and Screening (with or without mark) are exercised supervision or non-supervisory instruction to model Practice, while calibrating patterns, lift scheme precision, i.e. lifting set up the degree of accuracy of Speech acoustics model.

The structure chart for the device for setting up Speech acoustics model that Fig. 4 provides for the present invention.Fig. 4 shown devices include：

Signal acquisition module 401, the audio signal for obtaining speech data；

Extraction module 402, for carrying out feature extraction to audio signal, obtains the spectrogram of audio signal；

Identification module 403, for carrying out image recognition to the spectrogram, is identified result；

Module 404 is set up, for the actual sound data according to recognition result and the speech data, Speech acoustics mould is set up Type.

Wherein described identification module 403, specifically for：

Wherein, the identification module 403 is additionally operable to after convolutional layer processing, right using the pond layer in deep layer convolutional network Result after convolutional layer processing is handled, and is identified result.

Matrix acquisition module, for before being handled using convolutional layer, obtaining the weight matrix of the audio signal, Wherein described weight matrix is important in the voice data of audio signal time of occurrence and voice in voice Property is determined；

Optionally, described device also includes：

The device embodiment that the present invention is provided, by obtaining the spectrum information of audio signal, enters to the image of spectrum information Row image recognition, audio signal is handled as view data, more the acoustic information of degree of accuracy attribute sound really, improves language The degree of accuracy of phonematics model is high.

Although disclosed herein embodiment as above, described content be only readily appreciate the present invention and use Embodiment, is not limited to the present invention.Technical staff in any art of the present invention, is taken off not departing from the present invention On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of method for setting up Speech acoustics model, it is characterised in that including：

Obtain the audio signal of speech data；

Image recognition is carried out to the spectrogram, result is identified；

2. according to the method described in claim 1, it is characterised in that described that image recognition is carried out to the spectrogram, known Other result, including：

3. method according to claim 2, it is characterised in that described to carry out image recognition to the spectrogram, is known Other result, in addition to：

After convolutional layer processing, the result after convolutional layer processing is handled using the pond layer in deep layer convolutional network, obtained To recognition result.

4. according to the method in claim 2 or 3, it is characterised in that image recognition is carried out to the spectrogram, is identified As a result before, methods described also includes：

The weight matrix of the audio signal is obtained, wherein the weight matrix is existed according to the voice data of the audio signal The importance in time of occurrence and voice in voice is determined；

The data of frequency spectrum are handled using the weight matrix.

5. method according to claim 4, it is characterised in that methods described also includes：

The mark of valid data is carried out to the voice data in acoustic model.

6. a kind of device for setting up Speech acoustics model, it is characterised in that including：

Signal acquisition module, the audio signal for obtaining speech data；

7. device according to claim 6, it is characterised in that the identification module specifically for：

8. device according to claim 7, it is characterised in that the identification module is additionally operable to：

9. the device according to claim 7 or 8, it is characterised in that described device also includes：

Matrix acquisition module, for before being handled using convolutional layer, obtaining the weight matrix of the audio signal, wherein The weight matrix be the importance in the voice data of audio signal time of occurrence and voice in voice come Determine；

10. device according to claim 9, it is characterised in that described device also includes：