CN106653007B

CN106653007B - A kind of speech recognition system

Info

Publication number: CN106653007B
Application number: CN201611101551.2A
Authority: CN
Inventors: 沈小正; 张光宇; 朱孟旭; 代大明; 肖佳林
Original assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Current assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Priority date: 2016-12-05
Filing date: 2016-12-05
Publication date: 2019-07-16
Anticipated expiration: 2036-12-05
Also published as: CN106653007A

Abstract

The present invention relates to a kind of speech recognition systems, by the basic basic identifier based on acoustic model to phonetic mapping network and any number of being collectively constituted based on phonetic to the specific identification device of word mapping network and an integrated decision-making unit for different application field.Voice first by basic identifier be mapped as by multiple candidate pinyin sequentials organization at network, then the phonetic network passes through again is combined with the specific identification device of a particular application target, the search that optimal path is finally carried out on network after combining, obtains final recognition result.Under this framework, phonetic network can be combined with the individual phonetic of multiple application fields to the specific identification device that word maps, and finally select optimal recognition result according to acoustics and language model scoring and the relevant super rule of other application.

Description

A kind of speech recognition system

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of speech recognition that can carry out online field extension System.

Background technique

Chinese is not to combine language into syllables, if being difficult to directly conclude corresponding Chinese character from sound without contextual information.Traditional Speech recognition is decoded using pre-generated static decoding network, and the decoding network is usually directly to map from phoneme For word.The decoding network has merged the probability distribution information of the word for the audio content to be identified.Cause in this way identifier from When one field is switched to another field, performance can sharply decline, and other term and neologisms may always can not be just Really identification.In order to support the identification of multiple fields, the probability point of the word of multiple fields is usually modeled simultaneously with a model Cloth information.This causes model probability distribution relatively average (this means that recognition performance is generally also relatively average), and model It is huger.In order to support the identification of neologisms or term, it is necessary to re -training model and conformation identification device.This is to expend very much Time and resource.

In view of the above shortcomings, the designer, is actively subject to research and innovation, can be led online to found one kind The speech recognition system of domain extension, makes it with more the utility value in industry.

Summary of the invention

In order to solve the above technical problems, the object of the present invention is to provide one kind can carry out online field extension, so as to Quickly improve the speech recognition system of the recognition performance of specific area.

Speech recognition system of the invention, including

Based on the basic identifier of acoustic model to phonetic mapping network, for being mapped as voice by multiple candidate spellings Sound sequential organization at network；

Multiple specific identification devices based on phonetic to word mapping network for different application field arranged side by side, are used for Respectively with by multiple candidate pinyin sequentials organization at network be combined, obtain multiple best word sequences and confidence level；

Integrated decision-making unit, for receiving multiple best word sequences and confidence level, then according to confidence level along with preparatory Given priori knowledge and rule and additional knowledge, carry out decision, and optimal word sequence is selected to export.

Further, by adjusting phonetic to word mapping network, add new identification content to existing field based on Phonetic updates the identification content in existing field into the specific identification device of word mapping network；By constructing corresponding base offline In the specific identification device of phonetic to word mapping network, extension content is added to online then, net is mapped based on phonetic to word In the specific identification device of network, the identification content of new application field is created.

Further, the basic identifier based on acoustic model to phonetic mapping network is according to the audio frequency characteristics of input Dynamic calculates acoustic score, and the language model scores of pinyin sequence are preserved on its network, using dynamic programming algorithm knot Acoustic score and language model scores are closed, several pinyin sequences output of highest scoring is searched for.

Further, the language model of the pinyin sequence use the recurrent neural network based on long memory unit in short-term into Row modeling.

Further, the integrated decision-making unit passes through fusion recognition confidence level, priori knowledge and preset rules and attached Add information to select optimal candidate word sequence.

Further, the priori knowledge includes at least the mark about field inputted except the speech recognition system Information, or the field designation information obtained according to recognition result historical information.

Further, the field designation information is discrete 0/1 to set or continuous probability value.

Further, the preset rules include at least the word number range estimated according to audio length.

Further, the additional information include obtained according to super language model meet language about recognition result word string The degree of method specification is measured.

Further, the integrated decision-making unit by the additional information and preset rules by way of stratified calculation and Confidence score selects candidate word sequence to export as final recognition result together as decision rule.

According to the above aspect of the present invention, the present invention dynamically will can map net based on phonetic to word for different field online The specific identification device of network is added in identifying system, can quickly improve the recognition performance of specific area；It can fast custom extension Field, addition hot word/neologisms, customization field identify content；It supports the identification of multiple fields simultaneously, and guarantees its recognition performance not Decline.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, the following is a detailed description of the preferred embodiments of the present invention and the accompanying drawings.

Detailed description of the invention

Fig. 1 is speech recognition system frame diagram of the invention.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.

Referring to Fig. 1, a kind of speech recognition system described in a preferred embodiment of the present invention, by basic based on acoustic model To phonetic mapping network basic identifier and it is any number of for different application field based on phonetic to word mapping network Specific identification device and an integrated decision-making unit collectively constitute, wherein based on acoustic model to the basis of phonetic mapping network Identifier be used for by voice be mapped as by multiple candidate pinyin sequentials organization at network；Respectively based on phonetic to word mapping network Specific identification device for respectively with by multiple candidate pinyin sequentials organization at network be combined, obtain multiple best word order Column and confidence level；Integrated decision-making unit is for receiving multiple best word sequences and confidence level, then according to confidence level along with pre- First given priori knowledge and rule and additional knowledge, carry out decision, optimal word sequence are selected to export.

The specific identification device based on phonetic to word mapping network for different field of the invention can online dynamically It is added in identifying system, so as to quickly improve the recognition performance of specific area.In the present invention, respectively based on phonetic to word The specific identification device of mapping network be it is arranged side by side, can be with Quick Extended.Specifically, by adjusting phonetic to word mapping network, The new identification content of addition updates existing field into the specific identification device based on phonetic to word mapping network in existing field Identification content；It, then will be in extension by constructing the corresponding specific identification device based on phonetic to word mapping network offline Appearance is added in the specific identification device based on phonetic to word mapping network online, creates the identification content of new application field. When concrete application, the identification content in existing field is updated, such as the addition of neologisms/hot word, it is only necessary to adjust phonetic and arrive Word mapping network, without being related to the adjustment of acoustic model and base recognizer；The addition of new application field identification content, than Such as: home control, vehicle mounted guidance etc., it is only necessary to construct corresponding phonetic offline to word mapping network, then can add online It is added in identifying system, to not influence the identification process in existing field.

Basic identifier based on acoustic model to phonetic mapping network in the present invention is according to the audio frequency characteristics dynamic of input Acoustic score is calculated, and preserves the language model scores of pinyin sequence on its network, using dynamic programming algorithm combination sound It learns point and language model scores, searches for several pinyin sequences output of highest scoring, and the language model of pinyin sequence uses Recurrent neural network based on long memory unit in short-term is modeled.

Above-mentioned each network in the present invention be embodied in systems a weighted finite state automatic machine (WFST, Weighted Finite State Transducers).The sequence of input can be mapped as other sequence by the automatic machine Column.In the basic identifier based on acoustic model to phonetic mapping network, the language mould of pinyin sequence is saved on the network Type score calculates acoustic score according to the audio frequency characteristics of input dynamic, using dynamic programming algorithm at this in decoding process Acoustic score and language model scores are combined in WFST network, search for several pinyin sequences of highest scoring as more candidate results Output.

When it is implemented, phonetic language model can be using based on long short-term memory (LSTM, Long-short Term Memory) recurrent neural network (RNN, Recurrent Neural Network) of unit is modeled, and strengthens spelling in this way The association of sound context improves the accuracy of the more candidate recognition results of phonetic.

In the present invention, its input of the specific identification device based on phonetic to word mapping network is to indicate more candidate pinyin sequences Network and phonetic to word mapping network, output be best word sequence and its confidence indicator.More candidate pinyin sequences Network can be expressed as the WFST that a phonetic is mapped to phonetic, and the mapping network of phonetic to word is also expressed as one WFST, path weight value are mapping cost of the pinyin sequence to word sequence.Identification process is combined to two WFST first A new WFST is generated, the sequence of highest scoring is then searched for from the WFST, exports its word sequence and score.

In the present invention, integrated decision-making unit is received from multiple specific identification devices based on phonetic to word mapping network Output, i.e. word sequence and its confidence level, then according to its confidence level along with previously given priori knowledge and rule and Additional knowledge carries out decision, and optimal word sequence is selected to export.Specifically, so-called priori knowledge includes at least: identifying system Except the identification information about field that inputs, or the field designation information obtained according to recognition result historical information.It is so-called Field designation information can be discrete 0/1 and set, and be also possible to continuous probability value.Specifically, so-called rule includes at least: The word number range estimated according to audio length.According to word number range, those overlength or ultrashort recognition result can be excluded.It is special Fixed, so-called additional information may include being obtained according to super language model about recognition result word string grammaticalness specification Degree measurement.Above- mentioned information and rule are by way of stratified calculation and confidence score selects to wait together as decision rule Word string is selected to export as final recognition result.

The above is only a preferred embodiment of the present invention, it is not intended to restrict the invention, it is noted that for this skill For the those of ordinary skill in art field, without departing from the technical principles of the invention, can also make it is several improvement and Modification, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a kind of speech recognition system, it is characterised in that: including

Based on the basic identifier of acoustic model to phonetic mapping network, for being mapped as voice by multiple candidate pinyin sequences Arrange the network being organized into；

Multiple specific identification devices based on phonetic to word mapping network for different application field arranged side by side, for distinguishing With by multiple candidate pinyin sequentials organization at network be combined, obtain multiple best word sequences and confidence level；

Integrated decision-making unit, for receiving multiple best word sequences and confidence level, then according to confidence level along with previously given Priori knowledge and preset rules and additional information, carry out decision, optimal word sequence selected to export；

By adjusting phonetic to word mapping network, new identification content mapping based on phonetic to word to existing field is added In the specific identification device of network, the identification content in existing field is updated；It corresponding is reflected based on phonetic to word by constructing offline The specific identification device of network is penetrated, extension content is then added to the specific identification device based on phonetic to word mapping network online In, create the identification content of new application field；

The basic identifier based on acoustic model to phonetic mapping network calculates acoustics according to the audio frequency characteristics dynamic of input Score, and preserve on its network the language model scores of pinyin sequence, using dynamic programming algorithm combination acoustic score and Language model scores search for several pinyin sequences output of highest scoring；

The language model of the pinyin sequence uses the recurrent neural network based on long memory unit in short-term to be modeled；

The integrated decision-making unit is selected most by fusion recognition confidence level, priori knowledge and preset rules and additional information Good candidate's word sequence；

The priori knowledge includes at least the identification information about field that inputs except the speech recognition system, or according to The field designation information that recognition result historical information obtains；

The field designation information is discrete 0/1 to set or continuous probability value；

The preset rules include at least the word number range estimated according to audio length；

The additional information includes the degree about recognition result word string grammaticalness specification obtained according to super language model Measurement；

The integrated decision-making unit is by the additional information and preset rules by way of stratified calculation and confidence score one Candidate word sequence is selected to export as final recognition result as decision rule.