CN108831439A

CN108831439A - Audio recognition method, device, equipment and system

Info

Publication number: CN108831439A
Application number: CN201810677565.1A
Authority: CN
Inventors: 李忠杰
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2018-11-16
Anticipated expiration: 2038-06-27
Also published as: CN108831439B

Abstract

The present invention discloses a kind of audio recognition method, including step：Obtain voice signal；Processing is decoded to voice signal, obtains multiple optimal paths；According to user model trained in advance, multiple optimal paths are evaluated；It according to evaluation result, extracts with the matched optimal path of user model from multiple optimal paths as target best paths, and determines the speech recognition result of voice signal according to target best paths.A kind of speech recognition equipment, speech recognition apparatus and speech recognition system are also disclosed.It is decoded to obtain multiple optimal paths by voice signal, and user model is called to evaluate multiple optimal paths, speech recognition result is finally obtained according to evaluation result, solve the problems, such as that recognition result accuracy rate is lower in traditional voice identification technology, greatly improves the accuracy rate of recognition result.The recognition accuracy of above-mentioned speech recognition system is higher outer, can effectively improve the degree of safety of userspersonal information.

Description

Audio recognition method, device, equipment and system

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of audio recognition method, device, equipment and are System.

Background technique

With the fast development of intelligent interaction technology and the continuous extension of the market demand, speech recognition technology takes in recent years Tremendous development is obtained, is widely used in multiple fields so far.Speech recognition technology is exactly as its name suggests to defeated The voice signal entered is identified, to be converted into the accessible text information of computer.It can be real using speech recognition technology Intelligent sound interaction in existing numerous application scenarios, such as voice assistant, the intelligent control etc. based on speech recognition.

Traditional speech recognition technology scheme is usually to carry out feature extraction after system receives voice signal, and be based on mentioning The feature taken carries out classified calculating to voice signal, then combines weighted finite state machine (WFST) to be decoded output voice and knows Other result.However, the recognition result accuracy rate of traditional speech recognition technology is not still high.

Summary of the invention

Based on this, the present invention provides a kind of audio recognition method, a kind of speech recognition equipment, a kind of speech recognition apparatus with And a kind of speech recognition system.

To achieve the above object, on the one hand, the embodiment of the present invention provides a kind of audio recognition method, including step：

Obtain voice signal；

Processing is decoded to the voice signal, obtains multiple optimal paths；

According to user model trained in advance, multiple optimal paths are evaluated；

According to evaluation result, is extracted from multiple optimal paths and the user model matched one described best Path determines according to the target best paths speech recognition result of the voice signal as target best paths.

Processing is decoded to the voice signal in one of the embodiments, obtains the process of multiple optimal paths Include the following steps：

Feature extraction is carried out to the voice signal, obtains corresponding acoustic feature information；

It by the classification of speech signals is each class by the acoustic model that constructs in advance according to the acoustic feature information Not and determine corresponding class probability；

According to the voice signal of each classification and the corresponding class probability, based on the WFST module constructed in advance Sweep forward is carried out, multiple optimal paths are obtained.

In one of the embodiments, according to the voice signal of each classification and the corresponding class probability, base In the step of WFST module constructed in advance carries out sweep forward, obtains multiple optimal paths, including：

Independent sweep forward is carried out respectively based on the multiple WFST modules constructed in advance, is obtained and multiple WFST The corresponding multiple optimal paths of module.

In one of the embodiments, according to the voice signal of each classification and the corresponding class probability, base In the step of WFST module constructed in advance carries out sweep forward, obtains multiple optimal paths, further include：

Based on the multiple WFST modules constructed in advance and corresponding weight, sweep forward is synchronized, is obtained and more The corresponding multiple optimal paths of a WFST module.The accuracy rate of speech recognition is higher simultaneously, greatly promotes identification speed Degree.

In one of the embodiments, according to evaluation result, extracted and the use from multiple optimal paths One optimal path of family Model Matching determines institute's predicate as target best paths, and according to the target best paths After the step of speech recognition result of sound signal, further include：

If detecting, institute's speech recognition result includes newly-increased contact information, newly-increased wound phrase certainly and/or increases newly Characteristic language information, then according to the newly-increased contact information, described newly-increased from wound phrase and/or the newly-increased spy Language message is levied, the user model is updated.

Multiple WFST modules include customization WFST module, the customization WFST module in one of the embodiments, It is obtained by following steps：

Acquire the words and phrases and syntactic information of setting；

Word segmentation processing is carried out by words and phrases of the dictionary to the setting；

Statistics training is carried out to the syntactic information, obtains corresponding language model；

According to the result of the word segmentation processing and the language model, compiling obtains the customization WFST module.It can lead to It crosses in conjunction with customization WFST module, further increases the accuracy rate of speech recognition.

On the other hand, the embodiment of the present invention also provides a kind of audio recognition method, including step：

Voice signal is sent to server；

It obtains server and is decoded the multiple optimal paths fed back after processing to the voice signal；

In another aspect, the embodiment of the present invention provides a kind of speech recognition equipment, including：

Voice obtains module, for obtaining voice signal；

Decoding process module obtains multiple optimal paths for being decoded processing to the voice signal；

First evaluation module, for evaluating multiple optimal paths according to user model trained in advance；

First result obtains module, for being extracted and the user from multiple optimal paths according to evaluation result One optimal path of Model Matching determines the voice as target best paths, and according to the target best paths The speech recognition result of signal.

In another aspect, the embodiment of the present invention also provides a kind of speech recognition equipment, including：

Voice sending module, for sending voice signal to server；

Word sequence obtains module, is decoded the best road fed back after processing to the voice signal for obtaining server Diameter；

Second evaluation module, for evaluating multiple optimal paths according to user model trained in advance；

Second result obtains module, for being extracted and the user from multiple optimal paths according to evaluation result One optimal path of Model Matching determines the voice as target best paths, and according to the target best paths The speech recognition result of signal.

In another aspect, the embodiment of the present invention provides a kind of computer readable storage medium, it is stored thereon with computer program, The computer program realizes the step of any of the above-described kind of audio recognition method when being executed by processor.

In another aspect, the embodiment of the present invention provides a kind of speech recognition apparatus, including memory and processor, the storage Device is stored with computer program, and the computer program realizes any of the above-described kind of speech recognition side when being executed by the processor Method.

In another aspect, the embodiment of the present invention also provides a kind of speech recognition system, including server and terminal；

The terminal is for sending voice signal to the server；

The server obtains multiple optimal paths for being decoded processing to the voice signal；

The terminal is also used to evaluate multiple optimal paths according to user model trained in advance；According to Evaluation result is extracted with the matched optimal path of the user model from multiple optimal paths as target Optimal path, and determine according to the target best paths speech recognition result of the voice signal.

The terminal is also used in one of the embodiments,：If detecting, institute's speech recognition result includes newly-increased It is contact information, newly-increased from wound phrase and/or newly-increased characteristic language information, then according to the newly-increased contact information, Newly-increased wound phrase and/or the newly-increased characteristic language information certainly, update the user model.

A technical solution in above-mentioned technical proposal has the following advantages that and beneficial effect：

By the multiple optimal paths exported to WFST module, call user model trained in advance to multiple described best Path is evaluated, and is extracted from multiple optimal paths and the matched institute of the user model according to evaluation result Optimal path is stated as target best paths, and determines the speech recognition result of the voice signal according to target best paths. Gained speech recognition result can effectively cover more to the greatest extent interactive voice application scenarios and field, and effectively combine the voice of user Feature has reached gained speech recognition result closer to the practical application scene of user, and recognition result accuracy rate obtains larger mention High effect.

Detailed description of the invention

Fig. 1 is the flow diagram of the audio recognition method of one embodiment；

Fig. 2 is that the optimal path of one embodiment obtains flow diagram；

Fig. 3 is the brief flow diagram of the customization decoder building of one embodiment；

Fig. 4 is the first schematic speech recognition process schematic diagram of one embodiment；

Fig. 5 is second of schematic speech recognition process schematic diagram of one embodiment；

Fig. 6 is the flow diagram of another audio recognition method of one embodiment；

Fig. 7 is the modular structure schematic diagram of the first speech recognition equipment of one embodiment；

Fig. 8 is the structural schematic diagram of the decoding process module of one embodiment；

Fig. 9 is the modular structure schematic diagram of second of speech recognition equipment of one embodiment；

Figure 10 is the speech recognition system structural schematic diagram of one embodiment；

Figure 11 is the first time diagram of the speech recognition process of one embodiment；

Figure 12 is second of time diagram of the speech recognition process of one embodiment.

Specific embodiment

The contents of the present invention are described in further detail below in conjunction with preferred embodiment and attached drawing.Obviously, hereafter institute The examples are only for explaining the invention for description, rather than limitation of the invention.Based on the embodiments of the present invention, this field is general Logical technical staff every other embodiment obtained without making creative work belongs to what the present invention protected Range.

Speech recognition technology, be referred to as automatic speech recognition (Automatic Speech Recognition, ASR), task be vocabulary Content Transformation in the voice that people is issued be it is computer-readable enter text.Speech recognition skill Art is a kind of comprehensive technology, is related to multiple ambits, as sound generating mechanism and hearing mechanism, signal processing, probability theory and Information theory, pattern-recognition and artificial intelligence etc..Currently, being generallyd use in the large vocabulary speech recognition system of mainstream based on system Count the identification technology of model.The application vector of speech recognition technology is usually speech recognition system, main body usually can wrap containing Server and terminal, voice signal are sent to server after generally being inputted by terminal, carry out voice to voice signal by server Identifying processing simultaneously returns to corresponding result.Terminal for example can be smart phone, such as user can say one section of word by mobile phone, After this section of voice of input can be sent to server progress speech recognition by mobile phone, the speech recognition result that server returns is received, End user on mobile phone it is seen that one section with input the corresponding text of voice or mobile phone show corresponding text after execute Corresponding control operation, such as open corresponding application etc..In addition to this, above-mentioned terminal can also be various smart machines, Such as smart television, plate even other various intelligent appliances, Intelligent office equipment etc..

However, inventor has found during realizing the technical solution of the embodiment of the present invention, answered with increasing With requiring, the recognition methods in traditional speech recognition technology remains the not high problem of speech recognition accuracy.For this purpose, Referring to Fig. 1, providing a kind of audio recognition method, include the following steps：

S10 obtains voice signal.

Wherein, voice signal can be the voice signal for user's input that server is obtained from terminal, and terminal can be But it is not limited to smart phone, tablet computer, intelligent TV set, intelligent robot, intelligent interaction plate, intelligent wearable device, intelligence Energy Medical Devices etc., can also be other kinds of intelligent appliance, automobile etc..

S12 is decoded processing to voice signal, obtains multiple optimal paths；

Wherein, decoding process can be the decoding process that the search module by constructing in advance carries out voice signal, most Good path can be the path met the requirements in the searching route that decoding process exports, such as the highest decoding result of weight Corresponding searching route.

In some embodiments, the search module constructed in advance can be WFST module, and WFST module is in decoder Function of search module, wherein decoder, which refers to, exports the software program of corresponding text results (such as the audio signal decoding of input Application program of mobile phone, server program etc.) or device (such as independent voiced translation machine).Multiple WFST moulds can usually be passed through Block directly obtains multiple optimal paths, or the multiple optimal paths of word lattice information acquisition exported by decoding process, wherein word Lattice information namely word gridding information (word lattice), word lattice information are a kind of representation of decoding process result, word lattice It include multiple optimal paths in information.

The WFST module constructed in advance in the present embodiment can be according to each predetermined field, each predetermined scene and each setting Acoustic model, pronunciation dictionary and the language model of language mode, each predetermined field of the correspondence constructed respectively, each predetermined scene With each WFST module of each setting language mode, be also possible to combine after each WFST module at a general WFST mould Block.Wherein, each predetermined field can be diverse discipline field, all kinds of commodity fields or other specific fields, and usually each is pre- Determining field all can have the corresponding common words and phrases in the field, professional words and phrases etc. to have distinctive words and phrases, corresponding pronunciation habit It will be different or stress.Each predetermined scene for example can be the various living scenes that user is often in and operative scenario etc., together Sample can also have the characteristic voice under corresponding various scenes.Each setting language mode can be the speech habits or language of user itself Sound pronunciation habit, the language mode that can represent the individual subscriber feature of generation, such as accent and the idiom of user Deng.

Specifically, each WFST module that can be constructed in advance by server calls or each WFST block combiner At a general WFST module processing is decoded to voice signal, export multiple optimal paths.So far, server can After completion is scanned for by WFST module, the process of multiple preliminary speech recognition results with different probability is obtained.Respectively The construction method or combined method of a WFST module or general WFST module, can use the common method of this field, in this theory In bright book without limitation.

In further embodiments, it can be handled by other kinds of search module in the prior art to obtain most Good path, it will not go into details herein.

S14 evaluates multiple optimal paths according to user model trained in advance；

S16 is extracted and the matched optimal path conduct of user model from multiple optimal paths according to evaluation result Target best paths, and determine according to target best paths the speech recognition result of voice signal.

Wherein, user model can be the model of the data statistics form of reflection individual subscriber feature, can generally pass through The user data that acquisition needs in advance obtains to be trained.User model can pass through the various common technology hands of this field Section carries out training in advance to the user data of needs and obtaining, this specification to the training method of user model without limitation.

It is appreciated that user model trained in advance can be called by server or terminal, to it is aforementioned obtain it is more A optimal path is evaluated, so multiple optimal paths after evaluation, can assign a corresponding evaluation index, example respectively Such as close to the degree score of individual subscriber feature, such as close to the degree power corresponding with optimal path of individual subscriber feature Comprehensive score of both weight.Server or terminal can be, but not limited to extract and user model from multiple optimal paths Corresponding voice signal is determined as target best paths, and according to target best paths with a highest optimal path is spent Speech recognition result.

The audio recognition method of above-described embodiment, by calling user model trained in advance to carry out multiple optimal paths Evaluation, obtained from multiple optimal paths with user model matching degree, namely be best suitable for the speech recognition knot of user's actual conditions Fruit.

In addition, being evaluated in conjunction with the building of WFST module and with user model trained in advance, can effectively adapt to multiple Miscellaneous changeable speech exchange scene, and various fields and habit of speaking that the content that user speech exchanges is covered can be taken into account, Closer to the practical application scene of user, recognition result accuracy rate is greatly enhanced, and effectively avoids traditional speech recognition skill The lower problem of the recognition result accuracy rate of art.

Speech recognition result can be word sequence in one of the embodiments, is also possible to the corresponding control of word sequence and refers to It enables.Wherein, it is corresponding with corresponding probability and with the character string of network, tool to can be target best paths for word sequence Body can be the text information obtained after voice messaging decoding search.In this way, can be after speech recognition result is received by terminal Text importing is carried out, is also possible to execute corresponding control operation.For example, user can say against mobile phone when terminal is mobile phone One section of voice, the voice that the server on backstage can fast and accurately say user is converted into text, and shows.Or Person is when for example terminal is television set, and user can say a phonetic order against television set, the server on backstage can quickly, Accurately the phonetic order of user is identified, obtain corresponding control instruction and is returned on television set, television set is made to execute phase Operation, such as switching program should be controlled.

The user model in above-described embodiment can be believed according to the contact person of user-association in one of the embodiments, Breath is trained acquisition from wound phrase and/or characteristic language information.Associated contact information can be in advance from the terminal of user Up-regulation is obtained and is arrived, and can also be obtained when in terminal automatic synchronization contact information to server.It can be from wound phrase The phrase created by various modes during routine use terminal from user, such as created by way of inputting text Phrase, or to the voice messaging being input in terminal extract it is obtained from wound phrase.It is not deposited generally from wound phrase It is in existing dictionary, but what user created for the first time.Characteristic language information can wrap the speech habits containing characterization user Information and the information of voice use habit etc., for example, the pronunciation of user, average word speed, pet phrase or other characterization users language The information of sound characteristic.In this way, being used for the training of user model by the characteristics of speech sounds information for periodically or online collecting user, obtain Meet the user model of user's truth as far as possible, so that it is guaranteed that the accuracy improvement effect of speech recognition result.

It should be noted that each step of the audio recognition method in this specification, it can be with part steps at the terminal It executes, rest part step can execute on the server, can also execute each step, such as offline voice at the terminal Identification, thus it is described each step executed by server be exemplary executive mode, and not all executive mode.

Referring to Fig. 2, can be specifically included the following steps for step S12 in one of the embodiments,：

S122 carries out feature extraction to voice signal, obtains corresponding acoustic feature information.

It is appreciated that carrying out feature extraction to the voice signal of acquisition after the available voice signal of server to obtain The acoustic feature information of the voice signal.Server to execute feature extraction during can use this field routine techniques hand Section is completed, and this specification embodiment does not execute method used by acoustic feature information extraction process to server and limits It is fixed, such as linear prediction residue error method (LPCC can be used：LinearPrediction Cepstrum Coefficient), Mel frequency cepstrum Y-factor method Y (MFCC：Mel Frequency Cepstrum Coefficient), perception Linear forecasting parameter method (PLP：Perceptual Linear Predict ive) and Meier scale filter method (FBANK：Mel- Scale Filter Bank) in any one.

Classification of speech signals is each classification by the acoustic model that constructs in advance according to acoustic feature information by S124 And determine corresponding class probability.

Wherein, acoustic model can be constructed in advance by conventional method in that art, and this specification is not to building acoustics The method of model is defined, such as can be mixed based on convolutional neural networks, Recognition with Recurrent Neural Network, deep neural network, Gauss Any method in molding type and shot and long term memory network carries out the building of acoustic model.

It is appreciated that server can be believed by the acoustic model built in advance according to the acoustic feature of aforementioned acquisition Breath carries out classified calculating to voice signal and voice signal is divided into a fixed number in conjunction with indexs such as the classification quantity of setting and classifications The classification of amount and the correspondence class probability for providing each classification.In general, each item classified search path packet in acoustic model Containing corresponding weight (probability), merged by the respective weights to each item classification path, so that it may in the classification of output As a result the class probability of the category is obtained simultaneously.Such as it is 0.8 that certain frame in the voice signal, which is classified into the probability of A class, quilt The probability for assigning to B class is 0.4 etc..A certain number of classifications for example can be 3000 to 10000 classifications, can be according to voice The various subdivision classifications of the common scene being applied to required for identification technology are determined, such as can be A class is cell phone type, B class is television class, and C class is electronic thermometer class.

S126, according to the voice signal of each classification and corresponding class probability, based on the WFST module constructed in advance into Row sweep forward obtains multiple optimal paths.

Specifically, server can be carried out based on the general WFST module of the multiple WFST modules constructed in advance or one Sweep forward obtains the multiple optimal paths for corresponding to each predetermined field, each predetermined scene and each setting language mode.In this way, By above-mentioned decoding process step, can quickly obtain effectively covering the multiple of more to the greatest extent interactive voice application scenarios and field Optimal path output, applicability are stronger.

In one of the embodiments, for step S126, can specifically include the following steps：

Independent sweep forward is carried out respectively based on the multiple WFST modules constructed in advance, is obtained and is distinguished with multiple WFST modules Corresponding multiple optimal paths.

It, can be by each WFST module of every field, each it is appreciated that server is during executing decoding search Each WFST module in each WFST module of a scene and/or each setting language mode, the respectively voice according to each classification Signal and corresponding class probability carry out independent sweep forward, obtain each optimal path of multiple WFST module outputs.One A WFST module can correspond to an optimal path, and each optimal path has generally comprised respective weight.In this way, can lead to Cross multiple optimal paths that multiple WFST modules are carried out with independent sweep forward acquisition respectively, it can be ensured that in every field, respectively Accurate recognition result is obtained in a scene and/or setting language mode.

In one of the embodiments, for step S126, specifically can also be：Based on the multiple WFST moulds constructed in advance Block and corresponding weight synchronize sweep forward, obtain multiple optimal paths corresponding with multiple WFST modules.

It is appreciated that server can be by the voice signal of each classification and corresponding class probability, while being input to more A WFST module brings each respective weight of WFST module into search process, such as according to Viterbi in conjunction with viterbi algorithm Algorithm and the respective weight of each WFST, multiple WFST modules synchronize sweep forward, and path obtained in search is carried out Unified threshold value beta pruning management, such as lower than the path beta pruning removal of setting probability threshold value, retain the more optimal way of limited quantity after It is continuous to carry out sweep forward, to finally obtain multiple optimal path outputs.Each WFST module can obtain each when generating From respective weights, such as weight of the voice signal in the field corresponding to the WFST module.In this way, each WFST module During synchronous sweep forward each best road for having respective weights value can be exported based on the size of the weight of itself The time loss of search process is effectively reduced in diameter.In the evaluation of subsequent user model, before server or terminal can combine The weight stated carries out overall merit, while realizing raising recognition speed, also can be improved recognition accuracy.

In one of the embodiments, for step S16 after, can also include step：If detecting speech recognition knot Comprising newly-increased contact information, newly-increased wound phrase and/or newly-increased characteristic language information certainly in fruit, then according to newly-increased connection It is people's information, newly-increased wound phrase and/or newly-increased characteristic language information certainly, updates user model.

Wherein, newly-increased contact information can be newly added contact information in the contact information of user, or It can be in contact information, the portion of the updates such as newname, new digit or new address for generate after change by user Divide information.Newly-increased can refer to user's phrase pioneering during the routine use of terminal, such as user couple from wound phrase When recognition result is modified, the wound phrase certainly of appearance.Newly-increased characteristic language information can be user to be made in the daily of terminal It is lived, is formed new in a different language environment for a long time with the speech habits information being newly formed in the process, such as user Accent or new term habit etc., term habit can also be obtained by the modification that user carries out recognition result, such as mouth Head buddhist, frequent words etc..

It is appreciated that server or terminal are in detecting speech recognition result comprising newly-increased contact information, new When the wound phrase certainly increased and/or newly-increased characteristic language information, it will automatic to obtain newly-increased contact information, increase newly From wound phrase and/or newly-increased characteristic language information, user model is updated with timely training, so that it is guaranteed that user model is in day It can keep consistent with the characteristic of user during being often used, user's actual conditions can be accurately reflected.In this way, by above-mentioned The training of user model updates the accuracy that may insure the evaluation result using user model.

Referring to Fig. 3, multiple WFST modules in the various embodiments described above can wrap containing fixed in one of the embodiments, The WFST module of system in other words may include at least two class WFST modules in each WFST module, one type is according to each Predetermined field, each predetermined scene and each acoustic model, pronunciation dictionary and language model for setting language mode, building corresponds to respectively Each routine WFST module in aforementioned each predetermined field, each predetermined scene and each setting language mode is (relative to customization decoder For).It is another kind of for based on the less special grammer of routine use, uncommon words and phrases and most emerging new words and phrases or The customization WFST module of the buildings such as network hotspot word, wherein new words and phrases or hot spot word, such as can be annual popular on network Neologisms or hot word, such as " I will beat, I will see, I will listen, I will buy, OMG (Oh My God) ".Customize WFST module structure Required above-mentioned words and phrases can be obtained by way of crawling related corpus from network when building, about the specific method for crawling corpus Herein without limitation, method commonly used in the art can be used.

The key step for customizing the building of WFST module can following S20~S26：

S20 acquires the words and phrases and syntactic information of setting；

S22 carries out word segmentation processing by words and phrases of the dictionary to setting；

S24 carries out statistics training to syntactic information, obtains corresponding language model；

S26, according to the result and language model of word segmentation processing, compiling obtains customization WFST module.

Wherein, dictionary above-mentioned can be pronunciation dictionary traditional used in conventional WFST module generating process.Language Say that the statistics training of model can also use the conventional method of this field, such as N-Gram language model.

It is appreciated that can be by server in the WFST module for generating every field using traditional WFST generation method When, by the words and phrases and syntactic information of acquisition setting, and the statistics training to word segmentation processing and language model is carried out respectively, thus According to the language model that the result of word segmentation processing and training obtain, the words and phrases of setting and syntactic information are passed through into traditional common solution Code device construction method, compiling obtain customization WFST module, customization WFST module for example can be spoken language, written word, chemistry or Each customization WFST module in each subdivision field such as mathematics.In this way, passing through routine WFST module and customization WFST module difference Carry out sweep forward, may be implemented the voice signal got include uncommon words and phrases, the new words and phrases of network prevalence, hot spot words and phrases and When its existing grammer, the higher speech recognition result of accuracy can be equally exported.

The above-mentioned terminal referred to is the terminal in voice signal source, such as mobile phone, plate in one of the embodiments, Equipment or PDA or intelligent interaction device；It is also possible to the other equipment for needing to control corresponding to voice signal, such as electricity Depending on, Intelligent flat or other intelligent interaction devices.Voice signal can be processed by the server into corresponding speech recognition knot After fruit (such as the corresponding text information of word sequence), by server according to the command information for including in speech recognition result, determine The terminal that the voice signal is directed toward.In other words, server is obtaining the voice signal of user and is carrying out speech recognition, acquisition pair After the speech recognition result answered, speech recognition result can be sent to the corresponding terminal of voice signal, so as to realize language The overall process of the speech recognition response of sound signal facilitates corresponding terminal to execute corresponding display, interaction or operation control in time Deng the integrated level of server is higher.

Fig. 4 to 5 is please referred to, what is provided is the simplified diagram of speech recognition process, above-mentioned some realities are more readily understood Apply each step in example.It should be noted that for the various method embodiments described above, describing for simplicity, it is all stated For a series of action combinations, but those skilled in the art should understand that, the present invention is not by described sequence of movement Limitation, because according to the present invention, certain steps can use other sequences.

Referring to Fig. 6, also providing another audio recognition method, include the following steps S11~S17：

S11 sends voice signal to server；

S13 obtains server and is decoded the multiple optimal paths fed back after processing to voice signal；

S15 evaluates multiple optimal paths according to user model trained in advance；

S17 is extracted and the matched optimal path conduct of user model from multiple optimal paths according to evaluation result Target best paths, and determine according to target best paths the speech recognition result of voice signal.

It is appreciated that realizing the various decoding process and evaluate used mode, Ke Yican that above-mentioned each step is related to See corresponding decoding process and evaluation method in foregoing embodiments, details are not described herein again.

Specifically, can be decoded by terminal after the voice signal for receiving user's input to responsible execution voice signal The server of processing sends the voice signal.After the server receives the voice signal, place is decoded to the voice signal Reason obtains multiple optimal paths and feeds back onto terminal.To which terminal can be on the multiple best roads for receiving server return After diameter, according to user model trained in advance, multiple optimal paths of return are evaluated, and then according to evaluation result, from It extracts with the matched optimal path of user model in multiple optimal path as target best paths, and most according to target Good path determines the speech recognition result of voice signal.In this way, by the terminal using user model to multiple optimal paths It is evaluated, to obtain final speech recognition result, userspersonal information's leakage that user model can be prevented to be related to is improved The degree of safety of userspersonal information.

Referring to Fig. 7, provide a kind of speech recognition equipment 100, including voice obtain module 12, decoding process module 14, First evaluation module 16, the first result obtain module 18.Voice obtains module 12 for obtaining voice signal.First evaluation module 16 for evaluating multiple optimal paths according to user model trained in advance.First result obtains module 18 and is used for root According to evaluation result, extracted from multiple optimal paths with the matched optimal path of user model as target best paths, And the speech recognition result of voice signal is determined according to target best paths.

In this way, the technical solution of above-described embodiment, by each module, in conjunction with user model trained in advance to decoding process Obtained multiple optimal paths are evaluated, and obtain target best paths according to evaluation result to obtain final speech recognition knot Fruit can effectively adapt to speech exchange scene complicated and changeable, take into account user speech exchange the various fields that are covered of content and It speaks habit, closer to the practical application scene of user, recognition result accuracy rate is greatly enhanced, and effectively avoids traditional language The lower problem of the recognition result accuracy rate of sound identification technology.

Referring to Fig. 8, decoding process module 14 may include characteristic extracting module 142, divide in one of the embodiments, Class computing module 144 and decoding search module 146.Characteristic extracting module 142 is used to carry out feature extraction to voice signal, obtains Corresponding acoustic feature information.Classified calculating module 144 is used to pass through the acoustic model constructed in advance according to acoustic feature information It is each classification by classification of speech signals and determines corresponding class probability.Search module 146 is decoded to be used for according to each classification Voice signal and corresponding class probability, sweep forward is carried out based on the WFST module that constructs in advance, obtains multiple best roads Diameter.The method of feature extraction, classification and sweep forward in the present embodiment may refer to aforementioned voice recognition methods and respectively implement Feature extraction, classification and sweep forward method in example, details are not described herein again.

Decoding search module 146 may include the first search module in one of the embodiments, and the first search module is used In carrying out independent sweep forward respectively based on the multiple WFST modules constructed in advance, obtain corresponding with multiple WFST modules Multiple optimal paths.

Decoding search module 146 may include the second search module in one of the embodiments, and the second search module is used In based on the multiple WFST modules constructed in advance and corresponding weight, sweep forward is synchronized, is obtained and multiple WFST modules Corresponding multiple optimal paths.

Speech recognition equipment 100 can also include user model update module in one of the embodiments,.User model If update module is for detecting that speech recognition result includes newly-increased contact information, newly-increased wound phrase certainly and/or increases newly Characteristic language information, then according to newly-increased contact information, newly-increased from wound phrase and/or newly-increased characteristic language information, Update user model.

Above-mentioned speech recognition equipment 100 in one of the embodiments, can also include presupposed information acquisition module, Word segmentation processing module and customization decoder construct module.Presupposed information acquisition module is used to acquire the words and phrases and grammer letter of setting Breath.Word segmentation processing module is used to carry out word segmentation processing by words and phrases of the dictionary to setting, carries out statistics training to syntactic information, obtains To corresponding language model.The language model that customization decoder building module is used for the result according to word segmentation processing and obtains, is compiled It translates to obtain customization WFST module.In this way, carrying out sweep forward, Ke Yishi respectively by routine WFST module and customization WFST module The voice signal now got comprising uncommon words and phrases, the new words and phrases of network prevalence, hot spot words and phrases and it includes grammer when, together Sample can export the higher speech recognition result of accuracy.

Referring to Fig. 9, also providing a speech recognition equipment 200, speech recognition equipment in one of the embodiments, 200 include that voice sending module 22, path acquisition module 24, the second evaluation module 26 and the second result obtain module 28.Voice Sending module 22 is used to send voice signal to server.Path obtains module 24 and carries out for obtaining server to voice signal The multiple optimal paths fed back after decoding process.Second evaluation module 26 is used for according to user model trained in advance, to multiple Optimal path is evaluated.Second result obtains module 28 and is used to extract and use from multiple optimal paths according to evaluation result One optimal path of family Model Matching determines according to target best paths the voice of voice signal as target best paths Recognition result.

In this way, the technical solution of above-described embodiment returns server in conjunction with user model trained in advance by each module The multiple optimal paths returned are evaluated, and obtain target best paths according to evaluation result to obtain final speech recognition knot Fruit can effectively adapt to speech exchange scene complicated and changeable, take into account user speech exchange the various fields that are covered of content and It speaks habit, closer to the practical application scene of user, recognition result accuracy rate is greatly enhanced, and effectively avoids traditional language The lower problem of the recognition result accuracy rate of sound identification technology, additionally can be improved the degree of safety of userspersonal information.

The first evaluation module 16 in above-mentioned speech recognition equipment 100, with the second evaluation mould in speech recognition equipment 200 Block 26 can be understood as equal modules with the same function, and being subject to title difference can belong to different devices because of it, Rather than there is the difference of essence.Similarly, it is possible to understand that the first result in speech recognition equipment 100 obtains module 18 and language The second result in sound identification device 200 obtains the relationship of module 28.

Modules in above-mentioned speech recognition equipment 100 and speech recognition equipment 200 can be fully or partially through soft Part, hardware and combinations thereof are realized.Above-mentioned each module can be embedded in the form of hardware or independently of the processing in computer equipment It in device, can also be stored in a software form in the memory in computer equipment, in order to which processor calls execution above each The corresponding operation of a module.

A kind of speech recognition apparatus is provided in one of the embodiments, which can be computer and set It is standby；Such as common computer or it can be server.The speech recognition apparatus includes memory and processor.It is stored on memory There is the computer program that can be run on a processor.The processor of the speech recognition apparatus is for providing calculating and control ability. The memory of the speech recognition apparatus includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with Operating system and computer program.The built-in storage is the fortune of the operating system and computer program in non-volatile memory medium Row provides environment.The speech recognition apparatus may include network interface, for logical by network connection with external interactive terminal Letter.When processor executes the computer program on memory, following steps can be executed：Obtain voice signal；To voice signal It is decoded processing, obtains multiple optimal paths；According to user model trained in advance, multiple optimal paths are evaluated； According to evaluation result, extract with the matched optimal path of user model from multiple optimal paths as the best road of target Diameter, and determine according to target best paths the speech recognition result of voice signal.

Another speech recognition apparatus is also provided in one of the embodiments, which can be intelligence Terminal device；Such as mobile terminal or all kinds of intelligent interaction devices such as can be smart television, Intelligent flat.The speech recognition Equipment includes memory and processor.The computer program that can be run on a processor is stored on memory.The speech recognition The processor of equipment is for providing calculating and control ability.The memory of the speech recognition apparatus includes that non-volatile memories are situated between Matter, built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile The operation of operating system and computer program in storage medium provides environment.The speech recognition apparatus may include network and connect Mouthful, for being communicated with other external interactive terminals by network connection.When processor executes the computer program on memory, Following steps can be executed：Voice signal is sent to server；It obtains after server is decoded processing to voice signal and feeds back Multiple optimal paths；According to user model trained in advance, multiple optimal paths are evaluated；According to evaluation result, from It extracts with the matched optimal path of user model in multiple optimal paths as target best paths, and best according to target Path determines the speech recognition result of voice signal.

The processor in the speech recognition apparatus of the various embodiments described above executes on its memory in one of the embodiments, Computer program when, can also realize the embodiment of the various corresponding portions of the above-mentioned audio recognition method of the present invention.

It is commonly stored program in one storage medium, it can be by the way that program be directly read out storage medium or is passed through Program is installed or copied in the storage equipment (such as hard disk and/or memory) of data processing equipment and is executed.Therefore, such to deposit Storage media also constitutes the present invention.Any kind of recording mode, such as paper storage medium (such as paper can be used in storage medium Band etc.), magnetic storage medium (such as floppy disk, hard disk, flash memory), optical storage media (such as CD-ROM), magnetic-optical storage medium (such as MO Deng) etc..Therefore the invention also discloses a kind of computer readable storage mediums, wherein it is stored with computer program, the computer For executing following steps when program is run：Obtain voice signal；Processing is decoded to voice signal, is obtained multiple best Path；According to user model trained in advance, multiple optimal paths are evaluated；According to evaluation result, from multiple best roads It is extracted in diameter and determines language as target best paths, and according to target best paths with the matched optimal path of user model The speech recognition result of sound signal.

The invention also discloses another computer readable storage mediums in one of the embodiments, wherein being stored with Computer program, for executing following steps when which is run：Voice signal is sent to server；Obtain service Device is decoded the multiple optimal paths fed back after processing to voice signal；According to user model trained in advance, to it is multiple most It is evaluated in good path；According to evaluation result, extracted and the matched optimal path of user model from multiple optimal paths As target best paths, and determine according to target best paths the speech recognition result of voice signal.

The computer program on the computer readable storage medium of foregoing embodiments is transported in one of the embodiments, It is also used to execute corresponding each embodiment of the above-mentioned audio recognition method of the present invention when row.

According to the audio recognition method of each embodiment of aforementioned present invention, referring to Fig. 10, the embodiment of the present invention also provides one Kind speech recognition system 300, below with reference to timing shown in Figure 11, Figure 12 and alternative embodiment to speech recognition system of the invention 300 are described in detail.

Speech recognition system 300 may include server 32 and terminal 34.Terminal 34 can be used for sending voice signal extremely Server 32.Server 32 can be used for voice signal and be decoded processing, obtain multiple optimal paths；Terminal 34 can also be used According to user model trained in advance, multiple optimal paths are evaluated；According to evaluation result, from multiple best roads It is extracted in diameter and determines language as target best paths, and according to target best paths with the matched optimal path of user model The speech recognition result of sound signal.

Wherein, server 32 can be the background process equipment of voice signal, such as property server or cloud computing service The identifying processing platform for the voice signal that device or property server and cloud computing server are composed.Terminal 34 can be Various smart machines, such as smart phone, smart television, tablet computer either other various intelligent appliances, Intelligent office are set Standby and intelligent transportation tool.

Specifically, above-mentioned terminal 34 can obtain the direct Oral input of user, or indirectly defeated by other equipment After the voice signal entered, server 32 is sent by obtained voice signal.Server 32 so as to receive voice letter Number it is decoded processing, after obtaining multiple optimal paths outputs, by multiple optimal path back in terminal 34.At this point, eventually End 34 can call in advance trained user model, evaluate multiple optimal paths of return, according to evaluation result, from more It extracts with the matched optimal path of user model in a optimal path as target best paths, and according to the best road of target Diameter determines the speech recognition result of the voice signal of user's input.It is appreciated that the decoding process that is carried out of server 32 can be with Understand that terminal 34 is according to user model to multiple best roads according to the decoding process in each embodiment of above-mentioned audio recognition method The evaluation of diameter also may refer to the processing of the evaluation of the user model in each embodiment of above-mentioned audio recognition method, in the present embodiment It repeats no more.

In this way, utilizing each WFST module or a general WFST module by server 32, voice signal is carried out Put back to after decoding process on multiple optimal paths to terminal 34, then by terminal 34 according to user model trained in advance to multiple Optimal path is evaluated, to finally determine the speech recognition result of the voice signal of input.To sum up, above-mentioned speech recognition System 300 can effectively cover more to the greatest extent voice application scene and field, and can take into account user's habit, closer to the reality of user Border application scenario, recognition result accuracy rate are greatly improved；Additionally it is possible to the individual subscriber for avoiding user model from being related to The problem of information causes personal information to leak because sharing to public environment where server 32, userspersonal information's degree of safety Biggish improvement can be obtained in height, user experience.

Server 32 may include one in one of the embodiments, can wrap containing more, for example, more interconnection In server 32, it can store on each server 32 in one or more field, scene or preset language mode WFST module, by multiple servers 32 constitute distributed server decoding network carry out linkage work, can be right quickly Voice signal is decoded search in different field, scene or preset language mode, so as to more rapidly, it is accurately complete At the speech decoding process of above-mentioned voice signal, can also accommodate simultaneously greater number of terminal 34 at the same time section send wait know The decoding process of other voice signal, treatment effeciency are higher.

Multiple servers 32 above-mentioned can configure a master control server 32, with realize with each terminal 34 dock and knot Addressing pairing when fruit returns, improves multiple optimal paths or word lattice (lattice) information comprising multiple optimal path is returned Return to the speed of each terminal 34.Pass through terminal 34 in this way, can cooperate by distributed 32 network of server and complete user The tone decoding treatment process of the voice signal of input, improve entire speech recognition system 300 voice recognition processing efficiency and Capacity.

Terminal 34 can be also used in one of the embodiments,：If it includes newly-increased for detecting in speech recognition result It is contact information, newly-increased from wound phrase and/or newly-increased characteristic language information, then according to newly-increased contact information, newly-increased From wound phrase and/or newly-increased characteristic language information, update user model.In this way, terminal 34 can pass through periodic detection, collection The aforementioned contact information of user, the newly-increased characteristic information for creating phrase and/or characteristic language information etc. certainly, for user model Training updates, and is met the user model of user's truth as far as possible, so that it is guaranteed that can reach effectively in different time Improve the effect of the accuracy of speech recognition result.

Server 32 in above-described embodiment of speech recognition system 300 in one of the embodiments, executes decoding It include customization WFST in each WFST module used in the process or in each WFST module of composition universal decoder Module.The words and phrases and syntactic information of setting can be acquired by server 32 by customizing WFST module, and by dictionary to setting Words and phrases carry out word segmentation processing, statistics training are carried out to syntactic information, after obtaining corresponding language model, according to the knot of word segmentation processing Fruit and obtained language model compile to obtain.In this way, combining routine WFST module and customization WFST module, may be implemented to get Voice signal include uncommon words and phrases, the new words and phrases of network prevalence, hot spot words and phrases and its when existing grammer, still be able to output standard The higher multiple optimal paths of exactness, so that terminal 34 finally obtains the higher speech recognition result of accuracy.

Client can be installed in the terminal 34 in above-described embodiment in one of the embodiments,.Client can be with For executing the communication between terminal 34 and server 32, and the step of executing the above-mentioned speech recognition of terminal 34.

Terminal 34 or server 32 are after obtaining voice signal input in one of the embodiments, according to the sound prestored Color characteristic carries out tone color matching to the voice signal, if the matched result of tone color is consistent, after continuing to execute to the voice messaging Continuous voice recognition processing step；Otherwise it intercepts the voice signal and alarms or delete the voice signal, make the voice signal Subsequent identification step terminate.Wherein, the tamber characteristic prestored can be the first user (such as machine of terminal 34 of terminal 34 It is main) spectrum signature of the voice of typing, the matched process of tone color is the spectrum signature that will be prestored and the voice signal of input The process of spectrum signature progress the matching analysis.In this way, by carrying out identification early period to voice signal, it can be stolen to avoid terminal 34 With the problem of, improve the safety of speech recognition.

Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.

The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of audio recognition method, which is characterized in that including step：

Obtain voice signal；

Processing is decoded to the voice signal, obtains multiple optimal paths；

According to evaluation result, extracted and the matched optimal path of the user model from multiple optimal paths As target best paths, and determine according to the target best paths speech recognition result of the voice signal.

2. audio recognition method according to claim 1, which is characterized in that processing is decoded to the voice signal, The process for obtaining multiple optimal paths includes the following steps：

According to the acoustic feature information, by the acoustic model that constructs in advance by the classification of speech signals be each classification simultaneously Determine corresponding class probability；

According to the voice signal of each classification and the corresponding class probability, carried out based on the WFST module constructed in advance Sweep forward obtains multiple optimal paths.

3. audio recognition method according to claim 2, which is characterized in that according to the voice signal of each classification and The corresponding class probability carries out sweep forward based on the WFST module constructed in advance, obtains multiple optimal paths Step, including：

Independent sweep forward is carried out respectively based on the multiple WFST modules constructed in advance, is obtained and multiple WFST modules Corresponding multiple optimal paths.

4. audio recognition method according to claim 2, which is characterized in that according to the voice signal of each classification and The corresponding class probability carries out sweep forward based on the WFST module constructed in advance, obtains multiple optimal paths Step further includes：

Based on the multiple WFST modules constructed in advance and corresponding weight, sweep forward is synchronized, is obtained and multiple institutes State the corresponding multiple optimal paths of WFST module.

5. audio recognition method as claimed in any of claims 1 to 4, which is characterized in that according to evaluation result, It is extracted with the matched optimal path of the user model from multiple optimal paths as target best paths, And after the step of determining the speech recognition result of the voice signal according to the target best paths, further include：

If detecting, institute's speech recognition result includes newly-increased contact information, newly-increased wound phrase and/or newly-increased spy certainly Language message is levied, then according to the newly-increased contact information, newly-increased wound phrase and/or the newly-increased feature language certainly It says information, updates the user model.

6. audio recognition method according to claim 3 or 4, which is characterized in that multiple WFST modules include customization WFST module, the customization WFST module are obtained by following steps：

Acquire the words and phrases and syntactic information of setting；

According to the result of the word segmentation processing and the language model, compiling obtains the customization WFST module.

7. a kind of audio recognition method, which is characterized in that including step：

Voice signal is sent to server；

8. a kind of speech recognition equipment, which is characterized in that including：

Voice obtains module, for obtaining voice signal；

First result obtains module, for being extracted and the user model from multiple optimal paths according to evaluation result A matched optimal path determines the voice signal as target best paths, and according to the target best paths Speech recognition result.

9. a kind of speech recognition equipment, which is characterized in that including：

Voice sending module, for sending voice signal to server；

Word sequence obtains module, is decoded the multiple best roads fed back after processing to the voice signal for obtaining server Diameter；

Second result obtains module, for being extracted and the user model from multiple optimal paths according to evaluation result A matched optimal path determines the voice signal as target best paths, and according to the target best paths Speech recognition result.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of audio recognition method as described in any one of claim 1 to 7 is realized when being executed by processor.

11. a kind of speech recognition apparatus, including memory and processor, the memory are stored with computer program, feature It is, speech recognition side described in claim 1 to 7 any one is realized when the computer program is executed by the processor The step of method.

12. a kind of speech recognition system, which is characterized in that including server and terminal；

The terminal is for sending voice signal to the server；

The terminal is also used to evaluate multiple optimal paths according to user model trained in advance；According to evaluation As a result, being extracted from multiple optimal paths best as target with the matched optimal path of the user model Path, and determine according to the target best paths speech recognition result of the voice signal.

13. speech recognition system according to claim 12, which is characterized in that the terminal is also used to：If detecting institute Speech recognition result includes newly-increased contact information, newly-increased wound phrase and/or newly-increased characteristic language information certainly, then root According to the newly-increased contact information, described newly-increased from wound phrase and/or the newly-increased characteristic language information, described in update User model.