CN108831439A - Audio recognition method, device, equipment and system - Google Patents
Audio recognition method, device, equipment and system Download PDFInfo
- Publication number
- CN108831439A CN108831439A CN201810677565.1A CN201810677565A CN108831439A CN 108831439 A CN108831439 A CN 108831439A CN 201810677565 A CN201810677565 A CN 201810677565A CN 108831439 A CN108831439 A CN 108831439A
- Authority
- CN
- China
- Prior art keywords
- voice signal
- paths
- speech recognition
- module
- wfst
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The present invention discloses a kind of audio recognition method, including step:Obtain voice signal;Processing is decoded to voice signal, obtains multiple optimal paths;According to user model trained in advance, multiple optimal paths are evaluated;It according to evaluation result, extracts with the matched optimal path of user model from multiple optimal paths as target best paths, and determines the speech recognition result of voice signal according to target best paths.A kind of speech recognition equipment, speech recognition apparatus and speech recognition system are also disclosed.It is decoded to obtain multiple optimal paths by voice signal, and user model is called to evaluate multiple optimal paths, speech recognition result is finally obtained according to evaluation result, solve the problems, such as that recognition result accuracy rate is lower in traditional voice identification technology, greatly improves the accuracy rate of recognition result.The recognition accuracy of above-mentioned speech recognition system is higher outer, can effectively improve the degree of safety of userspersonal information.
Description
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of audio recognition method, device, equipment and are
System.
Background technique
With the fast development of intelligent interaction technology and the continuous extension of the market demand, speech recognition technology takes in recent years
Tremendous development is obtained, is widely used in multiple fields so far.Speech recognition technology is exactly as its name suggests to defeated
The voice signal entered is identified, to be converted into the accessible text information of computer.It can be real using speech recognition technology
Intelligent sound interaction in existing numerous application scenarios, such as voice assistant, the intelligent control etc. based on speech recognition.
Traditional speech recognition technology scheme is usually to carry out feature extraction after system receives voice signal, and be based on mentioning
The feature taken carries out classified calculating to voice signal, then combines weighted finite state machine (WFST) to be decoded output voice and knows
Other result.However, the recognition result accuracy rate of traditional speech recognition technology is not still high.
Summary of the invention
Based on this, the present invention provides a kind of audio recognition method, a kind of speech recognition equipment, a kind of speech recognition apparatus with
And a kind of speech recognition system.
To achieve the above object, on the one hand, the embodiment of the present invention provides a kind of audio recognition method, including step:
Obtain voice signal;
Processing is decoded to the voice signal, obtains multiple optimal paths;
According to user model trained in advance, multiple optimal paths are evaluated;
According to evaluation result, is extracted from multiple optimal paths and the user model matched one described best
Path determines according to the target best paths speech recognition result of the voice signal as target best paths.
Processing is decoded to the voice signal in one of the embodiments, obtains the process of multiple optimal paths
Include the following steps:
Feature extraction is carried out to the voice signal, obtains corresponding acoustic feature information;
It by the classification of speech signals is each class by the acoustic model that constructs in advance according to the acoustic feature information
Not and determine corresponding class probability;
According to the voice signal of each classification and the corresponding class probability, based on the WFST module constructed in advance
Sweep forward is carried out, multiple optimal paths are obtained.
In one of the embodiments, according to the voice signal of each classification and the corresponding class probability, base
In the step of WFST module constructed in advance carries out sweep forward, obtains multiple optimal paths, including:
Independent sweep forward is carried out respectively based on the multiple WFST modules constructed in advance, is obtained and multiple WFST
The corresponding multiple optimal paths of module.
In one of the embodiments, according to the voice signal of each classification and the corresponding class probability, base
In the step of WFST module constructed in advance carries out sweep forward, obtains multiple optimal paths, further include:
Based on the multiple WFST modules constructed in advance and corresponding weight, sweep forward is synchronized, is obtained and more
The corresponding multiple optimal paths of a WFST module.The accuracy rate of speech recognition is higher simultaneously, greatly promotes identification speed
Degree.
In one of the embodiments, according to evaluation result, extracted and the use from multiple optimal paths
One optimal path of family Model Matching determines institute's predicate as target best paths, and according to the target best paths
After the step of speech recognition result of sound signal, further include:
If detecting, institute's speech recognition result includes newly-increased contact information, newly-increased wound phrase certainly and/or increases newly
Characteristic language information, then according to the newly-increased contact information, described newly-increased from wound phrase and/or the newly-increased spy
Language message is levied, the user model is updated.
Multiple WFST modules include customization WFST module, the customization WFST module in one of the embodiments,
It is obtained by following steps:
Acquire the words and phrases and syntactic information of setting;
Word segmentation processing is carried out by words and phrases of the dictionary to the setting;
Statistics training is carried out to the syntactic information, obtains corresponding language model;
According to the result of the word segmentation processing and the language model, compiling obtains the customization WFST module.It can lead to
It crosses in conjunction with customization WFST module, further increases the accuracy rate of speech recognition.
On the other hand, the embodiment of the present invention also provides a kind of audio recognition method, including step:
Voice signal is sent to server;
It obtains server and is decoded the multiple optimal paths fed back after processing to the voice signal;
According to user model trained in advance, multiple optimal paths are evaluated;
According to evaluation result, is extracted from multiple optimal paths and the user model matched one described best
Path determines according to the target best paths speech recognition result of the voice signal as target best paths.
In another aspect, the embodiment of the present invention provides a kind of speech recognition equipment, including:
Voice obtains module, for obtaining voice signal;
Decoding process module obtains multiple optimal paths for being decoded processing to the voice signal;
First evaluation module, for evaluating multiple optimal paths according to user model trained in advance;
First result obtains module, for being extracted and the user from multiple optimal paths according to evaluation result
One optimal path of Model Matching determines the voice as target best paths, and according to the target best paths
The speech recognition result of signal.
In another aspect, the embodiment of the present invention also provides a kind of speech recognition equipment, including:
Voice sending module, for sending voice signal to server;
Word sequence obtains module, is decoded the best road fed back after processing to the voice signal for obtaining server
Diameter;
Second evaluation module, for evaluating multiple optimal paths according to user model trained in advance;
Second result obtains module, for being extracted and the user from multiple optimal paths according to evaluation result
One optimal path of Model Matching determines the voice as target best paths, and according to the target best paths
The speech recognition result of signal.
In another aspect, the embodiment of the present invention provides a kind of computer readable storage medium, it is stored thereon with computer program,
The computer program realizes the step of any of the above-described kind of audio recognition method when being executed by processor.
In another aspect, the embodiment of the present invention provides a kind of speech recognition apparatus, including memory and processor, the storage
Device is stored with computer program, and the computer program realizes any of the above-described kind of speech recognition side when being executed by the processor
Method.
In another aspect, the embodiment of the present invention also provides a kind of speech recognition system, including server and terminal;
The terminal is for sending voice signal to the server;
The server obtains multiple optimal paths for being decoded processing to the voice signal;
The terminal is also used to evaluate multiple optimal paths according to user model trained in advance;According to
Evaluation result is extracted with the matched optimal path of the user model from multiple optimal paths as target
Optimal path, and determine according to the target best paths speech recognition result of the voice signal.
The terminal is also used in one of the embodiments,:If detecting, institute's speech recognition result includes newly-increased
It is contact information, newly-increased from wound phrase and/or newly-increased characteristic language information, then according to the newly-increased contact information,
Newly-increased wound phrase and/or the newly-increased characteristic language information certainly, update the user model.
A technical solution in above-mentioned technical proposal has the following advantages that and beneficial effect:
By the multiple optimal paths exported to WFST module, call user model trained in advance to multiple described best
Path is evaluated, and is extracted from multiple optimal paths and the matched institute of the user model according to evaluation result
Optimal path is stated as target best paths, and determines the speech recognition result of the voice signal according to target best paths.
Gained speech recognition result can effectively cover more to the greatest extent interactive voice application scenarios and field, and effectively combine the voice of user
Feature has reached gained speech recognition result closer to the practical application scene of user, and recognition result accuracy rate obtains larger mention
High effect.
Detailed description of the invention
Fig. 1 is the flow diagram of the audio recognition method of one embodiment;
Fig. 2 is that the optimal path of one embodiment obtains flow diagram;
Fig. 3 is the brief flow diagram of the customization decoder building of one embodiment;
Fig. 4 is the first schematic speech recognition process schematic diagram of one embodiment;
Fig. 5 is second of schematic speech recognition process schematic diagram of one embodiment;
Fig. 6 is the flow diagram of another audio recognition method of one embodiment;
Fig. 7 is the modular structure schematic diagram of the first speech recognition equipment of one embodiment;
Fig. 8 is the structural schematic diagram of the decoding process module of one embodiment;
Fig. 9 is the modular structure schematic diagram of second of speech recognition equipment of one embodiment;
Figure 10 is the speech recognition system structural schematic diagram of one embodiment;
Figure 11 is the first time diagram of the speech recognition process of one embodiment;
Figure 12 is second of time diagram of the speech recognition process of one embodiment.
Specific embodiment
The contents of the present invention are described in further detail below in conjunction with preferred embodiment and attached drawing.Obviously, hereafter institute
The examples are only for explaining the invention for description, rather than limitation of the invention.Based on the embodiments of the present invention, this field is general
Logical technical staff every other embodiment obtained without making creative work belongs to what the present invention protected
Range.
Speech recognition technology, be referred to as automatic speech recognition (Automatic Speech Recognition,
ASR), task be vocabulary Content Transformation in the voice that people is issued be it is computer-readable enter text.Speech recognition skill
Art is a kind of comprehensive technology, is related to multiple ambits, as sound generating mechanism and hearing mechanism, signal processing, probability theory and
Information theory, pattern-recognition and artificial intelligence etc..Currently, being generallyd use in the large vocabulary speech recognition system of mainstream based on system
Count the identification technology of model.The application vector of speech recognition technology is usually speech recognition system, main body usually can wrap containing
Server and terminal, voice signal are sent to server after generally being inputted by terminal, carry out voice to voice signal by server
Identifying processing simultaneously returns to corresponding result.Terminal for example can be smart phone, such as user can say one section of word by mobile phone,
After this section of voice of input can be sent to server progress speech recognition by mobile phone, the speech recognition result that server returns is received,
End user on mobile phone it is seen that one section with input the corresponding text of voice or mobile phone show corresponding text after execute
Corresponding control operation, such as open corresponding application etc..In addition to this, above-mentioned terminal can also be various smart machines,
Such as smart television, plate even other various intelligent appliances, Intelligent office equipment etc..
However, inventor has found during realizing the technical solution of the embodiment of the present invention, answered with increasing
With requiring, the recognition methods in traditional speech recognition technology remains the not high problem of speech recognition accuracy.For this purpose,
Referring to Fig. 1, providing a kind of audio recognition method, include the following steps:
S10 obtains voice signal.
Wherein, voice signal can be the voice signal for user's input that server is obtained from terminal, and terminal can be
But it is not limited to smart phone, tablet computer, intelligent TV set, intelligent robot, intelligent interaction plate, intelligent wearable device, intelligence
Energy Medical Devices etc., can also be other kinds of intelligent appliance, automobile etc..
S12 is decoded processing to voice signal, obtains multiple optimal paths;
Wherein, decoding process can be the decoding process that the search module by constructing in advance carries out voice signal, most
Good path can be the path met the requirements in the searching route that decoding process exports, such as the highest decoding result of weight
Corresponding searching route.
In some embodiments, the search module constructed in advance can be WFST module, and WFST module is in decoder
Function of search module, wherein decoder, which refers to, exports the software program of corresponding text results (such as the audio signal decoding of input
Application program of mobile phone, server program etc.) or device (such as independent voiced translation machine).Multiple WFST moulds can usually be passed through
Block directly obtains multiple optimal paths, or the multiple optimal paths of word lattice information acquisition exported by decoding process, wherein word
Lattice information namely word gridding information (word lattice), word lattice information are a kind of representation of decoding process result, word lattice
It include multiple optimal paths in information.
The WFST module constructed in advance in the present embodiment can be according to each predetermined field, each predetermined scene and each setting
Acoustic model, pronunciation dictionary and the language model of language mode, each predetermined field of the correspondence constructed respectively, each predetermined scene
With each WFST module of each setting language mode, be also possible to combine after each WFST module at a general WFST mould
Block.Wherein, each predetermined field can be diverse discipline field, all kinds of commodity fields or other specific fields, and usually each is pre-
Determining field all can have the corresponding common words and phrases in the field, professional words and phrases etc. to have distinctive words and phrases, corresponding pronunciation habit
It will be different or stress.Each predetermined scene for example can be the various living scenes that user is often in and operative scenario etc., together
Sample can also have the characteristic voice under corresponding various scenes.Each setting language mode can be the speech habits or language of user itself
Sound pronunciation habit, the language mode that can represent the individual subscriber feature of generation, such as accent and the idiom of user
Deng.
Specifically, each WFST module that can be constructed in advance by server calls or each WFST block combiner
At a general WFST module processing is decoded to voice signal, export multiple optimal paths.So far, server can
After completion is scanned for by WFST module, the process of multiple preliminary speech recognition results with different probability is obtained.Respectively
The construction method or combined method of a WFST module or general WFST module, can use the common method of this field, in this theory
In bright book without limitation.
In further embodiments, it can be handled by other kinds of search module in the prior art to obtain most
Good path, it will not go into details herein.
S14 evaluates multiple optimal paths according to user model trained in advance;
S16 is extracted and the matched optimal path conduct of user model from multiple optimal paths according to evaluation result
Target best paths, and determine according to target best paths the speech recognition result of voice signal.
Wherein, user model can be the model of the data statistics form of reflection individual subscriber feature, can generally pass through
The user data that acquisition needs in advance obtains to be trained.User model can pass through the various common technology hands of this field
Section carries out training in advance to the user data of needs and obtaining, this specification to the training method of user model without limitation.
It is appreciated that user model trained in advance can be called by server or terminal, to it is aforementioned obtain it is more
A optimal path is evaluated, so multiple optimal paths after evaluation, can assign a corresponding evaluation index, example respectively
Such as close to the degree score of individual subscriber feature, such as close to the degree power corresponding with optimal path of individual subscriber feature
Comprehensive score of both weight.Server or terminal can be, but not limited to extract and user model from multiple optimal paths
Corresponding voice signal is determined as target best paths, and according to target best paths with a highest optimal path is spent
Speech recognition result.
The audio recognition method of above-described embodiment, by calling user model trained in advance to carry out multiple optimal paths
Evaluation, obtained from multiple optimal paths with user model matching degree, namely be best suitable for the speech recognition knot of user's actual conditions
Fruit.
In addition, being evaluated in conjunction with the building of WFST module and with user model trained in advance, can effectively adapt to multiple
Miscellaneous changeable speech exchange scene, and various fields and habit of speaking that the content that user speech exchanges is covered can be taken into account,
Closer to the practical application scene of user, recognition result accuracy rate is greatly enhanced, and effectively avoids traditional speech recognition skill
The lower problem of the recognition result accuracy rate of art.
Speech recognition result can be word sequence in one of the embodiments, is also possible to the corresponding control of word sequence and refers to
It enables.Wherein, it is corresponding with corresponding probability and with the character string of network, tool to can be target best paths for word sequence
Body can be the text information obtained after voice messaging decoding search.In this way, can be after speech recognition result is received by terminal
Text importing is carried out, is also possible to execute corresponding control operation.For example, user can say against mobile phone when terminal is mobile phone
One section of voice, the voice that the server on backstage can fast and accurately say user is converted into text, and shows.Or
Person is when for example terminal is television set, and user can say a phonetic order against television set, the server on backstage can quickly,
Accurately the phonetic order of user is identified, obtain corresponding control instruction and is returned on television set, television set is made to execute phase
Operation, such as switching program should be controlled.
The user model in above-described embodiment can be believed according to the contact person of user-association in one of the embodiments,
Breath is trained acquisition from wound phrase and/or characteristic language information.Associated contact information can be in advance from the terminal of user
Up-regulation is obtained and is arrived, and can also be obtained when in terminal automatic synchronization contact information to server.It can be from wound phrase
The phrase created by various modes during routine use terminal from user, such as created by way of inputting text
Phrase, or to the voice messaging being input in terminal extract it is obtained from wound phrase.It is not deposited generally from wound phrase
It is in existing dictionary, but what user created for the first time.Characteristic language information can wrap the speech habits containing characterization user
Information and the information of voice use habit etc., for example, the pronunciation of user, average word speed, pet phrase or other characterization users language
The information of sound characteristic.In this way, being used for the training of user model by the characteristics of speech sounds information for periodically or online collecting user, obtain
Meet the user model of user's truth as far as possible, so that it is guaranteed that the accuracy improvement effect of speech recognition result.
It should be noted that each step of the audio recognition method in this specification, it can be with part steps at the terminal
It executes, rest part step can execute on the server, can also execute each step, such as offline voice at the terminal
Identification, thus it is described each step executed by server be exemplary executive mode, and not all executive mode.
Referring to Fig. 2, can be specifically included the following steps for step S12 in one of the embodiments,:
S122 carries out feature extraction to voice signal, obtains corresponding acoustic feature information.
It is appreciated that carrying out feature extraction to the voice signal of acquisition after the available voice signal of server to obtain
The acoustic feature information of the voice signal.Server to execute feature extraction during can use this field routine techniques hand
Section is completed, and this specification embodiment does not execute method used by acoustic feature information extraction process to server and limits
It is fixed, such as linear prediction residue error method (LPCC can be used:LinearPrediction Cepstrum
Coefficient), Mel frequency cepstrum Y-factor method Y (MFCC:Mel Frequency Cepstrum Coefficient), perception
Linear forecasting parameter method (PLP:Perceptual Linear Predict ive) and Meier scale filter method (FBANK:Mel-
Scale Filter Bank) in any one.
Classification of speech signals is each classification by the acoustic model that constructs in advance according to acoustic feature information by S124
And determine corresponding class probability.
Wherein, acoustic model can be constructed in advance by conventional method in that art, and this specification is not to building acoustics
The method of model is defined, such as can be mixed based on convolutional neural networks, Recognition with Recurrent Neural Network, deep neural network, Gauss
Any method in molding type and shot and long term memory network carries out the building of acoustic model.
It is appreciated that server can be believed by the acoustic model built in advance according to the acoustic feature of aforementioned acquisition
Breath carries out classified calculating to voice signal and voice signal is divided into a fixed number in conjunction with indexs such as the classification quantity of setting and classifications
The classification of amount and the correspondence class probability for providing each classification.In general, each item classified search path packet in acoustic model
Containing corresponding weight (probability), merged by the respective weights to each item classification path, so that it may in the classification of output
As a result the class probability of the category is obtained simultaneously.Such as it is 0.8 that certain frame in the voice signal, which is classified into the probability of A class, quilt
The probability for assigning to B class is 0.4 etc..A certain number of classifications for example can be 3000 to 10000 classifications, can be according to voice
The various subdivision classifications of the common scene being applied to required for identification technology are determined, such as can be A class is cell phone type,
B class is television class, and C class is electronic thermometer class.
S126, according to the voice signal of each classification and corresponding class probability, based on the WFST module constructed in advance into
Row sweep forward obtains multiple optimal paths.
Specifically, server can be carried out based on the general WFST module of the multiple WFST modules constructed in advance or one
Sweep forward obtains the multiple optimal paths for corresponding to each predetermined field, each predetermined scene and each setting language mode.In this way,
By above-mentioned decoding process step, can quickly obtain effectively covering the multiple of more to the greatest extent interactive voice application scenarios and field
Optimal path output, applicability are stronger.
In one of the embodiments, for step S126, can specifically include the following steps:
Independent sweep forward is carried out respectively based on the multiple WFST modules constructed in advance, is obtained and is distinguished with multiple WFST modules
Corresponding multiple optimal paths.
It, can be by each WFST module of every field, each it is appreciated that server is during executing decoding search
Each WFST module in each WFST module of a scene and/or each setting language mode, the respectively voice according to each classification
Signal and corresponding class probability carry out independent sweep forward, obtain each optimal path of multiple WFST module outputs.One
A WFST module can correspond to an optimal path, and each optimal path has generally comprised respective weight.In this way, can lead to
Cross multiple optimal paths that multiple WFST modules are carried out with independent sweep forward acquisition respectively, it can be ensured that in every field, respectively
Accurate recognition result is obtained in a scene and/or setting language mode.
In one of the embodiments, for step S126, specifically can also be:Based on the multiple WFST moulds constructed in advance
Block and corresponding weight synchronize sweep forward, obtain multiple optimal paths corresponding with multiple WFST modules.
It is appreciated that server can be by the voice signal of each classification and corresponding class probability, while being input to more
A WFST module brings each respective weight of WFST module into search process, such as according to Viterbi in conjunction with viterbi algorithm
Algorithm and the respective weight of each WFST, multiple WFST modules synchronize sweep forward, and path obtained in search is carried out
Unified threshold value beta pruning management, such as lower than the path beta pruning removal of setting probability threshold value, retain the more optimal way of limited quantity after
It is continuous to carry out sweep forward, to finally obtain multiple optimal path outputs.Each WFST module can obtain each when generating
From respective weights, such as weight of the voice signal in the field corresponding to the WFST module.In this way, each WFST module
During synchronous sweep forward each best road for having respective weights value can be exported based on the size of the weight of itself
The time loss of search process is effectively reduced in diameter.In the evaluation of subsequent user model, before server or terminal can combine
The weight stated carries out overall merit, while realizing raising recognition speed, also can be improved recognition accuracy.
In one of the embodiments, for step S16 after, can also include step:If detecting speech recognition knot
Comprising newly-increased contact information, newly-increased wound phrase and/or newly-increased characteristic language information certainly in fruit, then according to newly-increased connection
It is people's information, newly-increased wound phrase and/or newly-increased characteristic language information certainly, updates user model.
Wherein, newly-increased contact information can be newly added contact information in the contact information of user, or
It can be in contact information, the portion of the updates such as newname, new digit or new address for generate after change by user
Divide information.Newly-increased can refer to user's phrase pioneering during the routine use of terminal, such as user couple from wound phrase
When recognition result is modified, the wound phrase certainly of appearance.Newly-increased characteristic language information can be user to be made in the daily of terminal
It is lived, is formed new in a different language environment for a long time with the speech habits information being newly formed in the process, such as user
Accent or new term habit etc., term habit can also be obtained by the modification that user carries out recognition result, such as mouth
Head buddhist, frequent words etc..
It is appreciated that server or terminal are in detecting speech recognition result comprising newly-increased contact information, new
When the wound phrase certainly increased and/or newly-increased characteristic language information, it will automatic to obtain newly-increased contact information, increase newly
From wound phrase and/or newly-increased characteristic language information, user model is updated with timely training, so that it is guaranteed that user model is in day
It can keep consistent with the characteristic of user during being often used, user's actual conditions can be accurately reflected.In this way, by above-mentioned
The training of user model updates the accuracy that may insure the evaluation result using user model.
Referring to Fig. 3, multiple WFST modules in the various embodiments described above can wrap containing fixed in one of the embodiments,
The WFST module of system in other words may include at least two class WFST modules in each WFST module, one type is according to each
Predetermined field, each predetermined scene and each acoustic model, pronunciation dictionary and language model for setting language mode, building corresponds to respectively
Each routine WFST module in aforementioned each predetermined field, each predetermined scene and each setting language mode is (relative to customization decoder
For).It is another kind of for based on the less special grammer of routine use, uncommon words and phrases and most emerging new words and phrases or
The customization WFST module of the buildings such as network hotspot word, wherein new words and phrases or hot spot word, such as can be annual popular on network
Neologisms or hot word, such as " I will beat, I will see, I will listen, I will buy, OMG (Oh My God) ".Customize WFST module structure
Required above-mentioned words and phrases can be obtained by way of crawling related corpus from network when building, about the specific method for crawling corpus
Herein without limitation, method commonly used in the art can be used.
The key step for customizing the building of WFST module can following S20~S26:
S20 acquires the words and phrases and syntactic information of setting;
S22 carries out word segmentation processing by words and phrases of the dictionary to setting;
S24 carries out statistics training to syntactic information, obtains corresponding language model;
S26, according to the result and language model of word segmentation processing, compiling obtains customization WFST module.
Wherein, dictionary above-mentioned can be pronunciation dictionary traditional used in conventional WFST module generating process.Language
Say that the statistics training of model can also use the conventional method of this field, such as N-Gram language model.
It is appreciated that can be by server in the WFST module for generating every field using traditional WFST generation method
When, by the words and phrases and syntactic information of acquisition setting, and the statistics training to word segmentation processing and language model is carried out respectively, thus
According to the language model that the result of word segmentation processing and training obtain, the words and phrases of setting and syntactic information are passed through into traditional common solution
Code device construction method, compiling obtain customization WFST module, customization WFST module for example can be spoken language, written word, chemistry or
Each customization WFST module in each subdivision field such as mathematics.In this way, passing through routine WFST module and customization WFST module difference
Carry out sweep forward, may be implemented the voice signal got include uncommon words and phrases, the new words and phrases of network prevalence, hot spot words and phrases and
When its existing grammer, the higher speech recognition result of accuracy can be equally exported.
The above-mentioned terminal referred to is the terminal in voice signal source, such as mobile phone, plate in one of the embodiments,
Equipment or PDA or intelligent interaction device;It is also possible to the other equipment for needing to control corresponding to voice signal, such as electricity
Depending on, Intelligent flat or other intelligent interaction devices.Voice signal can be processed by the server into corresponding speech recognition knot
After fruit (such as the corresponding text information of word sequence), by server according to the command information for including in speech recognition result, determine
The terminal that the voice signal is directed toward.In other words, server is obtaining the voice signal of user and is carrying out speech recognition, acquisition pair
After the speech recognition result answered, speech recognition result can be sent to the corresponding terminal of voice signal, so as to realize language
The overall process of the speech recognition response of sound signal facilitates corresponding terminal to execute corresponding display, interaction or operation control in time
Deng the integrated level of server is higher.
Fig. 4 to 5 is please referred to, what is provided is the simplified diagram of speech recognition process, above-mentioned some realities are more readily understood
Apply each step in example.It should be noted that for the various method embodiments described above, describing for simplicity, it is all stated
For a series of action combinations, but those skilled in the art should understand that, the present invention is not by described sequence of movement
Limitation, because according to the present invention, certain steps can use other sequences.
Referring to Fig. 6, also providing another audio recognition method, include the following steps S11~S17:
S11 sends voice signal to server;
S13 obtains server and is decoded the multiple optimal paths fed back after processing to voice signal;
S15 evaluates multiple optimal paths according to user model trained in advance;
S17 is extracted and the matched optimal path conduct of user model from multiple optimal paths according to evaluation result
Target best paths, and determine according to target best paths the speech recognition result of voice signal.
It is appreciated that realizing the various decoding process and evaluate used mode, Ke Yican that above-mentioned each step is related to
See corresponding decoding process and evaluation method in foregoing embodiments, details are not described herein again.
Specifically, can be decoded by terminal after the voice signal for receiving user's input to responsible execution voice signal
The server of processing sends the voice signal.After the server receives the voice signal, place is decoded to the voice signal
Reason obtains multiple optimal paths and feeds back onto terminal.To which terminal can be on the multiple best roads for receiving server return
After diameter, according to user model trained in advance, multiple optimal paths of return are evaluated, and then according to evaluation result, from
It extracts with the matched optimal path of user model in multiple optimal path as target best paths, and most according to target
Good path determines the speech recognition result of voice signal.In this way, by the terminal using user model to multiple optimal paths
It is evaluated, to obtain final speech recognition result, userspersonal information's leakage that user model can be prevented to be related to is improved
The degree of safety of userspersonal information.
Referring to Fig. 7, provide a kind of speech recognition equipment 100, including voice obtain module 12, decoding process module 14,
First evaluation module 16, the first result obtain module 18.Voice obtains module 12 for obtaining voice signal.First evaluation module
16 for evaluating multiple optimal paths according to user model trained in advance.First result obtains module 18 and is used for root
According to evaluation result, extracted from multiple optimal paths with the matched optimal path of user model as target best paths,
And the speech recognition result of voice signal is determined according to target best paths.
In this way, the technical solution of above-described embodiment, by each module, in conjunction with user model trained in advance to decoding process
Obtained multiple optimal paths are evaluated, and obtain target best paths according to evaluation result to obtain final speech recognition knot
Fruit can effectively adapt to speech exchange scene complicated and changeable, take into account user speech exchange the various fields that are covered of content and
It speaks habit, closer to the practical application scene of user, recognition result accuracy rate is greatly enhanced, and effectively avoids traditional language
The lower problem of the recognition result accuracy rate of sound identification technology.
Referring to Fig. 8, decoding process module 14 may include characteristic extracting module 142, divide in one of the embodiments,
Class computing module 144 and decoding search module 146.Characteristic extracting module 142 is used to carry out feature extraction to voice signal, obtains
Corresponding acoustic feature information.Classified calculating module 144 is used to pass through the acoustic model constructed in advance according to acoustic feature information
It is each classification by classification of speech signals and determines corresponding class probability.Search module 146 is decoded to be used for according to each classification
Voice signal and corresponding class probability, sweep forward is carried out based on the WFST module that constructs in advance, obtains multiple best roads
Diameter.The method of feature extraction, classification and sweep forward in the present embodiment may refer to aforementioned voice recognition methods and respectively implement
Feature extraction, classification and sweep forward method in example, details are not described herein again.
Decoding search module 146 may include the first search module in one of the embodiments, and the first search module is used
In carrying out independent sweep forward respectively based on the multiple WFST modules constructed in advance, obtain corresponding with multiple WFST modules
Multiple optimal paths.
Decoding search module 146 may include the second search module in one of the embodiments, and the second search module is used
In based on the multiple WFST modules constructed in advance and corresponding weight, sweep forward is synchronized, is obtained and multiple WFST modules
Corresponding multiple optimal paths.
Speech recognition equipment 100 can also include user model update module in one of the embodiments,.User model
If update module is for detecting that speech recognition result includes newly-increased contact information, newly-increased wound phrase certainly and/or increases newly
Characteristic language information, then according to newly-increased contact information, newly-increased from wound phrase and/or newly-increased characteristic language information,
Update user model.
Above-mentioned speech recognition equipment 100 in one of the embodiments, can also include presupposed information acquisition module,
Word segmentation processing module and customization decoder construct module.Presupposed information acquisition module is used to acquire the words and phrases and grammer letter of setting
Breath.Word segmentation processing module is used to carry out word segmentation processing by words and phrases of the dictionary to setting, carries out statistics training to syntactic information, obtains
To corresponding language model.The language model that customization decoder building module is used for the result according to word segmentation processing and obtains, is compiled
It translates to obtain customization WFST module.In this way, carrying out sweep forward, Ke Yishi respectively by routine WFST module and customization WFST module
The voice signal now got comprising uncommon words and phrases, the new words and phrases of network prevalence, hot spot words and phrases and it includes grammer when, together
Sample can export the higher speech recognition result of accuracy.
Referring to Fig. 9, also providing a speech recognition equipment 200, speech recognition equipment in one of the embodiments,
200 include that voice sending module 22, path acquisition module 24, the second evaluation module 26 and the second result obtain module 28.Voice
Sending module 22 is used to send voice signal to server.Path obtains module 24 and carries out for obtaining server to voice signal
The multiple optimal paths fed back after decoding process.Second evaluation module 26 is used for according to user model trained in advance, to multiple
Optimal path is evaluated.Second result obtains module 28 and is used to extract and use from multiple optimal paths according to evaluation result
One optimal path of family Model Matching determines according to target best paths the voice of voice signal as target best paths
Recognition result.
In this way, the technical solution of above-described embodiment returns server in conjunction with user model trained in advance by each module
The multiple optimal paths returned are evaluated, and obtain target best paths according to evaluation result to obtain final speech recognition knot
Fruit can effectively adapt to speech exchange scene complicated and changeable, take into account user speech exchange the various fields that are covered of content and
It speaks habit, closer to the practical application scene of user, recognition result accuracy rate is greatly enhanced, and effectively avoids traditional language
The lower problem of the recognition result accuracy rate of sound identification technology, additionally can be improved the degree of safety of userspersonal information.
The first evaluation module 16 in above-mentioned speech recognition equipment 100, with the second evaluation mould in speech recognition equipment 200
Block 26 can be understood as equal modules with the same function, and being subject to title difference can belong to different devices because of it,
Rather than there is the difference of essence.Similarly, it is possible to understand that the first result in speech recognition equipment 100 obtains module 18 and language
The second result in sound identification device 200 obtains the relationship of module 28.
Modules in above-mentioned speech recognition equipment 100 and speech recognition equipment 200 can be fully or partially through soft
Part, hardware and combinations thereof are realized.Above-mentioned each module can be embedded in the form of hardware or independently of the processing in computer equipment
It in device, can also be stored in a software form in the memory in computer equipment, in order to which processor calls execution above each
The corresponding operation of a module.
A kind of speech recognition apparatus is provided in one of the embodiments, which can be computer and set
It is standby;Such as common computer or it can be server.The speech recognition apparatus includes memory and processor.It is stored on memory
There is the computer program that can be run on a processor.The processor of the speech recognition apparatus is for providing calculating and control ability.
The memory of the speech recognition apparatus includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with
Operating system and computer program.The built-in storage is the fortune of the operating system and computer program in non-volatile memory medium
Row provides environment.The speech recognition apparatus may include network interface, for logical by network connection with external interactive terminal
Letter.When processor executes the computer program on memory, following steps can be executed:Obtain voice signal;To voice signal
It is decoded processing, obtains multiple optimal paths;According to user model trained in advance, multiple optimal paths are evaluated;
According to evaluation result, extract with the matched optimal path of user model from multiple optimal paths as the best road of target
Diameter, and determine according to target best paths the speech recognition result of voice signal.
Another speech recognition apparatus is also provided in one of the embodiments, which can be intelligence
Terminal device;Such as mobile terminal or all kinds of intelligent interaction devices such as can be smart television, Intelligent flat.The speech recognition
Equipment includes memory and processor.The computer program that can be run on a processor is stored on memory.The speech recognition
The processor of equipment is for providing calculating and control ability.The memory of the speech recognition apparatus includes that non-volatile memories are situated between
Matter, built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile
The operation of operating system and computer program in storage medium provides environment.The speech recognition apparatus may include network and connect
Mouthful, for being communicated with other external interactive terminals by network connection.When processor executes the computer program on memory,
Following steps can be executed:Voice signal is sent to server;It obtains after server is decoded processing to voice signal and feeds back
Multiple optimal paths;According to user model trained in advance, multiple optimal paths are evaluated;According to evaluation result, from
It extracts with the matched optimal path of user model in multiple optimal paths as target best paths, and best according to target
Path determines the speech recognition result of voice signal.
The processor in the speech recognition apparatus of the various embodiments described above executes on its memory in one of the embodiments,
Computer program when, can also realize the embodiment of the various corresponding portions of the above-mentioned audio recognition method of the present invention.
It is commonly stored program in one storage medium, it can be by the way that program be directly read out storage medium or is passed through
Program is installed or copied in the storage equipment (such as hard disk and/or memory) of data processing equipment and is executed.Therefore, such to deposit
Storage media also constitutes the present invention.Any kind of recording mode, such as paper storage medium (such as paper can be used in storage medium
Band etc.), magnetic storage medium (such as floppy disk, hard disk, flash memory), optical storage media (such as CD-ROM), magnetic-optical storage medium (such as MO
Deng) etc..Therefore the invention also discloses a kind of computer readable storage mediums, wherein it is stored with computer program, the computer
For executing following steps when program is run:Obtain voice signal;Processing is decoded to voice signal, is obtained multiple best
Path;According to user model trained in advance, multiple optimal paths are evaluated;According to evaluation result, from multiple best roads
It is extracted in diameter and determines language as target best paths, and according to target best paths with the matched optimal path of user model
The speech recognition result of sound signal.
The invention also discloses another computer readable storage mediums in one of the embodiments, wherein being stored with
Computer program, for executing following steps when which is run:Voice signal is sent to server;Obtain service
Device is decoded the multiple optimal paths fed back after processing to voice signal;According to user model trained in advance, to it is multiple most
It is evaluated in good path;According to evaluation result, extracted and the matched optimal path of user model from multiple optimal paths
As target best paths, and determine according to target best paths the speech recognition result of voice signal.
The computer program on the computer readable storage medium of foregoing embodiments is transported in one of the embodiments,
It is also used to execute corresponding each embodiment of the above-mentioned audio recognition method of the present invention when row.
According to the audio recognition method of each embodiment of aforementioned present invention, referring to Fig. 10, the embodiment of the present invention also provides one
Kind speech recognition system 300, below with reference to timing shown in Figure 11, Figure 12 and alternative embodiment to speech recognition system of the invention
300 are described in detail.
Speech recognition system 300 may include server 32 and terminal 34.Terminal 34 can be used for sending voice signal extremely
Server 32.Server 32 can be used for voice signal and be decoded processing, obtain multiple optimal paths;Terminal 34 can also be used
According to user model trained in advance, multiple optimal paths are evaluated;According to evaluation result, from multiple best roads
It is extracted in diameter and determines language as target best paths, and according to target best paths with the matched optimal path of user model
The speech recognition result of sound signal.
Wherein, server 32 can be the background process equipment of voice signal, such as property server or cloud computing service
The identifying processing platform for the voice signal that device or property server and cloud computing server are composed.Terminal 34 can be
Various smart machines, such as smart phone, smart television, tablet computer either other various intelligent appliances, Intelligent office are set
Standby and intelligent transportation tool.
Specifically, above-mentioned terminal 34 can obtain the direct Oral input of user, or indirectly defeated by other equipment
After the voice signal entered, server 32 is sent by obtained voice signal.Server 32 so as to receive voice letter
Number it is decoded processing, after obtaining multiple optimal paths outputs, by multiple optimal path back in terminal 34.At this point, eventually
End 34 can call in advance trained user model, evaluate multiple optimal paths of return, according to evaluation result, from more
It extracts with the matched optimal path of user model in a optimal path as target best paths, and according to the best road of target
Diameter determines the speech recognition result of the voice signal of user's input.It is appreciated that the decoding process that is carried out of server 32 can be with
Understand that terminal 34 is according to user model to multiple best roads according to the decoding process in each embodiment of above-mentioned audio recognition method
The evaluation of diameter also may refer to the processing of the evaluation of the user model in each embodiment of above-mentioned audio recognition method, in the present embodiment
It repeats no more.
In this way, utilizing each WFST module or a general WFST module by server 32, voice signal is carried out
Put back to after decoding process on multiple optimal paths to terminal 34, then by terminal 34 according to user model trained in advance to multiple
Optimal path is evaluated, to finally determine the speech recognition result of the voice signal of input.To sum up, above-mentioned speech recognition
System 300 can effectively cover more to the greatest extent voice application scene and field, and can take into account user's habit, closer to the reality of user
Border application scenario, recognition result accuracy rate are greatly improved;Additionally it is possible to the individual subscriber for avoiding user model from being related to
The problem of information causes personal information to leak because sharing to public environment where server 32, userspersonal information's degree of safety
Biggish improvement can be obtained in height, user experience.
Server 32 may include one in one of the embodiments, can wrap containing more, for example, more interconnection
In server 32, it can store on each server 32 in one or more field, scene or preset language mode
WFST module, by multiple servers 32 constitute distributed server decoding network carry out linkage work, can be right quickly
Voice signal is decoded search in different field, scene or preset language mode, so as to more rapidly, it is accurately complete
At the speech decoding process of above-mentioned voice signal, can also accommodate simultaneously greater number of terminal 34 at the same time section send wait know
The decoding process of other voice signal, treatment effeciency are higher.
Multiple servers 32 above-mentioned can configure a master control server 32, with realize with each terminal 34 dock and knot
Addressing pairing when fruit returns, improves multiple optimal paths or word lattice (lattice) information comprising multiple optimal path is returned
Return to the speed of each terminal 34.Pass through terminal 34 in this way, can cooperate by distributed 32 network of server and complete user
The tone decoding treatment process of the voice signal of input, improve entire speech recognition system 300 voice recognition processing efficiency and
Capacity.
Terminal 34 can be also used in one of the embodiments,:If it includes newly-increased for detecting in speech recognition result
It is contact information, newly-increased from wound phrase and/or newly-increased characteristic language information, then according to newly-increased contact information, newly-increased
From wound phrase and/or newly-increased characteristic language information, update user model.In this way, terminal 34 can pass through periodic detection, collection
The aforementioned contact information of user, the newly-increased characteristic information for creating phrase and/or characteristic language information etc. certainly, for user model
Training updates, and is met the user model of user's truth as far as possible, so that it is guaranteed that can reach effectively in different time
Improve the effect of the accuracy of speech recognition result.
Server 32 in above-described embodiment of speech recognition system 300 in one of the embodiments, executes decoding
It include customization WFST in each WFST module used in the process or in each WFST module of composition universal decoder
Module.The words and phrases and syntactic information of setting can be acquired by server 32 by customizing WFST module, and by dictionary to setting
Words and phrases carry out word segmentation processing, statistics training are carried out to syntactic information, after obtaining corresponding language model, according to the knot of word segmentation processing
Fruit and obtained language model compile to obtain.In this way, combining routine WFST module and customization WFST module, may be implemented to get
Voice signal include uncommon words and phrases, the new words and phrases of network prevalence, hot spot words and phrases and its when existing grammer, still be able to output standard
The higher multiple optimal paths of exactness, so that terminal 34 finally obtains the higher speech recognition result of accuracy.
Client can be installed in the terminal 34 in above-described embodiment in one of the embodiments,.Client can be with
For executing the communication between terminal 34 and server 32, and the step of executing the above-mentioned speech recognition of terminal 34.
Terminal 34 or server 32 are after obtaining voice signal input in one of the embodiments, according to the sound prestored
Color characteristic carries out tone color matching to the voice signal, if the matched result of tone color is consistent, after continuing to execute to the voice messaging
Continuous voice recognition processing step;Otherwise it intercepts the voice signal and alarms or delete the voice signal, make the voice signal
Subsequent identification step terminate.Wherein, the tamber characteristic prestored can be the first user (such as machine of terminal 34 of terminal 34
It is main) spectrum signature of the voice of typing, the matched process of tone color is the spectrum signature that will be prestored and the voice signal of input
The process of spectrum signature progress the matching analysis.In this way, by carrying out identification early period to voice signal, it can be stolen to avoid terminal 34
With the problem of, improve the safety of speech recognition.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention
Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (13)
1. a kind of audio recognition method, which is characterized in that including step:
Obtain voice signal;
Processing is decoded to the voice signal, obtains multiple optimal paths;
According to user model trained in advance, multiple optimal paths are evaluated;
According to evaluation result, extracted and the matched optimal path of the user model from multiple optimal paths
As target best paths, and determine according to the target best paths speech recognition result of the voice signal.
2. audio recognition method according to claim 1, which is characterized in that processing is decoded to the voice signal,
The process for obtaining multiple optimal paths includes the following steps:
Feature extraction is carried out to the voice signal, obtains corresponding acoustic feature information;
According to the acoustic feature information, by the acoustic model that constructs in advance by the classification of speech signals be each classification simultaneously
Determine corresponding class probability;
According to the voice signal of each classification and the corresponding class probability, carried out based on the WFST module constructed in advance
Sweep forward obtains multiple optimal paths.
3. audio recognition method according to claim 2, which is characterized in that according to the voice signal of each classification and
The corresponding class probability carries out sweep forward based on the WFST module constructed in advance, obtains multiple optimal paths
Step, including:
Independent sweep forward is carried out respectively based on the multiple WFST modules constructed in advance, is obtained and multiple WFST modules
Corresponding multiple optimal paths.
4. audio recognition method according to claim 2, which is characterized in that according to the voice signal of each classification and
The corresponding class probability carries out sweep forward based on the WFST module constructed in advance, obtains multiple optimal paths
Step further includes:
Based on the multiple WFST modules constructed in advance and corresponding weight, sweep forward is synchronized, is obtained and multiple institutes
State the corresponding multiple optimal paths of WFST module.
5. audio recognition method as claimed in any of claims 1 to 4, which is characterized in that according to evaluation result,
It is extracted with the matched optimal path of the user model from multiple optimal paths as target best paths,
And after the step of determining the speech recognition result of the voice signal according to the target best paths, further include:
If detecting, institute's speech recognition result includes newly-increased contact information, newly-increased wound phrase and/or newly-increased spy certainly
Language message is levied, then according to the newly-increased contact information, newly-increased wound phrase and/or the newly-increased feature language certainly
It says information, updates the user model.
6. audio recognition method according to claim 3 or 4, which is characterized in that multiple WFST modules include customization
WFST module, the customization WFST module are obtained by following steps:
Acquire the words and phrases and syntactic information of setting;
Word segmentation processing is carried out by words and phrases of the dictionary to the setting;
Statistics training is carried out to the syntactic information, obtains corresponding language model;
According to the result of the word segmentation processing and the language model, compiling obtains the customization WFST module.
7. a kind of audio recognition method, which is characterized in that including step:
Voice signal is sent to server;
It obtains server and is decoded the multiple optimal paths fed back after processing to the voice signal;
According to user model trained in advance, multiple optimal paths are evaluated;
According to evaluation result, extracted and the matched optimal path of the user model from multiple optimal paths
As target best paths, and determine according to the target best paths speech recognition result of the voice signal.
8. a kind of speech recognition equipment, which is characterized in that including:
Voice obtains module, for obtaining voice signal;
Decoding process module obtains multiple optimal paths for being decoded processing to the voice signal;
First evaluation module, for evaluating multiple optimal paths according to user model trained in advance;
First result obtains module, for being extracted and the user model from multiple optimal paths according to evaluation result
A matched optimal path determines the voice signal as target best paths, and according to the target best paths
Speech recognition result.
9. a kind of speech recognition equipment, which is characterized in that including:
Voice sending module, for sending voice signal to server;
Word sequence obtains module, is decoded the multiple best roads fed back after processing to the voice signal for obtaining server
Diameter;
Second evaluation module, for evaluating multiple optimal paths according to user model trained in advance;
Second result obtains module, for being extracted and the user model from multiple optimal paths according to evaluation result
A matched optimal path determines the voice signal as target best paths, and according to the target best paths
Speech recognition result.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of audio recognition method as described in any one of claim 1 to 7 is realized when being executed by processor.
11. a kind of speech recognition apparatus, including memory and processor, the memory are stored with computer program, feature
It is, speech recognition side described in claim 1 to 7 any one is realized when the computer program is executed by the processor
The step of method.
12. a kind of speech recognition system, which is characterized in that including server and terminal;
The terminal is for sending voice signal to the server;
The server obtains multiple optimal paths for being decoded processing to the voice signal;
The terminal is also used to evaluate multiple optimal paths according to user model trained in advance;According to evaluation
As a result, being extracted from multiple optimal paths best as target with the matched optimal path of the user model
Path, and determine according to the target best paths speech recognition result of the voice signal.
13. speech recognition system according to claim 12, which is characterized in that the terminal is also used to:If detecting institute
Speech recognition result includes newly-increased contact information, newly-increased wound phrase and/or newly-increased characteristic language information certainly, then root
According to the newly-increased contact information, described newly-increased from wound phrase and/or the newly-increased characteristic language information, described in update
User model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810677565.1A CN108831439B (en) | 2018-06-27 | 2018-06-27 | Voice recognition method, device, equipment and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810677565.1A CN108831439B (en) | 2018-06-27 | 2018-06-27 | Voice recognition method, device, equipment and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108831439A true CN108831439A (en) | 2018-11-16 |
CN108831439B CN108831439B (en) | 2023-04-18 |
Family
ID=64139035
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810677565.1A Active CN108831439B (en) | 2018-06-27 | 2018-06-27 | Voice recognition method, device, equipment and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108831439B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109524017A (en) * | 2018-11-27 | 2019-03-26 | 北京分音塔科技有限公司 | A kind of the speech recognition Enhancement Method and device of user's custom words |
CN109785858A (en) * | 2018-12-14 | 2019-05-21 | 平安普惠企业管理有限公司 | A kind of contact person's adding method, device, readable storage medium storing program for executing and terminal device |
CN110349569A (en) * | 2019-07-02 | 2019-10-18 | 苏州思必驰信息科技有限公司 | The training and recognition methods of customized product language model and device |
CN110688468A (en) * | 2019-08-28 | 2020-01-14 | 北京三快在线科技有限公司 | Method and device for outputting response message, electronic equipment and readable storage medium |
CN110688855A (en) * | 2019-09-29 | 2020-01-14 | 山东师范大学 | Chinese medical entity identification method and system based on machine learning |
CN110728133A (en) * | 2019-12-19 | 2020-01-24 | 北京海天瑞声科技股份有限公司 | Individual corpus acquisition method and individual corpus acquisition device |
CN110992932A (en) * | 2019-12-18 | 2020-04-10 | 睿住科技有限公司 | Self-learning voice control method, system and storage medium |
CN111081262A (en) * | 2019-12-30 | 2020-04-28 | 杭州中科先进技术研究院有限公司 | Lightweight speech recognition system and method based on customized model |
CN111128183A (en) * | 2019-12-19 | 2020-05-08 | 北京搜狗科技发展有限公司 | Speech recognition method, apparatus and medium |
CN111145756A (en) * | 2019-12-26 | 2020-05-12 | 北京搜狗科技发展有限公司 | Voice recognition method and device for voice recognition |
CN111326147A (en) * | 2018-12-12 | 2020-06-23 | 北京嘀嘀无限科技发展有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN111415653A (en) * | 2018-12-18 | 2020-07-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for recognizing speech |
CN111508501A (en) * | 2020-07-02 | 2020-08-07 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN111968648A (en) * | 2020-08-27 | 2020-11-20 | 北京字节跳动网络技术有限公司 | Voice recognition method and device, readable medium and electronic equipment |
CN112151020A (en) * | 2019-06-28 | 2020-12-29 | 北京声智科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN113247730A (en) * | 2021-06-10 | 2021-08-13 | 浙江新再灵科技股份有限公司 | Elevator passenger screaming detection method and system based on multi-dimensional features |
CN113436614A (en) * | 2021-07-02 | 2021-09-24 | 科大讯飞股份有限公司 | Speech recognition method, apparatus, device, system and storage medium |
CN114242046A (en) * | 2021-12-01 | 2022-03-25 | 广州小鹏汽车科技有限公司 | Voice interaction method and device, server and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101447183A (en) * | 2007-11-28 | 2009-06-03 | 中国科学院声学研究所 | Processing method of high-performance confidence level applied to speech recognition system |
CN101510222A (en) * | 2009-02-20 | 2009-08-19 | 北京大学 | Multilayer index voice document searching method and system thereof |
CN104217717A (en) * | 2013-05-29 | 2014-12-17 | 腾讯科技(深圳)有限公司 | Language model constructing method and device |
CN107195296A (en) * | 2016-03-15 | 2017-09-22 | 阿里巴巴集团控股有限公司 | A kind of audio recognition method, device, terminal and system |
US20180068653A1 (en) * | 2016-09-08 | 2018-03-08 | Intel IP Corporation | Method and system of automatic speech recognition using posterior confidence scores |
-
2018
- 2018-06-27 CN CN201810677565.1A patent/CN108831439B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101447183A (en) * | 2007-11-28 | 2009-06-03 | 中国科学院声学研究所 | Processing method of high-performance confidence level applied to speech recognition system |
CN101510222A (en) * | 2009-02-20 | 2009-08-19 | 北京大学 | Multilayer index voice document searching method and system thereof |
CN104217717A (en) * | 2013-05-29 | 2014-12-17 | 腾讯科技(深圳)有限公司 | Language model constructing method and device |
CN107195296A (en) * | 2016-03-15 | 2017-09-22 | 阿里巴巴集团控股有限公司 | A kind of audio recognition method, device, terminal and system |
US20180068653A1 (en) * | 2016-09-08 | 2018-03-08 | Intel IP Corporation | Method and system of automatic speech recognition using posterior confidence scores |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109524017A (en) * | 2018-11-27 | 2019-03-26 | 北京分音塔科技有限公司 | A kind of the speech recognition Enhancement Method and device of user's custom words |
CN111326147B (en) * | 2018-12-12 | 2023-11-17 | 北京嘀嘀无限科技发展有限公司 | Speech recognition method, device, electronic equipment and storage medium |
CN111326147A (en) * | 2018-12-12 | 2020-06-23 | 北京嘀嘀无限科技发展有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN109785858A (en) * | 2018-12-14 | 2019-05-21 | 平安普惠企业管理有限公司 | A kind of contact person's adding method, device, readable storage medium storing program for executing and terminal device |
CN109785858B (en) * | 2018-12-14 | 2024-02-23 | 深圳市兴海物联科技有限公司 | Contact person adding method and device, readable storage medium and terminal equipment |
CN111415653A (en) * | 2018-12-18 | 2020-07-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for recognizing speech |
CN111415653B (en) * | 2018-12-18 | 2023-08-01 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing speech |
CN112151020A (en) * | 2019-06-28 | 2020-12-29 | 北京声智科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN110349569A (en) * | 2019-07-02 | 2019-10-18 | 苏州思必驰信息科技有限公司 | The training and recognition methods of customized product language model and device |
CN110688468A (en) * | 2019-08-28 | 2020-01-14 | 北京三快在线科技有限公司 | Method and device for outputting response message, electronic equipment and readable storage medium |
CN110688855A (en) * | 2019-09-29 | 2020-01-14 | 山东师范大学 | Chinese medical entity identification method and system based on machine learning |
CN110992932A (en) * | 2019-12-18 | 2020-04-10 | 睿住科技有限公司 | Self-learning voice control method, system and storage medium |
CN110992932B (en) * | 2019-12-18 | 2022-07-26 | 广东睿住智能科技有限公司 | Self-learning voice control method, system and storage medium |
CN110728133A (en) * | 2019-12-19 | 2020-01-24 | 北京海天瑞声科技股份有限公司 | Individual corpus acquisition method and individual corpus acquisition device |
CN111128183A (en) * | 2019-12-19 | 2020-05-08 | 北京搜狗科技发展有限公司 | Speech recognition method, apparatus and medium |
CN110728133B (en) * | 2019-12-19 | 2020-05-05 | 北京海天瑞声科技股份有限公司 | Individual corpus acquisition method and individual corpus acquisition device |
WO2021120690A1 (en) * | 2019-12-19 | 2021-06-24 | 北京搜狗科技发展有限公司 | Speech recognition method and apparatus, and medium |
CN111145756A (en) * | 2019-12-26 | 2020-05-12 | 北京搜狗科技发展有限公司 | Voice recognition method and device for voice recognition |
CN111145756B (en) * | 2019-12-26 | 2022-06-14 | 北京搜狗科技发展有限公司 | Voice recognition method and device for voice recognition |
CN111081262A (en) * | 2019-12-30 | 2020-04-28 | 杭州中科先进技术研究院有限公司 | Lightweight speech recognition system and method based on customized model |
CN111508501B (en) * | 2020-07-02 | 2020-09-29 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN111508501A (en) * | 2020-07-02 | 2020-08-07 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN111968648B (en) * | 2020-08-27 | 2021-12-24 | 北京字节跳动网络技术有限公司 | Voice recognition method and device, readable medium and electronic equipment |
CN111968648A (en) * | 2020-08-27 | 2020-11-20 | 北京字节跳动网络技术有限公司 | Voice recognition method and device, readable medium and electronic equipment |
CN113247730B (en) * | 2021-06-10 | 2022-11-08 | 浙江新再灵科技股份有限公司 | Elevator passenger screaming detection method and system based on multi-dimensional features |
CN113247730A (en) * | 2021-06-10 | 2021-08-13 | 浙江新再灵科技股份有限公司 | Elevator passenger screaming detection method and system based on multi-dimensional features |
CN113436614A (en) * | 2021-07-02 | 2021-09-24 | 科大讯飞股份有限公司 | Speech recognition method, apparatus, device, system and storage medium |
CN113436614B (en) * | 2021-07-02 | 2024-02-13 | 中国科学技术大学 | Speech recognition method, device, equipment, system and storage medium |
CN114242046A (en) * | 2021-12-01 | 2022-03-25 | 广州小鹏汽车科技有限公司 | Voice interaction method and device, server and storage medium |
CN114242046B (en) * | 2021-12-01 | 2022-08-16 | 广州小鹏汽车科技有限公司 | Voice interaction method and device, server and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108831439B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831439A (en) | Audio recognition method, device, equipment and system | |
CN108899013A (en) | Voice search method, device and speech recognition system | |
US9753914B2 (en) | Natural expression processing method, processing and response method, device, and system | |
US11645547B2 (en) | Human-machine interactive method and device based on artificial intelligence | |
US7904297B2 (en) | Dialogue management using scripts and combined confidence scores | |
CN101010934B (en) | Method for machine learning | |
CN109151218A (en) | Call voice quality detecting method, device, computer equipment and storage medium | |
CN109977207A (en) | Talk with generation method, dialogue generating means, electronic equipment and storage medium | |
US8165887B2 (en) | Data-driven voice user interface | |
CN108829757A (en) | A kind of intelligent Service method, server and the storage medium of chat robots | |
CN107704482A (en) | Method, apparatus and program | |
CN106935239A (en) | The construction method and device of a kind of pronunciation dictionary | |
CN110381221B (en) | Call processing method, device, system, equipment and computer storage medium | |
CN110517664A (en) | Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing | |
CN106230689A (en) | Method, device and the server that a kind of voice messaging is mutual | |
CN105845133A (en) | Voice signal processing method and apparatus | |
CN112767910A (en) | Audio information synthesis method and device, computer readable medium and electronic equipment | |
US20230127787A1 (en) | Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium | |
CN107910004A (en) | Voiced translation processing method and processing device | |
CN111128175B (en) | Spoken language dialogue management method and system | |
CN114818649A (en) | Service consultation processing method and device based on intelligent voice interaction technology | |
Davies et al. | The IBM conversational telephony system for financial applications. | |
CN108491379A (en) | Shortcut key recognition methods, device, equipment and computer readable storage medium | |
CN111862970A (en) | False propaganda treatment application method and device based on intelligent voice robot | |
CN110781329A (en) | Image searching method and device, terminal equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |