CN110223678A - Audio recognition method and system - Google Patents
Audio recognition method and system Download PDFInfo
- Publication number
- CN110223678A CN110223678A CN201910506115.0A CN201910506115A CN110223678A CN 110223678 A CN110223678 A CN 110223678A CN 201910506115 A CN201910506115 A CN 201910506115A CN 110223678 A CN110223678 A CN 110223678A
- Authority
- CN
- China
- Prior art keywords
- frame
- keyword
- word
- voice
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Abstract
The embodiment of the present invention provides a kind of audio recognition method.This method comprises: the audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, the posterior probability of each frame is determined, be smoothed by the posterior probability to each frame, determine composition dialogic voice keyword;Determine the string set of words where keyword;Obtain the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in voice document, and the second sequence label of each word pronunciation mapping determination to be selected, the similarity for traversing the first sequence label second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as the identification word of dialogic voice.The embodiment of the present invention also provides a kind of speech recognition system.The factor and existing scoring method that the embodiment of the present invention is considered are entirely different, when determining string word by string vocabulary, the Length discrepancy sequence label of each string word and the similarity of voice label sequence are determined, to realize speech recognition.
Description
Technical field
The present invention relates to intelligent sound field more particularly to a kind of audio recognition method and systems.
Background technique
Speech recognition generallys use mixed Gauss model-hidden Markov model training and obtains acoustic model, then pass through depth
The posterior probability of each Chinese phonetic alphabet of output of neural network, calculates score using posterior probability and scheduled information is compared
Compared with to judge keyword whether in voice segments.
Speech recognition is usually to pass through deep neural network model to carry out identification decoding, need to just train deep neural network in advance,
In training, usually after receiving trained audio file, framing is carried out to training audio file, to extract the sound of each framing
Frequency feature obtains training data after spelling frame, and each frame is trained deep neural network model after carrying out alignment operation.In audio
When decoding, framing first is carried out to audio file, carries out feature extraction again later, is obtained after spelling frame and is input to trained depth nerve
In network model, the posterior probability of each frame is obtained, is given a mark according still further to certain mode, the keyword threshold of score and setting
Value compares, and when reaching threshold value, then judges that keyword is identified to.
In realizing process of the present invention, at least there are the following problems in the related technology for inventor's discovery:
With the mood of speaker or the environment of surrounding, the word speed of speaker has certain variation, when the speed spoken is understood
It is slow when fast, or concentrate on certain suddenly and quickly speak, make it easy to the feeling for allowing other people to recognize string word.And in multi-key word
In detection, it will usually string word occur, with the word speed of speaker, the frequency that string word occurs may be more serious, and existing
Method is weaker for similar key distinguishing ability.Since the too small posterior probability that may result in of deep neural network is inaccurate,
Due to word speed, posterior probability is inaccurate fastly or caused by the similar pronunciation of string word, and existing marking mode can not make up above-mentioned lack
It falls into.
Summary of the invention
In order at least solve in the prior art since the too small posterior probability that may result in of deep neural network is inaccurate, due to language
Posterior probability is inaccurate caused by the fast fast or similar pronunciation of string word, and existing marking mode can not make up asking for above-mentioned defect
Topic.
In a first aspect, the embodiment of the present invention provides a kind of audio recognition method, comprising:
The audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, after determining each frame
Probability is tested, is smoothed by the posterior probability to each frame, determines the keyword for forming the dialogic voice;
The keyword is detected whether in default easily string vocabulary, if so, determining the string set of words where the keyword;
Obtain the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in institute's voice file, Yi Jisuo
The second determining sequence label of each word pronunciation mapping to be selected is stated, first label is successively traversed by dynamic time warping algorithm
The similarity of sequence second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as the dialogue
The identification word of voice, wherein can be with Length discrepancy between each sequence label.
Second aspect, the embodiment of the present invention provide a kind of speech recognition system, comprising:
Keyword determines program module, for the audio frequency characteristics of each frame of the voice document extracted to be input to deep learning mind
It through in network, determining the posterior probability of each frame, is smoothed, is determined described in composition by the posterior probability to each frame
The keyword of dialogic voice;
Easily string word detects program module, for detecting the keyword whether in default easily string vocabulary, if so, described in determining
String set of words where keyword;
Recognizer module, for obtaining in voice file the posterior probability maximum value of every frame corresponding label composition the
One sequence label and determining the second sequence label of each word to be selected pronunciation mapping, by dynamic time warping algorithm according to
The similarity of secondary traversal first sequence label second sequence label corresponding with each word to be selected, maximum similarity is corresponding
Identification word of the word to be selected as the dialogic voice, wherein can be with Length discrepancy between each sequence label.
The third aspect provides a kind of electronic equipment comprising: at least one processor, and at least one described processor
The memory of communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, the finger
It enables and being executed by least one described processor, so that at least one described processor is able to carry out the language of any embodiment of the present invention
The step of voice recognition method.
Fourth aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, which is characterized in that should
The step of audio recognition method of any embodiment of the present invention is realized when program is executed by processor.
The beneficial effect of the embodiment of the present invention is: word when determining not sure for common scoring method carries out another again
The judgement of dimension, the factor considered and existing scoring method are entirely different, are equivalent to when encountering similar word, there is string word
Table is verified, and when determination is string word, determines the Length discrepancy sequence label of each string word and the similarity of voice label sequence, from
And realize speech recognition.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to embodiment or existing skill
Attached drawing needed in art description is briefly described, it should be apparent that, the accompanying drawings in the following description is of the invention
Some embodiments for those of ordinary skill in the art without creative efforts, can also be according to this
A little attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow chart for audio recognition method that one embodiment of the invention provides;
Fig. 2 is a kind of structural schematic diagram for speech recognition system that one embodiment of the invention provides.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is this hair
Bright a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having
Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
A kind of flow chart of the audio recognition method provided as shown in Figure 1 for one embodiment of the invention, includes the following steps:
S11: the audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, determine each frame
Posterior probability, be smoothed by the posterior probability to each frame, determine the keyword for forming the dialogic voice;
S12: the keyword is detected whether in default easily string vocabulary, if so, determining the string word set where the keyword
It closes;
S13: obtaining the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in institute's voice file, with
And the second sequence label of each word pronunciation mapping determination to be selected, described first is successively traversed by dynamic time warping algorithm
The similarity of sequence label second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as described in
The identification word of dialogic voice, wherein can be with Length discrepancy between each sequence label.
In the present embodiment, this method can be fitted in the intelligent sound assistant of intelligent sound box or mobile phone, with
When family carries out interactive voice, the dialogic voice file of user's input is received.
For step S11, after receiving voice document, sub-frame processing is carried out to voice document, after sub-frame processing, carries out sound
Frequency feature extraction is input in deep learning neural network, obtains the posterior probability of each frame by deep learning neural network,
By being smoothed to posterior probability, the keyword for forming the dialogic voice is determined.For example, user is to a certain product point
It comments and is inputted, after deep learning Processing with Neural Network, obtain the content of user's input are as follows: " the taste in that wide face of family's jorum
Road set is eaten ".Keyword has " the wide face of jorum ", " taste ", " set is eaten ".
For step S12, the keyword is detected whether in default easily string vocabulary, wherein string vocabulary is set in advance,
Such as keyword " set is eaten ", pronunciation is similar to be had " spy is nice ", the two words just in a kind of easy string word, determine easy string word set
It is combined into { set is eaten, and spy is nice }.In another example user sends some instructions by intelligent sound box, " opening air-conditioning, 20 degree ".With two
Ten degree of similar words of pronunciation have 24 degree." 20 degree ", " 24 degree " this kind of similar time, also in a kind of easy string word
In table, word set of easily going here and there is combined into { 20 degree, 24 degree }.
For step S13, due to word speed, posterior probability is inaccurate fastly or caused by the similar pronunciation of string word, existing marking mode
Above-mentioned defect can not be made up, then just needing separately to open up a scheme, making can be with Length discrepancy between each sequence label.With " that family
The taste set in the wide face of jorum is eaten " for, obtain the first mark of the corresponding label composition of posterior probability maximum value of the every frame of the words
The second sequence label for signing sequence and each word pronunciation mapping determination to be selected, successively traverses institute by dynamic time warping algorithm
State the similarity of the first sequence label second sequence label corresponding with each word to be selected, such as " the taste set in that wide face of family's jorum
Eat " (voice), the similarity of first sequence label of the words and " set is eaten " is 78%, first sequence label of the words and
The similarity of " spy is nice " is 93%, then, identification word by " spy is nice " as the words, then, after identification are as follows: " that
The taste spy in the wide face of family's jorum is nice ".
Likewise, obtaining the posterior probability maximum value of the every frame of the words for " opening air-conditioning, 20 degree " (voice) the words
First sequence label of corresponding label composition.Wherein, label is exactly number, for example we provide as follows pronunciation and map: er- >
Pronunciation of the entire Chinese of 0, shi- > 1, si- > 3, du- > 4, da- > 5, kai- > 6, kong- > 7.... without tune be more than totally 400, can be with
Exhaustive so goes here and there vocabulary are as follows:
Likewise, the second sequence label of each word pronunciation mapping determination to be selected also so obtains, pass through dynamic time warping algorithm
Successively traverse the similarity of first sequence label, second sequence label corresponding with each word to be selected, such as " open air-conditioning, two
Ten degree " (voice), the similarity of first sequence label of the words and " 20 degree " is 85%, first sequence label of the words
The similarity of " 24 degree " is 95%, then, the identification word by " 24 degree " as the words, then, after identification
Are as follows: " opening air-conditioning, 24 degree ".
It can be seen that word when determining not sure for common scoring method by the embodiment and carry out another dimension again
The judgement of degree, the factor considered and existing scoring method are entirely different, are equivalent to when encountering similar word, have string vocabulary into
Row verifying determines the Length discrepancy sequence label of each string word and the similarity of voice label sequence, thus real when determination is string word
Existing speech recognition.
As an implementation, in the present embodiment, defeated in the audio frequency characteristics of each frame of the voice document that will be extracted
Before entering into deep learning neural network, the method also includes:
The audio frequency characteristics for extracting each frame of training data carry out label registration operation to the audio frequency characteristics of each frame, are used as
The training parameter of deep neural network;
To the audio frequency characteristics after label registration, using deep neural network described in gradient descent algorithm repetitive exercise, to improve
State the size of deep neural network.
In the present embodiment, it needs to train deep neural network, extracts the audio frequency characteristics of each frame of training data, to described every
The audio frequency characteristics of one frame are carried out label registration operation and are instructed to the audio frequency characteristics after label registration using gradient descent algorithm iteration
Practice the deep neural network.
It can be seen that the size for improving deep neural network by the embodiment, to further promote posterior probability
Accuracy keeps dialogic voice identification more accurate.
As an implementation, in the present embodiment, the posterior probability by each frame is smoothed, and is determined
The keyword for forming the dialogic voice includes:
It is smoothly given a mark by the posterior probability to each frame, determines the score value of the dialogic voice recognition result;
When the score value of the recognition result reaches default recognition threshold, the recognition result is determined as forming described to language
The keyword of sound.
In the present embodiment, it is smoothly given a mark by the posterior probability to each frame, determines dialogic voice recognition result
Score value, so that it is determined that the keyword of composition conversation sentence, can be seen that by the embodiment and determine that it helps to improve prediction
Accuracy.
In the present embodiment, whether the detection keyword is in default easily string vocabulary further include:
When the keyword is not in the default easily string vocabulary, using the keyword as the identification word of the voice.
In the present embodiment, for example, conversation sentence " opening TV " (voice).There is no in easily string word column for keyword therein
In table, then, it directly will " opening TV " the identification word of (text) as the voice.It can be seen by the embodiment
Out, it when keyword is not in default easily string vocabulary, is just directly identified, guarantees the stable operation of program.
It is illustrated in figure 2 a kind of structural schematic diagram of speech recognition system of one embodiment of the invention offer, the system is executable
Audio recognition method described in above-mentioned any embodiment, and configure in the terminal.
A kind of speech recognition system provided in this embodiment includes: that keyword determines program module 11, and word of easily going here and there detects program mould
Block 12 and recognizer module 13.
Wherein, keyword determines program module 11 for the audio frequency characteristics of each frame of the voice document extracted to be input to depth
In learning neural network, the posterior probability of each frame is determined, be smoothed by the posterior probability to each frame, determine group
At the keyword of the dialogic voice;Easily whether string word detection program module 12 is for detecting the keyword in default easily string word
In table, if so, determining the string set of words where the keyword;Recognizer module 13 is for obtaining in institute's voice file
First sequence label of the corresponding label composition of the posterior probability maximum value of every frame and each word pronunciation mapping to be selected determine
The second sequence label, first sequence label corresponding with each word to be selected is successively traversed by dynamic time warping algorithm
The similarity of two sequence labels, using the corresponding word to be selected of maximum similarity as the identification word of the dialogic voice, wherein institute
Stating can be with Length discrepancy between each sequence label.
Further, before keyword determines program module, the system also includes: neural metwork training program module is used
In
The audio frequency characteristics for extracting each frame of training data carry out label registration operation to the audio frequency characteristics of each frame, are used as
The training parameter of deep neural network;
To the audio frequency characteristics after label registration, using deep neural network described in gradient descent algorithm repetitive exercise, to improve
State the size of deep neural network.
Further, the keyword determines that program module is also used to:
It is smoothly given a mark by the posterior probability to each frame, determines the score value of the dialogic voice recognition result;
When the score value of the recognition result reaches default recognition threshold, the recognition result is determined as forming described to language
The keyword of sound.
Further, the easy string word detection program module is also used to:
When the keyword is not in the default easily string vocabulary, using the keyword as the identification word of the voice.
The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with computer
The audio recognition method in above-mentioned any means embodiment can be performed in executable instruction, the computer executable instructions;
As an implementation, nonvolatile computer storage media of the invention is stored with computer executable instructions, meter
The setting of calculation machine executable instruction are as follows:
The audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, after determining each frame
Probability is tested, is smoothed by the posterior probability to each frame, determines the keyword for forming the dialogic voice;
The keyword is detected whether in default easily string vocabulary, if so, determining the string set of words where the keyword;
Obtain the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in institute's voice file, Yi Jisuo
The second determining sequence label of each word pronunciation mapping to be selected is stated, first label is successively traversed by dynamic time warping algorithm
The similarity of sequence second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as the dialogue
The identification word of voice, wherein can be with Length discrepancy between each sequence label.
As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile meter
Calculation machine executable program and module, such as the corresponding program instruction/module of the method for the test software in the embodiment of the present invention.One
A or multiple program instructions are stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, in execution
State the audio recognition method in any means embodiment.
Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storing program area
It can application program required for storage program area, at least one function;Storage data area can store the dress according to test software
That sets uses created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high random access is deposited
Reservoir, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non-volatile
Property solid-state memory.In some embodiments, it includes relative to processor that non-volatile computer readable storage medium storing program for executing is optional
Remotely located memory, these remote memories can be by being connected to the network to the device of test software.The reality of above-mentioned network
Example includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least one
Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, institute
It states instruction to be executed by least one described processor, so that at least one described processor is able to carry out any embodiment of the present invention
Audio recognition method the step of.
The client of the embodiment of the present application exists in a variety of forms, including but not limited to:
(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data communication
For main target.This Terminal Type includes: smart phone, multimedia handset, functional mobile phone and low-end mobile phone etc..
(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function, and one
As also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as tablet computer.
(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment includes: audio, video
Player, handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.
(4) other electronic devices with speech recognition.
Herein, relational terms such as first and second and the like are used merely to an entity or operation and another
Entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this actual passes
System or sequence.It not only include those elements moreover, the terms "include", "comprise", but also its including being not explicitly listed
His element, or further include for elements inherent to such a process, method, article, or device.What is do not limited more
In the case of, the element that is limited by sentence " including ... ", it is not excluded that include the process, method of the element, article or
There is also other identical elements in equipment.
The apparatus embodiments described above are merely exemplary, wherein the unit as illustrated by the separation member can be
Or may not be and be physically separated, component shown as a unit may or may not be physical unit, i.e.,
It can be located in one place, or may be distributed over multiple network units.It can select according to the actual needs therein
Some or all of the modules achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creative labor
In the case where dynamic, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can be by
Software adds the mode of required general hardware platform to realize, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned skill
Substantially the part that contributes to existing technology can be embodied in the form of software products art scheme in other words, the calculating
Machine software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used
So that computer equipment (can be personal computer, server or the network equipment etc.) execute each embodiment or
Method described in certain parts of person's embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although reference
Invention is explained in detail for previous embodiment, those skilled in the art should understand that: it still can be right
Technical solution documented by foregoing embodiments is modified or equivalent replacement of some of the technical features;And this
It modifies or replaces, the spirit and model of technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution
It encloses.
Claims (10)
1. a kind of audio recognition method, comprising:
The audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, after determining each frame
Probability is tested, is smoothed by the posterior probability to each frame, determines the keyword for forming the dialogic voice;
The keyword is detected whether in default easily string vocabulary, if so, determining the string set of words where the keyword;
Obtain the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in institute's voice file, Yi Jisuo
The second determining sequence label of each word pronunciation mapping to be selected is stated, first label is successively traversed by dynamic time warping algorithm
The similarity of sequence second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as the dialogue
The identification word of voice, wherein can be with Length discrepancy between each sequence label.
2. defeated in the audio frequency characteristics of each frame of the voice document that will be extracted according to the method described in claim 1, wherein
Before entering into deep learning neural network, the method also includes:
The audio frequency characteristics for extracting each frame of training data carry out label registration operation to the audio frequency characteristics of each frame, are used as
The training parameter of deep neural network;
To the audio frequency characteristics after label registration, using deep neural network described in gradient descent algorithm repetitive exercise, to improve
State the size of deep neural network.
3. according to the method described in claim 1, wherein, the posterior probability by each frame is smoothed, really
Surely the keyword for forming the dialogic voice includes:
It is smoothly given a mark by the posterior probability to each frame, determines the score value of the dialogic voice recognition result;
When the score value of the recognition result reaches default recognition threshold, the recognition result is determined as forming described to language
The keyword of sound.
4. according to the method described in claim 1, wherein, whether the detection keyword also wraps in default easily string vocabulary
It includes:
When the keyword is not in the default easily string vocabulary, using the keyword as the identification word of the voice.
5. a kind of speech recognition system, comprising:
Keyword determines program module, for the audio frequency characteristics of each frame of the voice document extracted to be input to deep learning mind
It through in network, determining the posterior probability of each frame, is smoothed, is determined described in composition by the posterior probability to each frame
The keyword of dialogic voice;
Easily string word detects program module, for detecting the keyword whether in default easily string vocabulary, if so, described in determining
String set of words where keyword;
Recognizer module, for obtaining in voice file the posterior probability maximum value of every frame corresponding label composition the
One sequence label and determining the second sequence label of each word to be selected pronunciation mapping, by dynamic time warping algorithm according to
The similarity of secondary traversal first sequence label second sequence label corresponding with each word to be selected, maximum similarity is corresponding
Identification word of the word to be selected as the dialogic voice, wherein can be with Length discrepancy between each sequence label.
6. system according to claim 5, wherein before keyword determines program module, the system also includes: mind
Through network training program module, it is used for
The audio frequency characteristics for extracting each frame of training data carry out label registration operation to the audio frequency characteristics of each frame, are used as
The training parameter of deep neural network;
To the audio frequency characteristics after label registration, using deep neural network described in gradient descent algorithm repetitive exercise, to improve
State the size of deep neural network.
7. system according to claim 5, wherein the keyword determines that program module is also used to:
It is smoothly given a mark by the posterior probability to each frame, determines the score value of the dialogic voice recognition result;
When the score value of the recognition result reaches default recognition threshold, the recognition result is determined as forming described to language
The keyword of sound.
8. system according to claim 5, wherein the easy string word detection program module is also used to:
When the keyword is not in the default easily string vocabulary, using the keyword as the identification word of the voice.
9. a kind of electronic equipment comprising: at least one processor, and deposited with what at least one described processor communication was connect
Reservoir, wherein the memory be stored with can by least one described processor execute instruction, described instruction by it is described at least
One processor executes, so that at least one described processor is able to carry out the step of any one of claim 1-4 the method
Suddenly.
10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the realization when program is executed by processor
The step of any one of claim 1-4 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910506115.0A CN110223678A (en) | 2019-06-12 | 2019-06-12 | Audio recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910506115.0A CN110223678A (en) | 2019-06-12 | 2019-06-12 | Audio recognition method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110223678A true CN110223678A (en) | 2019-09-10 |
Family
ID=67816618
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910506115.0A Pending CN110223678A (en) | 2019-06-12 | 2019-06-12 | Audio recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110223678A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110610707A (en) * | 2019-09-20 | 2019-12-24 | 科大讯飞股份有限公司 | Voice keyword recognition method and device, electronic equipment and storage medium |
CN110910885A (en) * | 2019-12-12 | 2020-03-24 | 苏州思必驰信息科技有限公司 | Voice awakening method and device based on decoding network |
CN112700766A (en) * | 2020-12-23 | 2021-04-23 | 北京猿力未来科技有限公司 | Training method and device of voice recognition model and voice recognition method and device |
CN113763988A (en) * | 2020-06-01 | 2021-12-07 | 中车株洲电力机车研究所有限公司 | Time synchronization method and system for locomotive cab monitoring information and LKJ monitoring information |
CN113888846A (en) * | 2021-09-27 | 2022-01-04 | 深圳市研色科技有限公司 | Method and device for reminding driving in advance |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004309928A (en) * | 2003-04-09 | 2004-11-04 | Casio Comput Co Ltd | Speech recognition device, electronic dictionary device, speech recognizing method, retrieving method, and program |
CN101996631A (en) * | 2009-08-28 | 2011-03-30 | 国际商业机器公司 | Method and device for aligning texts |
CN105679316A (en) * | 2015-12-29 | 2016-06-15 | 深圳微服机器人科技有限公司 | Voice keyword identification method and apparatus based on deep neural network |
CN106469554A (en) * | 2015-08-21 | 2017-03-01 | 科大讯飞股份有限公司 | A kind of adaptive recognition methodss and system |
CN107665190A (en) * | 2017-09-29 | 2018-02-06 | 李晓妮 | A kind of method for automatically constructing and device of text proofreading mistake dictionary |
-
2019
- 2019-06-12 CN CN201910506115.0A patent/CN110223678A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004309928A (en) * | 2003-04-09 | 2004-11-04 | Casio Comput Co Ltd | Speech recognition device, electronic dictionary device, speech recognizing method, retrieving method, and program |
CN101996631A (en) * | 2009-08-28 | 2011-03-30 | 国际商业机器公司 | Method and device for aligning texts |
CN106469554A (en) * | 2015-08-21 | 2017-03-01 | 科大讯飞股份有限公司 | A kind of adaptive recognition methodss and system |
CN105679316A (en) * | 2015-12-29 | 2016-06-15 | 深圳微服机器人科技有限公司 | Voice keyword identification method and apparatus based on deep neural network |
CN107665190A (en) * | 2017-09-29 | 2018-02-06 | 李晓妮 | A kind of method for automatically constructing and device of text proofreading mistake dictionary |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110610707A (en) * | 2019-09-20 | 2019-12-24 | 科大讯飞股份有限公司 | Voice keyword recognition method and device, electronic equipment and storage medium |
CN110610707B (en) * | 2019-09-20 | 2022-04-22 | 科大讯飞股份有限公司 | Voice keyword recognition method and device, electronic equipment and storage medium |
CN110910885A (en) * | 2019-12-12 | 2020-03-24 | 苏州思必驰信息科技有限公司 | Voice awakening method and device based on decoding network |
CN113763988A (en) * | 2020-06-01 | 2021-12-07 | 中车株洲电力机车研究所有限公司 | Time synchronization method and system for locomotive cab monitoring information and LKJ monitoring information |
CN112700766A (en) * | 2020-12-23 | 2021-04-23 | 北京猿力未来科技有限公司 | Training method and device of voice recognition model and voice recognition method and device |
CN112700766B (en) * | 2020-12-23 | 2024-03-19 | 北京猿力未来科技有限公司 | Training method and device of voice recognition model, and voice recognition method and device |
CN113888846A (en) * | 2021-09-27 | 2022-01-04 | 深圳市研色科技有限公司 | Method and device for reminding driving in advance |
CN113888846B (en) * | 2021-09-27 | 2023-01-24 | 深圳市研色科技有限公司 | Method and device for reminding driving in advance |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110223678A (en) | Audio recognition method and system | |
CN110648690B (en) | Audio evaluation method and server | |
EP1989701B1 (en) | Speaker authentication | |
WO2016092807A1 (en) | Speaker identification device and method for registering features of registered speech for identifying speaker | |
CN108428446A (en) | Audio recognition method and device | |
CN108711421A (en) | A kind of voice recognition acoustic model method for building up and device and electronic equipment | |
CN111862942B (en) | Method and system for training mixed speech recognition model of Mandarin and Sichuan | |
CN108766445A (en) | Method for recognizing sound-groove and system | |
CN109331470B (en) | Method, device, equipment and medium for processing answering game based on voice recognition | |
CN103594087B (en) | Improve the method and system of oral evaluation performance | |
CN105938716A (en) | Multi-precision-fitting-based automatic detection method for copied sample voice | |
CN110085261A (en) | A kind of pronunciation correction method, apparatus, equipment and computer readable storage medium | |
CN110706692A (en) | Training method and system of child voice recognition model | |
CN110600013B (en) | Training method and device for non-parallel corpus voice conversion data enhancement model | |
CN112017694B (en) | Voice data evaluation method and device, storage medium and electronic device | |
CN109741734B (en) | Voice evaluation method and device and readable medium | |
CN109976702A (en) | A kind of audio recognition method, device and terminal | |
CN107958673A (en) | A kind of spoken language methods of marking and device | |
CN112487139A (en) | Text-based automatic question setting method and device and computer equipment | |
CN108986798A (en) | Processing method, device and the equipment of voice data | |
CN107886968A (en) | Speech evaluating method and system | |
EP1398758B1 (en) | Method and apparatus for generating decision tree questions for speech processing | |
CN111841007A (en) | Game control method, device, equipment and storage medium | |
CN114125506B (en) | Voice auditing method and device | |
CN110349567B (en) | Speech signal recognition method and device, storage medium and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant after: Sipic Technology Co.,Ltd. Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant before: AI SPEECH Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190910 |
|
RJ01 | Rejection of invention patent application after publication |