CN110223678A

CN110223678A - Audio recognition method and system

Info

Publication number: CN110223678A
Application number: CN201910506115.0A
Authority: CN
Inventors: 万光辉
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-06-12
Filing date: 2019-06-12
Publication date: 2019-09-10

Abstract

The embodiment of the present invention provides a kind of audio recognition method.This method comprises: the audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, the posterior probability of each frame is determined, be smoothed by the posterior probability to each frame, determine composition dialogic voice keyword；Determine the string set of words where keyword；Obtain the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in voice document, and the second sequence label of each word pronunciation mapping determination to be selected, the similarity for traversing the first sequence label second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as the identification word of dialogic voice.The embodiment of the present invention also provides a kind of speech recognition system.The factor and existing scoring method that the embodiment of the present invention is considered are entirely different, when determining string word by string vocabulary, the Length discrepancy sequence label of each string word and the similarity of voice label sequence are determined, to realize speech recognition.

Description

Audio recognition method and system

Technical field

The present invention relates to intelligent sound field more particularly to a kind of audio recognition method and systems.

Background technique

Speech recognition generallys use mixed Gauss model-hidden Markov model training and obtains acoustic model, then pass through depth The posterior probability of each Chinese phonetic alphabet of output of neural network, calculates score using posterior probability and scheduled information is compared Compared with to judge keyword whether in voice segments.

Speech recognition is usually to pass through deep neural network model to carry out identification decoding, need to just train deep neural network in advance, In training, usually after receiving trained audio file, framing is carried out to training audio file, to extract the sound of each framing Frequency feature obtains training data after spelling frame, and each frame is trained deep neural network model after carrying out alignment operation.In audio When decoding, framing first is carried out to audio file, carries out feature extraction again later, is obtained after spelling frame and is input to trained depth nerve In network model, the posterior probability of each frame is obtained, is given a mark according still further to certain mode, the keyword threshold of score and setting Value compares, and when reaching threshold value, then judges that keyword is identified to.

In realizing process of the present invention, at least there are the following problems in the related technology for inventor's discovery:

With the mood of speaker or the environment of surrounding, the word speed of speaker has certain variation, when the speed spoken is understood It is slow when fast, or concentrate on certain suddenly and quickly speak, make it easy to the feeling for allowing other people to recognize string word.And in multi-key word In detection, it will usually string word occur, with the word speed of speaker, the frequency that string word occurs may be more serious, and existing Method is weaker for similar key distinguishing ability.Since the too small posterior probability that may result in of deep neural network is inaccurate, Due to word speed, posterior probability is inaccurate fastly or caused by the similar pronunciation of string word, and existing marking mode can not make up above-mentioned lack It falls into.

Summary of the invention

In order at least solve in the prior art since the too small posterior probability that may result in of deep neural network is inaccurate, due to language Posterior probability is inaccurate caused by the fast fast or similar pronunciation of string word, and existing marking mode can not make up asking for above-mentioned defect Topic.

In a first aspect, the embodiment of the present invention provides a kind of audio recognition method, comprising:

The audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, after determining each frame Probability is tested, is smoothed by the posterior probability to each frame, determines the keyword for forming the dialogic voice；

The keyword is detected whether in default easily string vocabulary, if so, determining the string set of words where the keyword；

Obtain the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in institute's voice file, Yi Jisuo The second determining sequence label of each word pronunciation mapping to be selected is stated, first label is successively traversed by dynamic time warping algorithm The similarity of sequence second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as the dialogue The identification word of voice, wherein can be with Length discrepancy between each sequence label.

Second aspect, the embodiment of the present invention provide a kind of speech recognition system, comprising:

Keyword determines program module, for the audio frequency characteristics of each frame of the voice document extracted to be input to deep learning mind It through in network, determining the posterior probability of each frame, is smoothed, is determined described in composition by the posterior probability to each frame The keyword of dialogic voice；

Easily string word detects program module, for detecting the keyword whether in default easily string vocabulary, if so, described in determining String set of words where keyword；

Recognizer module, for obtaining in voice file the posterior probability maximum value of every frame corresponding label composition the One sequence label and determining the second sequence label of each word to be selected pronunciation mapping, by dynamic time warping algorithm according to The similarity of secondary traversal first sequence label second sequence label corresponding with each word to be selected, maximum similarity is corresponding Identification word of the word to be selected as the dialogic voice, wherein can be with Length discrepancy between each sequence label.

The third aspect provides a kind of electronic equipment comprising: at least one processor, and at least one described processor The memory of communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, the finger It enables and being executed by least one described processor, so that at least one described processor is able to carry out the language of any embodiment of the present invention The step of voice recognition method.

Fourth aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, which is characterized in that should The step of audio recognition method of any embodiment of the present invention is realized when program is executed by processor.

The beneficial effect of the embodiment of the present invention is: word when determining not sure for common scoring method carries out another again The judgement of dimension, the factor considered and existing scoring method are entirely different, are equivalent to when encountering similar word, there is string word Table is verified, and when determination is string word, determines the Length discrepancy sequence label of each string word and the similarity of voice label sequence, from And realize speech recognition.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to embodiment or existing skill Attached drawing needed in art description is briefly described, it should be apparent that, the accompanying drawings in the following description is of the invention Some embodiments for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other attached drawings.

Fig. 1 is a kind of flow chart for audio recognition method that one embodiment of the invention provides；

Fig. 2 is a kind of structural schematic diagram for speech recognition system that one embodiment of the invention provides.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is this hair Bright a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.

A kind of flow chart of the audio recognition method provided as shown in Figure 1 for one embodiment of the invention, includes the following steps:

S11: the audio frequency characteristics of each frame of the voice document extracted are input in deep learning neural network, determine each frame Posterior probability, be smoothed by the posterior probability to each frame, determine the keyword for forming the dialogic voice；

S12: the keyword is detected whether in default easily string vocabulary, if so, determining the string word set where the keyword It closes；

S13: obtaining the first sequence label of the corresponding label composition of posterior probability maximum value of every frame in institute's voice file, with And the second sequence label of each word pronunciation mapping determination to be selected, described first is successively traversed by dynamic time warping algorithm The similarity of sequence label second sequence label corresponding with each word to be selected, using the corresponding word to be selected of maximum similarity as described in The identification word of dialogic voice, wherein can be with Length discrepancy between each sequence label.

In the present embodiment, this method can be fitted in the intelligent sound assistant of intelligent sound box or mobile phone, with When family carries out interactive voice, the dialogic voice file of user's input is received.

For step S11, after receiving voice document, sub-frame processing is carried out to voice document, after sub-frame processing, carries out sound Frequency feature extraction is input in deep learning neural network, obtains the posterior probability of each frame by deep learning neural network, By being smoothed to posterior probability, the keyword for forming the dialogic voice is determined.For example, user is to a certain product point It comments and is inputted, after deep learning Processing with Neural Network, obtain the content of user's input are as follows: " the taste in that wide face of family's jorum Road set is eaten ".Keyword has " the wide face of jorum ", " taste ", " set is eaten ".

For step S12, the keyword is detected whether in default easily string vocabulary, wherein string vocabulary is set in advance, Such as keyword " set is eaten ", pronunciation is similar to be had " spy is nice ", the two words just in a kind of easy string word, determine easy string word set It is combined into { set is eaten, and spy is nice }.In another example user sends some instructions by intelligent sound box, " opening air-conditioning, 20 degree ".With two Ten degree of similar words of pronunciation have 24 degree." 20 degree ", " 24 degree " this kind of similar time, also in a kind of easy string word In table, word set of easily going here and there is combined into { 20 degree, 24 degree }.

For step S13, due to word speed, posterior probability is inaccurate fastly or caused by the similar pronunciation of string word, existing marking mode Above-mentioned defect can not be made up, then just needing separately to open up a scheme, making can be with Length discrepancy between each sequence label.With " that family The taste set in the wide face of jorum is eaten " for, obtain the first mark of the corresponding label composition of posterior probability maximum value of the every frame of the words The second sequence label for signing sequence and each word pronunciation mapping determination to be selected, successively traverses institute by dynamic time warping algorithm State the similarity of the first sequence label second sequence label corresponding with each word to be selected, such as " the taste set in that wide face of family's jorum Eat " (voice), the similarity of first sequence label of the words and " set is eaten " is 78%, first sequence label of the words and The similarity of " spy is nice " is 93%, then, identification word by " spy is nice " as the words, then, after identification are as follows: " that The taste spy in the wide face of family's jorum is nice ".

Likewise, obtaining the posterior probability maximum value of the every frame of the words for " opening air-conditioning, 20 degree " (voice) the words First sequence label of corresponding label composition.Wherein, label is exactly number, for example we provide as follows pronunciation and map: er- > Pronunciation of the entire Chinese of 0, shi- > 1, si- > 3, du- > 4, da- > 5, kai- > 6, kong- > 7.... without tune be more than totally 400, can be with Exhaustive so goes here and there vocabulary are as follows:

Likewise, the second sequence label of each word pronunciation mapping determination to be selected also so obtains, pass through dynamic time warping algorithm Successively traverse the similarity of first sequence label, second sequence label corresponding with each word to be selected, such as " open air-conditioning, two Ten degree " (voice), the similarity of first sequence label of the words and " 20 degree " is 85%, first sequence label of the words The similarity of " 24 degree " is 95%, then, the identification word by " 24 degree " as the words, then, after identification Are as follows: " opening air-conditioning, 24 degree ".

It can be seen that word when determining not sure for common scoring method by the embodiment and carry out another dimension again The judgement of degree, the factor considered and existing scoring method are entirely different, are equivalent to when encountering similar word, have string vocabulary into Row verifying determines the Length discrepancy sequence label of each string word and the similarity of voice label sequence, thus real when determination is string word Existing speech recognition.

As an implementation, in the present embodiment, defeated in the audio frequency characteristics of each frame of the voice document that will be extracted Before entering into deep learning neural network, the method also includes:

The audio frequency characteristics for extracting each frame of training data carry out label registration operation to the audio frequency characteristics of each frame, are used as The training parameter of deep neural network；

To the audio frequency characteristics after label registration, using deep neural network described in gradient descent algorithm repetitive exercise, to improve State the size of deep neural network.

In the present embodiment, it needs to train deep neural network, extracts the audio frequency characteristics of each frame of training data, to described every The audio frequency characteristics of one frame are carried out label registration operation and are instructed to the audio frequency characteristics after label registration using gradient descent algorithm iteration Practice the deep neural network.

It can be seen that the size for improving deep neural network by the embodiment, to further promote posterior probability Accuracy keeps dialogic voice identification more accurate.

As an implementation, in the present embodiment, the posterior probability by each frame is smoothed, and is determined The keyword for forming the dialogic voice includes:

It is smoothly given a mark by the posterior probability to each frame, determines the score value of the dialogic voice recognition result；

When the score value of the recognition result reaches default recognition threshold, the recognition result is determined as forming described to language The keyword of sound.

In the present embodiment, it is smoothly given a mark by the posterior probability to each frame, determines dialogic voice recognition result Score value, so that it is determined that the keyword of composition conversation sentence, can be seen that by the embodiment and determine that it helps to improve prediction Accuracy.

In the present embodiment, whether the detection keyword is in default easily string vocabulary further include:

When the keyword is not in the default easily string vocabulary, using the keyword as the identification word of the voice.

In the present embodiment, for example, conversation sentence " opening TV " (voice).There is no in easily string word column for keyword therein In table, then, it directly will " opening TV " the identification word of (text) as the voice.It can be seen by the embodiment Out, it when keyword is not in default easily string vocabulary, is just directly identified, guarantees the stable operation of program.

It is illustrated in figure 2 a kind of structural schematic diagram of speech recognition system of one embodiment of the invention offer, the system is executable Audio recognition method described in above-mentioned any embodiment, and configure in the terminal.

A kind of speech recognition system provided in this embodiment includes: that keyword determines program module 11, and word of easily going here and there detects program mould Block 12 and recognizer module 13.

Wherein, keyword determines program module 11 for the audio frequency characteristics of each frame of the voice document extracted to be input to depth In learning neural network, the posterior probability of each frame is determined, be smoothed by the posterior probability to each frame, determine group At the keyword of the dialogic voice；Easily whether string word detection program module 12 is for detecting the keyword in default easily string word In table, if so, determining the string set of words where the keyword；Recognizer module 13 is for obtaining in institute's voice file First sequence label of the corresponding label composition of the posterior probability maximum value of every frame and each word pronunciation mapping to be selected determine The second sequence label, first sequence label corresponding with each word to be selected is successively traversed by dynamic time warping algorithm The similarity of two sequence labels, using the corresponding word to be selected of maximum similarity as the identification word of the dialogic voice, wherein institute Stating can be with Length discrepancy between each sequence label.

Further, before keyword determines program module, the system also includes: neural metwork training program module is used In

Further, the keyword determines that program module is also used to:

Further, the easy string word detection program module is also used to:

The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with computer The audio recognition method in above-mentioned any means embodiment can be performed in executable instruction, the computer executable instructions；

As an implementation, nonvolatile computer storage media of the invention is stored with computer executable instructions, meter The setting of calculation machine executable instruction are as follows:

As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile meter Calculation machine executable program and module, such as the corresponding program instruction/module of the method for the test software in the embodiment of the present invention.One A or multiple program instructions are stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, in execution State the audio recognition method in any means embodiment.

Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storing program area It can application program required for storage program area, at least one function；Storage data area can store the dress according to test software That sets uses created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high random access is deposited Reservoir, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non-volatile Property solid-state memory.In some embodiments, it includes relative to processor that non-volatile computer readable storage medium storing program for executing is optional Remotely located memory, these remote memories can be by being connected to the network to the device of test software.The reality of above-mentioned network Example includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least one Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, institute It states instruction to be executed by least one described processor, so that at least one described processor is able to carry out any embodiment of the present invention Audio recognition method the step of.

The client of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data communication For main target.This Terminal Type includes: smart phone, multimedia handset, functional mobile phone and low-end mobile phone etc..

(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function, and one As also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as tablet computer.

(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment includes: audio, video Player, handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.

(4) other electronic devices with speech recognition.

Herein, relational terms such as first and second and the like are used merely to an entity or operation and another Entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this actual passes System or sequence.It not only include those elements moreover, the terms "include", "comprise", but also its including being not explicitly listed His element, or further include for elements inherent to such a process, method, article, or device.What is do not limited more In the case of, the element that is limited by sentence " including ... ", it is not excluded that include the process, method of the element, article or There is also other identical elements in equipment.

The apparatus embodiments described above are merely exemplary, wherein the unit as illustrated by the separation member can be Or may not be and be physically separated, component shown as a unit may or may not be physical unit, i.e., It can be located in one place, or may be distributed over multiple network units.It can select according to the actual needs therein Some or all of the modules achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creative labor In the case where dynamic, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can be by Software adds the mode of required general hardware platform to realize, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned skill Substantially the part that contributes to existing technology can be embodied in the form of software products art scheme in other words, the calculating Machine software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used So that computer equipment (can be personal computer, server or the network equipment etc.) execute each embodiment or Method described in certain parts of person's embodiment.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although reference Invention is explained in detail for previous embodiment, those skilled in the art should understand that: it still can be right Technical solution documented by foregoing embodiments is modified or equivalent replacement of some of the technical features；And this It modifies or replaces, the spirit and model of technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims

1. a kind of audio recognition method, comprising:

2. defeated in the audio frequency characteristics of each frame of the voice document that will be extracted according to the method described in claim 1, wherein Before entering into deep learning neural network, the method also includes:

3. according to the method described in claim 1, wherein, the posterior probability by each frame is smoothed, really Surely the keyword for forming the dialogic voice includes:

4. according to the method described in claim 1, wherein, whether the detection keyword also wraps in default easily string vocabulary It includes:

5. a kind of speech recognition system, comprising:

6. system according to claim 5, wherein before keyword determines program module, the system also includes: mind Through network training program module, it is used for

7. system according to claim 5, wherein the keyword determines that program module is also used to:

8. system according to claim 5, wherein the easy string word detection program module is also used to:

9. a kind of electronic equipment comprising: at least one processor, and deposited with what at least one described processor communication was connect Reservoir, wherein the memory be stored with can by least one described processor execute instruction, described instruction by it is described at least One processor executes, so that at least one described processor is able to carry out the step of any one of claim 1-4 the method Suddenly.

10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the realization when program is executed by processor The step of any one of claim 1-4 the method.