CN109872714A

CN109872714A - A kind of method, electronic equipment and storage medium improving accuracy of speech recognition

Info

Publication number: CN109872714A
Application number: CN201910072525.9A
Authority: CN
Inventors: 傅峰峰
Original assignee: Guangzhou Fugang Wanjia Intelligent Technology Co Ltd
Current assignee: Guangzhou Fugang Wanjia Intelligent Technology Co Ltd
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2019-06-11

Abstract

The invention discloses a kind of methods for improving accuracy of speech recognition, the following steps are included: obtaining the voice messaging of active user by sound collection equipment, and voice recognition information is obtained by speech recognition technology, the voice recognition information includes speech recognition probability and speech recognition result；Judge whether speech recognition probability is greater than the first preset value, if it is, output speech recognition result；The Shape of mouth of active user is obtained by image capture device, and Mouth-Shape Recognition information is obtained by image recognition technology；Judge whether Mouth-Shape Recognition probability is higher than speech recognition probability, if it is, output shape of the mouth as one speaks recognition result.The present invention also provides a kind of electronic equipment and computer readable storage medium.The method of raising accuracy of speech recognition of the invention passes through global alignment speech recognition result and Mouth-Shape Recognition result to obtain the higher recognition result of accuracy rate, to improve the accuracy of identification.

Description

A kind of method, electronic equipment and storage medium improving accuracy of speech recognition

Technical field

The present invention relates to a kind of identification technology field more particularly to a kind of methods for improving accuracy of speech recognition, electronics Equipment and storage medium.

Background technique

Currently, speech recognition is a kind of technology that digital speech is converted to the text that computer is understood that.It is several recently Year, speech recognition technology obtains remarkable break-throughs, and speech recognition technology gradually enters into people's lives, life, work to us It offers convenience.Speech recognition technology is in industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumer electronics at present The every field such as product start to apply.The present invention mainly focuses speech recognition (i.e. to the identification of recording file), such as meeting note Record, the discriminance analysis of phone customer service voices and dining room are ordered dishes.It is quasi- although unprecedented development has been obtained in speech recognition technology True rate has been in relatively high level, but still can not accomplish entirely accurate；Therefore, speech recognition standard is further increased True property becomes those skilled in the art's technical problem urgently to be resolved.

Summary of the invention

For overcome the deficiencies in the prior art, one of the objects of the present invention is to provide a kind of raising accuracy of speech recognition Method, can be further improved identification accuracy.

The second object of the present invention is to provide a kind of electronic equipment, can be further improved identification accuracy.

The third object of the present invention is to provide a kind of computer readable storage medium, can be further improved and identify accurately Property.

An object of the present invention adopts the following technical scheme that realization:

A method of improving accuracy of speech recognition, comprising the following steps:

Voice recognition step: the voice messaging of active user is obtained by sound collection equipment, and passes through speech recognition skill Art obtains voice recognition information, and the voice recognition information includes speech recognition probability and speech recognition result；

First judgment step: judging whether speech recognition probability is greater than the first preset value, if it is, output speech recognition As a result, if it is not, then executing Mouth-Shape Recognition step；

Mouth-Shape Recognition step: the Shape of mouth of active user is obtained by image capture device, and passes through image recognition skill Art obtains Mouth-Shape Recognition information, and the Mouth-Shape Recognition information includes Mouth-Shape Recognition probability and Mouth-Shape Recognition result；

Second judgment step: judging whether Mouth-Shape Recognition probability is higher than speech recognition probability, if it is, the output shape of the mouth as one speaks is known Other result.

Further, second judgment step includes following sub-step:

It calculates step: the difference of Mouth-Shape Recognition probability Yu speech recognition probability is calculated；

Result step: judging whether the difference is greater than the second preset value, if so, output shape of the mouth as one speaks recognition result, and it is described Second preset value is positive value.

Further, in the result step, judge whether the difference is greater than the second preset value, second preset value is Positive value, if it is, output shape of the mouth as one speaks recognition result, if it is not, then speech recognition result and Mouth-Shape Recognition result are exported simultaneously It is marked.

Further, in the result step, judge whether the difference is greater than the second preset value, if so, the output shape of the mouth as one speaks Recognition result, if not, calculating separately speech recognition result and Mouth-Shape Recognition result and up and down by natural language processing technique Meaning of one's words correlation between literary sentence is used as prediction result to be exported using meaning of one's words correlation is higher.

Further, institute's speech recognition result, Mouth-Shape Recognition result and prediction result are minutes or finger of ordering dishes It enables.

Further, first preset value is 85%.

Further, second preset value is 5%.

Further, for the sound collection equipment using annular microphone, described image acquires equipment using annular camera shooting Head array.

The second object of the present invention adopts the following technical scheme that realization:

A kind of electronic equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor are realized when executing the computer program one described in any one of one of the object of the invention The method that kind improves accuracy of speech recognition.

The third object of the present invention adopts the following technical scheme that realization:

A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor A kind of method of raising accuracy of speech recognition as described in any one of one of the object of the invention is realized when row.

Compared with prior art, the beneficial effects of the present invention are:

The method of raising accuracy of speech recognition of the invention passes through global alignment speech recognition result and Mouth-Shape Recognition knot Fruit is to obtain the higher recognition result of accuracy rate, to improve the accuracy of identification.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the raising accuracy of speech recognition of embodiment one.

Specific embodiment

In the following, being described further in conjunction with attached drawing and specific embodiment to the present invention, it should be noted that not Under the premise of conflicting, new implementation can be formed between various embodiments described below or between each technical characteristic in any combination Example.

Embodiment one

In the description of the present embodiment, it is described primarily directed to two scenes are ordered in minutes and user , but it can be applied not only in both scenes when being embodied, it can also be according to actual need It asks and is applied in other scenes.

As shown in Figure 1, present embodiments providing a kind of method for improving accuracy of speech recognition, comprising the following steps:

S1: the voice messaging of active user is obtained by sound collection equipment, and voice is obtained by speech recognition technology Identification information, the voice recognition information include speech recognition probability and speech recognition result；The sound collection equipment uses Annular microphone；Can be with the highly efficient accurate acoustic information for obtaining round table surrounding by annular microphone, the sound got Source information is more clear, also it will be made more accurate then the later period carries out voiced translation.

Speech recognition technology mainly includes three Feature Extraction Technology, pattern match criterion and model training technology aspects. Speech recognition system is mainly formed including speech signal samples module, voice signal pre-processing module, phonic signal character ginseng Number extraction module, voice signal identification nucleus module, voice signal identify post-processing module.Pattern-recognition matching is that voice is known Other main process.The voice of people is analyzed first, feature is extracted and establishes targetedly speech model, pass through speech model Mode needed for establishing speech recognition.Using the overall model of speech recognition, obtained voice is believed in speech recognition process Number feature compared with the speech pattern that early period establishes carries out matching, by preset search strategy and matching strategy, can obtain The mode that best out and with input voice signal matches.Finally, can computer output recognition result.

In general, there are three ways to speech recognition: method, template matching based on channel model and phonic knowledge Method and the method for utilizing artificial neural network.The method of template matching develops comparative maturity, has had reached practical rank at present Section.In template matching method, to pass through four steps: feature extraction, template training, template classification, judgement.Common technology There are three types of: dynamic time warping (DTW), hidden Markov (HMM) theory, vector quantization (VQ) technology.

It is above-mentioned only to depict the technology that we use in field of speech recognition, next specific to Identification process is described in detail with one section of voice: when we will identify one section of voice, it is necessary first to which progress is Extraction to phonetic feature.The work that this step is done exactly is extracted from the voice signal (time-domain signal) of input in fact can With the Acoustic observation characteristic vector sequence O modeled.Generically explain that one section of voice for being exactly needs are identified carries out feature It extracts, has obtained one group of vector that can characterize this section of voice later, the subsequent sequence of operations carried out to voice is all base In this group of vector.

Under conditions of observational characteristic vector O, the maximum probability that one group of term vector W makes P (W | O) is found.This is also just It is that people hears the thing done when one section of voice --- look for all known texts to neutralize this section of voice most matched.But only This formula is only relied on, we can not solve speech recognition problem.It also needs to convert it using Bayes' theorem, The form of model solution can be carried out respectively by converting thereof into us.It converts as follows: W=argmaxP (W | O)=argmax P (O |W)P(W)/P(O)；

Wherein, P (O) is the prior probability of Acoustic observation, during automatic speech recognition, due to the Acoustic observation of input Characteristic sequence is fixed, it is believed that the P (O) in above-mentioned formula is constant, therefore P (O) is in the maximized of above-mentioned formula It does not work, can ignore in the process.So we only remaining P (O | W) and P (W) need to consider now.And in acoustic model and Language model each provides the method calculated P (O | W) and P (W), is calculated by acoustic model and language model P (O | W) and P (W).

We can build a solution code space using above-mentioned acoustic model, language model and pronunciation dictionary, it Decoder is utilized afterwards, is scanned in space in conjunction with the speech feature vector of each group of input, is found an optimal word order Column exactly find a paths and make P (O | W) P (W) maximum probability.So, this finally obtained word sequence is exactly that we think The recognition result wanted.It that is to say when we obtain corresponding recognition result, it is also available to arrive corresponding identification probability, For to a certain extent, the accuracy rate of the recognition result is shown in this probability.

S2: judging whether speech recognition probability is greater than the first preset value, if it is, output speech recognition result, if It is no, then execute Mouth-Shape Recognition step；First preset value is 85%；In conventional minutes, regardless of this voice is known Other probability it is high or low, it can all be recorded, can thus generate the hidden danger of very big correctness；In order to avoid this The case where sample, occurs, and can have mode below to go to be implemented, in minutes, when speech recognition probability is pre- lower than first If can be marked when value, leave subsequent arrangement minutes person for and go to check, mark mode, which can be, to be added Thick or italic or turn colors etc..But if it is during ordering, if the obtained accuracy rate of speech recognition compared with It if low, then will not place an order, voice prompting can be issued further to link up with user, to judge whether to continue to place an order.

But in this application not in this manner, in the present embodiment, in order to realize the speech recognition system Higher automation checks voice messaging provided with another mode；It that is to say the side by the image recognition shape of the mouth as one speaks Formula carries out, since the two is using different model and recognition logic, so avoiding mistake to a certain extent It is overlapping, so that the sentence information is can to avoid malfunctioning to a certain extent, so that its voice is known by calculating verification twice Other accuracy rate further improves.When being configured, the two can be set while identify then comparison, it can also In a manner of first progress Mouth-Shape Recognition is set and then is carried out as speech recognition again.

Preferred ground embodiment is as follows: being exactly first to detect to voice messaging, is then gone again by Mouth-Shape Recognition Carry out sentence detection.Due to for image, handling the data volume of voice or relatively small, so one is arranged herein The step of a judgement, just uses Mouth-Shape Recognition, in this way can not only only when the obtained probability of speech recognition is lower Guarantee certain identification accuracy, it is also possible that whole processing speed is very fast.Make it possible to the computing resource phase of consumption To less, the efficiency of integrated automation identification is improved.

S3: the Shape of mouth of active user is obtained by image capture device, and the shape of the mouth as one speaks is obtained by image recognition technology Identification information, the Mouth-Shape Recognition information include Mouth-Shape Recognition probability and Mouth-Shape Recognition result；Described image acquires equipment and uses Annular camera array；Preferred ground mode is the camera shooting in sound collection equipment in the quantity of microphone and annular camera The quantity of head is identical, and the two is made to have an one-to-one relationship, and the information got in this way can be carried out directly It is corresponding, without looking for respective corresponding relationship again.

In order to realize that the two combines, camera can be added on microphone, so that microphone not only may be used To acquire the voice signal that user generates during executing and speaking, the mouth shape image signal of generation, the image can also be acquired Including at least the image at the lip position of face in signal, of course for the variation of better identification export-oriented, in picture signal It also may include the image at other positions of face, this is because shape of the mouth as one speaks variation sometimes is related to human face expression variation.

It needs first to construct Mouth-Shape Recognition library come what is realized during carrying out Mouth-Shape Recognition, and by template matching, This Mouth-Shape Recognition library is trained by obtaining a large amount of figure or video information, so that the model is more healthy and stronger.

As follows during concrete implementation: image capture device obtains the view only changed comprising user's shape of the mouth as one speaks by camera Frequency sequence and input video decoding unit；The lip of input is moved video and obtains view using key frame acquisition technique by video decoding unit Representative key frame in frequency stream, and the keyframe sequence of extraction (normalized lip color static images) are sent into and are schemed As pretreatment unit；The key frame images that image pre-processing unit obtains a upper unit carry out ash using OpenCV library function Degreeization and median filter process then carry out binary conversion treatment to picture, are finally scanned denoising to picture and are standardized Shape of the mouth as one speaks binaryzation picture.

Feature extraction unit carries out shape of the mouth as one speaks spy for the normalization binaryzation picture after image procossing, using template Sign is extracted, and the feature vector for indicating shape of the mouth as one speaks feature is obtained；Shape of the mouth as one speaks template library is pre-established for storing standard shape of the mouth as one speaks feature Vector field homoemorphism block stores the standard shape of the mouth as one speaks template acquired in earlier test, lip when including the pronunciation of all Chinese phonetic alphabets Motion video (individual or multiple) sample and the feature vector for utilizing template to extract for mouth shape image；Mouth-Shape Recognition unit is to place Normalization binary image after reason identified, from the feature vector for obtaining every picture in sequence in feature extraction unit, Mouth-Shape Recognition information is finally obtained, likewise, Mouth-Shape Recognition result also has corresponding Mouth-Shape Recognition probability；After this probability is used for It is continuous to further determine whether to carry out output operation.

For minutes, it is to need to obtain all shape of the mouth as one speaks features certainly, is then stored with sufficiently large letter Breath can be compared complete prediction and record to all situations occurred in conference process.Similarly in the process of ordering In, it can also go to be realized in this way, but in addition to that, there are also another operation sides during ordering Formula goes to carry out Mouth-Shape Recognition, since the pattern of the dish of each businessman is fixed, institute when then usually ordering dishes for user The sentence of use is collected, and obtains the difference of all occur placing an order during ordering such imperative sentences and interrogative sentence, so All word tones being collected into are matched with the shape of the mouth as one speaks afterwards, then go the shape of the mouth as one speaks feature for extracting training word tone corresponding in this way；Greatly The cumbersome of database sharing is reduced greatly, due to the reduction of database amount, it is also possible that speed is significantly during matched Raising；So that it reaches better effect.This step passes through image recognition Shape of mouth mainly to obtain corresponding knowledge Other result.

S4: judging whether Mouth-Shape Recognition probability is higher than speech recognition probability, if it is, output shape of the mouth as one speaks recognition result.? Just like under type during specific implementation, one is when obtained Mouth-Shape Recognition probability is higher than speech recognition probability, Then can directly by identification probability it is higher that export as a result, operate relatively easy, but meeting this when There are such a problems, although being exactly that Mouth-Shape Recognition probability is relatively high, the two result carrys out normal voice identification It says, accuracy rate lower in this way can not all receive, such as the two one 80%, another is 81%；Although mouth Type identification probability is higher than speech recognition probability by 1%, but during minutes, discrimination low in this way is also that can not connect It receives；And since speech recognition result and Mouth-Shape Recognition result are obtained by probability, so it is in a certain range There are errors, and this 1% gap is sometimes also negligible.

So more preferably embodiment is as follows:

The step S4 specifically includes following sub-step:

It calculates step: the difference of Mouth-Shape Recognition probability Yu speech recognition probability is calculated；This difference have it is positive and negative, Rather than absolute value.

Result step: judging whether the difference is greater than the second preset value, if so, output shape of the mouth as one speaks recognition result, and it is described Second preset value is positive value, if it is not, then exporting and being marked speech recognition result and Mouth-Shape Recognition result.Institute Stating the second preset value is 5%.Both both the namely obtained accuracy rate of identification is all in lower level herein, such as Discrimination is all 80%, if it is in the identification process of this type of minutes, then the two is marked respectively, then And respective corresponding relationship is indicated, in order to which later period check person more specific can position and be checked.And if it is in point In the identification process of such type of eating, then directly selects a wherein result and be sent to audio output progress audio reading, make The personnel of ordering are obtained to further confirm that.

Although aforesaid way can solve the problems, such as to promote accuracy to a certain extent, in order to preferably can be into Row automation, so that minutes check person reduces workload, so that a little less confirmation of the personnel that order can order correct dish Product additionally use another way in the present embodiment and go to be identified.In the result step, judge whether the difference is greater than Second preset value, if so, output shape of the mouth as one speaks recognition result, knows if not, calculating separately voice by natural language processing technique Meaning of one's words correlation between other result and Mouth-Shape Recognition result and context sentence, it is higher as prediction using meaning of one's words correlation As a result it is exported.

When being analyzed it by natural language processing technique, first choice need to do is to two kinds of recognition results into Row participle has only carried out being possible to carry out further comparing analysis to the two after participle.It is compared in analysis specific It goes to carry out just like under type；The content for identifying and not meeting word specification in two kinds of recognition results can be first passed through to syntax, so Afterwards in word segmentation result, the correlation between each word carries out semantic analysis.

For different linguistic units, the task of semantic analysis is different.On the level of word, semantic analysis it is basic Task is to carry out word sense disambiguation (WSD), is semantic character labeling (SRL) on sentence surface；There is supervision word sense disambiguation according to upper Hereafter classification task is completed with annotation results.And unsupervised word sense disambiguation is commonly known as cluster task, uses clustering algorithm pair All contexts of the same polysemant carry out equivalence class partition, when meaning of a word identification, by the context of the word with it is each The equivalence class that the meaning of a word corresponds to context is compared, and the meaning of a word of word is determined by the corresponding equivalence class of context.In addition, in addition to There are supervision and unsupervised word sense disambiguation, there are also a kind of disambiguation methods based on dictionary.It is counted by natural language recognition technology Calculation obtains that meaning of one's words correlation is higher to be exported as a result.Institute's speech recognition result, Mouth-Shape Recognition result and prediction knot Fruit is minutes or instruction of ordering dishes.

The scheme of the present embodiment further improves the accuracy rate of speech recognition by integrated voice identification and Mouth-Shape Recognition, And when the result accuracy both identified is all lower, further by natural language processing technique come into one Step is modified, and it is higher as a result, being verified and being judged by the way that various aspects are comprehensive, so that voice is known to choose meaning of one's words correlation Other accuracy greatly improves；Also the practicability for allowing for the system is more preferable.

Embodiment two

Embodiment two discloses a kind of electronic equipment, which includes processor, memory and program, wherein locating One or more can be used in reason device and memory, and program is stored in memory, and is configured to be executed by processor, When processor executes the program, a kind of method of raising accuracy of speech recognition of realization embodiment one.The electronic equipment can be with It is a series of electronic equipment of mobile phone, computer, tablet computer etc..

Embodiment three

Embodiment three discloses a kind of computer readable storage medium, and the storage medium is for storing program, and the journey When sequence is executed by processor, a kind of method of raising accuracy of speech recognition of realization embodiment one.

Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention The method operation that executable instruction is not limited to the described above, can also be performed in method provided by any embodiment of the invention Relevant operation.

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions use so that an electronic equipment (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

It is worth noting that, in the above-mentioned embodiment based on content update notice device, included each unit and mould Block is only divided according to the functional logic, but is not limited to the above division, and is as long as corresponding functions can be realized It can；In addition, the specific name of each functional unit is also only for convenience of distinguishing each other, the protection model being not intended to restrict the invention It encloses.

The above embodiment is only the preferred embodiment of the present invention, and the scope of protection of the present invention is not limited thereto, The variation and replacement for any unsubstantiality that those skilled in the art is done on the basis of the present invention belong to institute of the present invention Claimed range.

Claims

1. a kind of method for improving accuracy of speech recognition, which comprises the following steps:

Voice recognition step: the voice messaging of active user is obtained by sound collection equipment, and is obtained by speech recognition technology Voice recognition information is taken, the voice recognition information includes speech recognition probability and speech recognition result；

First judgment step: judging whether speech recognition probability is greater than the first preset value, if it is, output speech recognition knot Fruit, if it is not, then executing Mouth-Shape Recognition step；

Mouth-Shape Recognition step: the Shape of mouth of active user is obtained by image capture device, and is obtained by image recognition technology Mouth-Shape Recognition information is taken, the Mouth-Shape Recognition information includes Mouth-Shape Recognition probability and Mouth-Shape Recognition result；

Second judgment step: judging whether Mouth-Shape Recognition probability is higher than speech recognition probability, if it is, output Mouth-Shape Recognition knot Fruit.

2. a kind of method for improving accuracy of speech recognition as described in claim 1, which is characterized in that the second judgement step Suddenly include following sub-step:

Result step: judging whether the difference is greater than the second preset value, if so, output shape of the mouth as one speaks recognition result, and described second Preset value is positive value.

3. a kind of method for improving accuracy of speech recognition as claimed in claim 2, which is characterized in that the result step In, judge whether the difference is greater than the second preset value, second preset value is positive value, if it is, output Mouth-Shape Recognition knot Fruit, if it is not, then exporting and being marked speech recognition result and Mouth-Shape Recognition result.

4. a kind of method for improving accuracy of speech recognition as claimed in claim 2, which is characterized in that the result step In, judge whether the difference is greater than the second preset value, if so, output shape of the mouth as one speaks recognition result, if not, by natural language Reason technology calculates separately the meaning of one's words correlation between speech recognition result and Mouth-Shape Recognition result and context sentence, using the meaning of one's words Correlation is higher to be exported as prediction result.

5. a kind of method for improving accuracy of speech recognition as claimed in claim 4, which is characterized in that the speech recognition knot Fruit, Mouth-Shape Recognition result and prediction result are minutes or instruction of ordering dishes.

6. a kind of method of raising accuracy of speech recognition as described in any one of claim 1-5, which is characterized in that institute Stating the first preset value is 85%.

7. a kind of method of raising accuracy of speech recognition as described in any one of claim 1-5, which is characterized in that institute Stating the second preset value is 5%.

8. a kind of method of raising accuracy of speech recognition as described in any one of claim 1-5, which is characterized in that institute Sound collection equipment is stated using annular microphone, described image acquires equipment using annular camera array.

9. a kind of electronic equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes any one of claim 1-8 institute when executing the computer program A kind of method for the raising accuracy of speech recognition stated.

10. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program A kind of method of raising accuracy of speech recognition as described in any one of claim 1-8 is realized when being executed by processor.