CN109686365A

CN109686365A - A kind of audio recognition method and speech recognition system

Info

Publication number: CN109686365A
Application number: CN201811599441.2A
Authority: CN
Inventors: 张云翔; 饶竹一
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2019-04-26
Anticipated expiration: 2038-12-26
Also published as: CN109686365B

Abstract

The present invention provides a kind of audio recognition method and system, this method comprises the following steps: S1, the voice messaging with error message for obtaining user's input and the voice error correction information to voice messaging progress error correction, and stored voice message and voice error correction information respectively；S2, it carries out preliminary treatment to voice messaging and voice error correction information, and treated voice messaging and voice error correction information is encoded；S3, respectively according to encoding speech information and voice error correction information coding is counter releases corresponding text information, and the anti-text information released of contrast phone information coding and the anti-text information released of voice error correction information coding obtain the first recognition result；S4, the environmental information that user inputs the voice messaging is obtained, the second recognition result is obtained according to environmental information；S5, the second recognition result and dictinary information are compared, obtains final recognition result.The present invention can quickly identify voice messaging, improve working efficiency.

Description

A kind of audio recognition method and speech recognition system

Technical field

The present invention relates to technical field of voice recognition more particularly to a kind of audio recognition method and speech recognition systems.

Background technique

It is to have accurate definition that speech recognition system, which selects the requirement of Recognition unit, can obtain enough data and be trained, With generality.English generallys use context-sensitive phoneme modeling, and the coarticulation of Chinese is serious not as good as English, can adopt It is modeled with syllable.Training data size needed for system is related with model complexity.Modelling obtains excessively complicated so that super The ability of provided training data is gone out, can performance sharply declined.

In the prior art, it by microphone input voice messaging, re-enters, is unfavorable for if input error can only delete Voice messaging quickly identifies, reduces working efficiency.

Summary of the invention

Technical problem to be solved by the present invention lies in provide one kind when input voice information has mistake, do not deleting In the case where except the voice messaging inputted, the audio recognition method of voice messaging can be quickly identified.

In order to solve the above technical problem, the present invention provides a kind of audio recognition methods, and this method comprises the following steps:

S1, the voice messaging with error message for obtaining user's input and the voice error correction to voice messaging progress error correction Information, and the voice messaging and voice error correction information are stored respectively；

S2, it carries out preliminary treatment to the voice messaging and voice error correction information, and treated voice messaging and voice is entangled Wrong information extraction characteristic information is simultaneously encoded, and encoding speech information and voice error correction information coding are obtained；

S3, corresponding text information, and contrast phone counter released according to encoding speech information and voice error correction information coding respectively The anti-text information released of information coding and voice error correction information encode the anti-text information released, and obtain the first recognition result；

S4, the environmental information that user inputs the voice messaging is obtained, the second recognition result is obtained according to the environmental information；

S5, second recognition result and dictinary information are compared, obtains final recognition result, and will be described final Recognition result is presented to the user.

Wherein, preliminary treatment is carried out to the voice messaging and voice error correction information in the step S2 to specifically include:

The voice messaging and voice error correction information are filtered respectively, and respectively to the voice messaging after filtering processing It is sampled with voice error correction information；

The voice messaging after sampling and the voice error correction information after sampling are encoded respectively, obtain encoding speech information and language Sound error correction information coding.

Wherein, it is specifically included in the step S3:

The encoding speech information and existing acoustic model and speech model are compared, the encoding speech information is obtained To the similar coding of the acoustic model and speech model, and counter to release the encoding speech information corresponding according to similar coding First text information；

The voice error correction information is encoded and is compared with existing acoustic model and speech model, the voice error correction is obtained The similar coding of information coding and the acoustic model and speech model, and the voice error correction letter is released according to similar coding is counter Breath encodes corresponding second text information；

First text information and second text information are compared, highest first text information of similarity is obtained With the second text information, and with second text information replace the first text information in it is similar with second text information Part forms the first recognition result.

Wherein, the acoustic model is Hidden Markov Model.

Wherein, the step S4 is specifically included:

Acquisition user inputs the image of the voice messaging local environment, and identifies the environmental information in image,

The possibility demand that the user is obtained according to the environmental information filters out the second identification knot according to the possible demand Fruit.

Wherein, the step S5 is specifically included:

Second recognition result is compared with dictinary information, the second recognition result for not meeting language format is rejected, obtains Obtain third recognition result；

The recognition result of third recognition result and user's storage is subjected to similarity comparison, and according to similarity from big to small suitable Sequence is arranged, and user is showed.

The present invention provides a kind of speech recognition system, the system comprises:

Acquiring unit, for obtaining the voice messaging with error message of user's input and to voice messaging progress error correction Voice error correction information, and store the voice messaging and voice error correction information respectively；

Processing unit, for carrying out preliminary treatment to the voice messaging and voice error correction information, and voice is believed to treated Breath and voice error correction information are extracted characteristic information and are encoded, and encoding speech information and voice error correction information coding are obtained；

It is counter to push away recognition unit, for releasing corresponding text letter according to encoding speech information and voice error correction information coding are counter respectively Breath, and the anti-text information released of contrast phone information coding and voice error correction information encode the anti-text information released, and obtain First recognition result；

Context awareness unit inputs the environmental information of the voice messaging for obtaining user, is obtained according to the environmental information Second recognition result；

Recognition unit is compared, the recognition result for storing second recognition result with user compares, and obtains final Recognition result, and the final recognition result is presented to the user.

Wherein, the anti-recognition unit that pushes away includes:

First comparison is counter to push away unit, for carrying out pair the encoding speech information and existing acoustic model and speech model Than obtaining the similar coding of the encoding speech information to the acoustic model and speech model, and push away according to similar coding is counter Corresponding first text information of the encoding speech information out；

Second comparison is counter to push away unit, and voice error correction information coding and existing acoustic model and speech model are carried out pair Than obtaining the voice error correction information coding and the similar coding of the acoustic model and speech model, and according to similar coding It is counter to release corresponding second text information of the voice error correction information coding；

Replacement unit is compared, first text information and second text information are compared, obtains similarity highest The first text information and the second text information, and replaced in the first text information with second text information with described second The similar part of text information forms the first recognition result.

The beneficial effect of the embodiment of the present invention is: the present invention by voice messaging to acquisition and voice error correction information into Row coding, and according to encoding speech information and voice error correction information coding respectively obtain it is counter push away text information, compare the two it is anti- Text information is pushed away, counter is pushed away what the high voice error correction information of similarity encoded in corresponding text information replacement encoding speech information Text information obtains environmental information locating for user's input voice information, and believe according to environment to obtain the first recognition result Breath carries out screening to the first recognition result and obtains the second recognition result, by comparing the second recognition result with dictinary information To obtain final recognition result.The audio recognition method of the embodiment of the present invention, when voice inputs and there is mistake, without deleting Except re-entering, be conducive to voice messaging and quickly identify, improves working efficiency.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of flow diagram of audio recognition method of the embodiment of the present invention.

Fig. 2 is a kind of structural schematic diagram of speech recognition system of the embodiment of the present invention.

Specific embodiment

The explanation of following embodiment be with reference to attached drawing, can be to the specific embodiment implemented to the example present invention.

It is illustrated referring to Fig. 1, the embodiment of the present invention one provides a kind of audio recognition method, and this method includes as follows Step:

S1, the voice messaging with error message for obtaining user's input and the voice error correction to voice messaging progress error correction Information, and the voice messaging and voice error correction information are stored respectively.

Specifically, voice input is carried out by voice input option, there are error message in the voice messaging of the typing, If there is lesser mistake in Input Process, voice error correction typing option is selected to carry out error correction typing, error correction typing only needs The information at the position of typing mistake, error correction typing is voice error correction information, and voice messaging and voice error correction information are carried out respectively Storage.

It illustrates, it is assumed that for it is expected that the voice messaging of typing is " finding nearest gas station ", user is in typing In the process, for some reason, the voice messaging of typing is " finding close gas station ", and user selects voice error correction typing choosing , the voice error correction information of typing is " nearest ".

S2, preliminary treatment is carried out to the voice messaging and voice error correction information, and to treated voice messaging and language Sound error correction information is extracted characteristic information and is encoded, and encoding speech information and voice error correction information coding are obtained.

Specifically, voice messaging and voice error correction information are filtered, eliminate voice messaging and voice error correction letter Noise and echo in breath increase the quality of voice messaging and voice error correction information, entangle to filtered voice messaging and voice Wrong information is sampled, and converts analog signals into digital signal by A/D converter, respectively to the number after voice messaging conversion Digital signal after word signal and the conversion of voice error correction information is encoded and is extracted characteristic information, obtain encoding speech information and Voice error correction information coding.

Characteristic information is frequency cepstral coefficient MFCC feature, and frequency cepstral coefficient MFCC feature is based on the non-of sound frequency The linear transformation of the logarithmic energy line spectrum of linear melscale (Mel scale), is converted to frequency domain for time-domain signal with FFT first, Its logarithmic energy is composed later and carries out convolution with the triangular filter group being distributed according to Mel scale, finally to each filter The vector that output is constituted carries out discrete cosine transform, takes top n coefficient, and PLP still goes to calculate LPC parameter with De Binfa, but When calculating autocorrelation parameter is also the method for carrying out DCT to the logarithmic energy spectrum of auditory stimulus.Voice messaging is carried out just Step processing, so as to promote the quality of voice messaging and error correction voice messaging, facilitates the quality for promoting subsequent identification.

S3, respectively according to encoding speech information and voice error correction information coding is counter releases corresponding text information, and compare The anti-text information released of encoding speech information and voice error correction information encode the anti-text information released, and obtain the first identification knot Fruit.

Specifically, the encoding speech information and existing acoustic model and speech model are compared, described in acquisition The similar coding of encoding speech information and the acoustic model and speech model, and the voice letter is released according to similar coding is counter Breath encodes corresponding first text information；Voice error correction information coding is carried out with existing acoustic model and speech model Comparison obtains the voice error correction information coding and the similar coding of the acoustic model and speech model, and according to similar volume Code is counter to release corresponding second text information of the voice error correction information coding；By first text information and second text Word information compares, and obtains highest first text information of similarity and the second text information, and believed with second text Part similar with second text information in breath the first text information of replacement, forms the first recognition result.

Acoustic model is one of part mostly important in speech recognition system, and current dominant systems mostly use hidden Ma Er Section's husband's model models, and the concept of hidden Markov model is a discrete time-domain finite-state automata, hidden Markov Model HMM refers to that the internal state external world of this Markov model is invisible, and the external world can only see the output valve at each moment, Language model is a simple, unified, abstract formal system, and language objective fact passes through the description of language model, than Electronic computer is more suited to be automatically processed, thus language model has great meaning for the information processing of natural language Justice, by comparing and analyzing, to be combined into the option met, each note has opposite coding, and stored good Acoustic model and language model compare, and then select all similar codings, complete preliminary identification, help to be promoted just The efficiency and quality of identification are walked, the output valve of acoustic model is usually exactly from the calculated acoustic feature of each frame, these are special Sign is exactly the coding of acoustics, and language model be according to language objective fact and the language abstract mathematics that carry out model, and these Feature is exactly the coding of language, so that convenient and collected voice coding carries out cross validation, thus in the result obtained It compares, selects phase knowledge and magnanimity highest, and then text information can be released according to coding is anti-.

For example, obtaining similarity by acoustic model and speech model by taking " finding nearest gas station " as an example Highest coding releases multiple groups text information so as to counter, then passes through the comparison between voice error correction information and voice messaging, into And the highest voice messaging of similarity and voice error correction information can be selected, such as in " finding close gas station " " close " and entangle One of " nearest " in wrong voice messaging is similar to best, so as to be replaced, alternately, it is also possible to identify certainly At " instruction, which is covered into moral frame, swims exhibition ", and error correction voice messaging is identified as " mouth into moral ", which is replaced, then becomes " instruction cover mouth Exhibition is swum into moral frame " as one of alternative, it is also possible to " instruction covers tight-lipped gas station " occur certainly this alternative, it can be seen that, First recognition result is " finding close gas station " either " instruction covers tight-lipped gas station " or " instruction cover mouth swims exhibition into moral frame ".

S4, acquisition user input environmental information locating when the voice messaging, obtain second according to the environmental information Recognition result.

Specifically, the environment photo that periphery situation when user inputs the voice messaging is shot by picture pick-up device, takes the photograph Picture equipment uses high definition image pick-up device, to identify for environment locating at that time, passes through environment locating for identification user Position, and then can substantially judge the demand of user, for example, environment position locating for user may be urban district, highway, Suburb may be office building, cell or the hotel in urban district with the biggish noun of the urban district degree of association, be associated with highway biggish Noun may be gas station, parking lot, garage etc. on highway, and being associated with suburb biggish may be suburb village name Deng.By the location information where identification user, it is hereby achieved that it is associated with biggish noun with corresponding location information, according to Incongruent recognition result obvious in the first recognition result can be rejected by being associated with biggish noun.

For example, still by taking " finding nearest gas station " as an example, by obtaining the photo of user's input voice information, Known to user be on highway at that time, according to be associated with maximum noun with highway may be gas station on highway, parking lot, repair Depot etc., thus " instruction cover mouth swims exhibition into moral frame " in the first recognition result can be rejected, to obtain the second recognition result " finding close gas station " or " instruction covers tight-lipped gas station ".

S5, the dictinary information of second recognition result and storage is compared, obtains final recognition result, and will The final recognition result is presented to the user.

By comparing the dictinary information of the second recognition result and storage, the identification of language rule will not obviously be met Result information is deleted, so that final recognition result is obtained, by the passing identification information of final recognition result and user's storage It compares, obtains the similarity of each final recognition result, show institute to user according to the sequence of similarity from big to small Final recognition result is stated, the final recognition result is inquired convenient for user, to select the expected identification knot of user Fruit improves the efficiency and quality of identification.

It after user has selected final recognition result, is played out by loudspeaker, correct recognition result is carried out Storage facilitates and reminds other staff, to determine recognition result again, recognition result is stored, consequently facilitating being expanded It fills, is used convenient for users to next time.

A kind of audio recognition method of the embodiment of the present invention is carried out by voice messaging to acquisition and voice error correction information Coding, and according to encoding speech information and voice error correction information coding obtain respectively it is counter push away text information, compare counter pushing away for the two Text information counter pushes away text for what the high voice error correction information of similarity encoded in corresponding text information replacement encoding speech information Word information obtains environmental information locating for user's input voice information, and according to environmental information to obtain the first recognition result Screening is carried out to the first recognition result and obtains the second recognition result, by by the second recognition result and dictinary information compare from And obtain final recognition result.The audio recognition method of the embodiment of the present invention, when voice inputs and there is mistake, without deleting It re-enters, is conducive to voice messaging and quickly identifies, improve working efficiency.

Based on the embodiment of the present invention one, second embodiment of the present invention provides a kind of speech recognition systems, as shown in Fig. 2, this is System 1 includes:

Acquiring unit 11, for obtaining the voice messaging of user's input and carrying out the voice error correction of error correction to the voice messaging of input Information, and the voice messaging and voice error correction information are stored respectively；

Processing unit 12, for carrying out preliminary treatment to the voice messaging and voice error correction information, and to treated voice Information and voice error correction information are extracted characteristic information and are encoded, and encoding speech information and voice error correction information coding are obtained；

It is counter to push away recognition unit 13, for counter releasing corresponding text according to encoding speech information and voice error correction information coding respectively Information, and the anti-text information released of contrast phone information coding and voice error correction information encode the anti-text information released, and obtain Obtain the first recognition result；

Context awareness unit 14 inputs the environmental information of the voice messaging for obtaining user, is picked according to the environmental information Except in the preliminary recognition result with the incoherent recognition result of the environmental information, obtain the second recognition result；

Recognition unit 15 is compared, the recognition result for storing second recognition result with user compares, and obtains most Whole recognition result, and the final recognition result is presented to the user.

Wherein, the anti-recognition unit 13 that pushes away includes:

The above disclosure is only the preferred embodiments of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims

1. a kind of audio recognition method, which comprises the steps of:

S4, acquisition user input environmental information locating when the voice messaging, obtain the second identification according to the environmental information As a result；

2. the method according to claim 1, wherein to the voice messaging and voice error correction in the step S2 Information carries out preliminary treatment and specifically includes:

3. according to the method described in claim 2, it is characterized in that, being specifically included in the step S3:

4. according to the method described in claim 3, it is characterized by:

The acoustic model is Hidden Markov Model.

5. according to the method described in claim 4, it is characterized in that, the step S4 is specifically included:

6. according to the method described in claim 5, it is characterized in that, the step S5 is specifically included:

7. a kind of speech recognition system, which is characterized in that the system comprises:

8. system according to claim 7, which is characterized in that the anti-recognition unit that pushes away includes: