CN106297795A

CN106297795A - Audio recognition method and device

Info

Publication number: CN106297795A
Application number: CN201510271782.7A
Authority: CN
Inventors: 孙廷玮; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd; Spreadtrum Communications Inc
Priority date: 2015-05-25
Filing date: 2015-05-25
Publication date: 2017-01-04
Anticipated expiration: 2035-05-25
Also published as: CN110895930A; CN106297795B; CN110895930B

Abstract

A kind of audio recognition method and device, described audio recognition method includes: the voice data of acquisition is carried out sub-frame processing, to obtain at least two voiced frame；The voiced frame meeting condition of choosing is chosen from described at least two voice data frame；Calculate the speech recognition score value of the described voiced frame meeting condition of choosing；When calculated speech recognition score value is more than the point threshold preset, the voice data of described acquisition is carried out speech recognition.Above-mentioned scheme can save calculating resource, promotes the speed of speech recognition.

Description

Audio recognition method and device

Technical field

The invention belongs to technical field of voice recognition, particularly relate to a kind of audio recognition method and device.

Background technology

Mobile terminal, refers to the computer equipment that can use in movement, include in a broad aspect mobile phone, Notebook, panel computer, POS, vehicle-mounted computer etc..Along with developing rapidly of integrated circuit technique, move Dynamic terminal has had powerful disposal ability, and mobile terminal becomes one from simple call instrument Individual integrated information processing platform, this also adds broader development space to mobile terminal.

The use of mobile terminal, it usually needs user concentrates certain attention.Mobile terminal of today sets For being equipped with touch screen, user needs to touch described touch screen, to perform corresponding operation.But, When user cannot touch mobile terminal device, operation mobile terminal will become highly inconvenient.Such as, When having carried article during user drives vehicle or hands when.

Audio recognition method and always listen the use of system (Always Listening System) so that permissible Mobile terminal is carried out non-manual activation and operation.When described always listen system acoustical signal to be detected time, language Sound identification system will activate, and is identified the acoustical signal detected, afterwards, mobile terminal is just Corresponding operation can be performed, such as, when user's input " dials the hands of XX according to the acoustical signal identified Machine " voice time, the voice messaging of " dialing the mobile phone of XX " of user's input just can be carried out by mobile terminal Identify, and after correct identification, from mobile terminal, obtain the information of the phone number of XX, and dial.

But, audio recognition method of the prior art, when carrying out speech recognition, there is amount of calculation Greatly, the problem that recognition speed is slow.

Summary of the invention

The problem that the embodiment of the present invention solves is the calculating resource saving speech recognition, improves speech recognition Speed.

For solving the problems referred to above, embodiments providing a kind of audio recognition method, described voice is known Other method includes:

The voice data of acquisition is carried out sub-frame processing, to obtain at least two voiced frame；

The voiced frame meeting condition of choosing is chosen from described at least two voice data frame；

Calculate the speech recognition score value of the described voiced frame meeting condition of choosing；

When calculated speech recognition score value is more than the point threshold preset, the sound to described acquisition Data carry out speech recognition.

Alternatively, described choose to meet from described at least two voice data frame choose the voiced frame of condition, Including:

Calculate the rear signal to noise ratio of current sound frame；

After between the previous voiced frame of rear signal-to-noise ratio computation and current sound frame according to described current sound frame Test signal to noise ratio weight energy distance；

Calculate the first selected threshold of current sound frame；

Posteriori SNR weight energy distance between described previous voiced frame and current sound frame is more than working as During the first selected threshold of front voiced frame, then choose current sound frame.

Alternatively, the rear signal to noise ratio of employing formula below calculating current sound frame:

Wherein, SNR_postT () represents the rear signal to noise ratio of current sound frame, T represents the position sequence of current sound frame, and E (t) represents the noisy speech energy of current sound frame, E_noiseT () represents The noise energy of current sound frame.

Alternatively, formula below is used to calculate the posteriority noise between previous voiced frame and current sound frame Than weight energy distance:

D (t)=| logE (t)-logE (t-1) | × SNR_post(t)；Wherein, D (t) represents previous sound Posteriori SNR weight energy distance between frame and current sound frame, logE (t) represents current sound frame Logarithmic energy, logE (t-1) represents the logarithmic energy of previous voiced frame.

Alternatively, the first selected threshold of employing formula below calculating current sound frame:

T (t)=D_a(t)×f(logE_noise(t)), wherein, T (t) represents that the second of current sound frame chooses threshold Value, D_aT () represents the posteriori SNR weight energy distance average of the continuous voiced frame before current sound frame, f(logE_noise(t)) it is S type function.

Alternatively, the described sound choosing the satisfied condition of choosing preset from the multiple voice data frames obtained Sound frame, including:

Calculate the rear signal to noise ratio of current sound frame；

When determining calculated rear signal to noise ratio more than the second selected threshold preset, choose current sound Frame.

Wherein, SNR_postT () represents the rear signal to noise ratio of current sound frame, T represents the position sequence of current sound frame, and E (t) represents the noisy speech energy of current sound frame, E_noise(t) table Show the noise energy of current sound frame.

Alternatively, the speech recognition using formula below to calculate the described voiced frame meeting and choosing condition divides Value, including:

M_{n} = \frac{1}{n_{-} + n_{+}} Σ_{m = n_{-}}^{n_{+}} f (α \times (n + m)),

Wherein, M_nRepresent calculated speech recognition Score value, n represents the position sequence of current sound frame, n_-The position sequence of start sound frame in voiced frame selected by expression, n₊Terminating the position sequence of voiced frame in voiced frame selected by expression, α represents default adjustment parameter, and m represents Along with the positive integer of selected sound framing bit sequence change, f (α × (n+m)) represents moving average method prediction Model.

The embodiment of the present invention additionally provides a kind of speech recognition equipment, and described speech recognition equipment includes:

Sub-frame processing unit, is suitable to the voice data of acquisition be carried out sub-frame processing, to obtain at least two Voiced frame；

Choose unit, be suitable to from described at least two voice data frame, choose the sound meeting condition of choosing Frame；

Computing unit, is suitable to calculate the speech recognition score value of the described voiced frame meeting condition of choosing；

Recognition unit, is suitable to when calculated speech recognition score value is more than the point threshold preset, right The voice data of described acquisition carries out speech recognition.

Alternatively, choose unit described in and be suitable to calculate the rear signal to noise ratio of current sound frame；According to described currently Posteriori SNR weight energy between the previous voiced frame of rear signal-to-noise ratio computation and the current sound frame of voiced frame Distance；Calculate the second selected threshold of current sound frame；When described previous voiced frame and current sound frame it Between posteriori SNR weight energy distance more than the second selected threshold of current sound frame time, then choose and work as Front voiced frame.

Alternatively, choose unit described in and be suitable to calculate the rear signal to noise ratio of current sound frame；Calculate when determining When the rear signal to noise ratio arrived is more than the first selected threshold preset, choose current sound frame.

Compared with prior art, technical scheme has the advantage that

Meet pre-conditioned voiced frame carry out speech recognition by choosing from voice data to be identified, The non-speech data frame not including voice messaging can be got rid of, and only selected voiced frame is all carried out language Sound identifying processing, therefore, it can save calculating resource, promotes the speed of speech recognition, promote user's Experience.

Further, according to the rear signal to noise ratio of calculated current sound frame, it is calculated current sound The posteriori SNR weight energy distance of frame and previous voiced frame, and calculated posteriori SNR is weighed Beijing South Maxpower Technology Co. Ltd's span compares from the second selected threshold with calculated current sound frame, and only calculates The rear signal to noise ratio of current sound frame is compared, and more will can not include the non-speech sounds frame of voice messaging Foreclose, therefore, it can save calculating resource further, promote the speed of speech recognition.

Further, by only by the rear signal to noise ratio of calculated current sound frame and first preset Selected threshold compares, and can be got rid of by the voiced frame more not including voice messaging, it is possible to joint Save and calculate resource, therefore, it can improve further the speed of speech recognition.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of audio recognition method in the embodiment of the present invention；

Fig. 2 is the flow chart of the another kind of audio recognition method in the embodiment of the present invention；

Fig. 3 is the flow chart of another audio recognition method in the embodiment of the present invention；

Fig. 4 is the structural representation of a kind of speech recognition equipment in the embodiment of the present invention.

Detailed description of the invention

Audio recognition method of the prior art, when carrying out speech recognition, generally with fixing frame per second (Fixed Frame Rate, FFR) voice data to be identified is divided the multiple voiced frames obtained carry out speech recognition Process.Voice messaging is not included owing to dividing in some voiced frame in the multiple voiced frames obtained, right These do not include that the non-speech frame of voice messaging carries out voice recognition processing, have no not only for speech recognition Meaning, but also calculating resource can be wasted, reduce the recognition speed of voice.

For solving the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention uses is passed through Choose from voice data to be identified and meet pre-conditioned voiced frame and carry out speech recognition, can save Calculate resource, promote the speed of speech recognition, promote the experience of user.

Understandable, below in conjunction with the accompanying drawings for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from The specific embodiment of the present invention is described in detail.

Fig. 1 shows the flow chart of a kind of audio recognition method in the embodiment of the present invention.As shown in Figure 1 Audio recognition method, may include that

Step S101: the voice data of acquisition is carried out sub-frame processing, to obtain at least two voiced frame.

In being embodied as, can use Mike that the acoustical signal of input is carried out Real-time Collection.When adopting When collection is to voice data, is processed by corresponding, the acoustical signal of input is converted into the sound of correspondence Data.Afterwards, the voice data being converted to can be carried out sub-frame processing, thus obtain at least two Voiced frame.

Step S102: choose the voiced frame meeting condition of choosing from described at least two voice data frame.

Existing audio recognition method, when carrying out speech recognition, it usually needs divide voice data To described at least two voiced frame all carry out corresponding voice recognition processing.But, it is not each sound Sound frame all includes voice messaging, and the voiced frame not including voice messaging is carried out voice recognition processing and incites somebody to action Resource can be wasted, and the speed of speech recognition can be reduced.Therefore, in embodiments of the present invention, first Selected part voiced frame from the voiced frame dividing at least two obtained, does not include speech data by part Voiced frame get rid of, as such, it is possible to save resource, it is possible to promote speech recognition speed.

Step S103: calculate the speech recognition score value of the described voiced frame meeting condition of choosing.

In being embodied as, described in choose condition and can be configured according to the actual needs.

Step S104: when calculated speech recognition score value is more than the point threshold preset, to described The voice data obtained carries out speech recognition.

In being embodied as, pre-when being more than according to selected voiced frame calculated speech recognition score value If point threshold time, it may be determined that acquired voice data includes the voice messaging of user, this Time, the voice data obtained can be carried out speech recognition.Otherwise, then need not it is carried out voice knowledge Not.Wherein, speech recognition score value can be configured according to the actual needs.

Fig. 2 shows the flow chart of the another kind of audio recognition method in the embodiment of the present invention.Such as Fig. 2 institute The audio recognition method shown, may include that

Step S201: the voice data of acquisition is carried out sub-frame processing, to obtain at least two voiced frame.

Step S202: travel through described at least two voiced frame.

Step S203: calculate the rear signal to noise ratio of current sound frame.

In being embodied as, choose which voiced frame to determine, described at least two sound can be traveled through Frame, and each voiced frame is respectively adopted the rear signal to noise ratio (post SNR) that formula below calculating is corresponding:

{SNR}_{post} (t) = \log \frac{E (t)}{E_{noise} (t)} - - - (1)

Wherein, SNR_postT () represents the rear signal to noise ratio of current sound frame, t represents the position sequence of current sound frame, E (t) represents noisy speech (noisy speech) energy of current sound frame, E_noiseT () represents current sound The noise energy of frame.

Step S204: according to the previous voiced frame of rear signal-to-noise ratio computation and the current sound of described current sound frame Posteriori SNR weight energy distance between frame.

In an embodiment of the present invention, use formula below calculate previous voiced frame and current sound frame it Between posteriori SNR weight energy distance:

D (t)=| logE (t)-logE (t-1) | × SNR_post(t) (2)

Wherein, D (t) represents the posteriori SNR weight energy span between previous voiced frame and current sound frame From, logE (t) represents the logarithmic energy of current sound frame, and logE (t-1) represents the logarithm of previous voiced frame Energy.

Step S205: calculate the first selected threshold of current sound frame.

In an embodiment of the present invention, need acquired voice data is divided each voiced frame obtained All calculate corresponding first selected threshold.Specifically, the first selected threshold of each voiced frame can use Formula below is calculated:

T (t)=D_a(t)×f(logE_noise(t)) (3)

Wherein, T (t) represents the first selected threshold of current sound frame, D_aT () expression includes current sound frame At the posteriori SNR weight energy distance average of two interior continuous voiced frames, f (logE_noise(t)) it is S Type function (sigmoid function).

It is to be herein pointed out D_aT () is not a constant, it changes along with the change of voiced frame. Divide with acquired voice data and obtain 3 voiced frame the first voiced frames, the second voiced frame and the As a example by three voiced frames, wherein, D (1) represents the posteriori SNR weight of the first voiced frame and previous voiced frame Energy distance (being the product of the rear signal to noise ratio of the energy logarithm of the first voiced frame and the first voiced frame), D (2) Representing the posteriori SNR weight energy distance of the second voiced frame and the first voiced frame, D (3) represents the 3rd sound The posteriori SNR weight energy distance of sound frame and the second voiced frame.So, formula (3) is being used to calculate During the first selected threshold of the first voiced frame, D_a(1) equal to D (1)；Calculate the first of the second voiced frame to choose During threshold value, D_a(2) it is D (1) and the meansigma methods of D (2)；When calculating the first selected threshold of the 3rd voiced frame, D_a(3) be D (1), D (2) and the meansigma methods of D (3).Thus, it can be seen that, D_aT () is carried out more along with voiced frame Newly.

Step S206: by the posteriori SNR weight energy between described previous voiced frame and current sound frame Distance compares with the first selected threshold of current sound frame.

Step S207: when the posteriori SNR weight determined between described previous voiced frame and current sound frame When energy distance is more than the first selected threshold of current sound frame, choose current sound frame.

Step S208: calculate the speech recognition score value of the described voiced frame meeting condition of choosing.

In an embodiment of the present invention, moving average method (moving average method) can be used Calculate the speech recognition score value of the voiced frame meeting condition of choosing, be specially and use formula below to calculate institute State the speech recognition score value of the voiced frame meeting condition of choosing, including:

M_{n} = \frac{1}{n_{-} + n_{+}} Σ_{m = n_{-}}^{n_{+}} f (α \times (n + m)) - - - (4)

Wherein, M_nRepresenting calculated speech recognition score value, n represents in selected voiced frame and is positioned at The position sequence of middle voiced frame, n_-The position sequence of start sound frame, n in voiced frame selected by expression₊Represent Terminating the position sequence of voiced frame in selected voiced frame, α represents default adjustment parameter, and m represents along with institute The positive integer of the sound framing bit sequence change chosen, f (α × (n+m)) represents moving average method forecast model.

When using above-mentioned formula (4) to calculate the speech recognition score value meeting the voiced frame choosing condition, Calculated M_nIt is to move with the frame of 10ms to calculate, is used as in average moving window The measurement of par of voiced frame.

Step S209: when calculated speech recognition score value is more than the point threshold preset, to described The voice data obtained carries out speech recognition.

In being embodied as, when being more than, when calculated speech recognition score value, the point threshold preset, Determine that acquired voice data includes voice messaging, then acquired voice data can be entered Row speech recognition.

In being embodied as, when the voice messaging identified in acquired voice data, mobile terminal Can perform to operate accordingly.Such as, the voice messaging identified when mobile terminal is for " to open FACEBOOK " time, mobile terminal will open FACEBOOK for user.

In being embodied as, in order to further the voiced frame not including speech data be foreclosed, permissible It is compared to carry out really only by by the rear signal to noise ratio of each voiced frame and the second selected threshold preset Fixed, so it is possible not only to save calculate resource, the speed of speech recognition can also be improved simultaneously further, The most shown in Figure 3.

Fig. 3 shows the flow chart of the another kind of audio recognition method in the embodiment of the present invention.Such as Fig. 3 institute The audio recognition method shown, may include that

Step S301: the voice data of acquisition is carried out sub-frame processing, to obtain at least two voiced frame.

In an embodiment of the present invention, for the ease of the analyzing and processing to voiced frame, the sound number that will obtain According to dividing a length of 25ms of each voiced frame, adjacent two voiced frames at least two voiced frame obtained Between frame move as 1ms.

Step S302: at least two voiced frame obtained by traversal, and calculate the rear noise of current sound frame Ratio.

In embodiments of the present invention, use the rear signal to noise ratio that above-mentioned formula (1) calculates, can be direct It is used in and judges whether to choose current sound frame in subsequent step.

It is to be herein pointed out compared with calculating first signal to noise ratio (priori SNR), use voiced frame Rear signal to noise ratio determines whether that choosing voiced frame will become more directly perceived, clear and definite, because calculating each sound The first signal to noise ratio of sound frame needs to estimate the energy of the clean speech in current sound frame, and to sound Clean speech energy in frame is estimated will being a thing being quite difficult to.

Step S303: the rear signal to noise ratio of current sound frame is compared with the second selected threshold preset.

In being embodied as, the second selected threshold can be set according to the actual needs.

Step S304: when the rear signal to noise ratio determining present frame is more than the second selected threshold preset, choose Current sound frame.

In being embodied as, when the rear signal to noise ratio determining present frame is more than the second selected threshold, illustrate to work as Front frame potentially includes voice messaging, now chooses present frame.Otherwise, then give up present frame, and continue The continuous judgement carrying out next voiced frame.

Step S305: calculate the speech recognition score value of the described voiced frame meeting condition of choosing.

Step S306: when calculated speech recognition score value is more than the point threshold preset, to described The voice data obtained carries out speech recognition.

Fig. 4 shows that the embodiment of the present invention additionally provides a kind of speech recognition equipment.Language as shown in Figure 4 Sound identification device, can include sub-frame processing unit 401, choose unit 402, computing unit 403 and know Other unit 404, wherein:

Sub-frame processing unit 401, is suitable to the voice data of acquisition be carried out sub-frame processing, to obtain at least two Individual voiced frame.

Choose unit 402, be suitable to from described at least two voice data frame, choose the sound meeting condition of choosing Sound frame.In an embodiment of the present invention, choose unit 402 and be suitable to calculate the rear signal to noise ratio of current sound frame. When determining calculated rear signal to noise ratio more than the first selected threshold preset, choose current sound frame. In an alternative embodiment of the invention, choose unit 402 and be suitable to calculate the rear signal to noise ratio of current sound frame；Root According to the posteriority noise between the previous voiced frame of rear signal-to-noise ratio computation and the current sound frame of described current sound frame Than weight energy distance；Calculate the second selected threshold of current sound frame；When described previous voiced frame and work as When posteriori SNR weight energy distance between front voiced frame is more than the second selected threshold of current sound frame, Then choose current sound frame.

Computing unit 403, is suitable to calculate the speech recognition score value of the described voiced frame meeting condition of choosing.

Recognition unit 404, is suitable to when calculated speech recognition score value is more than the point threshold preset, The voice data of described acquisition is carried out speech recognition.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment Suddenly the program that can be by completes to instruct relevant hardware, and this program can be stored in computer-readable In storage medium, storage medium may include that ROM, RAM, disk or CD etc..

Having been described in detail the method and system of the embodiment of the present invention above, the present invention is not limited to this. Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various change with Amendment, therefore protection scope of the present invention should be as the criterion with claim limited range.

Claims

1. an audio recognition method, it is characterised in that including:

When calculated speech recognition score value is more than the point threshold preset, the sound number to described acquisition According to carrying out speech recognition.

Audio recognition method the most according to claim 1, it is characterised in that described from described at least two Voice data frame is chosen the voiced frame meeting condition of choosing, including:

Calculate the rear signal to noise ratio of current sound frame；

Calculate the first selected threshold of current sound frame；

Audio recognition method the most according to claim 2, it is characterised in that use formula below to calculate The rear signal to noise ratio of current sound frame:

Audio recognition method the most according to claim 3, it is characterised in that before using formula below to calculate Posteriori SNR weight energy distance between one voiced frame and current sound frame:

D (t)=| logE (t)-logE (t-1) | × SNR_post(t)；Wherein, D (t) represents previous voiced frame And the posteriori SNR weight energy distance between current sound frame, logE (t) represents current sound frame Logarithmic energy, logE (t-1) represents the logarithmic energy of previous voiced frame.

Audio recognition method the most according to claim 4, it is characterised in that use formula below to calculate First selected threshold of current sound frame:

T (t)=D_a(t)×f(logE_noise(t)), wherein, T (t) represents the first selected threshold of current sound frame, D_aT () represents the posteriori SNR weight energy distance average of the continuous voiced frame before current sound frame, f(logE_noise(t)) it is S type function.

Audio recognition method the most according to claim 1, it is characterised in that described from the multiple sound obtained Sound Frame is chosen the voiced frame meeting the condition of choosing preset, including:

Calculate the rear signal to noise ratio of current sound frame；

Audio recognition method the most according to claim 6, it is characterised in that use formula below to calculate The rear signal to noise ratio of current sound frame:

8. according to the audio recognition method described in claim 2 or 7, it is characterised in that use formula below Calculate the speech recognition score value of the described voiced frame meeting condition of choosing, including:

Wherein, M_nRepresent that calculated speech recognition divides Value, n represents the position sequence of current sound frame, n_-The position of start sound frame in voiced frame selected by expression Sequence, n₊Terminating the position sequence of voiced frame in voiced frame selected by expression, α represents default adjustment parameter, M represents that, along with the positive integer of selected sound framing bit sequence change, f (α × (n+m)) represents mobile Averaging method forecast model.

9. a speech recognition equipment, it is characterised in that including:

Sub-frame processing unit, is suitable to the voice data of acquisition is carried out sub-frame processing, to obtain at least two sound Sound frame；

Choose unit, be suitable to from described at least two voice data frame, choose the voiced frame meeting condition of choosing；

Recognition unit, is suitable to when calculated speech recognition score value is more than the point threshold preset, to institute The voice data stating acquisition carries out speech recognition.

Speech recognition equipment the most according to claim 9, it is characterised in that described in choose unit be suitable to meter Calculate the rear signal to noise ratio of current sound frame；The previous sound of rear signal-to-noise ratio computation according to described current sound frame Posteriori SNR weight energy distance between frame and current sound frame；Calculate the first of current sound frame Selected threshold；Posteriori SNR weight energy span between described previous voiced frame and current sound frame From during more than the first selected threshold of current sound frame, then choose current sound frame.

11. speech recognition equipments according to claim 9, it is characterised in that described in choose unit be suitable to meter Calculate the rear signal to noise ratio of current sound frame；When determine calculated after signal to noise ratio more than preset second choosing When taking threshold value, choose current sound frame.