CN109273000B

CN109273000B - Speech recognition method

Info

Publication number: CN109273000B
Application number: CN201811186096.XA
Authority: CN
Inventors: 马世辉; 刘学军; 李进波
Original assignee: Henan Institute of Technology
Current assignee: Henan Institute of Technology
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2023-05-12
Anticipated expiration: 2038-10-11
Also published as: CN109273000A

Abstract

The invention discloses a voice recognition method, which comprises the following steps: recognizing the voice information by using a first voice recognition method to obtain a first voice recognition result, and recognizing the voice information by using a second voice recognition method to obtain a second voice recognition result; and comparing the first voice recognition result with the second voice recognition result, outputting the voice recognition result according to the comparison result, and displaying the voice recognition result. By the scheme, the correlation detection of key information of voice data can be effectively solved, and the success rate of voice recognition is increased.

Description

Speech recognition method

Technical Field

The invention relates to the field of voice recognition, in particular to a voice recognition method.

Background

Speech recognition is an emerging research direction of current development, and the following five problems are mainly involved in speech recognition: the recognition and understanding of natural language by the transmitter. The continuous speech must first be broken down into words, phonemes, etc., and secondly a rule must be established that understands the semantics. The voice information of the capsule is large. The speech pattern is not only different for different speakers but also for the same speaker, e.g. the speech information of one speaker when speaking at will and carefully speaking is different. The way one speaks varies over time. ⒊ ambiguity of speech. The different words may sound similar when the speaker speaks. This is common in english and chinese. ⒋ the phonetic properties of individual letters or words, words are affected by the context so that accents, tones, volume, pronunciation speed, etc. are changed. ⒌ environmental noise and interference have a serious impact on speech recognition, resulting in low recognition rates. .

In order to solve the problems, scientists introduce deep learning research in the machine learning field into voice recognition acoustic model training, and the accuracy of the acoustic model is greatly improved by using a multi-layer neural network with RBM pre-training. In this respect, researchers from microsoft corporation have made breakthrough progress, and after they use deep neural network model (DNN), the speech recognition error rate is reduced by 30%, which is the fastest progress in speech recognition technology for the last 20 years. However, most mainstream speech recognition decoders have adopted finite state machine (WFST) based decoding networks that can integrate language models, dictionaries, and acoustically shared word sets into one large decoding network, greatly improving decoding speed and providing a basis for real-time application of speech recognition.

Nevertheless, the accuracy of recognition remains a serious problem, and in particular, there is no reasonable strategy at all for self-checking the recognition results. Regardless of what is identified. And directly outputting without any evaluation.

In order to solve the above problems, the present invention proposes a speech recognition method.

Disclosure of Invention

In order to solve the above problems, the present invention proposes a speech recognition method comprising:

recognizing the voice information by using a first voice recognition method to obtain a first voice recognition result, and recognizing the voice information by using a second voice recognition method to obtain a second voice recognition result; a kind of electronic device with high-pressure air-conditioning system

And comparing the first voice recognition result with the second voice recognition result, outputting the voice recognition result according to the comparison result, and displaying the voice recognition result.

Preferably, the preprocessing method comprises fluency detection, endpoint detection, pre-emphasis, framing and windowing;

1) Endpoint detection

The endpoint detection adopts the following mode: setting a time threshold T0, a time interval Deltat and a sound threshold V0, and carrying out signal acquisition through an audio signal acquisition circuit to continuously acquire sound signals of N time nodes, wherein N is greater than T0/Deltat;

if the sound signals > V0 of INT (0.6N) time nodes are satisfied, then the sound is considered to be detected and the state is set to S1; wherein INT (·) represents rounding; if the previous state s=0 when the sound is detected, the start point of the sound is considered to be detected;

if the sound signal < V0 for INT (0.6N) time nodes is satisfied, then no sound is considered detected and the state is set to S0; wherein INT (·) represents rounding; if the previous state s=1 when the sound is detected, then the end point of the sound is considered to be detected;

after the end point detection is finished, cutting off silence at two ends of the sound signal;

2) Fluency detection

Cutting the voice into a front part and a rear part, sampling the front part and the rear part, continuously collecting the voice signals of M time nodes, and if the voice signals of M time nodes are met with V0, considering that the fluency is problematic, at the moment, cutting the voice of the part, wherein the cut voice is an effective voice segment; calculating the lengths of the effective voice segments of the front part and the rear part respectively, selecting a value with smaller length and the length value of the total voice to be scored for dividing operation, comparing the obtained value with a corresponding threshold value, and judging that the voice is fluent if the obtained value is larger than the corresponding threshold value; otherwise, judging that the flow is unfavorable;

3) Pre-emphasis

A high-pass filter H (z) =1-alpha z-1 with a pre-emphasis coefficient of 0.91 is adopted to eliminate signal attenuation, and the high-frequency part of the signal is promoted; framing the pre-emphasized signal, wherein the frame length of the frame is 15ms, the voice sampling frequency is 11025Hz, the frame length is 256 samples, and the frame is shifted by 128 samples;

the signal x (n) of each frame is smoothed using a hamming window.

Preferably, the first voice recognition method is as follows:

acquiring characteristic parameters in the voice information; the characteristic parameters include pitch, frequency, rate of frequency change, pitch period, gain, and band pass unvoiced/voiced intensity;

if the acquired voice characteristic parameters pass through the corresponding ANN model, voice recognition is carried out; corresponding words and sentences are obtained.

The second voice recognition method comprises the following steps:

1) Acquiring characteristic parameters in the voice information; the probability of generating an observation sequence O given the model Λ,

defining a forward variable alpha _t (i)：

α _t (t)＝P{O ₁ ，O ₂ ，…，O _t ；q _t ＝S _i |Λ}

Namely: under the given model condition, generating a partial observation symbol sequence before t, and at the moment of t, in a state S _i Probability of (2);

initializing:

α ₁ (i)＝π _i b _i (O ₁ )1≤i≤N

pi is the initial state distribution, pi= { pi _i },π _i ＝P[q ₁ ＝S _i ]J is more than or equal to 1 and less than or equal to N, and B is observation symbol probability distribution of states;

B＝{b _j (O _k )},b _j (O _k ) P [ output observation symbol at time t is O _k |q _t ＝S _j ],1≤j≤N,1≤k≤M；

Iterative calculation:

finally calculate

Wherein a is _ij B is an element in the state transition matrix _j (O _t ) Is an element in the observation symbol matrix;

2) The Baum-Welch algorithm finds the optimal solution λ=argmax { P (O|Λ) };

3) The Viterbi algorithm solves the optimal state transition sequence;

4) According to lambda corresponding to the optimal state sequence, candidate syllables or initial consonants are given;

5) Words and sentences are formed by the language model.

Preferably, the first speech recognition method is a large vocabulary speech recognition method based on a preset model, and the second speech recognition method is a speech recognition method based on an auxiliary speech data packet.

Preferably, the method further comprises:

a plurality of voice data packets are preset, the voice data packets are stored in the electronic equipment, the electronic equipment is connected with a processor, and the processor is connected with a server.

Preferably, the specific method for outputting the voice recognition result according to the comparison result comprises the following steps:

s1: comparing the first voice recognition result with the second voice recognition result, and if the coverage rate of the first voice recognition result and the second voice recognition result is lower than a set threshold value, executing the following steps, wherein the coverage rate refers to the ratio of complete repetition, the comparison is started from the first character, and the ratio of the same character number to the total character number is compared:

judging whether the character numbers of the first voice recognition result and the second voice recognition result are the same;

1) If the first voice recognition result and the second voice recognition result are the same, matching is carried out, and the matching quantity is counted; and calculates the similarity R: r=q (R1, R2)/Max (|r1|, |r2|); q (R1, R2) represents the same number as R1, R2; i.e. the same number of first speech recognition results as the second speech recognition results; max (|r1|, |r2|) represents the maximum value of R1, R2; s2, executing;

2) If not, deleting irrelevant characters of the first voice recognition result and the second voice recognition result, including: deleting the deactivated characters and the continuous identical characters; obtaining a corrected first voice recognition result and a corrected second voice recognition result; judging whether the character number of the corrected first voice recognition result is the same as that of the corrected second voice recognition result, and if so, judging that R=Q (R1, R2)/Max (|R1|, |R2|); q (R1, R2) represents the same number as R1, R2; namely, the same number of corrected first voice recognition results and corrected second voice recognition results; max (|r1|, |r2|) represents the maximum value of R1, R2; s2, executing;

if the number of characters of the corrected first voice recognition result is different from that of the corrected second voice recognition result, comparing the corrected first voice recognition result with the corrected second voice recognition result from front to back respectively, and calculating the similarity RA:

RA＝Q1(R1,R2)/Max(|R1|,|R2|)；

q1 (R1, R2) representing the same number of the corrected first speech recognition result and the corrected second speech recognition result compared from front to back; max (|r1|, |r2|) represents the maximum value of R1, R2;

comparing the corrected first voice recognition result with the corrected second voice recognition result from back to front, and calculating the similarity RB:

RB＝Q2(R1,R2)/Max(|R1|,|R2|)；

q2 (R1, R2) representing the same number of corrected first speech recognition results as corrected second speech recognition results compared from back to front; max (|r1|, |r2|) represents the maximum value of R1, R2;

compare RA, RB, r=max (RA, RB); s2, executing;

s2: if R is smaller than the appointed value, discarding the identification result, and resampling.

The invention has the beneficial effects that:

1) The correct recognition rate can be effectively improved;

2) The pre-judgment can be effectively prevented, and the output of an error result can be automatically prevented.

Detailed Description

Specific embodiments of the present invention will now be described in order to provide a clearer understanding of the technical features, objects and effects of the present invention.

A speech recognition method, the speech recognition method comprising:

1) Endpoint detection

2) Fluency detection

3) Pre-emphasis

the signal x (n) of each frame is smoothed using a hamming window.

Preferably, the first voice recognition method is as follows:

The second voice recognition method comprises the following steps:

defining a forward variable alpha _t (i)：

α _t (t)＝P{O ₁ ，O ₂ ，…，O _t ；q _t ＝S _i |Λ}

initializing:

α ₁ (i)＝π _i b _i (O ₁ ) 1≤i≤N

Iterative calculation:

finally calculate

2) The Baum-Welch algorithm finds the optimal solution λ=argmax { P (O|Λ) };

3) The Viterbi algorithm solves the optimal state transition sequence;

5) Words and sentences are formed by the language model.

Preferably, the method further comprises:

RA＝Q1(R1,R2)/Max(|R1|,|R2|)；

RB＝Q2(R1,R2)/Max(|R1|,|R2|)；

compare RA, RB, r=max (RA, RB); s2, executing;

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the described order of action, as some steps may take other order or be performed simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments and that the acts and elements referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by computer programs stored in a computer-readable storage medium, which when executed, may include the steps of the embodiments of the methods described above. Wherein the storage medium may be a magnetic disk, an optical disk, a ROM, a RAM, etc.

The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims

1. A method of speech recognition, the method comprising:

acquiring voice information;

recognizing the voice information by using a first voice recognition method to obtain a first voice recognition result, and recognizing the voice information by using a second voice recognition method to obtain a second voice recognition result; comparing the first voice recognition result with the second voice recognition result, outputting the voice recognition result according to the comparison result, and displaying the voice recognition result;

the specific method for outputting the voice recognition result according to the comparison result comprises the following steps:

RA＝Q1(R1,R2)/Max(|R1|,|R2|)；

RB＝Q2(R1,R2)/Max(|R1|,|R2|)；

compare RA, RB, r=max (RA, RB); s2, executing;

2. A method for speech recognition according to claim 1,

after voice information is acquired, preprocessing the voice information;

the preprocessing method comprises fluency detection, endpoint detection, pre-emphasis, framing and windowing;

1) Endpoint detection

The endpoint detection adopts the following mode: setting a time threshold T0, a time interval delta T and a sound threshold V0, carrying out signal acquisition by an audio signal acquisition circuit, continuously acquiring sound signals of N time nodes,

；/>

if INT (0.6N) time nodes are satisfied

The sound is considered to be detected, and the state is set as S to be 1; wherein INT (·) represents rounding; if the previous state s=0 when the sound is detected, the start point of the sound is considered to be detected;

if INT (0.6N) time nodes are satisfied

If no sound is detected, setting the state as S to 0; wherein INT (·) represents rounding; if the previous state s=1 when the sound is detected, then the end point of the sound is considered to be detected;

2) Fluency detection

Cutting the voice into front and back parts, sampling the front and back parts, continuously collecting the sound signals of M time nodes, if the sound signals of M time nodes are satisfied

The fluency is considered to be problematic, and at this time, the part of voice is cut off, and the voice after cutting off is an effective voice segment; calculating the length of the effective voice segment of the front and the rear parts, selecting the value with smaller length and the length value of the total voice to be scored to perform dividing operation, and obtainingComparing the value with a corresponding threshold value, and judging that the flow is favorable if the value is larger than the corresponding threshold value; otherwise, judging that the flow is unfavorable;

3) Pre-emphasis

the signal x (n) of each frame is smoothed using a hamming window.

3. A speech recognition method according to claim 1, wherein the first speech recognition method is a large vocabulary speech recognition method based on a predetermined model, and the second speech recognition method is a speech recognition method based on auxiliary speech data packets.

4. A method of speech recognition according to claim 2, wherein the method further comprises:

a plurality of voice data packets are preset and stored in electronic equipment, the electronic equipment is connected with a processor, and the processor is connected with a server.