CN103065633B

CN103065633B - Speech recognition decoding efficiency optimization method

Info

Publication number: CN103065633B
Application number: CN201210580290.2A
Authority: CN
Inventors: 鹿晓亮; 赵志伟; 陈旭; 尚丽; 吴晓如; 于振华; 潘青华
Original assignee: iFlytek Co Ltd
Current assignee: Iflytek Medical Technology Co ltd
Priority date: 2012-12-27
Filing date: 2012-12-27
Publication date: 2015-01-14
Anticipated expiration: 2032-12-27
Also published as: CN103065633A

Abstract

The invention relates to a method for optimizing the decoding efficiency of voice recognition, which is realized by the following steps: for every three frames of voice feature vectors, performing Viterbi dynamic programming in arcs, wherein at most three scores and corresponding paths can be output on each arc, and the three scores and paths respectively correspond to the output of three continuous different frames; according to a Viterbi algorithm, the three scores and the corresponding paths are transmitted to subsequent nodes of the arc for competition; reserving a winner on the node, and continuing to expand to a subsequent arc of the node when the next three frames arrive; for the last frame of voice feature vector, the path which is transmitted to the last node of the decoding network and is won is the optimal path; and backtracking the optimal path to obtain a corresponding word sequence, namely an identification result. The invention saves the memory access amount in the identification process and improves the efficiency of the whole system by adopting the frame semi-synchronization method with optimized efficiency.

Description

A kind of speech recognition decoder efficiency optimization method

Technical field

The present invention relates to and a kind of in Continuous Speech Recognition System, carry out speech recognition decoder efficiency optimization method, for promoting concurrent way based on the speech recognition system of cloud computing and recognition speed.

Background technology

Universal along with speech voice input function on the intelligent terminals such as mobile phone and application, user uses the scene of phonetic entry to get more and more on the intelligent terminals such as mobile phone.And mostly these application scenarioss are to carry out based on cloud computing, intelligent terminal is responsible for recording and Audio data compression, and the identified server then data being sent to high in the clouds identifies, recognition result returns to intelligent terminal again.For the speech recognition system based on cloud computing, if concurrent way and the recognition speed of separate unit identified server can be promoted, the identified server of equal number can support the use of more users simultaneously, thus can save a large amount of hardware cost for whole cloud computing platform.But, in order to promote speech recognition effect, often training language model in large scale and acoustic model, loading by the decoding network of these model constructions the internal memory getting up usually to need tens G.Speech recognition process needs to inquire about in the internal memory of tens G continually, and particularly when multipath concurrence, the bandwidth that internal memory reads can become the bottleneck of system for restricting efficiency (concurrent way and recognition speed).

Current Continuous Speech Recognition System as shown in Figure 1, comprises following several part: end-point detection, feature extraction, decoding and result export.In several modules of Continuous Speech Recognition System, decoder module calculated amount accounting maximum (accounting for more than 80%), internal memory reads also the most frequent, and being the most critical module affecting whole system efficiency (concurrent way and recognition speed), is also need most the nucleus module carrying out efficiency optimization.

Current decoding scheme is decoded based on the Viterbi of frame synchronization.First the semantic network of language model is extended to search network based on model state layer by acoustic model by system, and its schematic diagram as shown in Figure 2.This based on the search network of state node in all acoustic model states repeated arrangement in chronological order, make the status Bar of each time point all correspond to a frame speech characteristic vector.During search, calculate the cumulative path probability of each row state node relative to input speech frame respectively.When searching last frame voice, the state node with cumulative maximum probability is optimum node, by just obtaining optimum decoding status switch from the backtracking of this node executing state, thus obtains corresponding word sequence.

An actual decoding network is as shown in Figure 3: wherein, each red point represents a node in decoding network, and each rectangle represents an arc in decoding network, and each arc comprises 3 states, the state in this state corresponding diagram 2.Concrete algorithm flow is as follows: (1), for each frame speech characteristic vector, first carries out dynamic programming in arc, each arc can export at most a score and corresponding path; (2) according to Viterbi algorithm, this score and path are delivered in this arc subsequent node and are at war with, and retain winner; (3) remain into the winner on node, continue when next frame arrives to expand to this node follow-up go out arc get on; (4) for last frame speech characteristic vector, last node of decoding network (Final) is delivered to and the path of winning is optimal path; (5) recall optimal path, corresponding word sequence can be obtained, be recognition result.

For existing decoding technique, time each frame feature vector arrives, the node on decoding network all to access its all go out arc, and be delivered to follow-up arc get on this node being competed the score of winning and corresponding Viterbi path.For the Continuous Speech Recognition System particularly based on speech cloud, its decoding network can take the internal memory of tens G, namely the arc that goes out of access node represents and will access it and go out all internal memories corresponding to arc, when multipath concurrence (namely multiple user uses same identified server to use the service of identification simultaneously), the node of simultaneously accessing the internal memory of diverse location has hundreds of thousands or even up to a million, and internal storage access huge is like this kind of challenge for the memory bandwidth of the server of current mainstream configuration.Because memory bandwidth is not enough, causes wait during internal storage access, thus have impact on the recognition speed of whole recognition system.

Summary of the invention

The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of speech recognition decoder efficiency optimization method is provided, in the decoding network of large internal memory is decoded, internal storage access number of times can be reduced, avoid the bottleneck of memory bandwidth deficiency, thus optimize the recognition efficiency of Continuous Speech Recognition System.

The technology of the present invention solution: a kind of speech recognition decoder efficiency optimization method, its feature is: compared with traditional frame synchronization decoding algorithm, maximum difference is: be not that each frame speech characteristic vector all will carry out Viterbi, but every three frames carry out a Viterbi, be called the decoding algorithm that frame half is synchronous, its realization flow is as follows:

(1) for every three frame speech characteristic vectors, first in arc, carry out Viterbi dynamic programming, each arc can export at most three scores and corresponding path, the output of three scores and corresponding three the continuous different frames of path difference;

(2) according to Viterbi algorithm, these three scores and corresponding path are delivered in the subsequent node of this arc be at war with (with score and the path competition of corresponding frame);

(3) remain into the winner on node, continue to expand to when lower three frames arrive this node follow-up go out arc get on;

(4) for last frame speech characteristic vector, last node of decoding network (Final) is delivered to and the path of winning is optimal path;

(5) recall optimal path, obtain corresponding word sequence, be recognition result.

In described step (2), competition process performing step is as follows: for each node, have one or more of arc and be attached thereto; At a time t, has one or more of arc to this node bang path (with a score, this score portrays the possibility in this path to every paths), each arc can transmit three paths to this node, respectively the path of corresponding t-2, t-1 and t; The path of the synchronization that all arcs pass over is at war with according to score, and the path that score is the highest is retained, and all the other paths are deleted.

The present invention's advantage is compared with prior art: the present invention is in speech recognition decoder process, have employed frame half synchronous method, for the decoding network of large internal memory, can effectively reduce internal storage access number of times, thus when EMS memory access bandwidth is limited, significantly can promotes the efficiency of speech recognition decoder, promote concurrent way and recognition speed, for hardware cost is saved in the speech recognition based on cloud computing, optimizing user is experienced.

Accompanying drawing explanation

Fig. 1 is Continuous Speech Recognition System schematic diagram;

Fig. 2 is the schematic diagram each arc comprising 3 states;

Fig. 3 is an actual simple decoding network;

Fig. 4 is realization flow figure of the present invention.

Embodiment

Present invention employs frame half synchronous method that (particularly based on the speech recognition of cloud computing) in a kind of speech recognition for large internal memory carries out efficiency optimization, to save the internal storage access amount in identifying, thus promote the efficiency of whole system.

Compare with traditional frame synchronization algorithm, the maximum difference of frame half synchronized algorithm is exactly that every three frames carry out a Viterbi dynamic programming algorithm, its realization flow as shown in Figure 4:

1. first carry out the planning in t+1 moment, the renewal of each state is as follows:

q _t+1(2)＝max[q _t(1)+a ₁₂，q _t(2)+a ₂₂]+b ₂(a _t+1)

q _t+1(3)＝max[q _t(2)+a ₂₃，q _t(3)+a ₈₈]+b ₈(a _t+1)

Carry out the planning in t+2 moment again, the update mode of each state is as follows:

q _t+2(2)＝max[q _t+1(1)+a ₁₂，q _t+1(2)+a ₂₂]+b ₂(a _t+2)

q _t+2(3)＝max[q _t+1(2)+a ₂₈，q _t+1(3)+a ₈₈]+b ₈(a _t+2)

Then carry out the planning in t+3 moment, the update mode of each state is as follows:

q _t+8(2)＝max[q _t+2(1)+a ₁₂，q _t+2(2)+a ₂₂]+b ₂(a _t+8)

q _t+8(3)＝max[q _t+2(2)+a ₂₈，q _t+2(3)+a ₈₈]+b ₈(a _t+a)

Wherein q _tt () represents the score of i-th state when t frame; b _i(a _t) represent dividing of the likelihood of the corresponding j state of t frame; a _ijrepresent the transition probability from i-th state to a jth state.

2., according to Viterbi algorithm, the subsequent node that these three scores and path are delivered to this arc is at war with (with score and the path competition of corresponding frame);

Competition process performing step is as follows: for each node, has one or more of arc and is attached thereto; At a time t, has one or more of arc to this node bang path (with a score, this score portrays the possibility in this path to every paths), each arc can transmit three paths to this node, respectively the path of corresponding t-2, t-1 and t; The path of the synchronization that all arcs pass over is at war with according to score, and the path that score is the highest is retained, and all the other paths are deleted.

3. remain into the winner on node, continue to expand to when lower three frames arrive this node follow-up go out arc get on;

4., for last frame speech characteristic vector, be delivered to last node of decoding network (Final) and the path of winning is optimal path;

5. recall optimal path, corresponding word sequence can be obtained, be recognition result.

As can be seen from above-mentioned flow process, compare with traditional decoding algorithm based on frame synchronization, for the node on decoding network, every three frames of frame half synchronized algorithm just transmit once backward, namely every three frames just can access once this node all go out internal memory corresponding to arc, thus make internal storage access amount be reduced to original 1/3rd, the internal memory brought due to memory bandwidth bottleneck is waited for and greatly reduces, finally can bring tremendous increase to the efficiency of whole recognition system.

Non-elaborated part of the present invention belongs to techniques well known.

The above; be only part embodiment of the present invention, but protection scope of the present invention is not limited thereto, any those skilled in the art are in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.

Claims

1. a speech recognition decoder efficiency optimization method, is characterized in that performing step is as follows:

(2) according to Viterbi algorithm, these three scores and corresponding path are delivered in the subsequent node of this arc and are at war with, produce three new optimal paths down to transmit, until be delivered to last node of decoding network, produce optimal identification result, described competition refers to and the score of corresponding frame and path competition

(3) remain into the winner on node, the follow-up arc continuing to expand to this node when lower three frames arrive gets on;

(4) for last frame speech characteristic vector, last node of decoding network is delivered to and the path of winning is optimal path;

2. speech recognition decoder efficiency optimization method according to claim 1, is characterized in that: in described step (2), competition process performing step is as follows: for each node, have one or more of arc and be attached thereto; At a time t, have one or more of arc to this node bang path, every paths is with a score, and this score portrays the possibility in this path, each arc can transmit three paths to this node, respectively the path of corresponding t-2, t-1 and t; The path of the synchronization that all arcs pass over is at war with according to score, and the path that score is the highest is retained, and all the other paths are deleted.