WO2017177484A1 - 一种语音识别解码的方法及装置 - Google Patents
一种语音识别解码的方法及装置 Download PDFInfo
- Publication number
- WO2017177484A1 WO2017177484A1 PCT/CN2016/081334 CN2016081334W WO2017177484A1 WO 2017177484 A1 WO2017177484 A1 WO 2017177484A1 CN 2016081334 W CN2016081334 W CN 2016081334W WO 2017177484 A1 WO2017177484 A1 WO 2017177484A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- acoustic
- frame
- information
- decoding
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000013145 classification model Methods 0.000 claims abstract description 28
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000001360 synchronised effect Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 abstract 2
- 238000010586 diagram Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/193—Formal grammars, e.g. finite state automata, context free grammars or word networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Definitions
- the invention belongs to the field of speech processing, and in particular relates to a method and a device for speech recognition and decoding.
- Speech recognition is an artificial intelligence technique that allows a machine to transform a speech signal into a corresponding text or command through an identification and understanding process.
- linguistic information including the pronunciation sequence of words, the probability of occurrence of word combinations, etc.
- attributes include the pronunciation sequence of words, the probability of occurrence of word combinations, etc.
- the structure synthesizes all the linguistic information obtained by the transformation, and after global optimization of the network structure, constitutes a whole speech recognition search network for the decoding process to search in the network.
- the construction process is generally referred to the following figure (the "/" in the example indicates the path weight):
- the weighted finite state machine under this framework consumes a lot of computational and memory resources.
- an embodiment of the present invention provides a method and an apparatus for voice recognition decoding.
- the technical solution is as follows:
- a method for speech recognition decoding comprising:
- the information of the acoustic features mainly includes a vector extracted from the acoustic information of the sound wave frame by frame.
- the acoustic information storage structure is a connected time series model word map, and the information storage structure of the acoustic feature is represented based on a weighted finite state machine, and all candidate acoustic output models are connected in pairs between two different model output moments.
- the connected time series classification model obtains the appearance probability of each phoneme frame by frame after inputting the acoustic features of each frame.
- the linguistic information search is performed using a weighted finite state machine that adapts the acoustic modeling information and the history is stored, otherwise the frame is discarded.
- the method further includes: outputting the speech recognition result by the phoneme synchronous decoding.
- a device for speech recognition decoding comprising:
- a feature extraction module configured to receive voice information and extract acoustic features
- An acoustic calculation module configured to calculate information of the acoustic feature according to the connection timing classification model
- the information of the acoustic features mainly includes a vector extracted from the acoustic information of the sound wave frame by frame.
- the acoustic information storage structure is a word map connecting the time series classification model, and the information storage structure of the acoustic feature
- the representation is based on a weighted finite state machine, and all candidate acoustic output models are connected in pairs between two different model output moments.
- the connected time series classification model obtains the appearance probability of each phoneme frame by frame after inputting the acoustic features of each frame.
- Decoding the search module if the frame in the acoustic feature information is a non-empty model frame, the linguistic information search is performed using a weighted finite state machine adapted to the acoustic modeling information and the history is stored, otherwise the frame is discarded.
- the apparatus also includes a phoneme decoding module that outputs the speech recognition result by phoneme synchronous decoding.
- the invention makes the acoustic modeling more accurate by establishing a continuous time series classification model; using the improved weighted finite state machine to make the model representation more efficient, reducing computation and memory resource consumption by nearly 50%; using the phoneme synchronization method in decoding, effective Reduces the amount and number of calculations for model searches.
- FIG. 1 is a flowchart of a method for voice recognition decoding according to a first embodiment of the present invention
- FIG. 2 is a schematic diagram of a weighted finite state machine for adapting acoustic modeling information according to an embodiment of the present invention
- FIG. 3 is a schematic diagram of an acoustic information structure provided by an embodiment of the present invention.
- FIG. 4 is a flowchart of a method for synchronizing decoding of a phoneme according to a second embodiment of the present invention
- FIG. 5 is a schematic structural diagram of voice recognition decoding according to an embodiment of the present invention.
- FIG. 1 is a flowchart of a method for voice recognition decoding according to a first embodiment of the present invention. include:
- S101 receives voice information and extracts an acoustic feature
- the acoustic information of the sound wave is extracted into a vector frame by frame for back-end modeling and decoding as an input feature.
- S102 calculates information of the acoustic feature according to the connection timing classification model
- the information of the acoustic features mainly includes a vector extracted from the acoustic information of the sound wave frame by frame.
- the acoustic information storage structure is a connected time series model word map, and the information storage structure of the acoustic feature is represented based on a weighted finite state machine, and all candidate acoustic output models are connected in pairs between two different model output moments.
- the phoneme information of the audio is modeled based on the time series classification model.
- the specific method is to collect the training data marked with the audio content, and perform the model training of the time series classification model as the model input and output after pre-processing and extracting the features.
- the final connected time series classification model is obtained for model search.
- the trained model gives the probability that all modeling units may appear after inputting the acoustic characteristics of each frame, where the modeling unit is a phoneme.
- the connected time series classification model obtains the appearance probability of the phoneme frame by frame after inputting the acoustic characteristics of each frame.
- the linguistic information search is performed using a weighted finite state machine adapted to the acoustic modeling information and the history is stored, otherwise the frame is discarded.
- a weighted finite state machine is a structure for representing a speech recognition search network.
- a weighted finite state machine model for adapting acoustic modeling information is designed for the speech recognition system using the connected time series classification model. The model emphasizes high efficiency, saves memory and computing resources. Its structure is shown in Figure 2, where “ ⁇ "blk>” indicates a blank model in the connected time series model, “ ⁇ eps>” indicates an empty identifier, and “#1" is used to fit a multi-sentence word in the "weighted finite state machine representing a word pronunciation sequence", "a” Represents an example model in the connected time series model, "" represents the other models in the connected time series model. Compared with other similar structures currently in existence, the calculation of the algorithm and the memory resource consumption are reduced by about 50%, and the linguistic information is completely equivalent.
- the method further includes: outputting the speech recognition result by the phoneme synchronous decoding.
- This embodiment proposes a connection time series classification model word map, an efficient acoustic information storage structure, which is used as a carrier for the above-mentioned phoneme synchronous decoding.
- This acoustic information structure is represented based on a weighted finite state machine by connecting all candidate acoustic output models between two different model output moments.
- Figure 3 shows an example of the construction of such a structure, and the example acoustic information corresponding to the structure is shown in Table 1:
- the embodiment of the invention makes the acoustic modeling more accurate by establishing a continuous time series classification model; using the improved weighted finite state machine to make the model representation more efficient, reducing computation and memory resource consumption by nearly 50%; using phoneme synchronization method in decoding , effectively reducing the amount and number of calculations of the model search.
- the probability output distribution of the connected time series classification model has the characteristics of single peak prominentness.
- One sentence corresponds to a set of probability outputs of each frame.
- the vertical axis is the probability value
- the horizontal axis is the time axis
- the peaks of different colors represent the output of different models.
- the present embodiment proposes a novel method for synchronizing the phoneme synchronization to replace the conventional frame-by-frame synchronous decoding.
- the phoneme synchronous decoding method only performs the linguistic network search when the non-blank model output occurs, otherwise the current frame acoustic information is directly discarded and transferred to the next frame.
- the algorithm flow is shown in Figure 4.
- FIG. 4 is a flowchart showing a method for synchronizing decoding of a phoneme according to a second embodiment of the present invention, which is described in detail as follows:
- step S402 determines whether the speech is over, if it is over, backtracking and outputting the decoding result, otherwise proceeds to step S403;
- S404 calculates acoustic information by using a connection timing classification model
- step S405 determines whether each frame in the acoustic information is a blank model frame, and if so, directly discards, otherwise proceeds to step S406;
- S406 uses a weighted finite state machine for linguistic search
- S407 stores linguistic history information
- the S408 After acquiring the linguistic history information, the S408 traces back and outputs the decoding result.
- This method discards the linguistic network search corresponding to a large number of redundant blank models without causing loss of search space.
- the embodiment of the invention makes the acoustic modeling more accurate by establishing a continuous time series classification model; using the improved weighted finite state machine to make the model representation more efficient, reducing computation and memory resource consumption by nearly 50%; using phoneme synchronization method in decoding , effectively reducing the amount and number of calculations of the model search.
- FIG. 5 is a schematic structural diagram of a voice recognition decoding according to an embodiment of the present invention, which is described in detail as follows:
- a feature extraction module 51 configured to receive voice information, and extract an acoustic feature
- An acoustic calculation module 52 configured to calculate information of the acoustic feature according to the connection timing classification model
- the information of the acoustic features mainly includes a vector extracted from the acoustic information of the sound wave frame by frame.
- the acoustic information storage structure is a connected time series model word map, and the information storage structure of the acoustic feature is represented based on a weighted finite state machine, and all candidate acoustic output models are connected in pairs between two different model output moments.
- the connected time series classification model obtains the appearance probability of the phoneme frame by frame after inputting the acoustic characteristics of each frame.
- the decoding search module 53 searches for and stores the linguistic information using a weighted finite state machine adapted to the acoustic modeling information if the frame in the acoustic feature information is a non-empty model frame, otherwise discarding the frame.
- the apparatus also includes a phoneme decoding module 54 that outputs the speech recognition result by phoneme synchronization decoding.
- the invention makes the acoustic modeling more accurate by establishing a continuous time series classification model; using the improved weighted finite state machine, the model representation is more efficient, and the calculation and memory resource consumption is reduced by nearly 50%;
- the method of using phoneme synchronization in decoding effectively reduces the amount of calculation and the number of times of model search.
- a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
- the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种语音识别解码的方法及装置,该方法包括:接收语音信息,提取声学特征(S101);根据连接时序分类模型计算该声学特征的信息(S102);若声学特征信息中的帧为非空模型帧,则使用适配声学建模信息的加权有限状态机进行语言学信息搜索并存储历史,否则丢弃该帧(S103)。通过建立连续时序分类模型,使得声学建模更加精确;使用改进加权有限状态机,使得模型表示更为高效,减少计算和内存资源消耗近50%;在解码中使用音素同步的方法,有效减少了模型搜索的计算量和次数。
Description
本发明属于语音处理领域,具体涉及语音识别解码的方法及装置。
语音识别是一种让机器通过识别和理解过程把语音信号转变为相应的文本或命令的人工智能技术。传统语音识别中将语言学信息(包括词的发音序列,词组合的出现概率等)全部分别转换成一种具有“输入”,“输出”,“路径权重”,“状态跳转”四种属性的结构,并将转换得到的所有语言学信息合成(composition)在一起,经过全局优化网络结构后,构成了一个整体的语音识别搜索网络,供解码过程在网络中进行搜索。其构建流程大致参见下图(例子中的“/”后表示路径权重):
传统语音识别技术基于隐马尔科夫模型(hidden markov model),逐帧同步
解码(Frame Synchronous Decoding)和加权有限状态机(Weighted Finite State Transducer)方法进行构建,主要有以下缺点:
隐马尔科夫模型的建模效果有缺陷;
逐帧同步解码的计算量庞大且冗余;
该框架下的加权有限状态机消耗大量计算和内存资源。
发明内容
为了解决上述问题,本发明实施例提供了一种语音识别解码的方法及装置。所述技术方案如下:
第一方面,一种语音识别解码的方法,该方法包括:
接收语音信息,提取声学特征;
根据连接时序分类模型计算该声学特征的信息;
其中,声学特征的信息主要包括由声波的声学信息逐帧提取的向量。
声学信息存储结构为连接时序分类模型词图,该声学特征的信息存储结构基于加权有限状态机进行表示,将两个不同模型输出时刻之间,所有候选的声学输出模型进行两两相连。
具体的,连接时序分类模型在输入每一帧声学特征后,会逐帧得出各音素的出现概率。
若该声学特征信息中的帧为非空模型帧,则使用适配声学建模信息的加权有限状态机进行语言学信息搜索并存储历史,否则丢弃该帧。
具体的,该方法还包括:通过音素同步解码输出语音识别结果。
第二方面,一种语音识别解码的装置,该装置包括:
特征提取模块,用于接收语音信息,提取声学特征;
声学计算模块,用于根据连接时序分类模型计算该声学特征的信息;
其中,声学特征的信息主要包括由声波的声学信息逐帧提取的向量。
声学信息存储结构为连接时序分类模型词图,该声学特征的信息存储结构
基于加权有限状态机进行表示,将两个不同模型输出时刻之间,所有候选的声学输出模型进行两两相连。
具体的,连接时序分类模型在输入每一帧声学特征后,会逐帧得出各音素的出现概率。
解码搜索模块,若该声学特征信息中的帧为非空模型帧,则使用适配声学建模信息的加权有限状态机进行语言学信息搜索并存储历史,否则丢弃该帧。
该装置还包括音素解码模块,通过音素同步解码输出语音识别结果。
本发明通过建立连续时序分类模型,使得声学建模更加精确;使用改进加权有限状态机,使得模型表示更为高效,减少计算和内存资源消耗近50%;在解码中使用音素同步的方法,有效减少了模型搜索的计算量和次数。
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明第一实施例提供的一种语音识别解码的方法流程图;
图2是本发明实施例提供的适配声学建模信息的加权有限状态机的示意图;
图3是本发明实施例提供的声学信息结构的示意图;
图4是本发明第二实施例提供的一种音素同步解码的方法流程图;
图5是本发明实施例提供的一种语音识别解码的结构示意图。
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
图1示出了本发明第一实施例提供的一种语音识别解码的方法流程,具体
包括:
S101接收语音信息,提取声学特征;
特征提取通过传统信号处理技术,将声波的声学信息逐帧提取成一个向量,供后端建模和解码作为输入特征使用。
S102根据连接时序分类模型计算该声学特征的信息;
其中,声学特征的信息主要包括由声波的声学信息逐帧提取的向量。
声学信息存储结构为连接时序分类模型词图,该声学特征的信息存储结构基于加权有限状态机进行表示,将两个不同模型输出时刻之间,所有候选的声学输出模型进行两两相连。
基于时序分类模型对音频的音素信息进行建模。具体方法是将收集标注好音频内容的训练数据,经过前处理和提取特征后,作为模型输入和输出进行时序分类模型的模型训练。在海量数据训练下,得到最终的连接时序分类模型供模型搜索使用。训练得到的模型在输入每一帧声学特征后,会给出所有建模单元可能出现的概率,其中建模单元为音素。
具体的,连接时序分类模型在输入每一帧声学特征后,会逐帧得出音素的出现概率。
S103若声学特征信息中的帧为非空模型帧,则使用适配声学建模信息的加权有限状态机进行语言学信息搜索并存储历史,否则丢弃该帧。
加权有限状态机是一种用于表示语音识别搜索网络的结构。针对使用连接时序分类模型的语音识别系统设计了相应的适配声学建模信息的加权有限状态机模型,该模型强调了高效,节省内存及计算资源,其结构如图2所示,其中“<blk>”表示连接时序分类模型中的空白模型,“<eps>”表示空标识,“#1”用于适配“表示词发音序列的加权有限状态机”中的多发音词,“a”表示连接时序分类模型中的一个示例模型,“...”表示连接时序分类模型中的其他模型。该结构相比目前存在的其他同类结构,算法的计算和内存资源消耗减少50%左右,且语言学信息完全等效。
具体的,该方法还包括:通过音素同步解码输出语音识别结果。
本实施例提出了连接时序分类模型词图,一种高效的声学信息储存结构,用于作为上面提出的音素同步解码的载体。
这种声学信息结构基于加权有限状态机进行表示,方法是将两个不同模型输出时刻之间,所有候选的声学输出模型进行两两相连。图3示出这种结构的构建示例,对应于该结构的示例声学信息见表1:
Time | Phone:score |
0.4s | <blk>:0.2 a2:0.5 a4:0.2 |
0.9s | <blk>:0.3 a1:0.6 |
1.5s | a5:0.3 ai1:0.2 ai3:0.2 |
表1 声学信息结构的示例声学信息
本发明实施例通过建立连续时序分类模型,使得声学建模更加精确;使用改进加权有限状态机,使得模型表示更为高效,减少计算和内存资源消耗近50%;在解码中使用音素同步的方法,有效减少了模型搜索的计算量和次数。
连接时序分类模型的概率输出分布具有单峰突出的特点,一句话对应各帧的一组概率输出,一般纵轴为概率值,横轴为时间轴,不同颜色的峰值代表不同模型的输出。
基于该现象,本实施例提出了一种新颖的音素同步解码方法,以取代传统的逐帧同步解码。音素同步解码方法只在出现非空白模型输出时才进行语言学网络搜索,否则直接丢弃当前帧声学信息,转到下一帧。其算法流程如图4所示。
图4示出本发明第二实施例提供的一种音素同步解码的方法流程,详述如下:
S401算法初始化;
S402判断语音是否结束,若结束,则回溯并输出解码结果,否则进入步骤S403;
S403声学特征提取;
S404利用连接时序分类模型计算声学信息;
S405判断声学信息中每帧是否为空白模型帧,若是,则直接丢弃,否则进入步骤S406;
S406使用加权有限状态机进行语言学搜索;
S407储存语言学历史信息;
S408获取语言学历史信息后,回溯并输出解码结果。
该方法丢弃了大量冗余的空白模型对应的语言学网络搜索,且不会带来搜索空间的损失。
本发明实施例通过建立连续时序分类模型,使得声学建模更加精确;使用改进加权有限状态机,使得模型表示更为高效,减少计算和内存资源消耗近50%;在解码中使用音素同步的方法,有效减少了模型搜索的计算量和次数。
图5示出本发明实施例提供的一种语音识别解码的结构示意图,详述如下:
特征提取模块51,用于接收语音信息,提取声学特征;
声学计算模块52,用于根据连接时序分类模型计算该声学特征的信息;
其中,声学特征的信息主要包括由声波的声学信息逐帧提取的向量。
声学信息存储结构为连接时序分类模型词图,该声学特征的信息存储结构基于加权有限状态机进行表示,将两个不同模型输出时刻之间,所有候选的声学输出模型进行两两相连。
具体的,连接时序分类模型在输入每一帧声学特征后,会逐帧得出音素的出现概率。
解码搜索模块53,若该声学特征信息中的帧为非空模型帧,则使用适配声学建模信息的加权有限状态机进行语言学信息搜索并存储,否则丢弃该帧。
该装置还包括音素解码模块54,通过音素同步解码输出语音识别结果。
本发明通过建立连续时序分类模型,使得声学建模更加精确;使用改进加权有限状态机,使得模型表示更为高效,减少计算和内存资源消耗近50%;在
解码中使用音素同步的方法,有效减少了模型搜索的计算量和次数。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
Claims (10)
- 一种语音识别解码方法,其特征在于,所述方法包括:接收语音信息,提取声学特征;根据连接时序分类模型计算所述声学特征的信息;若所述声学特征信息中的帧为非空模型帧,则使用适配声学建模信息的加权有限状态机进行语言学信息搜索并存储历史,否则丢弃该帧。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:通过音素同步解码输出语音识别结果。
- 根据权利要求1所述的方法,其特征在于,所述声学特征的信息主要包括由声波的声学信息逐帧提取的向量。
- 根据权利要求1所述的方法,其特征在于,所述连接时序分类模型在输入每一帧声学特征后,会逐帧得出各音素的出现概率。
- 根据权利要求1所述的方法,其特征在于,所述声学信息存储结构为连接时序分类模型词图,所述声学特征的信息存储结构基于所述加权有限状态机进行表示,将两个不同模型输出时刻之间,所有候选的声学输出模型进行两两相连。
- 一种语音识别解码装置,其特征在于,所述装置包括:特征提取模块,用于接收语音信息,提取声学特征;声学计算模块,用于根据连接时序分类模型计算所述声学特征的信息;解码搜索模块,若所述声学特征信息中的帧为非空模型帧,则使用适配声学建模信息的加权有限状态机进行语言学信息搜索并存储历史,否则丢弃该帧。
- 根据权利要求6所述的装置,其特征在于,所述装置还包括:音素解码模块,通过音素同步解码输出语音识别结果。
- 根据权利要求6所述的装置,其特征在于,所述声学特征的信息主要包括由声波的声学信息逐帧提取的向量。
- 根据权利要求6所述的装置,其特征在于,所述连接时序分类模型在输入每一帧声学特征后,会逐帧得出各音素的出现概率。
- 根据权利要求6所述的装置,其特征在于,所述声学信息存储结构为连接时序分类模型词图,所述声学特征的信息存储结构基于所述加权有限状态机进行表示,将两个不同模型输出时刻之间,所有候选的声学输出模型进行两两相连。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP16894814.9A EP3444806A4 (en) | 2016-04-11 | 2016-05-06 | METHOD AND DEVICE FOR VOTING DETECTION-BASED DECODING |
US15/562,173 US20190057685A1 (en) | 2016-04-11 | 2016-05-06 | Method and Device for Speech Recognition Decoding |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610221182.4 | 2016-04-11 | ||
CN201610221182.4A CN105895081A (zh) | 2016-04-11 | 2016-04-11 | 一种语音识别解码的方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017177484A1 true WO2017177484A1 (zh) | 2017-10-19 |
Family
ID=57012369
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/081334 WO2017177484A1 (zh) | 2016-04-11 | 2016-05-06 | 一种语音识别解码的方法及装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20190057685A1 (zh) |
EP (1) | EP3444806A4 (zh) |
CN (1) | CN105895081A (zh) |
WO (1) | WO2017177484A1 (zh) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105895081A (zh) * | 2016-04-11 | 2016-08-24 | 苏州思必驰信息科技有限公司 | 一种语音识别解码的方法及装置 |
CN106782513B (zh) * | 2017-01-25 | 2019-08-23 | 上海交通大学 | 基于置信度的语音识别实现方法及系统 |
CN107680587A (zh) * | 2017-09-29 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | 声学模型训练方法和装置 |
WO2020263034A1 (en) * | 2019-06-28 | 2020-12-30 | Samsung Electronics Co., Ltd. | Device for recognizing speech input from user and operating method thereof |
CN110288972B (zh) * | 2019-08-07 | 2021-08-13 | 北京新唐思创教育科技有限公司 | 语音合成模型训练方法、语音合成方法及装置 |
KR20210079666A (ko) * | 2019-12-20 | 2021-06-30 | 엘지전자 주식회사 | 음향 모델을 학습시키기 위한 인공 지능 장치 |
CN113539242A (zh) | 2020-12-23 | 2021-10-22 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、计算机设备及存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102968989A (zh) * | 2012-12-10 | 2013-03-13 | 中国科学院自动化研究所 | 一种用于语音识别的Ngram模型改进方法 |
CN105139864A (zh) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | 语音识别方法和装置 |
CN105895081A (zh) * | 2016-04-11 | 2016-08-24 | 苏州思必驰信息科技有限公司 | 一种语音识别解码的方法及装置 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6315980B2 (ja) * | 2013-12-24 | 2018-04-25 | 株式会社東芝 | デコーダ、デコード方法およびプログラム |
US9530404B2 (en) * | 2014-10-06 | 2016-12-27 | Intel Corporation | System and method of automatic speech recognition using on-the-fly word lattice generation with word histories |
-
2016
- 2016-04-11 CN CN201610221182.4A patent/CN105895081A/zh active Pending
- 2016-05-06 EP EP16894814.9A patent/EP3444806A4/en not_active Withdrawn
- 2016-05-06 WO PCT/CN2016/081334 patent/WO2017177484A1/zh active Application Filing
- 2016-05-06 US US15/562,173 patent/US20190057685A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102968989A (zh) * | 2012-12-10 | 2013-03-13 | 中国科学院自动化研究所 | 一种用于语音识别的Ngram模型改进方法 |
CN105139864A (zh) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | 语音识别方法和装置 |
CN105895081A (zh) * | 2016-04-11 | 2016-08-24 | 苏州思必驰信息科技有限公司 | 一种语音识别解码的方法及装置 |
Non-Patent Citations (4)
Title |
---|
"AN EMPIRICAL EXPLORATION OF CTC ACOUSTIC MODELS", CONFERENCE:20 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP, 25 March 2016 (2016-03-25), XP032901078 * |
MARTIN WOLLMER ET AL.: "PROBABILISTIC ASR FEATURE EXTRACTION APPLYING CONTEXT-SENSITIVE CONNECTIONIST TEMPORAL CLASSIFICATION NETWORKS", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING , 2013, 31 May 2013 (2013-05-31), XP032507746, ISSN: 1520-6149 * |
See also references of EP3444806A4 * |
YAJIE MIAO ET AL.: "EESEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-BASED DECODING", AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING, 18 October 2015 (2015-10-18), XP055287634 * |
Also Published As
Publication number | Publication date |
---|---|
CN105895081A (zh) | 2016-08-24 |
EP3444806A1 (en) | 2019-02-20 |
US20190057685A1 (en) | 2019-02-21 |
EP3444806A4 (en) | 2019-12-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017177484A1 (zh) | 一种语音识别解码的方法及装置 | |
EP3133595B1 (en) | Speech recognition | |
CN108735201B (zh) | 连续语音识别方法、装置、设备和存储介质 | |
CN108682417B (zh) | 语音识别中的小数据语音声学建模方法 | |
WO2018227780A1 (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
CN109065032B (zh) | 一种基于深度卷积神经网络的外部语料库语音识别方法 | |
US20200251097A1 (en) | Named entity recognition method, named entity recognition equipment and medium | |
JP7070894B2 (ja) | 時系列情報の学習システム、方法およびニューラルネットワークモデル | |
US20220262352A1 (en) | Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation | |
EP3469582A1 (en) | Neural network-based voiceprint information extraction method and apparatus | |
WO2021047233A1 (zh) | 一种基于深度学习的情感语音合成方法及装置 | |
CN104575497B (zh) | 一种声学模型建立方法及基于该模型的语音解码方法 | |
CN109509470A (zh) | 语音交互方法、装置、计算机可读存储介质及终端设备 | |
CN112259089B (zh) | 语音识别方法及装置 | |
EP4131255A1 (en) | Method and apparatus for decoding voice data, computer device and storage medium | |
CN103117060A (zh) | 用于语音识别的声学模型的建模方法、建模系统 | |
WO2016099301A1 (en) | System and method of automatic speech recognition using parallel processing for weighted finite state transducer-based speech decoding | |
CN111694940A (zh) | 一种用户报告的生成方法及终端设备 | |
CN104143331A (zh) | 一种添加标点的方法和系统 | |
WO2017166625A1 (zh) | 用于语音识别的声学模型训练方法、装置和电子设备 | |
CN112397053B (zh) | 语音识别方法、装置、电子设备及可读存储介质 | |
CN115312033A (zh) | 基于人工智能的语音情感识别方法、装置、设备及介质 | |
Khursheed et al. | Tiny-crnn: Streaming wakeword detection in a low footprint setting | |
CN109934347A (zh) | 扩展问答知识库的装置 | |
CN113763939B (zh) | 基于端到端模型的混合语音识别系统及方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
REEP | Request for entry into the european phase |
Ref document number: 2016894814 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16894814 Country of ref document: EP Kind code of ref document: A1 |