WO2017114172A1 - 一种发音词典的构建方法及装置 - Google Patents

一种发音词典的构建方法及装置 Download PDF

Info

Publication number
WO2017114172A1
WO2017114172A1 PCT/CN2016/110125 CN2016110125W WO2017114172A1 WO 2017114172 A1 WO2017114172 A1 WO 2017114172A1 CN 2016110125 W CN2016110125 W CN 2016110125W WO 2017114172 A1 WO2017114172 A1 WO 2017114172A1
Authority
WO
WIPO (PCT)
Prior art keywords
pronunciation
phoneme sequence
target vocabulary
candidate
vocabulary
Prior art date
Application number
PCT/CN2016/110125
Other languages
English (en)
French (fr)
Inventor
王志铭
李晓辉
李宏言
Original Assignee
阿里巴巴集团控股有限公司
王志铭
李晓辉
李宏言
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 王志铭, 李晓辉, 李宏言 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017114172A1 publication Critical patent/WO2017114172A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method and an apparatus for constructing a pronunciation dictionary.
  • Voice interaction technology began to appear in the middle of the twentieth century. In recent years, with the popularization of smart phones, a large number of voice interaction products have appeared one after another, and voice interaction products have entered the daily life of ordinary users.
  • the voice input method is to receive and recognize the voice sent by the user, and then convert the voice of the user into a text, thereby eliminating the cumbersome input of typing; the caller report function can output the text in the form of voice, without the user watching the screen. In the case of the caller's identity.
  • the pronunciation dictionary is an important part of the voice interaction system. It is a bridge between the acoustic model and the language model. Its coverage and pronunciation quality have a significant impact on the overall performance of the system.
  • the pronunciation dictionary contains the mapping relationship between the word and the pronunciation phoneme sequence, and the mapping relationship can usually be established by using the word conversion to the phoneme (Graphme-to-Phoneme, G2P) method.
  • G2P Graphme-to-Phoneme
  • the pronunciation dictionary is corrected by experts in linguistics related aspects, and the scale is relatively fixed, so it is impossible to cover all vocabulary. Therefore, in practical applications, it is possible to use G2P method to determine new vocabulary according to needs.
  • the matched pronunciation phoneme sequence determines the correct pronunciation of the newly added vocabulary, and then expands the existing pronunciation dictionary according to the newly added vocabulary and the pronunciation phoneme sequence matched with it.
  • the embodiment of the present application provides a method for constructing a pronunciation dictionary, which is used to solve the problem according to the prior art.
  • the quality of the constructed pronunciation dictionary is poor.
  • the embodiment of the present application further provides a device for constructing a pronunciation dictionary for solving the problem of poor quality of a pronunciation dictionary constructed according to the prior art.
  • a method for constructing a pronunciation dictionary comprising:
  • the speech recognition decoder Inputting a speech acoustic feature of the target vocabulary into the speech recognition decoder; wherein the pronunciation dictionary in the speech recognition decoder includes: a target pronunciation vocabulary sequence of the target vocabulary and the target vocabulary;
  • a pronunciation dictionary is constructed based on the pronunciation of the correctly pronounced phoneme sequence.
  • a device for constructing a pronunciation dictionary comprising:
  • a decoding unit configured to input a speech acoustic feature of the target vocabulary into the speech recognition decoder; wherein the pronunciation dictionary in the speech recognition decoder includes: a candidate pronunciation phoneme sequence of the target vocabulary and the target vocabulary;
  • a pronunciation determining unit configured to determine, according to the candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature as an input, a probability distribution of the target vocabulary corresponding to the output candidate phoneme sequence; according to the probability a distribution, from the output candidate pronunciation phoneme sequence, selecting a pronunciation phoneme sequence that is the correct pronunciation of the target vocabulary;
  • Dictionary construction unit for constructing a pronunciation dictionary according to the correctly pronounced pronunciation phoneme sequence.
  • the speech acoustic characteristics of the target words to be predicted are introduced, as one of the basis for predicting the correct pronunciation of the vocabulary, the relative relationship between the vocabulary and the phoneme sequence is determined.
  • the correct pronunciation of the target vocabulary can be predicted more accurately, and the quality of the pronunciation dictionary constructed based on the determined correct pronunciation is improved.
  • FIG. 1 is a schematic flowchart of implementing a method for constructing a pronunciation dictionary according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a device for constructing a pronunciation dictionary according to an embodiment of the present invention.
  • the existing pronunciation prediction method is usually based on the G2P conversion method, and the G2P method converts the vocabulary into a pronunciation phoneme sequence by establishing a mapping relationship between vocabulary and pronunciation phonemes.
  • the G2P method converts the vocabulary into a pronunciation phoneme sequence by establishing a mapping relationship between vocabulary and pronunciation phonemes.
  • the pronunciation phoneme sequence matching the regular vocabulary can be basically obtained accurately, but since the method only utilizes the mapping relationship between the vocabulary (word sequence) and the pronunciation phoneme, for some special words, such as words containing polyphonic words, The accuracy of the pronunciation phoneme sequence determined by this method is often lower, which affects the quality of the pronunciation dictionary.
  • Embodiment 1 of the present application provides a method for constructing a pronunciation dictionary.
  • the execution body of the pronunciation dictionary construction method provided by the embodiment of the present application may be a server or other device different from the server, and the like.
  • the executor of the present invention is not limited to the present application.
  • the embodiments of the present application are described by taking the execution subject as a server as an example.
  • vocabulary and phonetic acoustic features having a correspondence relationship may be represented by vocabulary-speech acoustic features.
  • a vocabulary (word sequence) and a phoneme sequence in which a correspondence exists can also be expressed by the above representation.
  • vocabulary and phoneme sequences corresponding to each other can be represented by a lexical-speech phoneme sequence.
  • FIG. 1 The schematic diagram of the implementation process of the method is shown in FIG. 1 and includes the following steps:
  • Step 11 The server inputs the speech acoustic characteristics of the target vocabulary into a speech recognition decoder embedded with a pronunciation dictionary, an acoustic model, and a language model;
  • the target vocabulary may be any vocabulary, such as a Chinese vocabulary, an English vocabulary or a vocabulary of other languages.
  • the target vocabulary may refer to a vocabulary that is not currently included in the pronunciation dictionary, that is, a new vocabulary relative to the pronunciation dictionary.
  • the speech acoustic feature of the target vocabulary described in the embodiment of the present application may include, but is not limited to, a Filter Bank feature, a MFCC (Mel Frequency Cepstrum Coefficient) feature, and a PLP (from the voice signal generated by speaking the target vocabulary).
  • Perceptual Linear Predictive at least one of features and the like.
  • the voice signal may be, for example, an audio sample corresponding to the target vocabulary.
  • the audio sample corresponding to the target vocabulary can be obtained by, but not limited to, using at least one of the following methods:
  • the recording task is freely and voluntarily entrusted to the non-specific (and usually large) network public to obtain the audio corresponding to the target vocabulary. sample;
  • the user first inputs the target vocabulary by voice. If the voice recognition system recognizes an error and the user continues to input the correct target vocabulary through the keyboard, the series of behaviors can be recorded in the form of a log.
  • the speech acoustic features may be respectively obtained from the audio samples corresponding to the target vocabulary, and the obtained speech acoustic features are respectively input as the speech acoustic features of the target vocabulary into the speech recognition decoder.
  • step 11 The operation of the speech recognition decoder mentioned in step 11 is further described below.
  • a speech recognition decoder is used to search for a speech signal (or acoustical features) with a maximum probability based on an acoustic model, a speech model, and a pronunciation dictionary for an input speech signal (or speech acoustic feature). A virtual or physical device of the word that matches the voice signal).
  • the goal of decoding a speech signal is to find a word sequence W * (corresponding to the "word” described above), so that the corresponding speech acoustic feature X likelihood probability is maximized, which is essentially based on
  • the machine learning problem of the Bayesian criterion is to use the Bayesian formula to calculate the optimal word sequence W * , as shown in the formula [1.1]:
  • W i ) is the acoustic model and P(W i ) is the language model.
  • the acoustic model is the probability that the speech acoustic characteristic of the word sequence W i is X.
  • Acoustic models can generally be trained using a large amount of data, including speech acoustic features and corresponding tag sequences.
  • the language model is the probability of occurrence of the word sequence W i corresponding to the vocabulary.
  • the meaning of the probability of occurrence is generally: the probability that each word constituting a vocabulary appears in order according to the order in which the respective words are arranged in the vocabulary.
  • the word sequence generally corresponds to, for example, the pronunciation of a certain vocabulary (which can be represented by a word sequence) with different local accents may correspond to different pronunciation phonemes, or the words containing multi-tone words may also correspond to different Pronunciation phoneme, therefore, if assumed Is the pronunciation phoneme sequence corresponding to the word sequence W i , then the formula [1.1] can be changed to:
  • W i is a sequence of words
  • P(W i ) is the language model
  • the pronunciation phoneme sequence for the vocabulary in the pronunciation dictionary (represented by the word sequence W i ) is The probability.
  • the calculation target of the formula [1.2] can be converted to find the best pronunciation phoneme sequence Q corresponding to the word sequence W i . * .
  • the formula [1.2] can be further changed to:
  • Q * is the maximum value of the probability distribution of the candidate phoneme sequence that makes the value on the right side of the medium number of the formula [1.3] the largest, that is, the candidate phoneme sequence corresponding to the word sequence W i ;
  • W i is a sequence of words, and i is the number of words
  • X represents the acoustical characteristics of the voice corresponding to W i ;
  • Q represents a pronunciation phoneme sequence
  • j is the number of the pronunciation phoneme sequence
  • the corresponding speech acoustic feature is the probability of X.
  • the acoustic model used in the relevant speech recognition technology is generally Hidden Markov Model-Deep Neural Network (HMM-DNN).
  • the hybrid model is trained, or it can be trained on the DNN model.
  • the hybrid model or the DNN model of the HMM-DNN can be trained to obtain an acoustic model by using a large number of speech acoustic features in advance, and is set in the speech recognition decoder described in the embodiment of the present application.
  • P(W i ) is a language model—the language model in this embodiment may be an N-Gram model based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and Any other words are not related.
  • the probability of the whole sentence is the product of the probability of occurrence of each word.
  • the probability of occurrence of each word can be obtained by counting the number of simultaneous occurrences of N words from the corpus.
  • the language model in this embodiment may also be a language model based on a conditional random field or a deep neural network based strategy.
  • the language model can be pre-generated and set in the speech recognition decoder described in the embodiment of the present application.
  • the phoneme sequence for the pronunciation based on the vocabulary in the given pronunciation dictionary (represented by the word sequence W i ) is The probability.
  • the pronunciation dictionary mentioned here may be, for example, a pronunciation dictionary in which each candidate phoneme sequence corresponding to the target vocabulary is added.
  • the candidate pronunciation phoneme sequence of the target vocabulary refers to the pronunciation phoneme sequence that may be correctly pronounced as the target vocabulary.
  • the G2P method may be used to generate a pronunciation phoneme sequence (referred to as a “candidate phoneme sequence” in the embodiment of the present application), and the target vocabulary and the generated candidate phoneme sequence are generated. , added to the pronunciation dictionary.
  • the adding the target vocabulary and the generated each candidate phoneme sequence to the pronunciation dictionary may refer to adding the term including the target vocabulary-candidate phoneme sequence to the pronunciation dictionary.
  • adding the term to the pronunciation dictionary may refer to constructing a pronunciation dictionary according to the term; when the pronunciation dictionary currently exists, adding the term In the pronunciation dictionary, the existing pronunciation dictionary may be updated according to the term to obtain an updated pronunciation dictionary.
  • the target vocabulary is a new vocabulary relative to the currently existing pronunciation dictionary.
  • the number of corresponding candidate phoneme sequences generated for the target vocabulary depends on the actual situation.
  • more than ten candidate phoneme sequences can be generated for the target vocabulary "Alibaba." Taking one of the pronunciation phoneme sequences as an example, it can be expressed as "a1/li3/ba1/ba1/".
  • the symbol "/" is used to distinguish different phonemes, that is, the symbols before and after the "/" indicate different phonemes.
  • a1 and li3 are different phonemes.
  • the number in the phoneme represents the tone, that is, 1 represents a tone, 2 represents a tone, 3 represents a tone, and 4 represents a tone of four.
  • a speech recognition decoder of the language model P(W i ) Based on the acoustic model embedded in the above pronunciation dictionary, as shown in the formula [1.3] And a speech recognition decoder of the language model P(W i ).
  • the speech acoustic feature of the target vocabulary is input into the speech recognition decoder, and the speech recognition decoder can be triggered by the acoustic feature of the speech sample. Decoding, outputting a pronunciation phoneme sequence corresponding to the acoustic feature of the speech sample.
  • Step 12 determining a candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature described in step 11 as an input; and determining a target vocabulary corresponding according to a statistical rule of the target vocabulary corresponding to the output candidate pronunciation phoneme sequence a probability distribution of the candidate pronunciation phoneme sequence outputted; according to the probability distribution, selecting a pronunciation phoneme sequence that is the correct pronunciation of the target vocabulary from the output candidate pronunciation phoneme sequence;
  • the speech recognition decoder For example, if it is assumed that there are two candidate phoneme sequences corresponding to the target vocabulary T, A1A2 and B1B2, respectively, and they are added to the pronunciation dictionary included in the speech recognition decoder. Further, if it is assumed that there are 100 audio samples of the collected T, so that the respective acoustic acoustic features of the 100 audio samples (a total of 100 speech acoustic features) can be obtained, by performing step 11, the 100 speech acoustic features are obtained. They are input into a speech recognition decoder embedded in a pronunciation dictionary, an acoustic model, and a language model, respectively.
  • the speech recognition decoder identifies and decodes the 100 speech acoustic features, and A candidate pronunciation phoneme sequence is output, such as a combination of outputs A1, A2, B1, B2.
  • the server may determine the candidate phoneme sequence corresponding to the maximum probability value in the probability distribution as the pronunciation phoneme sequence of the correct pronunciation of the target vocabulary.
  • the server may determine the candidate pronunciation phoneme sequence A1A2 corresponding to the maximum probability value of 0.75 in the probability distribution as the pronunciation phoneme sequence of T correctly.
  • Step 13 Construct a pronunciation dictionary based on the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary.
  • the server may delete other candidate phoneme sequences corresponding to the target vocabulary other than the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary, for example, from the pronunciation dictionary in which each candidate phoneme sequence corresponding to the target vocabulary is added.
  • the server may reconstruct a new pronunciation dictionary based on the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary.
  • Embodiment 1 of the present application since the speech acoustic characteristics of the target word to be predicted are introduced, as one of the basis for predicting the correct pronunciation of the vocabulary, the mapping relationship between the vocabulary and the pronunciation phoneme sequence is performed. In the prior art for predicting the correct pronunciation of vocabulary, the correct pronunciation of the target vocabulary can be predicted more accurately, thereby improving the quality of the speech dictionary.
  • the embodiment of the present application provides a device for constructing a pronunciation dictionary.
  • the schematic diagram of the structure of the vocabulary pronunciation prediction apparatus is shown in FIG. 3, and mainly includes the following functional units:
  • a decoding unit 21 configured to input a speech acoustic feature of the target vocabulary into the speech recognition decoder; wherein the pronunciation dictionary in the speech recognition decoder includes: a candidate pronunciation phoneme sequence of the target vocabulary and the target vocabulary;
  • a pronunciation determining unit 22 configured to determine, according to the candidate pronunciation phoneme sequence output by the speech recognition decoder with the speech acoustic feature as an input, a probability distribution of the target vocabulary corresponding to the output candidate phoneme sequence; a probability distribution, selecting, from the output candidate phoneme sequence, a pronunciation phoneme sequence that is the correct pronunciation of the target vocabulary;
  • the dictionary construction unit 23 is configured to construct a pronunciation dictionary according to the correctly pronounced pronunciation phoneme sequence.
  • the apparatus provided by the embodiment of the present application may further include a phoneme sequence processing unit.
  • the unit is configured to obtain a candidate phoneme sequence of the target vocabulary before inputting the phonetic acoustic feature of the target vocabulary into the speech recognition decoder; and adding the target vocabulary and the obtained candidate phoneme sequence to the speech recognition decoder In the pronunciation dictionary.
  • the phoneme sequence processing unit may be specifically configured to obtain a candidate phoneme sequence of the target vocabulary by using the G2P method.
  • the decoding unit 21 may be specifically configured to collect audio samples corresponding to the target vocabulary; obtain the speech acoustic features according to the audio samples; and input the obtained acoustic acoustic features into the In the speech recognition decoder.
  • the pronunciation determining unit 22 may be specifically configured to determine a maximum probability value in the probability distribution; and select, from the output candidate phoneme sequence, a candidate pronunciation corresponding to the maximum probability value.
  • a phoneme sequence a phoneme sequence that is the correct pronunciation of the target vocabulary.
  • the dictionary construction unit 23 may be specifically configured to delete a target from a pronunciation dictionary in which the target vocabulary and the obtained candidate pronunciation phoneme sequence are added according to the pronunciation phoneme sequence that is correctly pronounced as the target vocabulary. Other candidate phoneme sequences corresponding to the pronunciation of the correctly pronounced phoneme sequence corresponding to the vocabulary.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Abstract

一种发音词典的构建方法及装置,用以解决按照现有技术构建的发音词典的质量较差问题。其中,该方法包括:将目标词汇的语音声学特征,输入语音识别解码器(12);其中,语音识别解码器中的发音词典包括:目标词汇和目标词汇的候选发音音素序列;根据语音识别解码器输出的候选发音音素序列,确定目标词汇对应于输出的候选发音音素序列的概率分布;根据该概率分布,从输出的候选发音音素序列中,选择作为目标词汇的正确发音的发音音素序列(13);根据正确发音的发音音素序列,构建发音词典(14)。

Description

一种发音词典的构建方法及装置 技术领域
本申请涉及计算机技术领域,尤其涉及一种发音词典的构建方法及装置。
背景技术
语音交互技术早在二十世纪中期就已经开始出现,近几年随着智能手机的普及,大量的语音交互产品相继出现,语音交互产品走进了普通用户的日常生活之中。例如,语音输入法就是通过接收并识别用户发出的语音,然后将用户的语音转换成文字,省去了打字的繁琐输入;来电报号功能可以将文字以语音的形式输出,在用户不看屏幕的情况下,即可获知来电方身份。
在语音交互技术中,发音词典是语音交互系统中重要的组成部分,是联接声学模型和语言模型之间的桥梁,其覆盖面和发音质量对系统的整体性能具有重大的影响。
发音词典中包含词和发音音素序列之间的映射关系,通常可以采用词转换为音素(Grapheme-to-Phoneme,G2P)方法建立该映射关系。一般情况下,发音词典经过语言学相关方面的专家审核校正,规模大小相对固定,因此其不可能覆盖所有的词汇,从而在实际应用中,有可能会根据需要,利用G2P方法确定新增词汇所匹配的发音音素序列,即确定新增词汇的正确发音,进而根据新增词汇和与其匹配的发音音素序列,对现有的发音词典进行扩充。
目前,采用G2P方法,基本能够准确确定常规词汇的正确发音。但是,对于一些特别的词汇,比如包含多音字的词汇,采用该方法确定出的词汇的正确发音的准确度往往较低,从而影响发音词典的质量。
发明内容
本申请实施例提供一种发音词典的构建方法,用以解决按照现有技术 构建的发音词典的质量较差的问题。
本申请实施例还提供一种发音词典的构建装置,用以解决按照现有技术构建的发音词典的质量较差的问题。
本申请实施例采用下述技术方案:
一种发音词典的构建方法,包括:
将目标词汇的语音声学特征,输入语音识别解码器;其中,所述语音识别解码器中的发音词典包括:目标词汇和目标词汇的候选发音音素序列;
根据所述语音识别解码器以所述语音声学特征作为输入而输出的候选发音音素序列,确定所述目标词汇对应于输出的候选发音音素序列的概率分布;
根据所述概率分布,从所述输出的候选发音音素序列中,选择作为所述目标词汇的正确发音的发音音素序列;
根据所述正确发音的发音音素序列,构建发音词典。
一种发音词典的构建装置,包括:
解码单元:用于将目标词汇的语音声学特征,输入语音识别解码器中;其中,所述语音识别解码器中的发音词典包括:目标词汇和目标词汇的的候选发音音素序列;
发音确定单元:用于根据所述语音识别解码器以所述语音声学特征作为输入而输出的候选发音音素序列,确定所述目标词汇对应于输出的候选发音音素序列的概率分布;根据所述概率分布,从所述输出的候选发音音素序列中,选择作为所述目标词汇的正确发音的发音音素序列;
词典构建单元:用于根据所述正确发音的发音音素序列,构建发音词典。
本申请实施例采用的上述至少一个技术方案能够达到以下有益效果:
由于引入了待预测发音的目标词的语音声学特征,作为预测词汇正确发音的依据之一,从而相对于仅依靠词汇和发音音素序列的映射关系来作 为预测词汇正确发音依据的现有技术而言,可以更为准确地预测目标词汇正确发音,提升了基于确定出的正确发音构建的发音词典的质量。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1为本申请实施例提供的一种发音词典的构建方法的实现流程示意图;
图2为本实施例提供的一种发音词典的构建装置的具体结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
以下结合附图,详细说明本申请各实施例提供的技术方案。
实施例1
现有的发音预测方法通常是基于G2P转换的方法,G2P方法通过建立词汇和发音音素之间的映射关系,将词汇转换为发音音素序列。采用G2P方法,基本能够准确得到与常规词汇匹配的发音音素序列,但是由于该方法只利用了词汇(字序列)和发音音素的映射关系,因此对于一些特别的词汇,比如包含多音字的词汇,采用该方法确定出的与词汇匹配的发音音素序列的准确度往往较低,从而影响发音词典的质量。
为解决由于现有技术不能准确预测词汇的正确发音从而影响发音词典的质量的问题,本申请实施例1提供了一种发音词典的构建方法。
本申请实施例提供的发音词典的构建方法的执行主体可以是服务器也可以是不同于服务器的其他设备,等等。所述的执行主体并不构成对本申请的限定,为了便于描述,本申请实施例均以执行主体是服务器为例进行说明。
为便于描述,在本实施例中,存在对应关系的词汇和语音声学特征可以用词汇-语音声学特征来表示。
类似的,存在对应关系的词汇(字序列)和音素序列,以及存在对应关系的语音声学特征和语音音素序列,也可用上述表示方式表示。例如,存在对应关系的词汇和音素序列,可以用词汇-语音音素序列来表示。
以下对本申请实施例提供该方法进行详细介绍。
该方法的实现流程示意图如图1所示,包括下述步骤:
步骤11:服务器将目标词汇的语音声学特征,输入到嵌入有发音词典、声学模型和语言模型的语音识别解码器;
本申请实施例中,所述的目标词汇,可以是任何词汇,比如中文词汇、英文词汇或者其他语言的词汇。若针对语音识别解码器中已有的发音词典而言,所述的目标词汇,可以是指该发音词典当前不包含的词汇,即相对于该发音词典的新增词汇。
本申请实施例中所述的目标词汇的语音声学特征,可以但不限于包括从说出该目标词汇所产生的语音信号中提取出的Filter Bank特征、MFCC(Mel Frequency Cepstrum Coefficient)特征以及PLP(Perceptual Linear Predictive)特征等等中的至少一种。
本申请实施例中,所述的语音信号,比如可以是根据目标词汇对应的音频样本。
目标词汇对应的音频样本,可以但不限于是采用下述方式中的至少一种获得的:
一、委托专业的语音数据供应商进行人工录音,从而获得目标词汇对 应的音频样本;
二、采用众包的形式,以用户的真实使用感受和切身体验为出发点,将录音任务以自由自愿的形式委托给非特定的(而且通常是大型的)网络大众,从而获得目标词汇对应的音频样本;
三、分析用户反馈的记录日志,从而获得目标词汇对应的音频样本。例如,在语音搜索任务中,用户先通过语音输入目标词汇,如果语音识别系统识别错误,用户继续通过键盘输入正确的目标词汇,这一系列的行为可以通过日志的形式记录下来。
本申请实施例中,可以从目标词汇对应的音频样本中分别获得语音声学特征,进而将获得的各语音声学特征作为所述目标词汇的语音声学特征,分别输入所述语音识别解码器。
以下进一步介绍步骤11中提及的语音识别解码器的工作原理。
一般地,语音识别解码器,是用于针对输入的语音信号(或语音声学特征),根据声学模型、语言模型及发音词典,寻找能够以最大概率发出该语音信号(或与该语音声学特征相匹配的语音信号)的词的虚拟或者实体设备。
在语音识别领域,对语音信号进行解码的目标,就是寻找字序列W*(相当于上文所述的“词”),使得对应的语音声学特征X似然概率最大化,实质上就是一个基于贝叶斯准则的机器学习问题,即利用贝叶斯公式来计算最佳字序列W*,如公式[1.1]所示:
Figure PCTCN2016110125-appb-000001
其中P(X|Wi)为声学模型,P(Wi)为语言模型。
声学模型,是字序列Wi的语音声学特征为X的概率。一般可以利用大量的数据(包括语音声学特征以及对应的标签序列)训练得到声学模型。
语言模型,是词汇对应的字序列Wi的出现概率。该出现概率的含义一般为:构成词汇的各个字依照所述各个字在该词汇中的排列顺序依次出现的概率。
考虑到字序列一般会对应的不同的发音音素序列,比如用带不同地方口音发出某个词汇(可由字序列表示)的发音可能对应不同的发音音素,又或者包含多音字的词汇也有可能对应不同的发音音素,因此,若假设
Figure PCTCN2016110125-appb-000002
是字序列Wi对应的各发音音素序列,那么公式[1.1]可变为:
Figure PCTCN2016110125-appb-000003
其中,Wi为字序列;
Figure PCTCN2016110125-appb-000004
为声学模型;P(Wi)为语言模型;
Figure PCTCN2016110125-appb-000005
为发音词典中的词汇(由字序列Wi表示)的发音音素序列为
Figure PCTCN2016110125-appb-000006
的概率。
对于发音学习的问题,进一步假定字序列Wi和对应的语音声学特征X是已知的,则公式[1.2]的计算目标,可以转换是为了寻找字序列Wi对应的最佳发音音素序列Q*。这样,公式[1.2]进一步可变为:
Figure PCTCN2016110125-appb-000007
公式[1.3]中:
Q*为使得公式[1.3]中等号右侧的值最大的发音音素序列,也即字序列Wi对应的候选发音音素序列的概率分布的最大值;
Wi为字序列,i为词汇的编号;
X表示Wi对应的语音声学特征;
Q表示发音音素序列;
j为发音音素序列的编号;
Figure PCTCN2016110125-appb-000008
表示编号为i的词汇对应的语音音素序列中的、编号为j的发音音素序列。
Figure PCTCN2016110125-appb-000009
为声学模型,即发音音素序列
Figure PCTCN2016110125-appb-000010
对应的语音声学特征为X的概率。
目前,相关的语音识别技术中用到的声学模型一般是对隐马尔科夫-深度神经网络(Hidden Markov Model-Deep Neural Network,HMM-DNN) 的混合模型训练得到的,或者也可以是对DNN模型训练得到的。本申请实施例中,可以预先通过海量语音声学特征,对HMM-DNN的混合模型或DNN模型进行训练得到声学模型,并设置在本申请实施例所述的该语音识别解码器中。
P(Wi)为语言模型——本实施例中的语言模型可以是N-Gram模型,该模型基于这样一种假设,第N个词的出现只与前面N-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现概率的乘积,各个词出现的概率可以通过直接从语料中统计N个词同时出现的次数得到。本实施例中的语言模型也可以是基于条件随机场或者基于深度神经网络策略的语言模型。该语言模型可以预先生成并设置在本申请实施例所述的该语音识别解码器中。
Figure PCTCN2016110125-appb-000011
为基于给定的发音词典中的词汇(由字序列Wi表示)的发音音素序列为
Figure PCTCN2016110125-appb-000012
的概率。
这里所说的发音词典,比如可以为加入了目标词汇对应的各个候选发音音素序列的发音词典。
目标词汇的候选发音音素序列,是指可能作为目标词汇正确发音的发音音素序列。本申请实施例中,可以但不限于采用G2P方法,为目标词汇生成发音音素序列(本申请实施例中称“候选发音音素序列”),并将所述目标词汇和生成的各候选发音音素序列,加入到发音词典中。
其中,将所述目标词汇和生成的各候选发音音素序列,加入到发音词典中,可以是指,将包含目标词汇-候选发音音素序列的词条,添加到发音词典中。
需要说明的是,当当前不存在发音词典时,将所述词条添加到发音词典中,可以是指根据所述词条构建发音词典;当当前已存在发音词典时,将所述词条添加到发音词典中,可以是指根据所述词条对该已有的发音词典进行更新,得到更新后的发音词典。
为便于描述,本申请实施例中假设当前已存在发音词典。在这样的场 景下,所述目标词汇为相对于当前已存在的发音词典而言的新增词汇。
本实施例中,为目标词汇生成的对应的候选发音音素序列的个数视实际情况而定。
如,采用G2P方法,可以为目标词汇“阿里巴巴”生成十个以上候选的发音音素序列。以该些发音音素序列中的某一个发音音素序列为例,其可以表示为“a1/li3/ba1/ba1/”。该发音音素序列中,符号“/”用于区分不同发音音素,即“/”前后的符号表示不同的音素。比如,a1和li3为不同音素。音素中的数字代表声调,即1代表声调一声,2代表声调二声,3代表声调三声,4代表声调四声。
基于嵌入有上述发音词典、公式[1.3]中所示的声学模型
Figure PCTCN2016110125-appb-000013
和语言模型P(Wi)的语音识别解码器,本申请实施例中,将目标词汇的语音声学特征输入到该语音识别解码器中,可以触发该语音识别解码器通过对语音样本声学特征的解码,输出该语音样本声学特征对应的发音音素序列。
以下进一步介绍本申请实施例提供的该方法包含的后续步骤。
步骤12:确定语音识别解码器以步骤11中所述的语音声学特征作为输入而输出的候选发音音素序列;并根据目标词汇对应于所述输出的候选发音音素序列的统计规律,确定目标词汇对应于输出的候选发音音素序列的概率分布;根据所述概率分布,从所述输出的候选发音音素序列中,选择作为目标词汇的正确发音的发音音素序列;
比如,若假定目标词汇T对应的候选发音音素序列有2个,分别为A1A2和B1B2,且它们被添加到语音识别解码器包含的发音词典中。进一步地,若假设采集到的T的音频样本有100个,从而可以获得这100个音频样本各自的语音声学特征(共100个语音声学特征),通过执行步骤11,将这100个语音声学特征分别输入到嵌入发音词典、声学模型和语言模型的语音识别解码器中。
那么,语音识别解码器对这100个语音声学特征进行识别解码,可以 输出候选发音音素序列,如输出A1、A2、B1、B2的组合。
进一步地,假设根据设置于该语音识别解码器中的发音词典,确定目标词汇对应于所述输出的候选发音音素序列的统计规律为:
这100个语音声学特征中:有75个语音声学特征是通过发音词典的词条“T-A1A2”映射到T,有25个语音声学特征是通过发音词典的词条“T-B1B2”映射到T。
那么,根据该统计规律,可以得到如下概率分布:
T对应于A1A2的概率为75/100=0.75
T对应于B1B2的概率为25/100=0.25
一般地,服务器可以将所述概率分布中的最大概率值对应的候选发音音素序列,确定为所述目标词汇正确的发音的发音音素序列。
沿用上例,则服务器可以将所述概率分布中的最大概率值0.75对应的候选发音音素序列A1A2,确定为T正确发音的发音音素序列。
步骤13:根据作为目标词汇正确发音的发音音素序列,构建发音词典。
具体地,服务器比如可以从加入了目标词汇对应的各个候选发音音素序列的发音词典中,删除除作为目标词汇正确发音的发音音素序列外的、与该目标词汇对应的其他候选发音音素序列。或者,服务器也可以根据作为目标词汇正确发音的发音音素序列,重新构建新的发音词典。
采用本申请实施例1提供的上述方法,由于引入了待预测发音的目标词的语音声学特征,作为预测词汇正确发音的依据之一,从而相对于仅依靠词汇和发音音素序列的映射关系来做为预测词汇正确发音依据的现有技术而言,可以更为准确地预测目标词汇正确发音,从而提升了语音词典的质量。
实施例2
为解决采用现有技术会导致与词汇匹配的发音音素序列的准确性较低的问题,本申请实施例提供一种发音词典的构建装置。该词汇发音预测装置的结构示意图如图3所示,主要包括下述功能单元:
解码单元21,用于将目标词汇的语音声学特征,输入语音识别解码器中;其中,语音识别解码器中的发音词典包括:目标词汇和目标词汇的的候选发音音素序列;
发音确定单元22,用于根据所述语音识别解码器以所述语音声学特征作为输入而输出的候选发音音素序列,确定所述目标词汇对应于输出的候选发音音素序列的概率分布;根据所述概率分布,从所述输出的候选发音音素序列中,选择作为所述目标词汇的正确发音的发音音素序列;
词典构建单元23,用于根据所述正确发音的发音音素序列,构建发音词典。
在一种实施方式中,本申请实施例提供的该装置还可以包括音素序列处理单元。该单元用于在于将目标词汇的语音声学特征,输入语音识别解码器中前,获得目标词汇的候选发音音素序列;并将目标词汇和获得的候选发音音素序列,加入到所述语音识别解码器中的发音词典中。
在一种实施方式中,音素序列处理单元,具体可以用于利用G2P方法,获得目标词汇的候选发音音素序列。
在一种实施方式中,所述解码单元21,具体可以用于采集目标词汇对应的音频样本;根据所述音频样本,获得所述语音声学特征;将获得的所述语音声学特征,输入所述语音识别解码器中。
在一种实施方式中,所述发音确定单元22,具体可以用于确定所述概率分布中的最大概率值;从所述输出的候选发音音素序列中,选择所述最大概率值对应的候选发音音素序列,作为所述目标词汇的正确发音的发音音素序列。
在一种实施方式中,所述词典构建单元23,具体可以用于根据作为所述目标词汇正确发音的发音音素序列,从加入了目标词汇和获得的候选发音音素序列的发音词典中,删除目标词汇对应的、除所述正确发音的发音音素序列外的其他候选发音音素序列。
采用本申请实施例2提供的上述装置,由于引入了待预测发音的目标词的语音声学特征,作为预测词汇正确发音的依据之一,从而相对于仅依 靠词汇和发音音素序列的映射关系来做为预测词汇正确发音依据的现有技术而言,可以更为准确地预测目标词汇正确发音。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (14)

  1. 一种发音词典的构建方法,其特征在于,所述方法包括:
    将目标词汇的语音声学特征,输入语音识别解码器;其中,所述语音识别解码器中的发音词典包括:目标词汇和目标词汇的候选发音音素序列;
    根据所述语音识别解码器以所述语音声学特征作为输入而输出的候选发音音素序列,确定所述目标词汇对应于输出的候选发音音素序列的概率分布;
    根据所述概率分布,从所述输出的候选发音音素序列中,选择作为所述目标词汇的正确发音的发音音素序列;
    根据所述正确发音的发音音素序列,构建发音词典。
  2. 如权利要求1所述的方法,其特征在于,将所述语音声学特征,输入所述语音识别解码器前,所述方法还包括:
    获得目标词汇的候选发音音素序列;
    将目标词汇和获得的候选发音音素序列,加入到所述语音识别解码器中的发音词典中。
  3. 如权利要求2所述的方法,其特征在于,获得目标词汇的候选发音音素序列,包括:
    利用词转换为音素G2P方法,获得目标词汇的候选发音音素序列。
  4. 如权利要求1所述的方法,其特征在于,所述语音识别解码器中嵌入的声学模型,是对深度神经网络进行训练得到的。
  5. 如权利要求1所述的方法,其特征在于,将目标词汇的语音声学特征,输入所述语音识别解码器中,包括:
    采集目标词汇对应的音频样本;
    根据所述音频样本,获得所述语音声学特征;
    将获得的所述语音声学特征,输入所述语音识别解码器中。
  6. 如权利要求1所述的方法,其特征在于,根据所述概率分布,从 所述输出的候选发音音素序列中,选择作为所述目标词汇的正确发音的发音音素序列,包括:
    确定所述概率分布中的最大概率值;
    从所述输出的候选发音音素序列中,选择所述最大概率值对应的候选发音音素序列,作为所述目标词汇的正确发音的发音音素序列。
  7. 如权利要求1~6任一权项所述的方法,其特征在于,根据所述正确发音的发音音素序列,构建发音词典,包括:
    根据作为所述目标词汇正确发音的发音音素序列,从加入了目标词汇和获得的候选发音音素序列的发音词典中,删除目标词汇对应的、除所述正确发音的发音音素序列外的其他候选发音音素序列。
  8. 一种发音词典的构建装置,其特征在于,所述装置包括:
    解码单元:用于将目标词汇的语音声学特征,输入语音识别解码器中;其中,所述语音识别解码器中的发音词典包括:目标词汇和目标词汇的的候选发音音素序列;
    发音确定单元:用于根据所述语音识别解码器以所述语音声学特征作为输入而输出的候选发音音素序列,确定所述目标词汇对应于输出的候选发音音素序列的概率分布;根据所述概率分布,从所述输出的候选发音音素序列中,选择作为所述目标词汇的正确发音的发音音素序列;
    词典构建单元:用于根据所述正确发音的发音音素序列,构建发音词典。
  9. 如权利要求8所述的装置,其特征在于,所述装置还包括:
    音素序列处理单元,用于在于将目标词汇的语音声学特征,输入语音识别解码器中前,获得目标词汇的候选发音音素序列;并将目标词汇和获得的候选发音音素序列,加入到所述语音识别解码器中的发音词典中。
  10. 如权利要求9所述的装置,其特征在于,所述音素序列处理单元,具体可以用于:
    利用词转换为音素G2P方法,获得目标词汇的候选发音音素序列。
  11. 如权利要求8所述的装置,其特征在于,所述语音识别解码器中嵌入的声学模型,是对深度神经网络进行训练得到的。
  12. 如权利要求8所述的装置,其特征在于:
    所述解码单元,具体用于采集目标词汇对应的音频样本;根据所述音频样本,获得所述语音声学特征;将获得的所述语音声学特征,输入所述语音识别解码器中。
  13. 如权利要求8所述的装置,其特征在于,所述发音确定单元,具体用于:
    确定所述概率分布中的最大概率值;
    从所述输出的候选发音音素序列中,选择所述最大概率值对应的候选发音音素序列,作为所述目标词汇的正确发音的发音音素序列。
  14. 如权利要求8~13任一权项所述的装置,其特征在于:
    所述词典构建单元,具体用于根据作为所述目标词汇正确发音的发音音素序列,从加入了目标词汇和获得的候选发音音素序列的发音词典中,删除目标词汇对应的、除所述正确发音的发音音素序列外的其他候选发音音素序列。
PCT/CN2016/110125 2015-12-29 2016-12-15 一种发音词典的构建方法及装置 WO2017114172A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201511016459.1 2015-12-29
CN201511016459.1A CN106935239A (zh) 2015-12-29 2015-12-29 一种发音词典的构建方法及装置

Publications (1)

Publication Number Publication Date
WO2017114172A1 true WO2017114172A1 (zh) 2017-07-06

Family

ID=59224572

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/110125 WO2017114172A1 (zh) 2015-12-29 2016-12-15 一种发音词典的构建方法及装置

Country Status (2)

Country Link
CN (1) CN106935239A (zh)
WO (1) WO2017114172A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143528A (zh) * 2019-12-20 2020-05-12 云知声智能科技股份有限公司 多音字词汇的标注方法及装置
CN111369974A (zh) * 2020-03-11 2020-07-03 北京声智科技有限公司 一种方言发音标注方法、语言识别方法及相关装置
CN112562675A (zh) * 2019-09-09 2021-03-26 北京小米移动软件有限公司 语音信息处理方法、装置及存储介质
CN113724710A (zh) * 2021-10-19 2021-11-30 广东优碧胜科技有限公司 语音识别方法及装置、电子设备、计算机可读存储介质

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767858B (zh) * 2017-09-08 2021-05-04 科大讯飞股份有限公司 发音词典生成方法及装置、存储介质、电子设备
CN108682420B (zh) * 2018-05-14 2023-07-07 平安科技(深圳)有限公司 一种音视频通话方言识别方法及终端设备
CN109192197A (zh) * 2018-09-18 2019-01-11 湖北函数科技有限公司 基于互联网的大数据语音识别系统
CN109616096B (zh) * 2018-12-29 2022-01-04 北京如布科技有限公司 多语种语音解码图的构建方法、装置、服务器和介质
CN110310619A (zh) * 2019-05-16 2019-10-08 平安科技(深圳)有限公司 多音字预测方法、装置、设备及计算机可读存储介质
CN110675855B (zh) * 2019-10-09 2022-03-25 出门问问信息科技有限公司 一种语音识别方法、电子设备及计算机可读存储介质
CN110889278B (zh) * 2019-11-27 2023-09-05 南京创维信息技术研究院有限公司 一种用于语音识别的词典生成方法
CN110889987A (zh) * 2019-12-16 2020-03-17 安徽必果科技有限公司 一种用于英语口语矫正的智能点评方法
CN111402862B (zh) * 2020-02-28 2023-06-20 出门问问创新科技有限公司 语音识别方法、装置、存储介质及设备
CN112037770B (zh) * 2020-08-03 2023-12-29 北京捷通华声科技股份有限公司 发音词典的生成方法、单词语音识别的方法和装置
CN112562636A (zh) * 2020-12-03 2021-03-26 云知声智能科技股份有限公司 一种语音合成纠错的方法和装置
CN112669851B (zh) * 2021-03-17 2021-06-08 北京远鉴信息技术有限公司 一种语音识别方法、装置、电子设备及可读存储介质
CN113571045B (zh) * 2021-06-02 2024-03-12 北京它思智能科技有限公司 一种闽南语语音识别方法、系统、设备及介质
CN117116267B (zh) * 2023-10-24 2024-02-13 科大讯飞股份有限公司 语音识别方法及装置、电子设备和存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1667700A (zh) * 2004-03-10 2005-09-14 微软公司 使用发音图表来改进新字的发音学习
US7280963B1 (en) * 2003-09-12 2007-10-09 Nuance Communications, Inc. Method for learning linguistically valid word pronunciations from acoustic data
CN101432801A (zh) * 2006-02-23 2009-05-13 日本电气株式会社 语音识别词典制作支持系统、语音识别词典制作支持方法以及语音识别词典制作支持用程序
US20100312550A1 (en) * 2009-06-03 2010-12-09 Lee Gil Ho Apparatus and method of extending pronunciation dictionary used for speech recognition
CN102201235A (zh) * 2010-03-26 2011-09-28 三菱电机株式会社 发音词典的构建方法和系统
WO2014209449A1 (en) * 2013-06-28 2014-12-31 Google Inc. Computer-implemented method, computer-readable medium and system for pronunciation learning
JP2016011995A (ja) * 2014-06-27 2016-01-21 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 発音辞書の拡張システム、拡張プログラム、拡張方法、該拡張方法により得られた拡張発音辞書を用いた音響モデルの学習方法、学習プログラム、および学習システム

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6973427B2 (en) * 2000-12-26 2005-12-06 Microsoft Corporation Method for adding phonetic descriptions to a speech recognition lexicon
JP2002358095A (ja) * 2001-03-30 2002-12-13 Sony Corp 音声処理装置および音声処理方法、並びにプログラムおよび記録媒体
WO2002091356A1 (fr) * 2001-05-02 2002-11-14 Sony Corporation Dispositif robot, appareil de reconnaissance de caracteres, procede de lecture de caracteres, programme de commande et support d'enregistrement
US20030088416A1 (en) * 2001-11-06 2003-05-08 D.S.P.C. Technologies Ltd. HMM-based text-to-phoneme parser and method for training same
KR100467590B1 (ko) * 2002-06-28 2005-01-24 삼성전자주식회사 발음 사전 갱신 장치 및 방법
KR100486733B1 (ko) * 2003-02-24 2005-05-03 삼성전자주식회사 음소 결합정보를 이용한 연속 음성인식방법 및 장치
JP2005043666A (ja) * 2003-07-22 2005-02-17 Renesas Technology Corp 音声認識装置
US8019602B2 (en) * 2004-01-20 2011-09-13 Microsoft Corporation Automatic speech recognition learning using user corrections
CN100592385C (zh) * 2004-08-06 2010-02-24 摩托罗拉公司 用于对多语言的姓名进行语音识别的方法和系统
GB0426347D0 (en) * 2004-12-01 2005-01-05 Ibm Methods, apparatus and computer programs for automatic speech recognition
US20070239455A1 (en) * 2006-04-07 2007-10-11 Motorola, Inc. Method and system for managing pronunciation dictionaries in a speech application
US20080130699A1 (en) * 2006-12-05 2008-06-05 Motorola, Inc. Content selection using speech recognition
CN101740024B (zh) * 2008-11-19 2012-02-08 中国科学院自动化研究所 基于广义流利的口语流利度自动评估方法
US8155961B2 (en) * 2008-12-09 2012-04-10 Nokia Corporation Adaptation of automatic speech recognition acoustic models
JP5326546B2 (ja) * 2008-12-19 2013-10-30 カシオ計算機株式会社 音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラム
CN101650886B (zh) * 2008-12-26 2011-05-18 中国科学院声学研究所 一种自动检测语言学习者朗读错误的方法
CN101510222B (zh) * 2009-02-20 2012-05-30 北京大学 一种多层索引语音文档检索方法
CN101826325B (zh) * 2010-03-10 2012-04-18 华为终端有限公司 对中英文语音信号进行识别的方法和装置
CN101840699B (zh) * 2010-04-30 2012-08-15 中国科学院声学研究所 一种基于发音模型的语音质量评测方法
CN102063900A (zh) * 2010-11-26 2011-05-18 北京交通大学 克服混淆发音的语音识别方法及系统
JP2013072903A (ja) * 2011-09-26 2013-04-22 Toshiba Corp 合成辞書作成装置および合成辞書作成方法
US20140067394A1 (en) * 2012-08-28 2014-03-06 King Abdulaziz City For Science And Technology System and method for decoding speech
CN103680498A (zh) * 2012-09-26 2014-03-26 华为技术有限公司 一种语音识别方法和设备
CN103578467B (zh) * 2013-10-18 2017-01-18 威盛电子股份有限公司 声学模型的建立方法、语音辨识方法及其电子装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7280963B1 (en) * 2003-09-12 2007-10-09 Nuance Communications, Inc. Method for learning linguistically valid word pronunciations from acoustic data
CN1667700A (zh) * 2004-03-10 2005-09-14 微软公司 使用发音图表来改进新字的发音学习
CN101432801A (zh) * 2006-02-23 2009-05-13 日本电气株式会社 语音识别词典制作支持系统、语音识别词典制作支持方法以及语音识别词典制作支持用程序
US20100312550A1 (en) * 2009-06-03 2010-12-09 Lee Gil Ho Apparatus and method of extending pronunciation dictionary used for speech recognition
CN102201235A (zh) * 2010-03-26 2011-09-28 三菱电机株式会社 发音词典的构建方法和系统
WO2014209449A1 (en) * 2013-06-28 2014-12-31 Google Inc. Computer-implemented method, computer-readable medium and system for pronunciation learning
JP2016011995A (ja) * 2014-06-27 2016-01-21 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 発音辞書の拡張システム、拡張プログラム、拡張方法、該拡張方法により得られた拡張発音辞書を用いた音響モデルの学習方法、学習プログラム、および学習システム

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562675A (zh) * 2019-09-09 2021-03-26 北京小米移动软件有限公司 语音信息处理方法、装置及存储介质
CN111143528A (zh) * 2019-12-20 2020-05-12 云知声智能科技股份有限公司 多音字词汇的标注方法及装置
CN111143528B (zh) * 2019-12-20 2023-05-26 云知声智能科技股份有限公司 多音字词汇的标注方法及装置
CN111369974A (zh) * 2020-03-11 2020-07-03 北京声智科技有限公司 一种方言发音标注方法、语言识别方法及相关装置
CN111369974B (zh) * 2020-03-11 2024-01-19 北京声智科技有限公司 一种方言发音标注方法、语言识别方法及相关装置
CN113724710A (zh) * 2021-10-19 2021-11-30 广东优碧胜科技有限公司 语音识别方法及装置、电子设备、计算机可读存储介质

Also Published As

Publication number Publication date
CN106935239A (zh) 2017-07-07

Similar Documents

Publication Publication Date Title
WO2017114172A1 (zh) 一种发音词典的构建方法及装置
US8478591B2 (en) Phonetic variation model building apparatus and method and phonetic recognition system and method thereof
CN109979432B (zh) 一种方言翻译方法及装置
JP7200405B2 (ja) 音声認識のためのコンテキストバイアス
KR101153078B1 (ko) 음성 분류 및 음성 인식을 위한 은닉 조건부 랜덤 필드모델
JP6284462B2 (ja) 音声認識方法、及び音声認識装置
JP2017513047A (ja) 音声認識における発音予測
CN107967916A (zh) 确定语音关系
CN111243599A (zh) 语音识别模型构建方法、装置、介质及电子设备
CN112349289A (zh) 一种语音识别方法、装置、设备以及存储介质
Jothilakshmi et al. Large scale data enabled evolution of spoken language research and applications
CN110853669B (zh) 音频识别方法、装置及设备
Raval et al. Improving deep learning based automatic speech recognition for Gujarati
JP2017102247A (ja) 音声対話システム、音声対話制御法およびプログラム
Trabelsi et al. Evaluation of the efficiency of state-of-the-art Speech Recognition engines
JP3660512B2 (ja) 音声認識方法、その装置及びプログラム記録媒体
JP2014164261A (ja) 情報処理装置およびその方法
TW201828281A (zh) 發音詞典的構建方法及裝置
Biswas et al. Speech Recognition using Weighted Finite-State Transducers
CN116052655A (zh) 音频处理方法、装置、电子设备和可读存储介质
Coto‐Solano Computational sociophonetics using automatic speech recognition
Nguyen et al. Development of a Vietnamese large vocabulary continuous speech recognition system under noisy conditions
Pranjol et al. Bengali speech recognition: An overview
JP2021529338A (ja) 発音辞書生成方法及びそのための装置
Azim et al. Using Character-Level Sequence-to-Sequence Model for Word Level Text Generation to Enhance Arabic Speech Recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16880962

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16880962

Country of ref document: EP

Kind code of ref document: A1