CN105551485A - Audio file retrieval method and system - Google Patents

Audio file retrieval method and system Download PDF

Info

Publication number
CN105551485A
CN105551485A CN201510882391.9A CN201510882391A CN105551485A CN 105551485 A CN105551485 A CN 105551485A CN 201510882391 A CN201510882391 A CN 201510882391A CN 105551485 A CN105551485 A CN 105551485A
Authority
CN
China
Prior art keywords
word
file
correlation
model
text file
Prior art date
Application number
CN201510882391.9A
Other languages
Chinese (zh)
Inventor
王建社
柳林
冯翔
胡国平
Original Assignee
讯飞智元信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 讯飞智元信息科技有限公司 filed Critical 讯飞智元信息科技有限公司
Priority to CN201510882391.9A priority Critical patent/CN105551485A/en
Publication of CN105551485A publication Critical patent/CN105551485A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

The invention discloses an audio file retrieval method and system. The method comprises the following steps of training a user interest model corresponding to a retrieval keyword; obtaining each audio file to be retrieved; performing audio transliteration on each audio file to obtain a transliteration result; obtaining a text file corresponding to each audio file and multi-knowledge source features of each word in each text file according to the transliteration result; performing confidence degree revaluation on each word by using the multi-knowledge source features, and filtering away unmeaning words and sentences in the text file; calculating the correlation degree of each text file and the user interest model according to the confidence degree revaluation result; and displaying the retrieved audio files according to the correlation degree. When the method and the system are used, the audio file retrieval efficiency and the audio file retrieval accuracy can be improved.

Description

语音文件检索方法及系统 Voice file retrieval method and system

技术领域 FIELD

[0001] 本发明涉及语音信号处理领域,具体涉及一种语音文件检索方法及系统。 [0001] The present invention relates to speech signal processing, and particularly relates to a speech retrieval method and system files.

背景技术 Background technique

[0002] 随着语音处理技术的不断发展,在越来越多的应用中,相关技术人员已尝试从语音数据中获取所需信息,如从海量语音数据中检索出特定应用场景所需的语音文件。 [0002] With the development of voice processing technology, more and more applications, the relevant technical staff have attempted to obtain from the voice data in the required information, such as retrieving the required specific application scenarios voice from the mass of speech data file. 传统的从大量语音文件中检索出有用文件的做法主要有两种: Traditional voice file from a large number of useful documents retrieved There are two main approaches:

[0003] 第一种是人工进行语音文件监听,进而找出相关性较高、有用的文件,而这种方法需要耗费大量的人力和物力,效率较低。 [0003] The first one is the artificial voice files to listen, then find highly relevant, useful files, but this method requires a lot of manpower and material resources, low efficiency.

[0004] 第二种是先将语音文件进行转写,得到文本文件,然后对文本文件进行检索。 [0004] The second is the first voice file transfer, receive text file and retrieve the text file. 由于受复杂的噪声环境、远场等因素的影响,语音转写的正确率还不能较好的保证,因而在进行语音文件检索时,为保证检索的准确性,通常都需要对语音转写的结果进行人工校验,因而同样存在人力消耗大、效率低的问题。 Due to the impact of complex factors noisy environments, such as the far-field voice transcription accuracy rate can not be a good guarantee, so during retrieval of voice files, in order to ensure the accuracy of retrieval, usually requires transfer of speech the results of manual checking, so there are also human consumption and low efficiency.

发明内容 SUMMARY

[0005] 本发明提供一种语音文件检索方法及系统,以解决现有语音文件检索时由于语音转写错误导致的效率低、准确性差的问题。 [0005] The present invention provides a voice file retrieval method and system to solve the low efficiency problem of poor accuracy of the existing voice file retrieval Since voice transcription errors.

[0006] 为此,本发明提供如下技术方案: [0006] To this end, the present invention provides the following technical solutions:

[0007] -种语音文件检索方法,包括: [0007] - kind of a voice file retrieval method comprising:

[0008] 训练对应检索关键词的用户兴趣模型; [0008] Training search keyword corresponding to user interest models;

[0009] 获取待检索的各语音文件; [0009] each of the acquired audio file to be retrieved;

[0010] 对所述语音文件进行语音转写,得到转写结果; [0010] The voice of the audio file transfer, result transfer obtained;

[0011]根据所述转写结果获得所述语音文件对应的文本文件及所述文本文件中各词的多知识源特征; [0011] The results obtained according to the transcription of the text file and text file for each word in the audio file corresponding to a plurality of knowledge source signature;

[0012] 利用所述多知识源特征对各词进行置信度重估,并滤除所述文本文件中无意义的词句; [0012] characterized by said plurality of knowledge source for each word confidence revaluation, and filtering the text file meaningless words;

[0013] 根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度; [0013] calculated for each text file with the user interest model based on the confidence level associated revaluation;

[0014] 根据所述相关度展示检索出的语音文件信息。 [0014] The display information retrieved voice file based on the correlation.

[0015] 优选地,所述检索关键词是用户在检索时输入的一个或多个检索关键词,或者是预先从一些特定情景语料中搜集得到的一个或多个关键词。 [0015] Preferably, the one or more search keyword is a search keyword input by the user at the time of retrieval, or from a previously collected certain situations corpus obtained one or more keywords.

[0016] 优选地,所述训练对应检索关键词的用户兴趣模型包括: User Interest [0016] Preferably, the training corresponding to the search keyword comprises:

[0017] 收集包含所述检索关键词的语料; [0017] The fractions containing the search keyword corpus;

[0018]计算所述语料中各词的词向量; [0018] Calculation of the word vectors of the corpus of words;

[0019] 利用所述词向量训练回归模型,将所述回归模型作为用户兴趣模型。 [0019] The use of the word training vector regression model, the regression model as a user interest model.

[0020] 优选地,所述转写结果为词级混淆网络格式,所述混淆网络中保存有每个词在语音文件中的时间位置、声学模型得分、语言模型得分和原始置信度; [0020] Preferably, the transcription result is confusion word-level network format, is stored in the network confusion time position of each word in the voice file, the acoustic model score, a language model and raw confidence scores;

[0021 ] 所述多知识源特征包括以下特征中的至少两种:词后验概率;竞争词的后验概率差;语言模型得分;帧平均声学模型得分。 [0021] The multiple knowledge sources characteristic comprises at least two of: the posterior probability of the word; posterior probability of differential competition words; language model score; frame average acoustic model score.

[0022]优选地,所述方法还包括: [0022] Preferably, the method further comprising:

[0023] 对所述混淆网络中的各词进行切分,得到该词对应的音素信息; [0023] for each word in the confusion network be segmented to obtain the word corresponding to the phoneme information;

[0024] 所述多知识源特征还包括以下任意一种或多种:各词对应的音素后验概率、状态帧方差;词位置系数;词长;是否为停止词;时长;竞争词个数;短时平均能量。 [0024] wherein said plurality of knowledge source further comprises any one or more of: the posterior probability of the phoneme corresponding to each word, the state of frame variance; coefficient word position; wordlength; whether stop words; long; the number of words competition ; short-term average energy.

[0025] 优选地,所述对所述文本文件中各词进行置信度评估包括: [0025] Preferably, the performing comprises the confidence evaluation for each word in the text file:

[0026] 根据所述多知识源特征为各词生成一组多维特征向量; [0026] According to the knowledge source wherein each of said plurality of word generating a multi-dimensional set of feature vectors;

[0027] 利用预先训练的回归模型及各词的多维特征向量计算该词的置信度。 [0027] Calculate confidences term use of pre-trained multi-dimensional feature vector and each word of the regression model.

[0028] 优选地,所述根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度包括: [0028] Preferably, the text file and calculated for each model of the user's interest based on the confidence level associated revaluation comprising:

[0029 ]对于每个文本文件,计算所述文本文件中各词的词向量; [0029] For each text file, the text file calculate word vector of each word;

[0030]将各词的置信度重估结果作为该词的权重,对所述文本文件中出现的所有词的词向量进行加权平均,得到所述文本文件的向量: [0030] The results of the revaluation confidence weight assigned to the word as weight vectors for all the words of the word appearing in the text file weighted average vector of the text file to obtain:

[0031 ]根据所述文本文件的向量计算所述文本文件与所述用户兴趣模型的相关度。 [0031] The text file with the user interest model vector for the text file is calculated based on the degree of correlation.

[0032]优选地,所述根据所述相关度展示检索出的语音文件信息包括: [0032] Preferably, the display according to the degree of correlation of voice files retrieved information comprises:

[0033]按照相关度从大到小依次展示相关度大于设定阈值的语音文件信息;或者[0034]按照相关度从大到小依次展示设定个数的语音文件信息。 [0033] Display order information related to the voice file is greater than a set threshold descending degree of correlation; or [0034] a degree of correlation in decreasing order of the number of voice files display setting information.

[0035] 优选地,所述方法还包括: [0035] Preferably, the method further comprising:

[0036] 设定针对不同重要性级别的相关度阈值; [0036] set for different levels of importance of the correlation threshold;

[0037] 根据各文本文件与所述用户兴趣模型的相关度及所述相关度阈值确定各语音文件的重要性级别; [0037] determining the importance level of the voice of each file according to said correlation and the correlation threshold for each text file with the user interest models;

[0038] 在展示所述语音文件信息时,展示所述语音文件的重要性级别信息。 [0038] when the audio file information display, showing the importance level information of the audio file.

[0039] 一种语音文件检索系统,包括: [0039] A voice file retrieval system, comprising:

[0040] 模型训练模块,用于训练对应检索关键词的用户兴趣模型; [0040] model training module configured to train a search keyword corresponding to the user interest models;

[0041] 语音文件获取模块,用于获取待检索的各语音文件; [0041] The voice file acquisition module for acquiring the respective voice file to be retrieved;

[0042] 语音转写模块,用于对所述语音文件进行语音转写,得到转写结果; [0042] The speech-to-write module, a voice of the audio file transfer, result transfer obtained;

[0043] 文本文件生成模块,用于根据所述转写结果获得所述语音文件对应的文本文件; [0043] Text file generating module, according to the results obtained to the transcription of the audio file corresponding to a text file;

[0044] 特征获取模块,用于获取所述文本文件中各词的多知识源特征; [0044] wherein obtaining module configured to obtain multiple knowledge sources wherein each of said words in a text file;

[0045] 置信度重估模块,用于利用所述多知识源特征对各词进行置信度重估; [0045] The confidence revaluation module for each of the words for the confidence characteristic of the revaluation of the use of multiple knowledge sources;

[0046] 过滤模块,用于滤除所述文本文件中无意义的词句; [0046] The filtration module for filtering the meaningless words in a text file;

[0047]相关度计算模块,用于根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度; [0047] The correlation calculating module, for calculating the degree of correlation of each text file with the user interest model based on the confidence revaluation;

[0048] 展示模块,用于根据所述相关度展示检索出的语音文件信息。 [0048] The display module configured to display, according to the degree of correlation information retrieved voice file.

[0049] 优选地,所述模型训练模块包括: [0049] Preferably, the model training module comprising:

[0050] 语料收集单元,用于收集包含所述检索关键词的语料; [0050] acoustical collecting means for collecting a corpus containing the search keyword;

[0051 ]词向量计算单元,用于计算所述语料中各词的词向量; [0051] The term vector calculating means for calculating word vectors for each word in the corpus;

[0052]训练单元,用于利用所述词向量训练回归模型,将所述回归模型作为用户兴趣模型。 [0052] The training unit, configured to use the training word vector regression model, the regression model as a model of user interest.

[0053]优选地,所述转写结果为词级混淆网络格式,所述混淆网络中保存有每个词在语音文件中的时间位置、声学模型得分、语言模型得分和原始置信度;所述多知识源特征包括以下特征中的至少两种:词后验概率;竞争词的后验概率差;语言模型得分;帧平均声学模型得分; [0053] Preferably, the transcription result is confusion word-level network format, is stored in the network confusion time position of each word in the voice file, the acoustic model score, a language model and raw confidence scores; the much knowledge source characteristic comprises at least two: the posterior probability of the word; the word posterior probability of poor competition; language model score; frame average acoustic model score;

[0054]所述置信度重估模块包括: [0054] The confidence revaluation module comprises:

[0055] 多维特征向量生成单元,用于根据所述多知识源特征为各词生成一组多维特征向量; [0055] multidimensional feature vector generating means for generating a set of multi-dimensional feature vector for the word based on the knowledge source wherein said plurality;

[0056] 置信度计算单元,用于利用预先训练的回归模型及各词的多维特征向量计算该词的置信度。 [0056] confidence confidence calculation unit for calculating word feature vector using a multi-dimensional and each word of a pre-trained regression model.

[0057] 优选地,所述相关度计算模块包括: [0057] Preferably, the correlation calculating module comprises:

[0058] 词向量计算单元,用于对于每个文本文件,计算所述文本文件中各词的词向量; [0058] The term vector calculating means, for the word vectors for each text file, the text file is calculated for each word;

[0059] 文件向量计算单元,用于将各词的置信度重估结果作为该词的权重,对所述文本文件中出现的所有词的词向量进行加权平均,得到所述文本文件的向量: [0059] The vector calculating unit files, the confidence level for revaluation of each word as the word weight weight vectors for all of the words of the word in the text file appearing weighted average vector of the resulting text file:

[0060] 相关度计算单元,用于根据所述文本文件的向量计算所述文本文件与所述用户兴趣模型的相关度。 [0060] The correlation degree calculating means for calculating a text file with the user interest vector according to the model of the relevant text file.

[0061] 优选地,所述展示模块具体用于按照相关度从大到小依次展示相关度大于设定阈值的语音文件,或者按照相关度从大到小依次展示设定个数的语音文件。 [0061] Preferably, the presentation module is configured to display a degree of correlation in decreasing order correlation is greater than a set threshold voice file, or a descending order according to the degree of correlation is set to show the number of voice files.

[0062] 优选地,所述系统还包括: [0062] Preferably, the system further comprising:

[0063] 设定模块,用于设定针对不同重要性级别的相关度阈值; [0063] The setting module for setting a threshold value for the correlation of different importance level;

[0064] 级别确定模块,用于根据各文本文件与所述用户兴趣模型的相关度及所述相关度阈值确定各语音文件的重要性级别; [0064] The level determination module, the importance level of each audio file for each text file with the user interest correlation model and the correlation threshold value is determined in accordance with;

[0065] 所述展示模块,还用于在展示所述语音文件信息时,展示所述语音文件的重要性级别信息。 The [0065] display module is further configured to, when the audio file information display, information showing the importance level of the voice file. 本发明实施例提供的语音文件检索方法及系统,针对语音转写得到的文本文件存在一定数量的转写错误的现象,通过提取语音转写得到的文本文件中各词的多知识源特征,利用所述多知识源特征对各词进行置信度重估,并滤除所述文本文件中无意义的词句, 根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度;根据所述相关度展示检索出的语音文件,从而有效地减少了转写错误对文件排序的影响。 Embodiment of the present invention, retrieval method and system for voice file provided, there is a transfer error phenomena certain number for voice transcription resulting text file, multiple knowledge sources wherein a text file by extracting the voice transcription obtained in each word, using wherein said plurality of knowledge source for each word confidence revaluation, and filtering the text file meaningless words, calculating a correlation degree of each text file with the user interest model based on the confidence revaluation; in accordance with the said correlation shows retrieved voice file, thus effectively reducing the impact of file transfer error sort. 本发明实施例的语音文件检索方法及系统,不仅大大提高了语音文件检索的效率,而且保证了检索结果的准确性。 Embodiment of the present invention, a voice file retrieval method and system of the embodiment, not only greatly improve the efficiency of document retrieval of voice, but also to ensure the accuracy of the search results.

附图说明 BRIEF DESCRIPTION

[0066] 为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。 [0066] In order to more clearly illustrate the technical solutions according to the prior art embodiment of the present application, the drawings are briefly introduced as required for use in the embodiments describing the embodiments. Apparently, the accompanying drawings described below are merely Some embodiments of the invention described, those of ordinary skill in the art is concerned, you can also obtain other drawings based on these drawings.

[0067] 图1是本发明实施例语音文件检索方法的流程图; [0067] FIG. 1 is a flow diagram of a voice file retrieval method of the present invention;

[0068] 图2是本发明实施例语音文件检索系统的一种结构示意图; [0068] FIG. 2 is a schematic structural diagram of a voice file retrieval system according to the present embodiment of the invention;

[0069] 图3是本发明实施例中相关度计算模块的一种结构示意图; [0069] FIG. 3 is a schematic diagram of a structure of the correlation computation module embodiment of the present invention;

[0070] 图4是本发明实施例语音文件检索系统的另一种结构示意图。 [0070] FIG. 4 is another schematic structural diagram of a voice file retrieval system according to the present embodiment of the invention.

具体实施方式 Detailed ways

[0071] 为了使本技术领域的人员更好地理解本发明实施例的方案,下面结合附图和实施方式对本发明实施例作进一步的详细说明。 [0071] In order to make those skilled in the art better understand the solutions of the embodiments of the present invention, the following embodiments of the present invention will be described in further detail in conjunction with the accompanying drawings and embodiments.

[0072] 如图1所示,是本发明实施例语音文件检索方法的流程图,包括以下步骤: [0072] As shown in FIG. 1 is a flowchart of a voice file retrieval method of the present embodiment of the invention, comprising the steps of:

[0073] 步骤101,训练对应检索关键词的用户兴趣模型。 [0073] Step 101, the search keyword corresponding to the training user interest model.

[0074]需要说明的是,所述检索关键词可以是用户在检索时输入的一个或多个检索关键词,也可以是预先从一些特定情景语料中搜集得到的一个或多个检索关键词,对此本发明实施例不做限定。 [0074] Incidentally, the search keyword may be one or more search keywords input by the user at the time of retrieval, it may be certain situations in advance from a corpus collected to give one or more search keywords, Example embodiments of the present invention is not limited in this regard.

[0075]所述用户兴趣模型可以采用回归模型,比如SVM(支持向量机)模型或RNN (Recurrent Neural Network、循环神经网络)模型,在训练回归模型时,可以采用现有的Word Embedding技术计算检索关键词的词向量表示,并结合待检索文本中与检索词无关的词向量动态训练回归模型,作为最终的用户兴趣模型。 [0075] The user interest models regression model may be employed, such as SVM (Support Vector Machine) model or RNN (Recurrent Neural Network, Recurrent Neural Network) model, the regression model training, may be employed conventional computing technology to retrieve Word Embedding keywords word vector representation, combined with word vector text to be searched irrelevant to the search term training dynamic regression models, as the final user interest model. 具体地,可以分别在预先准备的大语料库和待检索语音的文本中搜索包含所述检索关键词的语料作为正例样本,并随机抽取一些与所述检索关键词无关的语料作为反例样本,然后利用Word Embedding方法将这些样本语料转换为词向量,利用这些正例词向量和反例词向量即可训练回归模型。 Large text corpus to be searched, and specifically voice, respectively, may be prepared in advance in the search keyword searching a corpus comprising a positive example samples, and randomly selected unrelated to the search keyword as a negative samples corpus, and use Word Embedding method to convert these samples corpus word vector, use of these positive examples and counterexamples word vector word vector regression model can be trained.

[0076]步骤102,获取待检索的各语音文件。 [0076] Step 102, each of the acquired audio file to be retrieved.

[0077] 步骤103,对所述语音文件进行语音转写,得到转写结果。 [0077] Step 103, the speech voice file transfer, transfer the results obtained.

[0078] 具体地,可以采用大规模语音转写技术对所述语音文件进行转写,得到转写结果。 [0078] Specifically, a large-scale voice transcription technologies employed for the voice file transfer, transfer the results obtained.

[0079] 在本发明实施例中,转写结果采用词级混淆网络格式,其不仅包括最优候选词,还包括多个竞争候选词。 [0079] In an embodiment of the present invention, the transfer results using word-level format confusion network, including not only the best candidate word, further comprising a plurality of candidate words competition. 所述混淆网络中保存有每个词在语音文件中的时间位置、声学模型得分、语言模型得分和原始置信度等信息,以方便后续获取各词的多知识源特征。 The confusion network information stored in the time position of each word in the voice file, the acoustic model score, a language model score and other raw confidence, multiple knowledge sources to facilitate subsequent retrieval of each characteristic word. 其中,所述原始置信度可以根据每个词的后验概率计算得到。 Wherein said raw confidence can be calculated according to the posterior probability of each word.

[0080] 需要说明的是,在实际应用中,可以设置每个词的相同位置上最多可以保留的竞争候选词的个数,比如15个。 [0080] Note that, in practical applications, the number of competitive candidate words may be provided at the same position of each word on the maximum retention, such as 15. 这些竞争候选词可以按照设定数量依照各词的原始置信度由大到小来选择,也可以按照设定的置信度阈值选择大于该阈值的候选词。 These competing candidate words may be selected in accordance with the set number of descending confidence levels in accordance with the original word may be selected candidate word is greater than the threshold value in accordance with a set confidence threshold. 而且,相同位置上所有竞争候选词的原始置信度之和为1。 Moreover, raw confidence of all the competing candidate word is 1 and the same position.

[0081] 步骤104,根据所述转写结果获得所述语音文件对应的文本文件及所述文本文件中各词的多知识源特征。 Wherein multiple knowledge sources [0081] Step 104 to obtain the audio file corresponding to the text file and text file for each word based on the result of transcription.

[0082] 具体地,通过混淆网络解码,可以得到所述语音文件对应的文本文件。 [0082] In particular, network confusion by decoding, the audio file can be obtained corresponding text file.

[0083] 所述多知识源特征可以包括以下特征中的至少两种:词后验概率;竞争词的后验概率差;语言模型得分;帧平均声学模型得分。 [0083] wherein said plurality of knowledge source may comprise at least two of the following features: the posterior probability of the word; posterior probability of differential competition words; language model score; frame average acoustic model score. 当然,为了使后续置信度重估结果更准确,提取的多知识源特征还可进一步包括以下任意一种或多种:各词对应的音素后验概率、状态帧方差;词位置系数;词长;是否为停止词;时长;竞争词个数;短时平均能量等。 Of course, in order to make subsequent revaluation confidence more accurate, feature extraction multiple knowledge sources may further comprise any one or more of the following: posterior probability of the phoneme corresponding to each word, the state of frame variance; coefficient word position; wordlength ; whether stop words; long; the number of competing word; short-time average energy and so on.

[0084] 下面对上述这些特征分别加以说明: [0084] Hereinafter, these characteristics will be described, respectively:

[0085] (1)词后验概率:当前词的后验概率; After the [0085] (1) the word posterior probability: Current posterior probability of the word;

[0086] (2)竞争词的后验概率差:混淆网络上两个相邻节点间最优的两个候选词的后验概率之差; [0086] posteriori (2) Competition word difference: difference in probabilities after confusion between two adjacent nodes on the optimal network two test candidate words;

[0087] (3)语言模型得分;即当前词的N-Gram语言模型得分; [0087] (3) the language model score; i.e. N-Gram current word language model score;

[0088] (4)帧平均声学模型得分:当前词的声学模型得分除以该词的特征总帧数;比如提取语音的声学特征(如MFCC)时帧移为10毫秒,则1秒钟语音能提出约100帧特征。 [0088] (4) The average frame score acoustic model: the current acoustic model characteristic word score divided by the total number of frames assigned to it; for example the frame-shift is extracted acoustic features of speech (e.g., MFCC) 10 ms, 1 second speech It can be raised to about 100 feature. 按照这样的计算,比如"科大讯飞"这个词在语音文件中所占的长度为0.7秒,换算成总帧数就是70 帧; According to this calculation, such as "IFLYTEK" word voice file in the share of a length of 0.7 seconds, in terms of total number of frames it is 70;

[0089] (5)音素后验概率:当前词对应各音素后验概率的平均值; [0089] (5) phoneme posterior probability: average value corresponding to the current word posterior probabilities for each phoneme;

[0090] (6)状态帧方差:当前词对应各状态上特征总帧数的方差; [0090] (6) Status frame variance: total number of frames of the current word corresponding to the variance of each feature state;

[0091] (7)词位置系数:当前词在句子中的位置i除以该词所在句子的总词数N; [0091] (7) the position of the coefficient word: current word position i in sentence divided by the total number of words in the sentence where the word N;

[0092] (8)词长:当前词包含的总字数; [0092] (8) word length: the current total number of words contained in the word;

[0093] (9)是否为停止词; [0093] (9) whether the stop words;

[0094] (10)时长:当前词持续的时间长度; [0094] (10) Length: Length of the word length of the current time;

[0095] (11)竞争词个数:混淆网络中两个相邻节点之间词的总数; The number of [0095] (11) Competition words: the total number of words in confusion between two adjacent nodes in the network;

[0096] (12)短时平均能量:当前词对应语音文件中相应片段的短时平均能量。 [0096] (12) short-term average energy: short-time average energy of the current word corresponding to the voice file corresponding fragment.

[0097] 需要说明的是,可以通过对所述混淆网络中的各词进行FA进行强制切分,得到状态级信息(即语音的最小建模单元,一般一个词包含多个音素,每个音素包含多个状态),即得到每个状态的后验概率,则每个音素的后验概率为该音素中所有状态后验概率的均值。 [0097] Incidentally, each term by the network using the FA confusion forced segmentation, to obtain the status level information (i.e., the smallest unit of speech model, a word typically comprising a plurality of phonemes, each phoneme comprising a plurality of states), to obtain posterior probability of each state, the posterior probability for each phoneme in the phoneme mean posterior probabilities for all states.

[0098] 步骤105,利用所述多知识源特征对各词进行置信度重估,并滤除所述文本文件中无意义的词句。 [0098] Step 105, for each word confidence revaluation wherein said multiple use of the knowledge source, and filtering the text file meaningless words.

[0099] 具体地,可以根据上述多知识源特征为各词生成一组多维特征向量,然后利用预先训练的回归模型(后面以SVM模型为例)及各词的多维特征向量计算该词的置信度。 [0099] In particular, knowledge may be generated based on the multi-source wherein each multidimensional feature vector is a set of words, then the pre-trained using a regression model (SVM model as an example to later) and each multidimensional feature vector calculating a confidence assigned to the word degree.

[0100] 下面以后验概率及竞争词的后验概率差这两种知识源特征,说明为各词生成一组多维特征向量(以18维为例)的过程。 [0100] After the following posterior probability and posterior probability of the difference between these two words competitive knowledge source characteristics, note after each word generating a set of multi-dimensional feature vector (in 18 dimensions for example) process.

[0101] 为了描述方便,下面按照各特征在所述多维特征向量中的下标顺序进行说明: [0101] For ease of description, will be described below in accordance with various features of the indexed sequential multidimensional feature vectors:

[0102] 1)第1 至9维:词的后验概率WPP(il),WPP2(il),WPP3(i_l),WPP(i),WPP2(i), 猶33(丨),^^+1),冊?2(丨+1),猶53(丨+1),其中1为当前词在句子中的位置,词1的后验概率WPP(i)的定义如下: [0102] 1) 1 to 9 dimension: the posterior probability of the word WPP (il), WPP2 (il), WPP3 (i_l), WPP (i), WPP2 (i), Utah 33 (Shu), ^^ + ? 1), volume 2 (Shu +1), Utah 53 (Shu + 1), where 1 is the position of the current word in the sentence, the word posterior probability 1 WPP (i) is defined as follows:

Figure CN105551485AD00091

[0104] p(i) =pac(i)pim(i) (2) [0104] p (i) = pac (i) pim (i) (2)

[0105] 其中,at (i)表示词i在t时刻的前向概率,(i)表示词i在t时刻的后向概率,前后向概率使用现有的前向后向算法计算得到,Ω表示t时刻出现的所有候选词的集合;p ac;(i) 为词:的声学模型得分,Pim(i)为词i的语言模型得分。 [0105] wherein, AT (i) represents the probability of the word i, (i) represents the probability of the word i, obtained before and after the backward calculation algorithm to the conventional forward probability at time t before the time t, [Omega] It represents all candidate word set appearing at time t; p ac; (i) Ci: acoustic model score, Pim (i) is the language model score of the word i.

[0106] 2)第10至18维:竞争词的后验概率差 [0106] 2) 10 to 18 of dimension: the posterior probability of differential competition word

Figure CN105551485AD00092

Figure CN105551485AD00101

[0112] 其中i为当前词在句子中的位置,下标onebest代表第一候选,twobest代表第二候选。 [0112] where i is the position of the current word in the sentence, and the subscript represents a first candidate onebest, twobest representing the second candidate.

[0113] 将该多知识源特征向量在事先训练好的SVM模型上计算得分Swcird: [0113] The multiple knowledge sources on feature vectors previously calculated score trained SVM model Swcird:

[0114] Sword=wi · x+bi (3) [0114] Sword = wi · x + bi (3)

[0115]上式中,wi是SVM分类平面的法向量,x是输入的多知识源特征向量,bi是偏置参数(常量),这里的奶和匕是事先利用词的正例和反例数据训练出的。 [0115] In the above formula, wi is the SVM classification plane normal vector, x is a multiple knowledge sources feature vector input, BI is offset parameter (constant), where the milk and dagger advance using words positive and negative examples of data training out.

[0116]由于标准SVM分类器的输出结果不是以概率的形式给出,而本发明实施例需要使用SVM分类器获得关键词新的置信度,因此必须对SVM的输出结果施加变换以获得概率形式的得分输出。 [0116] Since the standard SVM classifier output is not given in the form of probability, and embodiments of the invention requires the use of SVM classifier to obtain a new keyword confidence, therefore must be applied to transform the output of the SVM to probabilities obtained in the form of scoring output. 具体可以采用现有的方法对SVM输出结果进行变换,方法之一是对SVM的输出结果进行sigmoid变换: Specific methods may be employed conventional SVM output transform, one approach is to SVM output sigmoid conversion is performed:

Figure CN105551485AD00102

[0118] 其中,WPPwcird即为词置信度的重估结果;变量A和B是变换参数,采用最大似然准则训练得到。 [0118] wherein, WPPwcird revaluation is the word confidence; variables A and B is a transformation parameter using a maximum likelihood criterion is trained.

[0119] 下面再以上述12种特征为例,说明为各词生成一组多维特征向量(以32维为例)的过程。 [0119] In the above-described 12 kinds of the following characteristics again as an example, a set of instructions to generate a multi-dimensional feature vectors (32 dimensions for example) a process for the word.

[0120] 为了描述方便,下面按照各特征在所述多维特征向量中的下标顺序进行说明: [0120] For ease of description, will be described below in accordance with various features of the indexed sequential multidimensional feature vectors:

[0121] 1)第1 至9维:词的后验概率WPP(il),WPP2(il),WPP3(il),WPP(i),WPP2(i), 猶33(丨),^^+1),冊?2(丨+1),猶53(丨+1),其中1为当前词在句子中的位置,词1的后验概率WPP(i)的定义参照前面公式(1)、(2)。 [0121] 1) 1 to 9 dimension: the posterior probability of the word WPP (il), WPP2 (il), WPP3 (il), WPP (i), WPP2 (i), Utah 33 (Shu), ^^ + 1), Volume? 2 (Shu +1), Utah 53 (Shu + 1), where 1 is the position of the current word in the sentence, the posterior probability of a word WPP (i) is defined with reference to the foregoing equation (1), (2).

[0122] 2)第10至18维:竞争词的后验概率差 [0122] 2) 10 to 18 of dimension: the posterior probability of differential competition word

Figure CN105551485AD00103

[0128] 其中i为当前词在句子中的位置,下标onebest代表第一候选,twobest代表第二候选。 [0128] where i is the position of the current word in the sentence, and the subscript represents a first candidate onebest, twobest representing the second candidate.

[0129] 3)第19至21 维:词的N-Gram语言模型得分Pim(i-1),Pim(i),Pim(i+Ι); [0129] 3) 19-21 Dimension: N-Gram word language model score Pim (i-1), Pim (i), Pim (i + Ι);

[0130] 4)第22至24维:帧平均声学模型得分Pac(i-"/Ni-hPjU/NhPjW/Ni+i,其中Ni代表词i对应的语音帧数; [0130] 4) 22-24 Dimension: frame average acoustic model score Pac (i - "/ Ni-hPjU / NhPjW / Ni + i, where i corresponds to Ni word representative of speech frames;

[0131] 5)第25维:词的音素后验概率PPPi After phoneme posterior probability of the word PPPi: [0131] 5) 25th dimension

Figure CN105551485AD00104

Figure CN105551485AD00111

[0134] 本案采用深度神经网络(如RNN)对音素的声学分布建模,其输入是声学特征,输出是音素的后验概率,上式中Μ表示神经网络的输出维度。 [0134] the depth of the case using the neural network (e.g., RNN) acoustic model of a phoneme distribution, acoustic feature which is input, the output of the phoneme posterior probability is, the above formula represents the output Μ dimension neural network. 对于中文而言,Μ表示40个无调的音素加上sil(表示静音)和sp(表示词间停顿)共42个音素。 For the Chinese, Μ denote 40 phonemes plus atonal sil (represented mute) and sp (represented pause between words) a total of 42 phonemes. 上式中N phcine表示词i对应的音素总数,p(ph」0t)是当前语音帧为0t时音素为j的后验概率,tjPL分别表示当前(待重估)音素的开始帧和结束帧(语音转写过程中得到),|丨和€为当前音素中第s个状态的开始帧和结束帧(对词做状态级切分后获得)。 N phcine above formula represents the total number of phonemes corresponding to the word i, p (ph "0t) the current speech frame is 0t phoneme posterior probability of j, tjPL respectively indicating the start frame of the current (to be revaluation) phonemes and the end frame (speech-writing process to obtain), | Shu and € for the start and end frames of the current phoneme in the s-state (after word do get state-level segmentation).

[0135] 6)第26维:状态帧方差〇sframe [0135] 6) 26 Dimension: Status frame variance 〇sframe

Figure CN105551485AD00112

[0138] 上式中,Ns表示当前词对应的状态数,Fs表示当前词第s个状态上获得的帧数(对词做状态级切分后获得个状态对应帧数的平均值。 [0138] In the above formula, Ns represents the number of states corresponding to the current word, Fs represents the number of frames obtained at the s-th word of the current state (state level after doing word obtained by dividing states corresponding to the average number of frames.

[0139] 7)第27维:词的位置系数ilcic;/Nw,h。 [0139] 7) 27 Dimension: the position of the coefficient word ilcic; / Nw, h. . 表示当前词在句子中的位置序号,心表示当前句子包含的词的总数; Represents the position number of the current word in the sentence, the sentence containing the current center represents the total number of words;

[0140] 8)第28维:词长,即当前词包含的字数; [0140] 8) 28 Dimension: word length, i.e., the current word contains the number of words;

[0141] 9)第29维:判断当前词是否为停止词,是停止词时为1,否则为0; [0141] 9) 29 Dimension: determining whether the current word is a stop word, when the word is a stop, otherwise 0;

[0142] 10)第30维:当前词的时间长度,单位为秒; [0142] 10) 30 Dimension: word length of the current time, in seconds;

[0143] 11)第31维:当前词对应的竞争词总数,即混淆网络中两个相邻结点间弧的总数; [0143] 11) 31 Dimension: the current total number of words corresponding to the word competition, the total number of network nodes in between two adjacent arcs, i.e. confusion;

[0144] 12)第32维:当前关键词对应语音文件中相应片段的短时平均能量。 [0144] 12) 32 Dimension: current corresponding to the keyword short-time average energy of the voice file corresponding fragment.

[0145] 利用上述多知识源特征生成的32维特征向量对各词进行置信度重估的过程可参照前面公式(3)、(4)的描述,在此不再赘述。 [0145] for each word confidence revaluation 32-dimensional feature vector using the plurality of knowledge sources feature generation process may be reference to the preceding formula (3), (4) is described, which is not repeated herein.

[0146] 上面所述滤除文本文件中无意义的词句可以采用依存句法分析技术对转写后的文本进行句法分析,并将句法分析结果转换为词向量(如one-hot向量),将该词向量作为特征,结合分类器(如SVM)对转写后文本中的词进行分类,依据分类结果过滤掉无意义的词(如语气词)和句子等内容。 [0146] The above-filtered text file meaningless words may transfer the text analysis syntactic dependency parsing techniques employed, and the syntactic analysis result into word vectors (e.g., one-hot vector), the as the word feature vector combining classifiers (e.g., SVM) after transfer of words in the text classification, filtering out content meaningless words (e.g. Modal) and the like based on the classification result sentence.

[0147] 需要说明的是,上述对各词进行置信度重估和滤除文本文件中无意义的词句这两个过程在处理时不分先后,即可以先对各词进行置信度重估,再滤除文本文件中无意义的词句;也可以先滤除所述文本文件中无意义的词句,再对各词进行置信度重估。 [0147] Note that the confidence for the revaluation filtered and text file for each word meaningless words in alphabetical order of these two processes, i.e., for each word can be first confidence revaluation processing, then filtered off and meaningless words in the text file; may be filtered out of the first text file meaningless words, for each word and then confidence for the revaluation.

[0148] 步骤106,根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度。 [0148] Step 106, the text file associated with each of the user-interest calculation model based on the confidence revaluation.

[0149] 首先,对过滤后的文本文件,采用现有的Word Embedding技术计算过滤后文本文件中每个词的词向量,记为V。 [0149] First, the text file after filtration, after filtration using a text file for each word of the word of Word Embedding conventional vector calculation technique, referred to as V.

[0150] 然后,将各词的置信度重估结果作为该词的权重,对文本文件中出现的所有词的词向量进行加权平均,得到该文本文件的向量: [0150] Then, the revaluation confidence weights of each word as the word weight, the word vectors for all words appearing in the text file in the weighted average vector of the resulting text file:

Figure CN105551485AD00121

[0153] 上式中Nword为过滤后的文本文件包含词的总数,WPPi表示第i个词的置信度,Vi表示第i个词的词向量,Vd。 [0153] The total number of words contained in the above formula Nword a text file after filtration, WPPi i represents the confidence of the word, Vi represents a vector of the i-th word words, Vd. . 表示过滤后文本文件的向量。 Vector representing the text file after filtering.

[0154] 最后,计算当前文本文件与用户兴趣模型(以SVM模型为例)之间的相关度; [0154] Finally, the text file with the current user interest model (SVM model in an example) correlation between;

[0155] Sd〇c=W2 · Vd〇c+b2 (11) [0155] Sd〇c = W2 · Vd〇c + b2 (11)

[0156] 其中,参数w^SVM分类平面的法向量、b2为偏置参数(常量),由大量训练数据训练得到。 [0156] where the parameter vector w ^ SVM classification plane, b2 is offset parameter (constant), a large amount of training data obtained from training.

[0157] 进一步的,还可以将上述SVM输出的相关度值进行归一化处理,以便更直观地进行检索文件的排序。 [0157] Further, correlation may also be output value of the SVM is normalized, in order to more intuitively retrieve a file sort.

[0158] 步骤107,根据所述相关度展示检索出的语音文件信息。 [0158] Step 107, based on the correlation of voice files retrieved display information.

[0159] 具体地,可以按照相关度从大到小依次展示相关度大于设定阈值的语音文件信息;或者按照相关度从大到小依次展示设定个数的语音文件信息。 [0159] In particular, show a degree of correlation may be sequentially descending correlation greater than a threshold value of the speech file setting information; or descending order according to the degree of correlation is set to show the number of voice file information.

[0160] 此外,还可对文件相关度得分划分对应不同级别的阈值,得到原始语音文件的重要性级别,如"高"、"中"和"低"等级别,将最终展示的语音文件信息和其级别信息一起展示给用户。 [0160] Further, documents related to the score may be divided corresponding to the different levels of the threshold value, to obtain the original audio file importance level, such as "high", "medium" and "low" level, etc., will eventually show speech information file and its level information together to show to the user.

[0161]需要说明的是,展示的语音文件信息可以是语音文件的主题名称、摘要、链接等信息,对此本发明实施例不做限定。 [0161] Incidentally, display information may be a voice file name of the audio file relating to, abstracts, links and other information, the present invention is not limited to this embodiment.

[0162] 本发明实施例提供的语音文件检索方法,针对语音转写得到的文本文件存在一定数量的转写错误的现象,通过提取语音转写得到的文本文件中各词的多知识源特征,利用所述多知识源特征对各词进行置信度重估,并滤除所述文本文件中无意义的词句,根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度;根据所述相关度展示检索出的语音文件,从而有效地减少了转写错误对文件排序的影响。 [0162] voice file search method according to an embodiment of the present invention, for the presence of speech transcription resulting text file transfer error phenomena certain number of multiple knowledge sources wherein a text file by extracting the voice transcription obtained in each word, for each word confidence revaluation, and filtering the text file meaningless words using knowledge of the multi-source characteristics, calculated for each text file with the user interest model based on the confidence level associated revaluation; the the correlation display retrieved voice file, thus effectively reducing the impact of file transfer error sort. 本发明实施例的语音文件排序方法,不仅大大提高了语音文件检索的效率,而且保证了检索结果的准确性。 Voice file sorting method according to the present embodiment of the invention, not only greatly improve the efficiency of document retrieval of voice, but also to ensure the accuracy of the search results.

[0163] 相应地,本发明实施例还提供一种语音文件检索系统,如图2所示,是该系统的一种结构示意图。 [0163] Accordingly, embodiments of the present invention further provides a voice file retrieval system shown in FIG. 2, a schematic structure of the system.

[0164] 在该实施例中,所述系统包括: [0164] In this embodiment, the system comprising:

[0165] 模型训练模块201,用于训练对应检索关键词的用户兴趣模型; [0165] model training module 201 is used for training a search keyword corresponding to the user interest models;

[0166] 语音文件获取模块202,用于获取待检索的各语音文件; [0166] voice file acquisition module 202, configured to obtain each of the audio file to be retrieved;

[0167] 语音转写模块203,用于对所述语音文件进行语音转写,得到转写结果; [0167] Voice transfer module 203, a voice of the audio file transfer, result transfer obtained;

[0168] 文本文件生成模块204,用于根据所述转写结果获得所述语音文件对应的文本文件; [0168] Text file generating module 204, according to the results obtained to the transcription of the audio file corresponding to a text file;

[0169] 特征获取模块205,用于获取所述文本文件中各词的多知识源特征; [0169] feature acquisition module 205 for acquiring multiple knowledge sources wherein each of the words in the text file;

[0170] 置信度重估模块206,用于利用所述多知识源特征对所述文本文件中各词进行置信度重估; [0170] Confidence revaluation module 206, the text file for each word confidence for using the knowledge source revaluation wherein said plurality;

[0171] 过滤模块207,用于滤除所述文本文件中无意义的词句; [0171] 207 filter module for filtering the text file meaningless words;

[0172] 相关度计算模块208,用于根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度; [0172] correlation calculating module 208 for calculating the degree of correlation of each text file with the user interest model based on the confidence revaluation;

[0173] 展示模块209,用于根据所述相关度展示检索出的语音文件信息。 [0173] Display module 209, in accordance with the degree of correlation for display information retrieved voice file.

[0174] 需要说明的是,在实际应用中,所述检索关键词可以是用户在检索时输入的一个或多个检索关键词,也可以是预先从一些特定情景语料中搜集得到的一个或多个检索关键词,对此本发明实施例不做限定。 [0174] Note that, in practice, the search keyword may be one or more search keywords input by the user at the time of retrieval, may be obtained from a previously collected certain situations or corpus a search keyword, the present invention is not limited to this embodiment.

[0175] 所述用户兴趣模型可以采用回归模型,模型训练模块201在训练回归模型时,可以采用现有的Word Embedding技术计算检索关键词的词向量表示,并结合待检索文本中与检索词无关的词向量动态训练回归模型,作为最终的用户兴趣模型。 [0175] The user interest models may employ regression model, the model training module 201 When the training regression model may be employed conventional technical computing Word Embedding term vector representation of the search keyword, and combined to be independent of the search terms in the text the word dynamic training vector regression model, as an end user interest model. 相应地,模型训练模块201的一种具体结构可以包括以下各单元: Accordingly, a concrete structure model training module 201 may include the following units:

[0176] 语料收集单元,用于收集包含所述检索关键词的语料; [0176] acoustical collecting means for collecting a corpus containing the search keyword;

[0177] 词向量计算单元,用于计算所述语料中各词的词向量; [0177] Words vector calculating means for calculating word vectors for each word in the corpus;

[0178] 训练单元,用于利用所述词向量训练回归模型,将所述回归模型作为用户兴趣模型。 [0178] training unit, configured to use the training word vector regression model, the regression model as a model of user interest.

[0179] 在本发明实施例中,所述转写结果为词级混淆网络格式,其不仅包括最优候选词, 还包括多个竞争候选词。 [0179] In an embodiment of the present invention, the transfer result of word-level format confusion network, including not only the best candidate word, further comprising a plurality of candidate words competition. 所述混淆网络中保存有每个词在语音文件中的时间位置、声学模型得分、语言模型得分和原始置信度。 The confusion network is stored in the location in time of each word in the voice file, the acoustic model score, a language model and raw confidence score. 另外,所述多知识源特征包括以下特征中的至少两种:词后验概率;竞争词的后验概率差;语言模型得分;帧平均声学模型得分。 In addition, knowledge of the multi-source characteristic comprises at least two of: the posterior probability of the word; posterior probability of differential competition words; language model score; frame average acoustic model score. 当然,为了使后续置信度重估结果更准确,所述多知识源特征还可进一步包括以下任意一种或多种:各词对应的音素后验概率、状态帧方差;词位置系数;词长;是否为停止词;时长;竞争词个数; 短时平均能量等。 Of course, in order to make the subsequent more accurate confidence revaluation, wherein said plurality of knowledge source may further comprise any one or more of the following: posterior probability of the phoneme corresponding to each word, the state of frame variance; coefficient word position; wordlength ; whether stop words; long; the number of competing word; short-time average energy and so on. 对于这些特征前面已有详细说明,在此不再赘述。 For these foregoing features have been described in detail, which is not repeated herein.

[0180]相应地,所述置信度重估模块206可以利用上述多知识源特征为各词生成一组多维特征向量,然后利用预先训练的回归模型(后面以SVM模型为例)及各词的多维特征向量计算该词的置信度。 [0180] Accordingly, the confidence level using the revaluation module 206 may generate multiple knowledge sources feature a set of multi-dimensional feature vector for the word, then the pre-trained using a regression model (SVM model as an example to later) and each word word confidence vector calculating multidimensional feature. 置信度重估模块206的一种具体结构可以包括:多维特征向量生成单元和置信度计算单元,其中,所述多维特征向量生成单元用于根据所述多知识源特征为各词生成一组多维特征向量;置信度计算单元用于利用预先训练的回归模型及各词的多维特征向量计算该词的置信度。 A particular confidence revaluation configuration module 206 may include: a multi-dimensional feature vector generating unit and confidence calculation unit, wherein said multi-dimensional feature vector generating unit configured to generate a multi-dimensional set of words according to the plurality of respective knowledge sources wherein feature vector; confidence calculating means for calculating a degree of confidence assigned to multi-dimensional feature vector using a pre-trained and each word regression model.

[0181]如图3所示,是本发明实施例中相关度计算模块的一种结构示意图,该模块包括: [0181] Figure 3 is a schematic structural diagram of the correlation computation module of the present embodiment of the invention, the module comprising:

[0182] 词向量计算单元31,用于对于每个文本文件,计算所述文本文件中各词的词向量; [0182] Words vector calculating unit 31, a word vector for each text file, the text file is calculated for each word;

[0183] 文件向量计算单元32,用于将各词的置信度重估结果作为该词的权重,对所述文本文件中出现的所有词的词向量进行加权平均,得到所述文本文件的向量: [0183] File vector calculating unit 32 for revaluation confidence weights of each word as the word weight, the word vectors for all of the words appearing in the text file weighted average vector of the resulting text file :

[0184] 相关度计算单元33,用于根据所述文本文件的向量计算所述文本文件与所述用户兴趣模型的相关度。 [0184] correlation calculation unit 33, a correlation of the text file with the user interest vector is calculated according to the model of the text file.

[0185] 上述各计算单元的具体计算过程可参照前面本发明方法实施例中的描述,在此不再赘述。 [0185] The calculation process of calculating the above-described means may be described with reference to the foregoing method of the present invention, in the embodiment, which is not repeated herein.

[0186] 上展示模块209可以根据所述相关度展示检索出的语音文件信息。 [0186] display module 209 may display a voice file information retrieved on the basis of the correlation. 在实际应用中, 可以按照相关度从大到小的顺序依次展示相应语音文件信息,比如,可以展示相关度大于设定阈值的所有语音文件信息,或者展示设定个数的语音文件信息。 In practical applications, the respective voice file may be sequentially display information in the descending order of correlation, for example, can show all the information related to voice file is greater than the set threshold value, or the voice file number display setting information. 所述语音文件信息可以是语音文件的主题名称、摘要、链接等信息,对此本发明实施例不做限定。 The information may be a voice file name of the audio file relating to, abstracts, links and other information, the present invention is not limited to this embodiment.

[0187] 如图4所示,是本发明实施例语音文件检索系统的另一种结构示意图。 [0187] As shown in FIG 4 is another schematic structural diagram of a voice file retrieval system according to the present embodiment of the invention.

[0188] 与图2所示实施例不同的是,在该实施例中,所述系统还包括:设定模块401和级别确定模块402。 Different embodiments of [0188] shown in FIG. 2, in this embodiment, the system further comprising: a setting module 401 and a level determination module 402. 其中,设定模块401用于设定针对不同重要性级别的相关度阈值;级别确定模块402用于根据各文本文件与所述用户兴趣模型的相关度及所述相关度阈值确定各语音文件的重要性级别。 Wherein the setting module 401 for setting a threshold value for the correlation of different importance level; level determination module 402 for determining a correlation and the correlation with the threshold value of the respective user interest model according to the text file of each audio file level of importance.

[0189] 相应地,在该实施例中,展示模块209不仅用于展示检索出的语音文件信息,还用于在展示所述语音文件信息时,展示所述语音文件的重要性级别信息。 [0189] Accordingly, in this embodiment, not only the display module 209 for displaying the retrieved voice information file is further configured to display the audio file when information showing the importance level information of the audio file.

[0190] 本发明实施例提供的语音文件检索系统,针对语音转写得到的文本文件存在一定数量的转写错误的现象,通过提取语音转写得到的文本文件中各词的多知识源特征,利用所述多知识源特征对各词进行置信度重估,并滤除所述文本文件中无意义的词句,根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度;根据所述相关度展示检索出的语音文件,从而有效地减少了转写错误对文件排序的影响。 [0190] Voice document retrieval system according to an embodiment of the present invention, the presence of transfer errors phenomena certain number for voice transcription resulting text file, much knowledge source text file by extracting the voice transcription obtained in each word feature, for each word confidence revaluation, and filtering the text file meaningless words using knowledge of the multi-source characteristics, calculated for each text file with the user interest model based on the confidence level associated revaluation; the the correlation display retrieved voice file, thus effectively reducing the impact of file transfer error sort. 本发明实施例的语音文件检索系统,不仅大大提高了语音文件检索的效率,而且保证了检索结果的准确性。 Voice document retrieval system according to an embodiment of the present invention, not only greatly improve the efficiency of document retrieval of voice, but also to ensure the accuracy of the search results.

[0191] 本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。 [0191] In the present specification, various embodiments are described in a progressive manner, between similar portions of the same embodiment of various embodiments refer to each other, it is different from the embodiment and the other embodiments described each embodiment focus. 尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。 In particular, for embodiments of the system, since they are substantially similar to the method embodiments, the description is relatively simple, some embodiments of the methods see relevant point can be described. 以上所描述的系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。 Embodiments of the systems described above are merely exemplary, as a unit wherein the separate parts may be or may not be physically separate, parts displayed as units may be or may not be physical units, i.e. may be located a place, or they may be distributed to multiple network units. 可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。 You can select some or all of the modules according to actual needs to achieve the object of the solutions of the embodiments. 本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。 Those of ordinary skill in the art without creative efforts, can be understood and implemented.

[0192] 以上对本发明实施例进行了详细介绍,本文中应用了具体实施方式对本发明进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及系统;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。 [0192] or more embodiments of the present invention are described in detail herein apply specific embodiments of the present invention are set forth in description of the above embodiment is only intended to help understand the method and system of the invention; the same time, for those of ordinary in the art, according to the idea of ​​the present invention, there are modifications to the specific embodiments and application scope of the specification shall not be construed as limiting the present invention.

Claims (15)

1. 一种语音文件检索方法,其特征在于,包括: 训练对应检索关键词的用户兴趣模型; 获取待检索的各语音文件; 对所述语音文件进行语音转写,得到转写结果; 根据所述转写结果获得所述语音文件对应的文本文件及所述文本文件中各词的多知识源特征; 利用所述多知识源特征对各词进行置信度重估,并滤除所述文本文件中无意义的词句; 根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度; 根据所述相关度展示检索出的语音文件信息。 1. A speech file retrieval method, comprising: training a user interest model corresponding to the search keyword; acquiring each voice file to be retrieved; voice of the audio file transfer, result transfer obtained; in accordance with the the results obtained above transfer the audio file corresponding to the text file and text file for each word of the multiple knowledge sources characteristics; for each word confidence revaluation wherein said multiple use of the knowledge source, and filtering the text file meaningless words; calculated for each text file with the user interest model based on the confidence level associated revaluation; display according to the degree of correlation information retrieved voice file.
2. 根据权利要求1所述的方法,其特征在于,所述检索关键词是用户在检索时输入的一个或多个检索关键词,或者是预先从一些特定情景语料中搜集得到的一个或多个关键词。 2. The method according to claim 1, wherein said one or more search keyword is a search keyword input by the user at the time of retrieving, from a predetermined or certain situations a corpus or collection obtained key words.
3. 根据权利要求1所述的方法,其特征在于,所述训练对应检索关键词的用户兴趣模型包括: 收集包含所述检索关键词的语料; 计算所述语料中各词的词向量; 利用所述词向量训练回归模型,将所述回归模型作为用户兴趣模型。 3. The method according to claim 1, characterized in that the training of the search keyword corresponding to the user interest model comprising: collecting a corpus containing the search keyword; computing the word vectors for each word in the corpus; using the word training vector regression model, the regression model as a user interest model.
4. 根据权利要求1所述的方法,其特征在于,所述转写结果为词级混淆网络格式,所述混淆网络中保存有每个词在语音文件中的时间位置、声学模型得分、语言模型得分和原始置信度; 所述多知识源特征包括以下特征中的至少两种:词后验概率;竞争词的后验概率差;语言模型得分;帧平均声学模型得分。 4. The method according to claim 1, wherein said word-level transcription result is confusion network format, a network confusion time position of each word stored in the voice file, the acoustic model score, language and raw confidence score model; multiple knowledge sources said characteristic comprises at least two of: the word posterior probability; posterior probability of differential competition words; language model score; frame average acoustic model score.
5. 根据权利要求4所述的方法,其特征在于,所述方法还包括: 对所述混淆网络中的各词进行切分,得到该词对应的音素信息; 所述多知识源特征还包括以下任意一种或多种:各词对应的音素后验概率、状态帧方差;词位置系数;词长;是否为停止词;时长;竞争词个数;短时平均能量。 The method according to claim 4, characterized in that, said method further comprising: for each word in the confusion network be segmented to obtain the corresponding phoneme information word; wherein said source further comprises a plurality of knowledge any one or more of: the word corresponding to each phoneme posterior probability state frame variance; coefficient word position; wordlength; whether stop words; long; the number of words competition; short-term average energy.
6. 根据权利要求4或5所述的方法,其特征在于,所述对所述文本文件中各词进行置信度评估包括: 根据所述多知识源特征为各词生成一组多维特征向量; 利用预先训练的回归模型及各词的多维特征向量计算该词的置信度。 The method according to claim 4 or claim 5, wherein the performing comprises the confidence evaluation for each word in the text file: generating a set of multi-dimensional feature vector for the word based on the knowledge source wherein said plurality; calculate confidences term use of pre-trained multi-dimensional feature vector and each word of the regression model.
7. 根据权利要求6所述的方法,其特征在于,所述根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度包括: 对于每个文本文件,计算所述文本文件中各词的词向量; 将各词的置信度重估结果作为该词的权重,对所述文本文件中出现的所有词的词向量进行加权平均,得到所述文本文件的向量: 根据所述文本文件的向量计算所述文本文件与所述用户兴趣模型的相关度。 7. The method according to claim 6, wherein said calculating a correlation degree of each text file with the user interest model comprising a revaluation confidence: for each text file, the text file is calculated word vector of each word; revaluation the confidence weight of each word as the word weight, the word vectors for all words appearing in the text file weighted average vector of the resulting text file: according to the text calculating a vector file and a text file to the user interest correlation model.
8. 根据权利要求1至7任一项所述的方法,其特征在于,所述根据所述相关度展示检索出的语音文件信息包括: 按照相关度从大到小依次展示相关度大于设定阈值的语音文件信息;或者按照相关度从大到小依次展示设定个数的语音文件信息。 1 8. The method according to any one of claims 7, wherein said display according to the degree of correlation of voice files retrieved information comprises: decreasing order of degree of correlation is greater than the set show relevant threshold information of the voice file; or descending order according to the degree of correlation is set to show the number of the voice file information.
9. 根据权利要求8所述的方法,其特征在于,所述方法还包括: 设定针对不同重要性级别的相关度阈值; 根据各文本文件与所述用户兴趣模型的相关度及所述相关度阈值确定各语音文件的重要性级别; 在展示所述语音文件信息时,展示所述语音文件的重要性级别信息。 9. The method according to claim 8, characterized in that, said method further comprising: setting a threshold value for the correlation of different importance levels; the correlation and correlation with each of the text files according to user interest model importance level threshold is determined for each voice file; the audio file while displaying information, information showing the importance level of the voice file.
10. -种语音文件检索系统,其特征在于,包括: 模型训练模块,用于训练对应检索关键词的用户兴趣模型; 语音文件获取模块,用于获取待检索的各语音文件; 语音转写模块,用于对所述语音文件进行语音转写,得到转写结果; 文本文件生成模块,用于根据所述转写结果获得所述语音文件对应的文本文件; 特征获取模块,用于获取所述文本文件中各词的多知识源特征; 置信度重估模块,用于利用所述多知识源特征对各词进行置信度重估; 过滤模块,用于滤除所述文本文件中无意义的词句; 相关度计算模块,用于根据置信度重估结果计算各文本文件与所述用户兴趣模型的相关度; 展示模块,用于根据所述相关度展示检索出的语音文件信息。 10. --voice file retrieval system, characterized by comprising: model training module configured to train the user interest model corresponding to the search keyword; voice file acquisition module for acquiring each of the audio file to be retrieved; voice transcription module for the speech voice file transfer, result transfer obtained; text file generating module, for obtaining the audio file corresponding to the document according to the transcription result; wherein obtaining module, configured to obtain the multiple knowledge sources wherein each word in the text file; confidence revaluation module for each of the words for the confidence characteristic of the revaluation of the use of multiple knowledge sources; filtration module for filtering the text file meaningless expression; correlation calculating module, for calculating the degree of correlation of each text file with the user interest model based on the confidence revaluation; display module, for display according to the degree of correlation information retrieved voice file.
11. 根据权利要求10所述的系统,其特征在于,所述模型训练模块包括: 语料收集单元,用于收集包含所述检索关键词的语料; 词向量计算单元,用于计算所述语料中各词的词向量; 训练单元,用于利用所述词向量训练回归模型,将所述回归模型作为用户兴趣模型。 11. The system of claim 10, wherein said model training module comprising: a corpus collecting unit for collecting a corpus containing the search keyword; word vector calculating unit for calculating the corpus word vector of each word; training unit, configured to use the training word vector regression model, the regression model as a model of user interest.
12. 根据权利要求10所述的系统,其特征在于,所述转写结果为词级混淆网络格式,所述混淆网络中保存有每个词在语音文件中的时间位置、声学模型得分、语言模型得分和原始置信度;所述多知识源特征包括以下特征中的至少两种:词后验概率;竞争词的后验概率差;语言模型得分;帧平均声学模型得分; 所述置信度重估模块包括: 多维特征向量生成单元,用于根据所述多知识源特征为各词生成一组多维特征向量; 置信度计算单元,用于利用预先训练的回归模型及各词的多维特征向量计算该词的置信度。 12. The system according to claim 10, wherein said word-level transcription result confusion network format, a network confusion time position of each word stored in the voice file, the acoustic model score, language and raw confidence score model; multiple knowledge sources said characteristic comprises at least two of: posterior probability of the word; posterior probability of differential competition words; language model score; frame average acoustic model score; the confidence weight estimation module comprising: a multi-dimensional feature vector generating means for generating a set of multi-dimensional feature vector for the word based on the multiple knowledge sources wherein; confidence calculation unit for multidimensional feature vector and each word using pre-trained regression model calculation term confidence.
13. 根据权利要求10所述的系统,其特征在于,所述相关度计算模块包括: 词向量计算单元,用于对于每个文本文件,计算所述文本文件中各词的词向量; 文件向量计算单元,用于将各词的置信度重估结果作为该词的权重,对所述文本文件中出现的所有词的词向量进行加权平均,得到所述文本文件的向量: 相关度计算单元,用于根据所述文本文件的向量计算所述文本文件与所述用户兴趣模型的相关度。 13. The system according to claim 10, wherein said correlation calculation module comprises: a word vector calculating unit, for the word vectors for each text file, the text file is calculated for each word; vector file calculating means for revaluation confidence weights of each word as the word weight, the word vectors for all of the words appearing in the text file weighted average vector of the resulting text file: correlation calculation unit, the text file for calculating the model based on user interest vector for the correlation of the text file.
14. 根据权利要求10至13任一项所述的系统,其特征在于,所述展示模块具体用于按照相关度从大到小依次展示相关度大于设定阈值的语音文件,或者按照相关度从大到小依次展示设定个数的语音文件。 14. The system of any one of claims 10 to claim 13, wherein said display module is configured to display a degree of correlation in decreasing order of correlation is greater than a set threshold voice file, or a degree of correlation descending order is set to show the number of voice files.
15. 根据权利要求14所述的系统,其特征在于,所述系统还包括: 设定模块,用于设定针对不同重要性级别的相关度阈值; 级别确定模块,用于根据各文本文件与所述用户兴趣模型的相关度及所述相关度阈值确定各语音文件的重要性级别; 所述展示模块,还用于在展示所述语音文件信息时,展示所述语音文件的重要性级别ig 息 15. The system according to claim 14, characterized in that the system further comprises: setting means for setting a threshold value for the correlation of different importance level; level determination module for each of the text file the user interest correlation model and the correlation threshold to determine the importance level of each audio file; the presentation module is further configured to, when the audio file information display, show the importance level of the voice file ig interest
CN201510882391.9A 2015-11-30 2015-11-30 Audio file retrieval method and system CN105551485A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510882391.9A CN105551485A (en) 2015-11-30 2015-11-30 Audio file retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510882391.9A CN105551485A (en) 2015-11-30 2015-11-30 Audio file retrieval method and system

Publications (1)

Publication Number Publication Date
CN105551485A true CN105551485A (en) 2016-05-04

Family

ID=55830634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510882391.9A CN105551485A (en) 2015-11-30 2015-11-30 Audio file retrieval method and system

Country Status (1)

Country Link
CN (1) CN105551485A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0651372A2 (en) * 1993-10-27 1995-05-03 AT&T Corp. Automatic speech recognition (ASR) processing using confidence measures
GB2364814A (en) * 2000-07-12 2002-02-06 Canon Kk Speech recognition
CN101021856A (en) * 2006-10-11 2007-08-22 鲍东山 Distributing speech searching system
CN101510222A (en) * 2009-02-20 2009-08-19 北京大学 Multilayer index voice document searching method and system thereof
CN102023994A (en) * 2009-09-22 2011-04-20 株式会社理光 Device for retrieving voice file and method thereof
CN102314876A (en) * 2010-06-29 2012-01-11 株式会社理光 Speech retrieval method and system
CN103793515A (en) * 2014-02-11 2014-05-14 安徽科大讯飞信息科技股份有限公司 Service voice intelligent search and analysis system and method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0651372A2 (en) * 1993-10-27 1995-05-03 AT&T Corp. Automatic speech recognition (ASR) processing using confidence measures
GB2364814A (en) * 2000-07-12 2002-02-06 Canon Kk Speech recognition
CN101021856A (en) * 2006-10-11 2007-08-22 鲍东山 Distributing speech searching system
CN101510222A (en) * 2009-02-20 2009-08-19 北京大学 Multilayer index voice document searching method and system thereof
CN102023994A (en) * 2009-09-22 2011-04-20 株式会社理光 Device for retrieving voice file and method thereof
CN102314876A (en) * 2010-06-29 2012-01-11 株式会社理光 Speech retrieval method and system
CN103793515A (en) * 2014-02-11 2014-05-14 安徽科大讯飞信息科技股份有限公司 Service voice intelligent search and analysis system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐杰: "基于内容的音频检索技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
US8478591B2 (en) Phonetic variation model building apparatus and method and phonetic recognition system and method thereof
JP4705023B2 (en) Speech recognition device, speech recognition method, and a program
CN102405495B (en) Audio classification for information retrieval using sparse features
JP2004005600A (en) Method and system for indexing and retrieving document stored in database
JP2003036093A (en) Speech input retrieval system
CN101930735B (en) Speech emotion recognition equipment and speech emotion recognition method
CN101382937A (en) Multimedia resource processing method based on speech recognition and on-line teaching system thereof
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
JP2004133880A (en) Method for constructing dynamic vocabulary for speech recognizer used in database for indexed document
CN102651217A (en) Method and equipment for voice synthesis and method for training acoustic model used in voice synthesis
Gharavian et al. Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network
Ferrer et al. A prosody-based approach to end-of-utterance detection that does not require speech recognition
JP6066354B2 (en) The methods and apparatus of the reliability calculation
US8494850B2 (en) Speech recognition using variable-length context
US8195459B1 (en) Augmentation and calibration of output from non-deterministic text generators by modeling its characteristics in specific environments
CN101567189B (en) Device, method and system for correcting voice recognition result
Akbacak et al. Open-vocabulary spoken term detection using graphone-based hybrid recognition systems
CN105023573B (en) Using auditory attention cues speech syllable / vowel / phoneme boundary detection
CN102280106A (en) Voice network search method and apparatus for a mobile communication terminal,
KR101309042B1 (en) Apparatus for multi domain sound communication and method for multi domain sound communication using the same
CN104200804B (en) Many types of emotion recognition information for a coupling HCI
US8321218B2 (en) Searching in audio speech
CN101645064B (en) Superficial natural spoken language understanding system and method thereof
CN101447185B (en) Audio frequency rapid classification method based on content
Metze et al. Language independent search in MediaEval's Spoken Web Search task

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination