CN102314876B

CN102314876B - Speech retrieval method and system

Info

Publication number: CN102314876B
Application number: CN 201010212269
Authority: CN
Inventors: 史达飞; 鲁耀杰; 王磊; 尹悦燕; 郑继川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2010-06-29
Filing date: 2010-06-29
Publication date: 2013-04-10
Anticipated expiration: 2030-06-29
Also published as: CN102314876A

Abstract

The invention provides speech retrieval method and system. The speech retrieval method comprises the steps of: receiving retrieval input from a user; extracting a plurality of retrieval input speech features and acquiring a first confidence degree of each retrieval input speech feature by utilizing multiple groups acoustic models and language models; respectively retrieving the plurality of retrieval input speech features to obtain a retrieval result list corresponding to each retrieval input speech feature as well as a second confidence degree and a search engine score recorded by each result record in the retrieval result list; calculating the retrieval score of each result record of each speech feature according to the first confidence degree, the second confidence degree and the search engine score of the voice feature and normalizing the retrieval scores; re-sequencing each retrieval result list according to the normalized retrieval score; and merging the resequenced retrieval result list of each feature to obtain a final retrieval list.

Description

The method and system of speech retrieval

Technical field

The embodiment of the invention relates to a kind of method and speech searching system of speech retrieval, and more specifically, relates to the method and the speech searching system that adopt the speech retrieval that a plurality of phonetic features retrieve.

Background technology

In recent years, traditional character search more and more can not satisfy the day by day needs of diversification of people.Along with the development of speech recognition technology, more and more receive people's concern based on the speech retrieval of speech recognition technology.But, because user's pronunciation may there are differences with predetermined Received Pronunciation in the searching system in when retrieval, in the currently used mode of retrieving by individual voice feature (for example word), the error rate of speech recognition is higher, and for the pronunciation of user search input very high requirement is arranged.

A kind of speech index and search method are disclosed in United States Patent (USP) US 2009/0030894A1 number (patent documentation 1).The method comprises: receive the retrieval input that is formed by one or more search terms; Judge that each search terms belongs to " at lexical set " still " not at lexical set "; Select one or more index to retrieve according to the aggregate type under the search terms; Merge result for retrieval for each search terms; And the result for retrieval that merges all search terms.The index of patent documentation 1 and search method be divided into a plurality of search terms by retrieving input, search terms is carried out under the lexical set aggregate type judges and selects different search engines to retrieve and adopt the merging method of twice merging according to judged result, has improved the accuracy of speech recognition.

In United States Patent (USP) US 2009/0132251A1 number (patent documentation 2), a kind of method of speech retrieval is also disclosed.In the method, by using time text unit as the retrieval character of speech retrieval, improved retrieval rate and improved the accuracy of speech recognition.

Yet, in existing speech retrieval method, still have the problems such as position that retrieval precision is not high, the desirable result of user is not come relative front in the result for retrieval.In addition, when needs were retrieved multilingual or emerging vocabulary, existing speech retrieval method need to rebuild searching system, this means huge workload.

Summary of the invention

For above problem, be desirable to provide a kind of method and system that can improve the speech retrieval of retrieval precision.

A kind of method of speech retrieval is provided according to an aspect of the present invention.The method of this speech retrieval comprises: receive the retrieval input from the user; The a plurality of retrievals input phonetic features of utilization many groups acoustic models and language model extraction from retrieve input also obtain the first degree of confidence that phonetic feature is inputted in each retrieval; Respectively a plurality of retrievals input phonetic features are retrieved, to obtain the second degree of confidence and the search engine score corresponding to every outcome record in the result for retrieval tabulation of each retrieval input phonetic feature and the result for retrieval tabulation; Calculate according to the first degree of confidence, the second degree of confidence and the search engine score of each phonetic feature this phonetic feature every outcome record the retrieval score and carry out normalization; According to normalized retrieval score, each result for retrieval tabulation is resequenced; And the result for retrieval after the rearrangement of each feature tabulation merged to obtain final retrieval tabulation.

A kind of system of speech retrieval is provided according to an aspect of the present invention.The system of this speech retrieval comprises: load module is used for receiving the retrieval input from the user; Decoder module is used for utilizing many group acoustic models and language model to extract a plurality of retrieval input phonetic features and obtain the first degree of confidence that phonetic feature is inputted in each retrieval from retrieving input; Retrieval module, the a plurality of retrievals input phonetic features that respectively decoder module extracted are retrieved, to obtain the second degree of confidence and the search engine score corresponding to every outcome record in the result for retrieval tabulation of each retrieval input phonetic feature and the result for retrieval tabulation; Module reorders, calculate according to the first degree of confidence, the second degree of confidence and the search engine score of the phonetic feature of each phonetic feature this phonetic feature every outcome record the retrieval score and carry out normalization, and according to normalized retrieval score, each result for retrieval tabulation is resequenced; And the merging module, the tabulation of the result for retrieval after the rearrangement of each feature is merged to obtain final retrieval tabulation.

By utilizing a plurality of features of voice, speech retrieval method and system of the present invention can obtain than the better result of speech retrieval method and system who uses a phonetic feature.And by utilizing degree of confidence that result for retrieval is resequenced, when having reduced speech recognition than the impact of low confidence result on phonetic search.

In addition, speech retrieval method and system of the present invention is applicable to multilingual speech retrieval and to the speech retrieval to emerging vocabulary.

Description of drawings

In order to be illustrated more clearly in the technical scheme of the embodiment of the invention, the below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the process flow diagram of having described speech retrieval method according to an embodiment of the invention;

Fig. 2 is the process flow diagram of having described according to an example of the step of the method shown in Figure 1 of the embodiment of the invention;

Fig. 3 is the process flow diagram of having described according to another example of the step of the method shown in Figure 1 of the embodiment of the invention;

Fig. 4 shows the block diagram of speech searching system according to an embodiment of the invention.

Fig. 5 shows the block diagram of speech searching system according to another embodiment of the invention.

Fig. 6 shows the block diagram of speech searching system according to another embodiment of the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, describe the preferred embodiments of the present invention in detail.Note, in this instructions and accompanying drawing, have substantially the same step and represent with identical Reference numeral with element, and will be omitted the repetition of explanation of these steps and element.

Fig. 1 is the process flow diagram of having described speech retrieval method 1000 according to an embodiment of the invention.Below, with reference to Fig. 1 speech retrieval method 1000 according to an embodiment of the invention is described.

In the step S1010 of Fig. 1, receive the retrieval input from the user.Utilization in step S 1020 organizes acoustic models and a plurality of retrievals input phonetic features of language model extraction from the retrieval input and obtains the first degree of confidence that each retrieves the input phonetic feature.Phonetic feature can be acoustic feature, phoneme feature, inferior character features, word (word) feature or voice identification result etc.The phoneme feature refers to the voice unit (VU) of the minimum that the mankind can send, and the permutation and combination of a plurality of phonemes can form a word, and the set of phonemes of different language all is limited, and the English International Phonetic Symbols are a kind of the most frequently used set of phonemes.Digital speech can be converted into aligned phoneme sequence by speech recognition technology.Take English word banana as example: digital speech banana can be converted into B AANAANAAHH by speech recognition technology, and ' B ' is exactly a phoneme here.Inferior text unit feature refers to a reasonably combination (usually greater than two phonemes) of phoneme feature, but this combination can not consist of the pronunciation of a word.The inferior text unit of different language also is a finite aggregate, and inferior text unit is more than the quantity of factor.Utilize speech recognition technology can say that voice change into time text unit sequence.Take English word banana as example: digital speech banana can be converted into B-AAN-AAN-AA-HH by speech recognition technology, and ' B-AA ' is exactly one text unit here.Voice identification result is to utilize speech recognition technology a digital voice file to be changed the true literal that can read for the people.This literal just is called voice identification result.The recognition result of each phonetic feature that produces by speech recognition is not absolutely accurately.

Acoustic model is by utilize digital voice file and the corresponding artificial markup information as phonetic material, the probability model that uses speech recognition technology to set up between the acoustic feature of word (word) and digital voice file.Acoustic model is that voice are important inputs of speech recognition engine.So before carrying out speech recognition, must train in advance acoustic model.By using different training utterance materials can obtain different acoustic models.

Language model is according to the frequency of occurrences of word in the text material and front and back order, the statistical model that uses a large amount of text material training to obtain.Describe as an example of 3 kinds of language models example in an embodiment of the present invention, that is, and phonemic language model, inferior text unit language model and word language model.Acoustic model and language model can use in a plurality of fields such as natural language processing, machine learning, text marking and full-text search.

Next, in step S1030, respectively a plurality of retrievals input phonetic features are retrieved, to obtain degree of confidence and the search engine score corresponding to every outcome record in the result for retrieval tabulation of each retrieval input phonetic feature and the result for retrieval tabulation.For the first degree of confidence with the front phonetic feature differentiates, the degree of confidence of outcome record is called the second degree of confidence.Utilize different search engines that different input phonetic features is retrieved, and can utilize a plurality of search engines to retrieve for same input phonetic feature.

Degree of confidence is an output of decoding in the speech recognition process.As mentioned above, because the recognition result of each phonetic feature that produces by speech recognition is not absolutely accurately, the decode procedure of speech recognition is actually the output procedure of probability statistics, so the Output rusults of each decoding can have a degree of confidence to represent this time correct probability of decoding.Usually the value of degree of confidence is to change between 0.0 and 1.0.In addition, the concrete grammar of calculating degree of confidence is not construed as limiting scope of the present invention.Can calculate the first and second degree of confidence by any method known in the art.For example, can come by following formula 1 the first or second degree of confidence CLi of computing voice feature Ei:

{CL}_{i} = Π_{i = 1}^{n} P_{i} (E_{i} | E_{1}, E_{2}, . . ., E_{i - 1})

... formula 1

Wherein, use one group of acoustic model and language model decoded speech file, in order to voice document is transformed into sequence E1, E2, E3, the E4......En of a phonetic feature.Pi (Ei|E1, E2 ..., Ei-1) expression when sequence E1, E2 ..., after Ei-1 occurs, i position is the probability of Ei.When calculating degree of confidence, acoustic feature also can be utilized in addition.Node degree of confidence when for example decoding in the hidden Markov model of acoustic feature (HMM) model can form new degree of confidence with the top degree of confidence acting in conjunction of mentioning.The HMM model is mainly used to make up acoustic model (referring to http://www.hudong.com/wiki/ hidden Markov model) in speech recognition.The node of HMM model generally represents phoneme (phone).Migration probability between the sequence node that the Output rusults of HMM model forms, as mentioned above this mobility can with language model in degree of confidence jointly form new degree of confidence.

In step S1040, calculate the retrieval score of every outcome record of this phonetic feature according to the first degree of confidence, the second degree of confidence and the search engine score of each phonetic feature.In addition, because different search engines does not have a comparability, so in this step, also need the retrieval score is carried out normalization.In addition, the retrieval score being carried out normalized concrete grammar is not construed as limiting scope of the present invention yet.Can carry out retrieving the normalization of score by any method known in the art.For example, according to linear model, minimum score is become 0.0, maximum score becomes 1.0; Perhaps according to statistical model, the minimum value place of statistical distribution is become 0.0, the summation place 1.0 of statistical distribution; Perhaps according to statistical model, Gaussian is so that average becomes 0.0, and variance is set as 1.0 etc.

Then, in step S1050, according to normalized retrieval score, each result for retrieval tabulation is resequenced.At last, in step S1060, the tabulation of the result for retrieval after the rearrangement of each feature is merged to obtain final retrieval tabulation.Retrieval process for the retrieval input can be called online processing.

In the present embodiment, the retrieval input can be phonetic entry and also can be the literal input.When described user is input as the literal input, can use dictionary from the retrieval input, to extract a plurality of retrieval input phonetic features, and the first degree of confidence is set to represent the value that the phonetic feature of the phonetic feature that extracts and actual input mates fully, for example, the value of the first degree of confidence is set to 1.0.

Below with reference to the step in the accompanying drawing specific descriptions method 1000.Fig. 2 is the process flow diagram of having described according to an example of the step of the method shown in Figure 1 1000 of the embodiment of the invention.The example of the specific implementation of the step S1030 among Fig. 1 is described below with reference to Fig. 2.

In method shown in Figure 2, in order to improve the speed of retrieval, can utilize the index a plurality of retrieval input of retrieval phonetic feature in the voice record set.Can obtain by the mode shown in Fig. 2 the index of a plurality of retrieval input phonetic features.Particularly, at first in step S2010, read voice document from the voice record set.Then, in step S2020, with step S1020 similarly, utilize many group acoustic models and language model from voice document, to extract a plurality of file voice features, and the degree of confidence of calculating each file voice feature is as the second degree of confidence.Next in step S2030, each file voice feature and the voice document at its place, position and the second degree of confidence in voice document are associated, in order to improve retrieval rate.Last in step S2040 the related information of voice document, the position in voice document and second degree of confidence at storage file phonetic feature and its place as index.Table 1 schematically shows the index that generates, and wherein E represents phonetic feature, and AM represents acoustic model, LM representation language model.

Table 1 phonetic feature index and description

An example of the index that obtains according to the present embodiment has been shown in the table 1.As shown in table 1, phonetic feature and the position of this feature in voice document are (namely, the start and end time that this phonetic feature occurs in comprising the voice document of this phonetic feature) degree of confidence of this phonetic feature is associated and in this voice document, forms the index of this phonetic feature.Phonetic feature can with a plurality of voice set in the voice document coupling, also can with a voice document in a plurality of voice segments couplings.The processing of generating indexes can be called processed offline.

When retrieving, the file voice feature that comprises in the retrieval that is retrieved input phonetic feature and the index is corresponding.Below, also the file voice feature is called phonetic feature to be retrieved.For example, retrieval input phoneme feature is corresponding with phoneme feature to be retrieved, and retrieval input time character features is corresponding with to be retrieved character features, and retrieval input word feature is corresponding with word feature to be retrieved.Table 2 shows the exemplary illustration figure according to the retrieval corresponding relation of the embodiment of the invention.Wherein, the phoneme feature that acoustic model 1 and language model 1 extract is used in ' E11 (phoneme, AM1, LM1) ' expression from the retrieval input.E11 (phoneme, dictionary) ' is illustrated in that the user is input as literal input but not during phonetic entry, the phoneme feature of using dictionary to be converted to.' √ ' expression has corresponding relation to need retrieval, so just can obtain the retrieval tabulation on this feature.In retrieving, when obtaining retrieving the class table, obtained degree of confidence by making index of reference, therefore improve retrieval rate.

Table 2 retrieval corresponding relation

Fig. 3 is the process flow diagram of having described according to another example of the step of the method shown in Figure 1 1000 of the embodiment of the invention.The example of a specific implementation of retrieval score that calculates every outcome record of this phonetic feature according to the first degree of confidence, the second degree of confidence and the search engine score of each phonetic feature is described among the step S1040 among Fig. 1 below with reference to Fig. 3.As shown in Figure 3, obtain a record from result for retrieval tabulation in step S3010, and the retrieval score (TSi) that this record is set is 0.0, i is the position of record in the tabulation here.In step S3020, obtain the degree of confidence of a retrieval input phonetic feature, and in step 3025 the scanning result record, input phonetic feature and whether be present in the outcome record to check this retrieval.If retrieval input phonetic feature is present in the outcome record feature, advances to step S3030, otherwise advance to step S3040.

In step S3030, calculate retrieval score (TSi): TSi+=Si * CLq * CLr.Here Si is the search engine score of result for retrieval record, and CLq is the first degree of confidence of retrieval input phonetic feature, and CLr is the second degree of confidence of the current results record of retrieval input phonetic feature.After this, at step S3045, judged whether more retrieval input phonetic feature.If have then return step S3020, otherwise advance to step S3050.

In step S3040, PTS TSi+=Si, Si is the search engine score of result for retrieval record here, and has judged whether more retrieval input phonetic feature at step S3045.If have then return step S3020, otherwise advance to step S3050.

Next, in step S3050, preserve TSi.And judged whether more result for retrieval record at step S3055.If more result for retrieval record is arranged then turn back to step S3010, otherwise advance to step S3060.At last, in step S3060, use TSi that the retrieval tabulation is sorted again.Replacedly, also can after carrying out normalization, resequence again.

The specific implementation of step S1060 among Fig. 1 below will be described according to another embodiment of the present invention.Because the standard of the score of different search engines is different, need to tabulate to result for retrieval according to concrete requirement is weighted.

In step S1060, tabulation is weighted to the result for retrieval after the rearrangement, then each retrieval tabulation after the weighting is merged to obtain final retrieval tabulation.The most simply merging method is linear combining.Can give weight of each signature search engine during linear combining, the summation of weight is required to be 1.0.Can carry out linearity according to formula 2 closes.Weighti is the weight of i signature search engine in formula 2.Weighted value is trained often and is obtained in actual applications, will pay close attention to during training to use and emphasize precision or recall rate.For same phonetic feature, precision and recall rate are inversely proportional to.' n ' is the number of signature search engine.If one outcome record appears in i the search engine, so score _iBe exactly that this is recorded in score in the different search engines.Otherwise, score _iBe 0.0.NewScorei represents new score.

NewScorei = Σ_{i}^{n} {weight}_{i} \cdot {Score}_{i}

... formula 2

Merging method in addition such as Comb MNZ, as shown in Equation 3.The new score of ' CombMNZ ' expression.' SUM (Individual Similarities) ' represents the same score that is recorded in the different search engines.' what signature search engines Number ofNonzero Similarities ' expression has comprised this record.

CombMNZ＝SUM(Individual?Similarities)×Number?of?Nonzero

Similarities ... formula 3

Also can use other merging method, such as Borda-fuse, Bayes-fuse method etc.The concrete grammar that merges is not construed as limiting scope of the present invention.

For different search engines, the standard of the new score that obtains by weighting be unified be comparable.So just can merge to obtain final retrieval tabulation according to the score after the weighting.

In addition, according to an embodiment of the invention speech retrieval method is also applicable to multilingual speech retrieval.The voice record set is often very large, can comprise multilingual voice sometimes.When extracting phonetic feature, add the language model that different language was trained if can cross, just can obtain the phonetic feature of different language.By the described speech retrieval method of above embodiment, can process like a cork multilingual problem.For example, when the acoustic model that uses japanese voice to train and language model carried out speech feature extraction to japanese voice, the degree of confidence of the phonetic feature that extracts can be very high.And when the acoustic model of training with japanese voice and language model processing Chinese and English, the degree of confidence of the phonetic feature that extracts so will be very low.So according to the retrieval score, the result for retrieval record nature of Japanese will come forward position when in the end merging.

In addition, also language and the vocabulary of dynamic extending speech searching system of according to an embodiment of the invention speech retrieval method.For speech searching system extended language and vocabulary are very important tasks.In actual applications, voice record set is very complicated and all increasing every day.Often the reason of Search Results variation may be exactly that many new vocabulary or new language file have joined the voice record set.In the speech retrieval method according to above embodiment, only need with new vocabulary or new acoustic model and the language model of speech training, then the model with new training carries out speech feature extraction to voice document, and generating indexes.When user input comprised the retrieval input of new term or language, the new phonetic feature that produces will join search, and in the effect of retrieving, reorder, last the results list being contributed after normalization and the merging oneself.Because the degree of confidence that the phonetic feature that extracts like this obtains is higher, therefore in retrieval, can improve the rank of himself.Because the new acoustic model that uses when having new vocabulary or language to add and language model are only based on new vocabulary or speech training, the vocabulary that comprises is fewer, so processing speed is than very fast.

Below, with reference to Fig. 4 speech searching system according to an embodiment of the invention is described.Fig. 4 shows the block diagram of speech searching system 400 according to an embodiment of the invention.As shown in Figure 4, the speech searching system 400 of the present embodiment comprises load module 410, decoder module 420, retrieval module 430, the module that reorders 440 and merges module 450.The modules of speech searching system 400 can be carried out respectively each step/function of the speech retrieval method among above-mentioned each embodiment, and is therefore succinct in order to describe, and no longer specifically describes.

For example, load module 410 can receive the retrieval input from the user.Decoder module 420 can utilize many group acoustic models and language model to extract a plurality of retrieval input phonetic features from the retrieval input and obtain the first degree of confidence of each retrieval input phonetic feature.Retrieval module 430 is retrieved a plurality of retrievals input phonetic features that decoder module 420 extracts respectively, to obtain the second degree of confidence and the search engine score corresponding to every outcome record in the result for retrieval tabulation of each retrieval input phonetic feature and the result for retrieval tabulation.The module that reorders 440 can calculate according to the first degree of confidence, the second degree of confidence and the search engine score of the phonetic feature of each phonetic feature this phonetic feature every outcome record the retrieval score and carry out normalization, and according to normalized retrieval score, each result for retrieval tabulation is resequenced.Merge module 450 and can merge to obtain final retrieval tabulation to the tabulation of the result for retrieval after the rearrangement of each feature.

Fig. 5 shows the block diagram of speech searching system 500 according to another embodiment of the invention.In speech searching system shown in Figure 5 500, represent identical parts of parts in the speech searching system 400 with Fig. 4 with identical Reference numeral.

In the speech searching system 500 of another embodiment of the present invention, load module 510 is except receiving from also reading voice document from the voice record set user's the retrieval input.Decoder module 520 utilizes many group acoustic models and language model to extract a plurality of retrievals input phonetic features from the retrieval input and also obtains the first degree of confidence of each retrieval input phonetic feature, and is used for utilizing degree of confidence that described many group acoustic models and language model extract a plurality of file voice features and calculate each file voice feature from described voice document as described the second degree of confidence.Speech searching system 500 also can comprise index module 560 in addition.Each file voice feature that index module 560 can be extracted decoder module 420 and the voice document at its place, position and described the second degree of confidence in described voice document are associated, and store related information as index.Retrieval module 430 utilizes the described a plurality of retrieval input phonetic features of index retrieval in the voice record set of storage in the index module 560.In addition, similar with speech searching system 400, speech searching system 500 also can comprise according to the first degree of confidence of the phonetic feature of each phonetic feature, the second degree of confidence and search engine score can calculate this phonetic feature every outcome record the retrieval score and carry out normalization, and according to normalized retrieval score, the merging module 450 that reorders module 440 and the tabulation of the result for retrieval after the rearrangement of each feature is merged to obtain final retrieval tabulation to each result for retrieval tabulation is resequenced here repeats no more.

By utilizing a plurality of features of voice, speech searching system of the present invention can obtain than the better result of speech searching system who uses a phonetic feature.And by utilizing degree of confidence that result for retrieval is resequenced, when having reduced speech recognition than the impact of low confidence result on phonetic search.

In addition, can be with acoustic model and language model, voice record set and the index stores that obtains externally in the storer.And speech searching system also can comprise output module, to export result for retrieval to the user.

In addition, can centralized (example as shown in Figure 4 and Figure 5) or the distributive constitution embodiments of the invention in speech searching system.Fig. 6 shows the block diagram of speech searching system 600 according to another embodiment of the invention.In Fig. 6, with the speech searching system shown in the distribution mode pie graph 5.For example, load module 510 is arranged in distribution apparatus 610, and decoder module 420, index module 560, retrieval module 430, the module that reorders 440, merges module 450 and be arranged in distribution apparatus 620.

Distribution apparatus

610 and 620 is the devices that separate, and the position can away from each other, for example be connected to each other by network 630.Certainly, above-mentioned module also can make up according to other/mode of sub-portfolio, be distributed in the device of each long-distance distribution.

In addition, the speech searching system shown in a plurality of Fig. 4 and/or Fig. 5 can also be interconnected by network.

It should be noted that the method shown in Fig. 1-3 each step needn't according to shown in order carry out.Can put upside down or carry out concurrently some step.For example in step S1020, utilize many group acoustic models and language model from the retrieval input, to extract a plurality of retrievals input phonetic features and obtain after first degree of confidence of each retrieval input phonetic feature, can for the many retrievals input phonetic features that extract simultaneously execution in step S1030 to step S1050.

Those of ordinary skills can recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Those skilled in the art can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought and exceeds scope of the present invention.

It should be appreciated by those skilled in the art that can be dependent on design requirement and other factors carries out various modifications, combination, part combination to the present invention and replace, as long as they are in the scope of appended claims and equivalent thereof.

Claims

1. the method for a speech retrieval may further comprise the steps:

Reception is from user's retrieval input;

Utilize to organize acoustic models and a plurality of retrieval input phonetic features of language model extraction from described retrieval is inputted more and obtain each and retrieve the first degree of confidence of inputting phonetic feature;

Respectively described a plurality of retrievals input phonetic features are retrieved, to obtain the second degree of confidence and the search engine score corresponding to every outcome record in the result for retrieval tabulation of each retrieval input phonetic feature and the tabulation of described result for retrieval;

Calculate according to the first degree of confidence, the second degree of confidence and the search engine score of each phonetic feature this phonetic feature every outcome record the retrieval score and carry out normalization;

According to normalized retrieval score, each result for retrieval tabulation is resequenced; And

The tabulation of result for retrieval after the rearrangement of each feature is merged to obtain final retrieval tabulation,

Wherein saidly respectively described a plurality of retrievals input phonetic features are retrieved, are comprised corresponding to the second degree of confidence and the search engine score of every outcome record in the result for retrieval tabulation of each retrieval input phonetic feature and the tabulation of described result for retrieval obtaining:

Utilize the described a plurality of retrieval input phonetic features of index retrieval in the voice record set,

Wherein obtaining described index comprises:

Read voice document from described voice record set;

The degree of confidence of utilizing described many group acoustic models and language model to extract a plurality of file voice features from described voice document and calculating each file voice feature is as described the second degree of confidence;

Each file voice feature and the voice document at its place, position and described the second degree of confidence in described voice document are associated; And

The storage related information is as index.

2. the method for claim 1 wherein saidly merges to obtain final retrieval tabulation to the tabulation of the result for retrieval after the rearrangement of each feature and comprises:

Tabulation is weighted to the result for retrieval after the described rearrangement; And

Each retrieval tabulation after the weighting is merged to obtain final retrieval tabulation.

3. the file voice feature that comprises in the method for claim 1, the retrieval that wherein is retrieved input phonetic feature and index is corresponding.

4. the method for claim 1, wherein

Described many group acoustic models and language model are corresponding at least a language.

5. the method for claim 1, wherein

Described many group acoustic models comprise mutually different vocabulary with language model.

6. the method for claim 1, wherein

Described retrieval is input as phonetic entry and/or literal input,

When described retrieval is input as literal when input, described the first degree of confidence is set to represent the value that the phonetic feature of the phonetic feature that extracts and actual input mates fully.

7. the method for claim 1, wherein said phonetic feature is acoustic feature, phoneme feature, inferior character features, word feature or voice identification result.

8. the system of a speech retrieval comprises:

Load module is used for receiving the retrieval input from the user, and is used for reading voice document from the voice record set;

Decoder module, be used for to utilize many group acoustic models and language model to extract a plurality of retrievals input phonetic features and obtain the first degree of confidence of each retrieval input phonetic feature from described retrieval input, and be used for utilizing degree of confidence that described many group acoustic models and language model extract a plurality of file voice features and calculate each file voice feature from described voice document as described the second degree of confidence;

Index module, be used for each file voice feature that decoder module is extracted and its place voice document, be associated in position and described second degree of confidence of described voice document, and store related information as index;

Retrieval module, the described a plurality of retrieval input phonetic features that respectively described decoder module extracted are retrieved, to obtain the second degree of confidence and the search engine score corresponding to every outcome record in the result for retrieval tabulation of each retrieval input phonetic feature and the tabulation of described result for retrieval, wherein said retrieval module utilizes the described a plurality of retrievals input phonetic features of index retrieval in the voice record set of storing in the described index module;

Module reorders, calculate according to the first degree of confidence, the second degree of confidence and the search engine score of the phonetic feature of each phonetic feature this phonetic feature every outcome record the retrieval score and carry out normalization, and according to normalized retrieval score, each result for retrieval tabulation is resequenced; And

Merge module, the tabulation of the result for retrieval after the rearrangement of each feature is merged to obtain final retrieval tabulation.