CN101510222A - Multilayer index voice document searching method and system thereof - Google Patents

Multilayer index voice document searching method and system thereof Download PDF

Info

Publication number
CN101510222A
CN101510222A CNA200910131828XA CN200910131828A CN101510222A CN 101510222 A CN101510222 A CN 101510222A CN A200910131828X A CNA200910131828X A CN A200910131828XA CN 200910131828 A CN200910131828 A CN 200910131828A CN 101510222 A CN101510222 A CN 101510222A
Authority
CN
China
Prior art keywords
speech
document
voice
index
phone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200910131828XA
Other languages
Chinese (zh)
Other versions
CN101510222B (en
Inventor
吴玺宏
迟惠生
曲天书
万广鲁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN200910131828XA priority Critical patent/CN101510222B/en
Publication of CN101510222A publication Critical patent/CN101510222A/en
Application granted granted Critical
Publication of CN101510222B publication Critical patent/CN101510222B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a multilayer indexing voice document retrieval method and a system thereof, and belongs to the technical field of information retrieval. The multilayer indexing voice document retrieval method comprises the following steps: (1) feature extraction of a multimedia stream is implemented, thus obtaining a voice feature sequence; (2) a voice identifying decoder is used for searching the voice feature sequences, thus obtaining a word lattice and an optimal identification result; (3) according to the word lattice and the optimal identification result, a word and syllable double-layer indexing database is constructed; and (4) relevant documents of a given query term are searched in the indexing database and returned to users. The multilayer indexing voice document retrieval system comprises an automatic voice identifying module that is used for automatically identifying characters in voice documents; an automatic voice document index constructing module that is used for constructing double indexes of the voice identification result, and a voice document retrieval module that is used for searching the relevant documents of given query terms in the indexing database and returning the documents to users. Compared with the prior art, the multilayer indexing voice document retrieval method and the system can realize quick and accurate searching of multimedia data.

Description

A kind of multilayer index voice document searching method and system thereof
Technical field
The present invention relates to a kind of multilayer index voice document searching method and system thereof, utilize the voice messaging in multimedia document to carry out automated cataloging and retrieval, belong to technical field of information retrieval, can be applicable to TV station, radio station and multimedia web site etc. is located.
Background technology
Along with the fast development of digital multimedia technology, the multimedia messages that people are faced is volatile growth.Along with improving constantly of the Internet speeds, the multi-medium data that is present in the internet is more and more.Simultaneously, the building-up work of digital library is also perfect day by day.Therefore, how obtaining required knowledge fast and accurately from the multi-medium data of magnanimity, is problem demanding prompt solution.
In recent years, information retrieval technique has become the important means that people obtain information, has profoundly changed the human mode of obtaining knowledge.Present most of multimedia retrieval system, still depend on multimedia file metadata information that is added or the text message that is associated, and the content information that does not utilize multimedia itself to be comprised, this is a waste greatly for the bulk information of preserving in the multimedia file.And for the multimedia database that does not have text message in a large number, this method is helpless.
In various multimedia forms, voice are the most approaching with literal.People are not passing through voice delivery information, preservation information all the time.Since Edison's invention phonograph begins, people have accumulated the speech documents data of enormous amount.Yet owing to lack retrieval mode fast and accurately, a large amount of speech documents can't be effectively used.
The voice document searching technology that the present invention paid close attention to, expectation utilizes the content information in the speech documents exactly, makes the user can seek the speech documents of needs efficiently and accurately.Because exist close getting in touch between voice and text, the voice document searching technology has significant values in multimedia retrieval.
Exist in the voice document searching system current, carry out index building with speech as indexing units mostly, have the outer speech problem of collection but be to use speech to carry out index building, make when term be can't find corresponding document during speech outside collecting.Some scholar begins one's study based on sub-speech cell formation index recently, utilize sub-speech unit as indexing units, the not accurate enough problem that returns to document can occur, the present invention proposes the method for bluebeard compound and phone, makes that the accuracy rate and the recall rate performance of retrieval are all well showed.
Summary of the invention
The object of the present invention is to provide a kind of multilayer index voice document searching method and system thereof, introduced Chinese speech identification demoder in the present invention based on the weighting FST, this method adopts the method for static expansion, can improve the speed of speech recognition, make this product must handle a large amount of speech datas faster, realize searching fast and accurately multimedia document; Utilize this system can detect some responsive content in a large amount of audio frequency and video fast and accurately, and the construction that can be used for digital library.
Current main demand for voice document searching comprises: automatic speech document catalogue and quick voice document searching.Consider current Science and Technology level, satisfy above-mentioned requirements, exist following technological difficulties, and these technological difficulties these works problem of wanting emphasis to solve just, simultaneously, solve the place that these technological difficulties also are the innovation and the contribution of these works scientifically and rationally.
Obtain voice content fast and accurately
The automatic identification of voice content is an extremely difficult task.On the one hand, the acoustic enviroment in the audio frequency changes various, comprises voice, the voice under the outdoor noise circumstance or the voice under the music background noise under the quiet environment or the like; On the other hand, the speaker in the movie and video programs also is that variation is various, from the announcer of standard, and the ordinary person, even comprise speaker who is with dialectal accent or the like.Therefore, according to acoustic enviroment and speaker's difference cut apart, voice in classification and the cluster movie and video programs are captions essential steps of identification automatically.Simultaneously, for such task complicated and changeable, it is vital designing the large vocabulary continuous speech recognition system that strong robustness, a speaker has nothing to do.Further guaranteed the accuracy of captions scripts based on the auxiliary critique system of the captions of degree of confidence marking.Consider the requirement of validity, the present invention also optimizes automatic continuous speech recognition system, optimizes recognition system and makes it reach real-time processing (referring to that the processing time is less than or equal to the time of voice itself) under the slight prerequisite that descends of accuracy rate guaranteeing.
Index voice content how fast
Along with the data in the speech database is more and more, and the voice data in the internet is more and more general, therefore, needs apace multimedia file to be made a catalogue automatically, put in order, and the user is searched conveniently.With that text is set up index is different, the result of speech recognition is not right-on, and we must consider this factor in the index construct process, set up rational index representation.At first can fast automaticly set up index database, secondly index database needs and can be retrieved fast.
How to search interested speech documents fast
People search needed speech documents and become more and more difficult in the face of the speech data of magnanimity, and we must adopt ingenious method fast, make to search corresponding speech documents fast in index database.In addition, we need to access relevant documentation accurately, and it is more intense with the correlativity of keyword promptly to return to document.Its key issue is how to represent similarity between search terms and the document exactly.
Technical scheme of the present invention is:
1, automatic speech recognition module
The function of automatic speech recognition is to finish the task of automatic identification audio file Chinese words, to replace the captions script generating mode of traditional " note while listening ".Its input is to comprise the multimedia file of audio frequency or audio file, and output is its corresponding text message.This function comprises following sequential processes process:
(1) from media stream, extracts audio stream;
(2) analyze this audio stream, and be divided into the junior unit under the different acoustic enviroments automatically;
(3) utilize large vocabulary continuous speech recognition system, the cutting unit after the identification cluster, output speech grid (Lattice) and optimal identification result.In the speech grid, preserved the zero-time of each speech, concluding time and model score.
2, the automatic speech document index makes up module
Automatic speech document index structure module is finished the task to the voice identification result index building, promptly to the optimal identification result and the speech grid of speech recognition output, utilize the suitable technique index building, make that searching system can realize retrieving fast, in native system, recognition result is taked speech and phone are made up duplicate key.Concrete steps are as follows:
(1) each speech in the speech recognition output speech grid is calculated its degree of confidence.
(2) each speech, to its word frequency rate TF and contrary document frequency (IDF) with and zero-time, the concluding time is set up index.
(3) according to the speech optimal identification result, and pronunciation dictionary, the phone that obtains recognition result is represented sequence.
(4) this phone is represented sequence, index building.
3, voice document searching module
The function that the voice document searching module is finished is to given term, finds relevant document apace in index database.And it is returned to the user.
(1) the input term is analyzed, obtained the search key sequence.
(2) with the keyword sequence input system, obtain return results
(3) if in keyword sequence, comprise speech outside the collection, then be converted into the phone sequence form.Utilize phone layer index search file.
Description of drawings
Fig. 1 is the voice document searching system flowchart;
Fig. 2 is voice document searching system demo system master interface;
Fig. 3 is speech recognition framework figure;
Fig. 4 is the computation process of U.S. scale cepstrum coefficient;
Fig. 5 is hidden Markov model (HMM) synoptic diagram;
Fig. 6 is based on the overall construction drawing of the demoder of weighting FST (WFST);
Fig. 7 is the WFST synoptic diagram that has merged language model information
Fig. 8 has merged language model, pronunciation dictionary, the WFST synoptic diagram of acoustic model information;
Fig. 9 is the Viterbi algorithm;
Figure 10 is a speech confusion network inverted entry structural drawing;
Figure 11 is a phone confusion matrix synoptic diagram;
Figure 12 is four phone string keyword extraction synoptic diagram.
Embodiment
One, automatic speech recognition module
The automatic speech recognition framework as shown in Figure 3, below with each module of detailed introducing system.
1, feature extraction
The purpose of feature extraction is to embody the feature of useful information stable in the voice as automatic speech recognition better in order to extract.A fundamental characteristics of voice signal is a smooth performance in short-term, and short-time analysis is the basis that phonic signal character extracts.Generally will carry out pre-emphasis to voice signal earlier and handle before extracting feature, the high fdrequency component that promotes voice is to reduce the decay of channel to the voice signal radio-frequency component.Subsequently, voice signal is carried out the branch frame handle (adopt 25 milliseconds of frame lengths usually, frame moves 10 milliseconds), and it is level and smooth to add Hamming (Hamming) window.
The acoustic feature that is used for automatic speech recognition commonly used is U.S. scale cepstrum coefficient MFCC (Mel FrequencyCepstral Coefficients), it is the promotion that is subjected to people's auditory system achievement in research, the acoustic feature of deriving based on the listening perception mechanism of people more meets the non-linear psychological phenomena of human auditory system.The process of calculating the MFCC feature is as shown in Figure 3:
The feature that adopts in native system is that 12 dimension MFCC cepstrum coefficients add energy, and their single order and second order difference, constitutes 39 dimensional feature vectors altogether.In addition, in order to eliminate the convolution The noise of channel, system is on the basis of extracting the MFCC feature, and (Cepstral Mean Normalization CMN) comes channel is compensated to have introduced cepstral mean normalization.
2, acoustic model and linguistic model
Automatic speech recognition system based on statistics need utilize the method for pattern-recognition to carry out the automatic identification of voice on the basis of statistical model.Usually, these statistical models are called as the knowledge base (Knowledge Base) of automatic speech recognition, comprise acoustic model (Acoustic Model, AM), language model (Language Model, LM) and pronunciation model (Pronunciation Model, PM).
Automatic speech recognition system supposes that usually voice signal is a kind of coding (Encoding) realization of series of sign.So, given voice signal of identification just is equivalent to a decoding (Decoding) process.In order under the prerequisite of given voice signal, to identify implicit symbol sebolic addressing effectively, stationarity in short-term according to voice, continuous speech waveform is converted into the discrete vector (shown in the proper vector among the figure) of a series of equal lengths by characteristic extracting module earlier usually, and supposes that this a series of discrete vector can characterize corresponding speech waveform exactly.Therefore, the task of recognizer will realize from speech feature vector exactly to the such mapping of implicit symbol sebolic addressing (Mapping) process.In this process, the role of acoustic model (AM) utilizes a large amount of speech datas, and the difference in acoustic properties of distinct symbols unit is carried out modeling; Language model (LM) has defined the language of symbol sebolic addressing, is playing the part of the role that language that recognizer is allowed carries out modeling.In addition, for a certain specific language, symbolic unit has the definition of different levels usually, and such as the speech in the Chinese, word, syllable harmony simple or compound vowel of a Chinese syllable or the like, pronunciation model (PM) is exactly the mapping that has realized between the linguistic unit of these different levels.
As mentioned above, the acoustic difference of acoustic model modeling distinct symbols unit.Hidden Markov model HMM is the modeling method of current most popular a kind of voice signal time varying characteristic.It describes the statistical property of signal jointly by two stochastic processes that are mutually related, and one of them is the hidden Markov chain with finite state, and another is the stochastic process of the observation vector that is associated with each state of Markov chain.The feature that time varying signals such as voice are a certain section is just described by the stochastic process of corresponding states observation symbol, and signal is then described by the transition probability between the state of latent Markov chain over time, and the motion of vocal organs then is hidden in after the Markov state chain.Why this HMM that also is based on statistics can become the immanent cause of the powerful tool of voice signal processing.Because voice signal is a time series, therefore, our general employings model structure from left to right, as shown in Figure 9.As can be seen from the figure, the parameter of HMM model comprises that initial state distribution, state transition probability distribute and the probability distribution (using the GMM modeling usually) of observation vector.Estimate these parameters, the classic algorithm of promptly training the HMM model is the Baum-Welch algorithm, and this is an algorithm based on recursion, be called forward direction-back again to algorithm, (Maximum Likelihood, ML) criterion belong to a kind of of EM algorithm to this algorithm based on maximum likelihood.
Following table has been enumerated some parameters and the setting of the acoustic training model of automatic speech recognition, comprises the acoustic training model data.
The training of acoustic model
Modeling unit 204 single-tone submodel unit that band is transferred, three-tone modeling between context-sensitive speech.
Model structure Each modeling unit comprises 3 states from left to right, redirect between the enable state.
The model cluster Based on the state clustering of decision tree, obtain 8000 state class (Senone) after the cluster.
Observation vector output distributes Each Senone carries out modeling with the mixing of 32 gaussian component
Training data About altogether 720 hours speech data comprises: 863 standard databases; 863 dialect databases (Chongqing, Xiamen, Shanghai, Guangzhou, Harbin); The laboratory is from recording speech database.
Model training Based on the Baum-Welch algorithm, utilize cluster machine parallel training; The sex Acoustic Modeling of being correlated with, training obtains the masculinity and femininity acoustic model respectively.
In speech recognition system, language model has provided the prior probability of the language of demoder permission in advance, and this has important effect for limit search space, disambiguation in decode procedure.The language model of at present widespread use is the grammatical language model of N unit, thinks that promptly the probability of current speech appearance is relevant with its preceding N-1 speech, and this preceding N-1 speech is become the history of current speech.Along with the increase of N, the number of model sharply rises, and has just required more corpus.Consider the sparse problem of data and the trainability of model, the N value is 3 usually, promptly obtains the ternary syntax (Trigram) language model, and this can be regarded as the Markovian process of a second order.The language model training is according to the number of times of ternary speech to occurring in corpus, utilizes maximum likelihood estimate to obtain the parameter of model.Even under the situation of N<3, still the sparse phenomenon of data might occur and cause some speech in corpus, not occurring, therefore must carry out the smoothing processing of data, common smoothing method has: rollback (Back off) method, Discounting method, Good-Turing smoothing method and Witten-Bell smoothing method or the like.Following table has been listed some parameters and the setting of the language model training of automatic speech recognition, comprises the training data of language model.
The training of language model
Corpus The corpus of text of 1.6G comprises: the People's Daily's corpus of text in 9 years altogether; Www.xinhuanet.com's corpus of text; Network data (from Sina website and some Olympic Games, the related web site of travelling).
Pre-service HTML and XML data-switching, text normalization, language material balance, participle
Smoothing method Rollback Backoff Good-Turing is level and smooth
The model scale Finally obtain the language model of 250MB size, comprise: the monobasic syntax: 64,275 bi-gram: 19,966, the 258 ternary syntax: 24,724,142
Model performance Repeatedly test finds that the average degree of branching of this language model is 300
Pronunciation model, i.e. pronunciation dictionary has been set up the mapping relations between the linguistic unit of different levels.In native system, acoustic model has been portrayed the difference between the different pronunciation unit, and language model has been described speech or the semantic information on the speech level, and pronunciation dictionary then is " one to one " or " one-to-many " mapping that has realized from " speech " to " sound ".We have set up a single-shot sound dictionary that comprises 64275 entries, and have guaranteed the consistance of the entry and the entry in the language model of pronunciation dictionary.Be part pronunciation dictionary example below:
The b ei3 j ing1 d a4 x ve2 of Peking University
The b ei3 j ing1 sh ib4 of Beijing
Challenge cup t iao3 zh an4 b ei1
The left side one row are entries, and corresponding to the entry in the language model, the right one row are pronunciations of this entry, corresponding to the modeling unit of acoustic model.Same entry can comprise a plurality of pronunciations, and can specify probability for each multiple sound entry.
3, based on the speech recognition decoder device of weighting FST
The final search mission of speech recognition is finished by demoder, and it is a part very crucial in the speech recognition system.The task of demoder is to utilize knowledge such as acoustic model and language model, to input feature vector sequence search optimal path, obtains the optimum word sequence of voice correspondence.
Native system is taked based on weighting FST (Weighted Finite State Transducer, WFST) speech recognition decoder device is finished the speech recognition decoder function, and this demoder can improve some drawbacks based on the speech recognition decoder device of speech tree.At first, the speech recognition decoder device of setting based on speech adopts dynamic speech tree expansion technique, and this technology can be brought very big time complexity.Especially along with speech recognition is moved towards in the process of application, environment for use is constantly had higher requirement to the recognition speed of system, how to avoid the cost of dynamic operation search network, the time complexity that reduces identification becomes an important technology index of speech recognition decoder device.Secondly, extensibility based on the speech recognition decoder device of speech tree is not strong, this means many newest research results in speech recognition can't directly use one time the decoding in, but in a decoded results speech net, beat again branch, so at first make total system become very complicated, many in addition technology are used its advantage of desalination in two times decodings.
Various statistical models are represented, merged to speech recognition decoder device based on WFST at first with WFST, using automaton theory again is optimized search network, thereby realized that the static application ternary syntax make up search network, and controlled the scale of search network by optimized Algorithm.Because the cost that this method has avoided the News Search network operation to be brought, relatively the method for News Search extension of network has the advantage on the recognition speed.
Fig. 6 is the overall construction drawing of demoder.Demoder is divided into two big modules as we can see from the figure: search network makes up module and search module.Wherein search network structure module is responsible for expanding according to the search network that carries out of the model static state of importing.The model of input comprises: context-sensitive hidden Markov acoustic model, pronunciation dictionary, the grammatical language model of N unit etc.The expansion of search network comprises two key steps: the optimization of model merging and network.The network optimization mainly is that definiteization by network, minimize and the composition (factoring) of network of network are changed realization.After finishing above step, can obtain a static search network that launches and optimized.
The search module of demoder then at the input voice, at the enterprising line search of search network, finds optimal target speech string.Search module adopts the frame synchronization Viterbi Beam searching algorithm of online mode.Searching algorithm carries out operations such as acoustic model marking, path expansion and path cutting successively, and to recall at certain interval, produces recognition result each frame input signal.Recognition result comprises two kinds of forms: optimal identification result and speech grid.
Because search network makes up module and search module is independently, therefore when we need add the more knowledge source, only need the change search network to make up part and get final product, can add other knowledge sources so more flexibly or add additive method, improve the extensibility of system.
The WFST definition
The definition that weighting finite state acceptor and weighting FST provide is all based on the semi-ring Algebraic Structure.A semi-ring Algebraic Structure
Figure A200910131828D00121
0,1) is a ring structure that possible lack subtraction.It has two computings with law of association of sealing+and * on K.The identical element of these two computings is respectively 0 and 1.And+* is had partition coefficient.Such as, 0,1) is exactly a semi-ring.Weights are represented probability usually in speech recognition, thus the Algebraic Structure that is fit to be the probability semi-ring (
Figure A200910131828D00132
0,1).
Weighting finite state acceptor (Weighted Finite State Acceptor) is a kind of expansion of finite-state automata.Weighting finite state acceptor A=on semi-ring K (∑, Q, E, i, F, λ ρ), comprises an alphabet
Figure A200910131828D0013113018QIETU
, a finite state set Q, a limited redirect set
Figure A200910131828D00134
An original state i ∈ Q, a group termination state F ⊆ Q , An initial weight λ and a termination weights function ρ.A redirect t=(t -, l (t), w (t), t +) ∈ E can be by one from source state t -To purpose state t +Transfer represent.Have a mark l (t) and power w (t) in each such redirect.In speech recognition, redirect power w (t) generally represents the logarithm of probability or probability.
A paths is one group of interconnective redirect sequence t among the WFSA 1T n, satisfy t i + = t i + 1 - , If i=1...n-1 one paths
Figure A200910131828D00137
Start from original state i and arrive certain final state f ∈ F, then
Figure A200910131828D00138
It is the path of a success.The path The sequenced splicing of mark that is labeled as all redirects that it comprises
Figure A200910131828D001310
Figure A200910131828D0013113926QIETU
Power be end power to initial power, redirect power and ` final state
Figure A200910131828D001311
Figure A200910131828D001312
Operation result.
Weighting FST (Weighted Finite-State Transducer) is that WFSA is further expanded: the symbol among the WFSA in the redirect is replaced by a pair of incoming symbol i and output symbol o.Thus, WFST has defined from symbol sebolic addressing the function to weights.Formal, one is defined in semi-ring
Figure A200910131828D001313
On WFST T=(∑, Ω, Q, E, i, F, λ ρ) comprises an input character table ∑, an output character table Ω, finite state set Q, redirect set E ∈ Q * (∑ ∪ { ε }) * (Ω ∪ { ε }) * (K * Q), an original state i ∈ Q, a final state F ⊆ Q , An initial weight λ and a termination weighting function ρ.A redirect t=(t -, l i(t), l o(t), w (t), t +) can represent one from previous status t -To purpose state t +Transfer.Carrying that each is such has an incoming symbol l on the brick i(t), output symbol l o(t) and one power w (t).Path among the WFST, the definition of the input cousin in path and the power in path is identical with definition among the WFSA.The output token of one paths is the order splicing of output symbol of his all redirects.
Introduce each module below respectively based on the speech recognition decoder device of WFST
Model merges
The task that model merges is the fusional language model, information such as pronunciation dictionary and acoustic model, the process of an initial ranging network of structure.
Below with this process of formal specification of example, suppose in the dictionary of system to comprise two speech go and bus that language model adopts bi-gram.Shown the WFST that only comprises language model in Fig. 6, P among the figure (bus) represents the monobasic syntax score of speech bus, and B (bus) represents the rollback weight of speech bus, and P (bus|go) representative is the forerunner with speech go, the bi-gram score of speech bus.Utilize the gem-pure language model knowledge of having described of this search network as can be seen.
Added pronunciation dictionary and acoustic model information in Fig. 7, as can be seen from the figure the difference of itself and Fig. 6 is that jump-transfer unit is a phone, rather than the jump-transfer unit speech among Fig. 6.Each speech can find corresponding phone sequence according to pronunciation dictionary, thereby obtains corresponding acoustic model unit, and introduces acoustic model information.Therefore in Fig. 7, incorporated linguistic information and acoustic information, a complete search network is provided.Weight is score in this search network as can be seen, is the fusional language integrate score that harmony gains knowledge of gaining knowledge.Incoming symbol is the acoustic model unit, and output symbol is a speech.
Definiteization of network
Adopt each step synthetic be optimized of the definite words of weighting algorithm to search network in this system, the fundamental purpose of definiteization is the path of repeating for removal in synthetic network, and significantly reduces network size with this.Thereby the time complexity when reducing search.
Network minimizes
To after the determining of network after synthetic, alignment minimizes operation again, minimizes operation except the status number that can reduce search network in this system, also to improving recognition performance certain help is arranged.
Minimized committed step pushes away weights exactly before original state is carried out, i.e. linguistics score or acoustics score are incorporated in the system go as far as possible early.Can make that like this path score in the decode procedure is more accurate, thereby can improve recognition performance.
Network componentization
Article one, except first state and last state, all states just only have the road on an input limit and output limit to be called as a linear path, after the process aforesaid operations, may in network, there be many linear paths, the network component fractional analysis adopts a new limit to replace this linear path, uses a mapping table to preserve the new limit and the mapping of linear path.Can significantly reduce the scale of network like this, reduce space complexity
After the process aforesaid operations, this system obtains a search network that can be used for speech recognition decoder.
Speech recognition decoder is promptly sought optimal path according to input feature vector in search network, the Viterbi algorithm of alignment when native system adopts, and Fig. 9 is the synoptic diagram of this algorithm, wherein horizontal ordinate has been represented its timeline information, has vertically represented each state.Can obtain the optimum search path by this algorithm.
In the process of search, because the search volume is very big,, therefore need to adopt necessary cutting if therefore keeping all possible paths will make that time complexity sharply expands, after each frame signal decoding finishes, only keep a part of path, and other path is abandoned.Take the speech level in this system, the method for phone level and the multi-level cutting of gauss' condition level makes cutting keep correct path as much as possible.In order to improve the speed of cutting, native system has been introduced the histogram tailoring technique simultaneously.By the introducing of tailoring technique, can make the speech recognition real-time be greatly enhanced.
Because in the process of speech recognition, need when each speech output it be saved in the rollback table, along with the carrying out of decoding, continuous must the increase of rollback table meeting exceeds the tolerance range of system.In this system, rollback table certain hour is exported the temporary word grid at interval, empty its shared internal memory simultaneously.
Two, the automatic speech document index makes up module
1, confidence calculations
In speech recognition, degree of confidence is used for assessing the speech recognition reliability.The value of degree of confidence is used for representing giving the degree by certain speech correctness in the recognition result usually between 0 to 1, and whether each whole sentence recognition result is reasonable in other words.A good degree of confidence marking influences the range of application of speech recognition to a great extent.
In native system, adopt the method construct confusion network that aligns based on N-best and obtain degree of confidence, adopt N-best tabulation alignment, and obtain degree of confidence according to the method for N-best Voting.
The generation of N-best tabulation
The N-Best tabulation that this paper adopts refers to the relevant N-Best tabulation of speech.This algorithm at first carries out behind directed acyclic graph obtaining each node n to speech net tail node<=s to optimum route search〉optimal path acoustic model and language model PTS h (n), then carry out forward direction A again *The search, with the resulting score h of back (n) as heuristic function.
Forward direction A *Algorithm need be safeguarded two heaps (Heap), and this heap is the array type data structure of regular length, and generally its length is taken as 2 n-1, n is the number of plies of heap.Its element of definition H (k) according to big root heap〉H (2k+1), H (k)〉H (2k+2), the root header element is a greatest member, least member is one deck in the end necessarily.Pile structure used herein guarantees that simultaneously last element is a least member.When adding, deleting the heap element operation, will adjust to guarantee the structure of heap.First heap is " part path " heap P in the algorithm, preserves the part path of current expansion, and another heap is " fullpath heap " F.
For the characteristic that guarantees that speech is relevant, in heap operation, add row's retry, if the word sequence of two paths is identical, then compare its score, that paths that only keeps score higher.
(1) initialization section path heap P is with start node<s〉put into heap
(2) get in the part path heap and pile header element, expand all possible path.And calculate acoustics, the language model PTS g (n) in this path.Wherein the acoustic model score has been stored in each node, and the language model score then obtains according to preceding two nodes in path and present node.If the path inspire to such an extent that be divided into the path that expands and comprise<=s then put into F, otherwise put into P.P is according to H (n)=g (n)+h (n) ordering, and F sorts according to g (n)
(3), get rid of k minimum element of exceed capacity if P surpasses max cap.
(4) if P is empty, algorithm finishes
(5) if F is full, then algorithm finishes, otherwise changes 2
(6) fullpath among the output F is empty up to F.
The result who obtains according to above algorithm is the N-Best tabulation of this speech net.Wherein N can be controlled by the capacity of F.
The generation of degree of confidence
To every fullpath result in the N-Best tabulation that generates, carry out simple time unifying according to optimal path result's time division, calculate the speech probability of occurrence in each time period, with the degree of confidence of this probability as this speech.Principle is:
(1) if certain the speech hypothesis among the N-Best drops on optimal path fully in the speech hypothesis of corresponding time, then it was integrated in this time period
(2), then it is integrated in the optimal path in the time period with its overlapping maximum if certain speech hypothesis not exclusively drops in certain speech hypothesis of optimal path among the N-Best.
(3) if certain bar N-Best path with after optimal path aligns, certain speech of optimal path is supposed to align with it without any the path, then adds an empty limit.N-Best result after the alignment is voted, and its principle is at first to add up all different speech hypothesis in each time period, and calculate its occurrence number." degree of confidence " of each speech hypothesis simply is set as its occurrence number divided by N.
2, improved TF/IDF keyword weight
In traditional keyword weight marking, the most basic several amounts comprise word frequency rate (tf) and document frequency (df).But owing in the confusion network index, introduced a large amount of candidate's hypothesis, obtain tf and df, will make that the estimation of weight is inaccurate if still adopt directly to add with form.For this reason, degree of confidence need be introduced in the calculating of these statistics.At this note confusion network document is D i, its complete or collected works are D, and each the speech node in each document is designated as w I; j=(s; E; C), w I; jMark the position of speech w, i is for this speech comes across i document, j is j speech in i the document for this speech, wherein s; E is respectively beginning, the concluding time of speech hypothesis, and c is its degree of confidence score.Improved tf (mtf) computing formula is:
mtf ( w , D i ) = Σ ∀ w i , j ∈ D i , w i , j = w w i , j . c
W in this formula I, j=w represents that j speech in i the document is w, and the degree of confidence score that this all in document speech is supposed sums up, and obtains a tf value.The meaning of doing like this is conspicuous, and it will guarantee that the high speech hypothesis of degree of confidence produces bigger influence to tf, and the speech of less degree of confidence hypothesis also can play certain influence to tf.
The definition of df also needs to revise, and from probability meaning, supposes to occur a plurality of speech hypothesis w of certain speech w in one piece of confusion network document I; j, degree of confidence is respectively w I; j: c then establishes the degree of confidence of this speech and represents the probability that this speech hypothesis occurs, and then this speech supposes that absent variable probability is:
P (w I, jDo not occur)=1-P (w I, jOccur)=1-w I, jC
So, the absent variable probability of this speech is exactly in this piece document
P (w does not come across document i)=∏ P (w I, jDo not occur)=∏ 1-w I, jC
Summary is got up, occur in this piece document this speech probability should for:
P A(w, D i)=P (w comes across document i)=1-∏ P (w I, jOccur)=1-∏ 1-w I, jC
According to this formula, the computing method of improved df are:
df ( w ) = Σ i P A ( w , D i )
This df numerical procedure is designated as mdf1.
3, based on the index of speech confusion network
At sound identification module, except obtaining net result, also obtained the degree of confidence of speech confusion network and speech net node.Utilize the confusion network index building must introduce the information of degree of confidence, can adopt N-Best Voting or posterior probability in the calculating of this degree of confidence.In order to use the relevant keyword weight of vector space model, only need to keep to improve TF and IDF, and, then also need to keep the time location and the degree of confidence of each speech hypothesis in order to carry out retrieving result reordering based on temporal information.
The inverted entry structure of Figure 10 for using for the marking of keyword weight.As can be seen from the figure, preserved the improvement IDF value of each keyword in the index structure, with the document information of correspondence, preserved the position chained list of keyword in document simultaneously, by this index structure we can find fast relevant documentation with and the position at place.
4, based on the index of phone
Utilizing the recognition result of speech recognition to retrieve a problem that may face is that the recall rate of system can be lower under the not high situation of discrimination.Therefore, many researchers begin to attempt utilizing the sub-speech of pure acoustics unit such as phone, syllable to retrieve.
From the speech recognition angle, directly do not adopt sub-speech unit to discern if do not adopt the speech unit, its discrimination makes the word unit mutually far short of what is expected.Therefore, its mistake of bringing might be higher than it is carried out the benefit that fuzzy matching brought.Therefore, native system does not adopt sub-speech unit to carry out Direct Recognition, but the optimal identification result of speech unit is converted to the phone recognition result.So, the sub-speech string that obtains so still can't be broken away from the dictionary restriction, must answer the problem that how to solve the dictionary restriction.Questions answer is exactly that sub-speech unit has ambiguity hereto.Because its number is limited, can carry out explicit modeling to identification error, represents thereby obtain abundant more result, thereby can improve collecting the robustness of outer speech and identification error.
Based on the sound parent structure (Initial-Final) of Chinese, native system adopts band voicing simple or compound vowel of a Chinese syllable unit as sub-speech unit, the unified in this manual band voicing simple or compound vowel of a Chinese syllable unit that uses " phone " to come in the define system to be adopted.
The phone confusion matrix
Carry out query expansion, at first must obtain the phone confusion matrix.The phone confusion matrix is a kind of modeling pattern to the speech recognition device behavior, is described under the speech recognition device, and which other phone a phone is identified as easily by mistake.Figure 11 is the part of a phone confusion matrix.
In the figure, column heading is represented correct phone, and row headers is represented recognition result.The correct phone of numeric representation wherein is identified as the probability of the recognition result of this row correspondence.Gray scale among the figure is given according to the probability size.As can be seen, it is maximum that phone is identified correct probability, but also has certain probability to be identified as other phones.
The phone confusion matrix be built with multiple mode, promptly can also can obtain by model by a large amount of corpus.The method that native system adopted is: utilize speech recognition device that language material is discerned, and the phone string of recognition result is exported, simultaneously, utilize the pinyin marking instrument that the identification script is carried out pinyin marking.Then, in the mode that minimizes editing distance two results are alignd and count the phone confusion matrix.What this method obtained is not the acoustics similarity of two phones strictly speaking, because also be subjected to the influence of language model in building process.But owing in the index construct process, also will adopt the method that is converted to the phone string from recognition result, so this method can be carried out modeling to the demoder behavior preferably.
Four phone string indexings
Before the index of structure, must consider the problem of the phone string length that index adopted based on phone.Obviously it is irrational utilizing single phone index building, because the phone number seldom, almost each document all can comprise all phones.The used phone string of index building is long more, and the accuracy rate of retrieval will improve, and recall rate will descend.But when index phone string is longer than 6 phones, performance will descend rapidly.
Based on the analysis to the Chinese characteristic, we think that four phone string indexings are optimal selections of Chinese speech file retrieval.At first, each word of Chinese constitutes by sound is female, and based on the analysis of front, the possibility that initial consonant, simple or compound vowel of a Chinese syllable are obscured exists hardly.Therefore, adopting the phone string of even length in the Chinese is comparatively reasonably to select.Secondly, the speech of Chinese is based on two words, and searching keyword also is like this.At a large amount of searching keywords all is under the situation of two words, and the modeling unit of being longer than four phones does not find a place where one uses one's talents.Therefore, select to adopt four phone strings to meet Chinese feature as modeling unit.
Concrete grammar is:
At first, generate simultaneously in the voice optimal identification result, export the pairing phone sequence of this recognition result, be 4 then according to window length, translational movement is four all phones of 2 taking-ups, and with continuous in proper order " keyword " collection as this string of these four phones, these keywords are indexing units, and Figure 12 is four phone string keyword extraction synoptic diagram.
Then, utilize these keywords structure inverted entries and calculate tf and df.Same keyword is also carried out in the inquiry of input generate, and utilize basic vector space model to retrieve just can to obtain the most basic result.
Three, quick voice document searching module
The function that the voice document searching module is finished is to given term, finds relevant document apace in index database.And it is returned to the user.
1, input term analysis
The form of system's input keyword is varied, may be a speech, also might be a plurality of speech that separated by the space, or continuous a plurality of speech, and therefore, we must at first analyze the input term.Extract the keyword in the input term.
Chinese word segmentation based on maximum entropy model
The input search terms might be the word sequence that does not have the space to separate, and therefore at first input character sequence is carried out the Chinese word segmentation operation in this system.
From in form, speech is the combination of stable word, and therefore in context, the number of times that adjacent word occurs simultaneously is many more, just might constitute a speech more.Therefore word and the frequency or the probability of the adjacent co-occurrence of the word confidence level that can reflect into speech preferably.Can add up the frequency of the combination of each word of adjacent co-occurrence in the language material, calculate their information that appears alternatively.The information that appears alternatively of two words of definition, the adjacent co-occurrence probabilities of calculating two Chinese character X, Y.The information of appearing alternatively has embodied the tightness degree of marriage relation between the Chinese character.When tightness degree is higher than some threshold values, can think that just this word group may constitute a speech.This method only needs to add up the word group frequency in the language material, does not need the cutting dictionary, thereby is called no dictionary again and divides morphology or statistics to get the speech method.
Maximum entropy model is admitted known things, unknown things is not done any hypothesis, without any prejudice.Think that various situations evenly distribute, it embodies situation and is the entropy maximum.Maximum entropy model is introduced Chinese word segmentation, both brought into play the characteristics that coupling participle cutting speed is fast, efficient is high, utilized the advantage of no dictionary participle again in conjunction with context identification new word, automatic disambiguation.
Query expansion based on the phone confusion matrix
Owing to outer speech may occur collecting in the input inquiry speech, in this system, adopt the mode of speech and phone match retrieval.Adopting word and search for the outer speech of non-collection, to its mode that adopts the phone retrieval, use this mode can use accuracy when making word as retrieval unit for the outer speech of the collection in the retrieval mutually, can the speech situation be the recall rate of assurance system outside collection occurring again.
At first utilize the method for describing in the four phone string indexings to obtain four phone sequences of the outer speech of collection to be checked, alignment is expanded according to the phone confusion matrix then, its basic ideas are to find out other four phones strings (comprising himself) that each four phone string most probable is obscured, it is added in the search key, and utilize degree of obscuring that its tf is carried out discount.Obscure many that string may be very because one four phone is pairing, a thresholding t must be set, get rid of those probability lower obscure string.
Filter stop words
In the searching system based on speech, stop words is meant the speech that the frequency of occurrences is too high, do not have too overall search meaning, as ", be, too " etc.; In this system, preserved a stop words tabulation, the search terms of input has been filtered.
Through this step, this system obtains a keyword sequence that merges four phone strings and speech.
2, search index storehouse
System obtains after the keyword sequence, to query word and inquiry four phone sequences, searches its corresponding index entry in index database respectively.Can obtain the correlativity score of each input keyword and document according to the correlativity of keyword of preserving in the index entry and document, we then obtain the document with each keyword to the correlativity score addition of certain document and import preliminary correlativity score between the search terms then.
A large amount of results prove that the relative position of keyword in document is very important for the correlativity of determining document and inquiry.This paper also takes similar thinking, and institute's difference is that this paper adopts the time interval but not difference that lexeme is put, file correlation is made amendment, and further some optimal result of return results collection are reset.
Basic ideas based on the retrieving result reordering of temporal information are that if in inquiry, two adjacent keywords are also adjacent in the confusion network document, so the tf score of these two keywords of increase that should be suitable.Simultaneously, for confusion network, because it is a grid and will there be a problem in non-linear form, that is exactly that the temporal information of speech hypothesis may overlap mutually.In this case, two terms are not to exist simultaneously, but mutex relation, in this case, the suitable document weight that reduces these two keywords is rational.In view of this consideration, this paper has proposed a kind of method of the tf of correction weight.If two speech hypothesis w I, jWith
Figure A200910131828D00201
Belong to same document D i, definition overlap (. .) function and dist (., .) function represents two speech hypothesis overlapping length and bee-line in time respectively, and definition overlap* (. .) when two speech hypothesis of expression had overlapping, length did not partly overlap in two whole time zones that speech covered.When two speech hypothesis have overlapping, dist (. .) be negative value.Adopt the correlativity score of following formula correction the document at this:
sc = [ w i , j . c + w i , j * . c 2 ] md + dist ( w i , j , w i , j * ) 2 * md , dist ( w i , j , w i , j * ) < md
Above formula need pre-establish a distance threshold md, the bee-line that adds two speech greater than md then two speech be considered to be independent of each other, its score can not be modified yet.Simultaneously, if two speech overlap, discount will be given so
sc = [ w i , j . c + w i , j * . c 2 ] overlap ( w i , j , w i , j * ) 1 + overlap * ( w i , j , w i , j * )
Utilize new tf and the former weight of estimating, the correlativity score of n position document before computing formula is calculated once more, and rearrangement.At this, the scope of ordering still is limited on the preceding n position result.
By inquiring about in index database, we have finally obtained the document relevant with search terms, and it is returned to the user get final product.
In the voice document searching system of practical application, the emphasis of care is the textual representation form that how to obtain speech documents accurately, and correlativity how to represent keyword and document accurately.
Although disclose specific embodiments of the invention and accompanying drawing for the purpose of illustration, its purpose is to help to understand content of the present invention and implement according to this, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various replacements, variation and modification all are possible.Therefore, the present invention should not be limited to most preferred embodiment and the disclosed content of accompanying drawing.

Claims (10)

1, a kind of multilayer index voice document searching method the steps include:
1) media stream is carried out feature extraction, obtain the phonetic feature sequence;
2) utilization is searched for the phonetic feature sequence based on the speech recognition decoder device of weighting FST, obtains speech grid and optimal identification result;
3), make up speech and syllable double-layer indexing database according to speech grid and optimal identification result;
4) given term is searched relevant document and it is returned to the user in index database.
2, the method for claim 1 is characterized in that comprising in institute's predicate grid zero-time, concluding time and the model score of each speech.
3, method as claimed in claim 2 is characterized in that described speech recognition decoder device searches for the method that obtains optimal identification result to the phonetic feature sequence and comprise that definiteization of network, network minimize and network componentization; The method of described network componentization is: adopt a linear path in the new limit replacement speech network, and set up the mapping that a mapping table is used to preserve each new limit and its linear path of replacing.
4, as claim 1 or 3 described methods, it is characterized in that described speech recognition decoder device regularly exports the temporary word grid, empty its shared internal memory simultaneously.
5, method as claimed in claim 2 is characterized in that the method for building up of described double-deck index database is:
1) the degree of confidence c of each speech w in the computing voice identification output speech grid;
2) calculate word frequency rate and the document frequency of each speech w according to degree of confidence c;
3) set up index according to word frequency rate and the contrary document frequency of speech w;
4) optimal identification result is converted to the phone string, according to window length be 4, translational movement is 2 to extract all four phones in the phone strings, and these four phones linked to each other in proper order as four phone sequences of this phone string;
5), make up the phone layer index according to the phone sequence.
6, method as claimed in claim 5 is characterized in that described confidence calculations method is: at first adopt N-Best alignment schemes to generate the N-Best tabulation of each speech; Then to every fullpath result in the N-Best tabulation, time unifying is carried out in time division according to the optimal path result: if certain speech hypothesis among the N-Best drops on optimal path fully in the speech hypothesis of corresponding time, then it is integrated in this time period; If certain speech hypothesis not exclusively drops in certain speech hypothesis of optimal path among the N-Best, then it is integrated in the optimal path in the time period with its overlapping maximum; If certain bar N-Best path is with after optimal path aligns, certain speech of optimal path is supposed to align with it without any the path, then adds an empty limit; At last the N-Best result after the alignment is voted, add up all different speech hypothesis in each time period, probability of occurrence is with the degree of confidence of this probability as this speech.
7, method as claimed in claim 6 is characterized in that describedly calculating the word frequency rate of each speech w and the method for document frequency is according to degree of confidence c: adopt formula mtf ( w , D i ) = &Sigma; &ForAll; w i . j &Element; D i , w i . j = w w i , j . c Calculate described word frequency rate; Wherein i represents that this speech comes across i document, j and represents that this speech is j speech in i the document, D iBe the confusion network document; Adopt formula df ( w ) = &Sigma; i P A ( w , D i ) Calculate described document frequency, wherein P A(w, D i)=P (w comes across document i)=1-∏ P (w I, jOccur)=1-∏ 1-w I, j.c.
8, method as claimed in claim 7 is characterized in that described relevant document searched in given term in index database method is:
1) the input term is analyzed, obtained the search key sequence;
2) in index database, search with keyword sequence in the corresponding index entry of key word;
3) if in keyword sequence, comprise speech outside the collection, then be converted into the phone sequence form, utilize the phone layer to search the index entry corresponding with it;
4) correlativity according to keyword of preserving in the index entry and document can obtain the correlativity score that each imports keyword and document;
5), obtain the preliminary correlativity score between the document and the input search terms with of the correlativity score addition of each keyword to certain document;
6) by the ordering of correlativity score, the archives of n before the rank are returned to the user.
9, method as claimed in claim 8 is characterized in that the correlativity score of n archives before described is revised the back rearrangement returns to the user, and its method is: for comprising two speech hypothesis W and W in the same document *, and the bee-line of two speech hypothesis greater than zero less than setting thresholding md, then the correlativity score worked as of this article is passed through formula sc = [ w i , j . c + w i , j * . c 2 ] md + dist ( w i , j , w i , j * ) 2 * md Calculate and revise, if speech hypothesis W and W *Bee-line equals zero, then according to formula sc = [ w i , j . c + w i , j * . c 2 ] overlap ( w i , j , w i , j * ) 1 + overlap * ( w i , j , w i , j * ) Calculate and revise; Wherein overlap (. .) two speech of function representation hypothesis overlapping length in time, dist (. .) bee-line of two speech hypothesis of function representation, if two speech hypothesis have when overlapping dist (. .) be negative value, overlap *(. .) when two speech hypothesis of expression had overlapping, length did not partly overlap in two whole time zones that speech covered.
10, a kind of multilayer index voice document searching system, the Chinese automatic speech recognition module that it comprises based on the weighting FST is used to finish the task of automatic identification audio file Chinese words; The automatic speech document index makes up module, is used to finish the task of voice identification result being made up duplicate key; The voice document searching module is used for given term is searched relevant document and it is returned to the user at index database.
CN200910131828XA 2009-02-20 2009-04-08 Multilayer index voice document searching method Expired - Fee Related CN101510222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910131828XA CN101510222B (en) 2009-02-20 2009-04-08 Multilayer index voice document searching method

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN200910078176.8 2009-02-20
CN200910078176 2009-02-20
CN200910131828XA CN101510222B (en) 2009-02-20 2009-04-08 Multilayer index voice document searching method

Publications (2)

Publication Number Publication Date
CN101510222A true CN101510222A (en) 2009-08-19
CN101510222B CN101510222B (en) 2012-05-30

Family

ID=41002622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910131828XA Expired - Fee Related CN101510222B (en) 2009-02-20 2009-04-08 Multilayer index voice document searching method

Country Status (1)

Country Link
CN (1) CN101510222B (en)

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117335A (en) * 2011-03-25 2011-07-06 天脉聚源(北京)传媒科技有限公司 Method for retrieving multimedia information
CN102122506A (en) * 2011-03-08 2011-07-13 天脉聚源(北京)传媒科技有限公司 Method for recognizing voice
CN102270450A (en) * 2010-06-07 2011-12-07 株式会社曙飞电子 System and method of multi model adaptation and voice recognition
CN102521262A (en) * 2011-11-21 2012-06-27 广东国笔科技股份有限公司 Data processing equipment, system and method for realizing voice intelligent indexing
CN102549652A (en) * 2009-09-09 2012-07-04 歌乐株式会社 Information retrieving apparatus, information retrieving method and navigation system
CN101996195B (en) * 2009-08-28 2012-07-11 中国移动通信集团公司 Searching method and device of voice information in audio files and equipment
CN103092888A (en) * 2011-11-07 2013-05-08 联想(北京)有限公司 Electronic device and content processing method
CN103164403A (en) * 2011-12-08 2013-06-19 深圳市北科瑞声科技有限公司 Generation method of video indexing data and system
CN103366734A (en) * 2012-03-31 2013-10-23 佳能株式会社 Method and device for checking voice recognition results, voice recognition system and audio monitoring system
CN103514170A (en) * 2012-06-20 2014-01-15 中国移动通信集团安徽有限公司 Speech-recognition text classification method and device
CN103730115A (en) * 2013-12-27 2014-04-16 北京捷成世纪科技股份有限公司 Method and device for detecting keywords in voice
CN103956166A (en) * 2014-05-27 2014-07-30 华东理工大学 Multimedia courseware retrieval system based on voice keyword recognition
CN104284219A (en) * 2013-07-11 2015-01-14 Lg电子株式会社 Mobile terminal and method of controlling the mobile terminal
CN104751847A (en) * 2015-03-31 2015-07-01 刘畅 Data acquisition method and system based on overprint recognition
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN105760399A (en) * 2014-12-19 2016-07-13 华为软件技术有限公司 Data retrieval method and device
CN106021531A (en) * 2016-05-25 2016-10-12 北京云知声信息技术有限公司 Method, system and device for book inquiry through voice
CN106056207A (en) * 2016-05-09 2016-10-26 武汉科技大学 Natural language-based robot deep interacting and reasoning method and device
CN106328146A (en) * 2016-08-22 2017-01-11 广东小天才科技有限公司 Video subtitle generation method and apparatus
CN106663423A (en) * 2014-10-06 2017-05-10 英特尔公司 System and method of automatic speech recognition using on-the-fly word lattice generation with word histories
CN106683677A (en) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 Method and device for recognizing voice
CN106782607A (en) * 2012-07-03 2017-05-31 谷歌公司 Determine hot word grade of fit
CN106935239A (en) * 2015-12-29 2017-07-07 阿里巴巴集团控股有限公司 The construction method and device of a kind of pronunciation dictionary
CN107203616A (en) * 2017-05-24 2017-09-26 苏州百智通信息技术有限公司 The mask method and device of video file
CN107239571A (en) * 2017-06-28 2017-10-10 浪潮金融信息技术有限公司 Index structuring method based on multidimensional data space technology
CN107704461A (en) * 2016-07-26 2018-02-16 中国科学院自动化研究所 A kind of intelligent Road information retrieval method based on data analysis
CN107766318A (en) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 Keyword extraction method and device and electronic equipment
CN107818785A (en) * 2017-09-26 2018-03-20 平安普惠企业管理有限公司 A kind of method and terminal device that information is extracted from multimedia file
CN107908722A (en) * 2017-11-14 2018-04-13 华东师范大学 Reverse k rankings querying method based on distance
CN108280179A (en) * 2018-01-22 2018-07-13 百度在线网络技术(北京)有限公司 Method and system, terminal and the computer readable storage medium of audio advertisement detection
CN108597497A (en) * 2018-04-03 2018-09-28 中译语通科技股份有限公司 A kind of accurate synchronization system of subtitle language and method, information data processing terminal
CN108647190A (en) * 2018-04-25 2018-10-12 北京华夏电通科技有限公司 A kind of speech recognition text is inserted into the method, apparatus and system of notes document
CN108682415A (en) * 2018-05-23 2018-10-19 广州视源电子科技股份有限公司 voice search method, device and system
CN108831439A (en) * 2018-06-27 2018-11-16 广州视源电子科技股份有限公司 Audio recognition method, device, equipment and system
CN108962271A (en) * 2018-06-29 2018-12-07 广州视源电子科技股份有限公司 Add to weigh finite state converter merging method, device, equipment and storage medium
CN109036391A (en) * 2018-06-26 2018-12-18 华为技术有限公司 Audio recognition method, apparatus and system
CN109361823A (en) * 2018-11-01 2019-02-19 深圳市号互联科技有限公司 A kind of intelligent interaction mode that voice is mutually converted with text
CN109887498A (en) * 2019-03-11 2019-06-14 西安电子科技大学 Highway mouth term of courtesy methods of marking
CN110415705A (en) * 2019-08-01 2019-11-05 苏州奇梦者网络科技有限公司 A kind of hot word recognition methods, system, device and storage medium
CN110784603A (en) * 2019-10-18 2020-02-11 深圳供电局有限公司 Intelligent voice analysis method and system for offline quality inspection
CN110867179A (en) * 2019-11-12 2020-03-06 云南电网有限责任公司德宏供电局 File storage and retrieval method and system based on voice recognition, IKAnalyzer word segmentation and hdfs
CN110968245A (en) * 2019-12-05 2020-04-07 深圳乐华高科实业有限公司 Operation method for controlling office software through voice
CN111081245A (en) * 2019-12-24 2020-04-28 杭州纪元通信设备有限公司 Call center menu system based on voice recognition
CN111985231A (en) * 2020-08-07 2020-11-24 中移(杭州)信息技术有限公司 Unsupervised role recognition method and device, electronic equipment and storage medium
CN112040163A (en) * 2020-08-21 2020-12-04 上海阅目科技有限公司 Hard disk video recorder supporting audio analysis
CN113470627A (en) * 2021-07-02 2021-10-01 因诺微科技(天津)有限公司 MVGG-CTC-based keyword search method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7912699B1 (en) * 2004-08-23 2011-03-22 At&T Intellectual Property Ii, L.P. System and method of lattice-based search for spoken utterance retrieval
US7809568B2 (en) * 2005-11-08 2010-10-05 Microsoft Corporation Indexing and searching speech with text meta-data
JP4845523B2 (en) * 2006-01-31 2011-12-28 マイクロソフト コーポレーション Character processing apparatus, method, program, and recording medium
US20080162125A1 (en) * 2006-12-28 2008-07-03 Motorola, Inc. Method and apparatus for language independent voice indexing and searching
US7818170B2 (en) * 2007-04-10 2010-10-19 Motorola, Inc. Method and apparatus for distributed voice searching

Cited By (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996195B (en) * 2009-08-28 2012-07-11 中国移动通信集团公司 Searching method and device of voice information in audio files and equipment
CN102549652A (en) * 2009-09-09 2012-07-04 歌乐株式会社 Information retrieving apparatus, information retrieving method and navigation system
CN102549652B (en) * 2009-09-09 2013-08-07 歌乐株式会社 Information retrieving apparatus
CN102270450A (en) * 2010-06-07 2011-12-07 株式会社曙飞电子 System and method of multi model adaptation and voice recognition
CN102270450B (en) * 2010-06-07 2014-04-16 株式会社曙飞电子 System and method of multi model adaptation and voice recognition
CN102122506A (en) * 2011-03-08 2011-07-13 天脉聚源(北京)传媒科技有限公司 Method for recognizing voice
CN102117335B (en) * 2011-03-25 2014-01-22 天脉聚源(北京)传媒科技有限公司 Method for retrieving multimedia information
CN102117335A (en) * 2011-03-25 2011-07-06 天脉聚源(北京)传媒科技有限公司 Method for retrieving multimedia information
CN103092888A (en) * 2011-11-07 2013-05-08 联想(北京)有限公司 Electronic device and content processing method
CN102521262A (en) * 2011-11-21 2012-06-27 广东国笔科技股份有限公司 Data processing equipment, system and method for realizing voice intelligent indexing
CN103164403B (en) * 2011-12-08 2016-03-16 深圳市北科瑞声科技有限公司 The generation method and system of video index data
CN103164403A (en) * 2011-12-08 2013-06-19 深圳市北科瑞声科技有限公司 Generation method of video indexing data and system
CN103366734A (en) * 2012-03-31 2013-10-23 佳能株式会社 Method and device for checking voice recognition results, voice recognition system and audio monitoring system
CN103366734B (en) * 2012-03-31 2015-11-25 佳能株式会社 The voice recognition result method of inspection and equipment, voice recognition and audio monitoring systems
CN103514170A (en) * 2012-06-20 2014-01-15 中国移动通信集团安徽有限公司 Speech-recognition text classification method and device
CN106782607A (en) * 2012-07-03 2017-05-31 谷歌公司 Determine hot word grade of fit
CN104284219A (en) * 2013-07-11 2015-01-14 Lg电子株式会社 Mobile terminal and method of controlling the mobile terminal
US9639251B2 (en) 2013-07-11 2017-05-02 Lg Electronics Inc. Mobile terminal and method of controlling the mobile terminal for moving image playback
CN103730115B (en) * 2013-12-27 2016-09-07 北京捷成世纪科技股份有限公司 A kind of method and apparatus detecting keyword in voice
CN103730115A (en) * 2013-12-27 2014-04-16 北京捷成世纪科技股份有限公司 Method and device for detecting keywords in voice
CN103956166A (en) * 2014-05-27 2014-07-30 华东理工大学 Multimedia courseware retrieval system based on voice keyword recognition
CN106663423B (en) * 2014-10-06 2021-02-26 英特尔公司 System and method for automatic speech recognition using real-time word lattice generation with word history
CN106663423A (en) * 2014-10-06 2017-05-10 英特尔公司 System and method of automatic speech recognition using on-the-fly word lattice generation with word histories
CN105760399A (en) * 2014-12-19 2016-07-13 华为软件技术有限公司 Data retrieval method and device
CN104751847A (en) * 2015-03-31 2015-07-01 刘畅 Data acquisition method and system based on overprint recognition
US11664020B2 (en) 2015-11-06 2023-05-30 Alibaba Group Holding Limited Speech recognition method and apparatus
CN106683677A (en) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 Method and device for recognizing voice
US10741170B2 (en) 2015-11-06 2020-08-11 Alibaba Group Holding Limited Speech recognition method and apparatus
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN106935239A (en) * 2015-12-29 2017-07-07 阿里巴巴集团控股有限公司 The construction method and device of a kind of pronunciation dictionary
CN106056207A (en) * 2016-05-09 2016-10-26 武汉科技大学 Natural language-based robot deep interacting and reasoning method and device
CN106056207B (en) * 2016-05-09 2018-10-23 武汉科技大学 A kind of robot depth interaction and inference method and device based on natural language
CN106021531A (en) * 2016-05-25 2016-10-12 北京云知声信息技术有限公司 Method, system and device for book inquiry through voice
CN107704461A (en) * 2016-07-26 2018-02-16 中国科学院自动化研究所 A kind of intelligent Road information retrieval method based on data analysis
CN107704461B (en) * 2016-07-26 2020-04-24 中国科学院自动化研究所 Intelligent road condition information retrieval method based on data analysis
CN107766318A (en) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 Keyword extraction method and device and electronic equipment
CN107766318B (en) * 2016-08-17 2021-03-16 北京金山安全软件有限公司 Keyword extraction method and device and electronic equipment
CN106328146A (en) * 2016-08-22 2017-01-11 广东小天才科技有限公司 Video subtitle generation method and apparatus
CN107203616A (en) * 2017-05-24 2017-09-26 苏州百智通信息技术有限公司 The mask method and device of video file
CN107239571A (en) * 2017-06-28 2017-10-10 浪潮金融信息技术有限公司 Index structuring method based on multidimensional data space technology
CN107239571B (en) * 2017-06-28 2021-04-09 浪潮金融信息技术有限公司 Index construction method based on multidimensional data space technology
CN107818785A (en) * 2017-09-26 2018-03-20 平安普惠企业管理有限公司 A kind of method and terminal device that information is extracted from multimedia file
CN107908722A (en) * 2017-11-14 2018-04-13 华东师范大学 Reverse k rankings querying method based on distance
CN107908722B (en) * 2017-11-14 2021-10-12 华东师范大学 Reverse k ranking query method based on distance
CN108280179A (en) * 2018-01-22 2018-07-13 百度在线网络技术(北京)有限公司 Method and system, terminal and the computer readable storage medium of audio advertisement detection
CN108597497B (en) * 2018-04-03 2020-09-08 中译语通科技股份有限公司 Subtitle voice accurate synchronization system and method and information data processing terminal
CN108597497A (en) * 2018-04-03 2018-09-28 中译语通科技股份有限公司 A kind of accurate synchronization system of subtitle language and method, information data processing terminal
CN108647190A (en) * 2018-04-25 2018-10-12 北京华夏电通科技有限公司 A kind of speech recognition text is inserted into the method, apparatus and system of notes document
CN108682415A (en) * 2018-05-23 2018-10-19 广州视源电子科技股份有限公司 voice search method, device and system
CN109036391A (en) * 2018-06-26 2018-12-18 华为技术有限公司 Audio recognition method, apparatus and system
CN108831439A (en) * 2018-06-27 2018-11-16 广州视源电子科技股份有限公司 Audio recognition method, device, equipment and system
CN108962271A (en) * 2018-06-29 2018-12-07 广州视源电子科技股份有限公司 Add to weigh finite state converter merging method, device, equipment and storage medium
CN109361823A (en) * 2018-11-01 2019-02-19 深圳市号互联科技有限公司 A kind of intelligent interaction mode that voice is mutually converted with text
CN109887498A (en) * 2019-03-11 2019-06-14 西安电子科技大学 Highway mouth term of courtesy methods of marking
CN110415705B (en) * 2019-08-01 2022-03-01 苏州奇梦者网络科技有限公司 Hot word recognition method, system, device and storage medium
CN110415705A (en) * 2019-08-01 2019-11-05 苏州奇梦者网络科技有限公司 A kind of hot word recognition methods, system, device and storage medium
CN110784603A (en) * 2019-10-18 2020-02-11 深圳供电局有限公司 Intelligent voice analysis method and system for offline quality inspection
CN110867179A (en) * 2019-11-12 2020-03-06 云南电网有限责任公司德宏供电局 File storage and retrieval method and system based on voice recognition, IKAnalyzer word segmentation and hdfs
CN110968245A (en) * 2019-12-05 2020-04-07 深圳乐华高科实业有限公司 Operation method for controlling office software through voice
CN110968245B (en) * 2019-12-05 2023-11-10 深圳乐华高科实业有限公司 Operation method for controlling office software through voice
CN111081245A (en) * 2019-12-24 2020-04-28 杭州纪元通信设备有限公司 Call center menu system based on voice recognition
CN111985231A (en) * 2020-08-07 2020-11-24 中移(杭州)信息技术有限公司 Unsupervised role recognition method and device, electronic equipment and storage medium
CN111985231B (en) * 2020-08-07 2023-12-26 中移(杭州)信息技术有限公司 Unsupervised role recognition method and device, electronic equipment and storage medium
CN112040163A (en) * 2020-08-21 2020-12-04 上海阅目科技有限公司 Hard disk video recorder supporting audio analysis
CN112040163B (en) * 2020-08-21 2023-07-07 上海阅目科技有限公司 Hard disk video recorder supporting audio analysis
CN113470627A (en) * 2021-07-02 2021-10-01 因诺微科技(天津)有限公司 MVGG-CTC-based keyword search method

Also Published As

Publication number Publication date
CN101510222B (en) 2012-05-30

Similar Documents

Publication Publication Date Title
CN101510222B (en) Multilayer index voice document searching method
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
Chelba et al. Retrieval and browsing of spoken content
Arisoy et al. Turkish broadcast news transcription and retrieval
EP2269148B1 (en) Intra-language statistical machine translation
US8131539B2 (en) Search-based word segmentation method and device for language without word boundary tag
Parlak et al. Spoken term detection for Turkish broadcast news
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
US20070179784A1 (en) Dynamic match lattice spotting for indexing speech content
WO2003010754A1 (en) Speech input search system
CN104199965A (en) Semantic information retrieval method
Chien et al. Topic-based hierarchical segmentation
Dinarelli et al. Discriminative reranking for spoken language understanding
Arisoy et al. Syntactic and sub-lexical features for Turkish discriminative language models
CN104199825A (en) Information inquiry method and system
Turunen et al. Indexing confusion networks for morph-based spoken document retrieval
Arisoy et al. Discriminative language modeling with linguistic and statistically derived features
Ghannay et al. Acoustic Word Embeddings for ASR Error Detection.
Madnani et al. Multiple alternative sentence compressions for automatic text summarization
Wang et al. Improving handwritten Chinese text recognition by unsupervised language model adaptation
CN109783648B (en) Method for improving ASR language model by using ASR recognition result
Avram et al. Romanian speech recognition experiments from the robin project
Oger et al. On-demand new word learning using world wide web
Wang Mandarin spoken document retrieval based on syllable lattice matching
Whittaker et al. Vocabulary independent speech recognition using particles

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120530

Termination date: 20180408