CN100565532C

CN100565532C - A kind of multimedia resource search method based on the audio content retrieval

Info

Publication number: CN100565532C
Application number: CNB2008100620738A
Authority: CN
Inventors: 叶睿智
Original assignee: Individual
Current assignee: Hangzhou Micro Network Co Ltd
Priority date: 2008-05-28
Filing date: 2008-05-28
Publication date: 2009-12-02
Anticipated expiration: 2028-05-28
Also published as: CN101281534A

Abstract

The invention discloses a kind of multimedia resource search method based on the audio content retrieval.Comprising the steps: 1) preprocessing server is converted into video and audio frequency the voice to be identified of standard; 2) speech recognition server is trained to acoustic model with corpus, and voice to be identified and acoustics Model Matching are obtained meaning of one's words text index; 3) keyword index is stored and organized to index server, and the coupling search condition obtains result for retrieval.The present invention utilizes that keyword detects technology in the audio frequency, obtained the meaning of one's words information of audio and video resources inherence, meaning of one's words information to textization is carried out index, reliable more comprehensively audio and video resources information index is provided, can allow searching system match multimedia resource more accurately, and navigate to the exact position that term occurs in audio frequency and video.

Description

A kind of multimedia resource search method based on the audio content retrieval

Technical field

The present invention relates to a kind of multimedia resource search method, particularly relate to and be used to retrieve the resource that comprises video, audio form, find the resource that comprises institute's retrieving information and provide the institute location of retrieving information in resource based on the audio content retrieval.

Background technology

In current digitizing and network times, multi-medium data has become the major part of the data that transmit on the internet information highway.Content of multimedia such as audio frequency, image and video occupy 15% at present in the internet, and should numeral also in growth at full speed.The high-capacity and high-speed storage system provides basic guarantee for the mass memory of audio frequency and video, and every profession and trade to the use of audio frequency and video more and more widely.How obtaining Useful Information from the audio/video information of magnanimity, promptly the management and retrieval of audio/video information resource seem and become more and more important, and audio frequency and video have become the network user's one of resource of frequent retrieval.Present main flow search engine such as Google, Yahoo, Baidu etc., solved retrieval preferably to content of text on the internet, but aspect the audio frequency and video retrieval, these search engines still rely on the relevant peripheral text message (for example resource file name, resource mark, resource are introduced text etc.) of match retrieval multimedia resource to realize the search to text at present.This makes that the content of audio and video resources self is not discerned well, and some lack the searched engine of useful resources of clear and definite textual description and ignore.At present, information retrieval system commonly used, for example digital library system, Knowledge Management System are faced with all also that multimedia resource becomes the important information carrier day by day but the problem that lacks effective retrieval method.One of approach that addresses this problem utilizes speech recognition technology to retrieve the corresponding informance that usable text is expressed from the phonological component of audio and video resources exactly, goes these resources of index by the mode of text retrieval again.

Key word recognition is exactly to identify given keyword in continuous unconfined natural-sounding stream.It comprises the substance of two aspects, and one is that keyword detects, and one is keyword recognition.It is to differentiate that voice comprise the keyword which is imported in advance that keyword detects, and is a polynary decision problem.Keyword recognition then answers "Yes" or " not being " comprises this keyword, is two yuan of decision problems, and the special finger keyword of key word recognition technology of the present invention detects.

(Mel-FrequencyCepstrumCoefficients MFCC) is based upon on Fourier and the cepstral analysis basis Mei Er cepstrum coefficient, has reacted the frequency domain character of sound signal.Sampled point in the audio frame is in short-term carried out Fourier transform, obtain this energy size of audio frame on each frequency in short-term, whole frequency band is divided into n subband, calculates the gross energy on each subband of this n respectively, just constituted this n Mel coefficient of audio frame in short-term.The Mel coefficient that extracts is calculated its corresponding cepstrum coefficient again, is exactly the Mel cepstrum coefficient.Cepstral analysis is a kind of nonlinear signal processing technology, and it is the basis of homomorphic system theory, is the signal that special disposal is combined by convolution, is applied to afterwards in the processing of voice signal.

(Hidden Markov Model, HMM) model is a kind of probabilistic model based on transition probability and output probability to hidden Markov.It regards voice as be made up of observable symbol sebolic addressing stochastic process, and symbol sebolic addressing then is the output of sonification system status switch.When using hidden Markov model identification,, obtain state transition probability matrix and symbol output probability matrix by training for each speaker sets up sonification model.Calculate the maximum probability of unknown voice in the state transitions process during identification, adjudicate according to the model of maximum probability correspondence.The ergodic type HMM of the general employing of Speaker Identification for text-independent generally adopts from left to right type HMM for the Speaker Identification relevant with text.HMM when not required between consolidation, computing time and memory space when having saved judgement.

Carry out the very important aspect that technology is speech retrieval based on the key word recognition of Hidden Markov Model (HMM), it occupies critical positions in the certain content retrieval of voice, because the restriction of its robustness of speech recognition technology and practicality aspect at present, utilize continuous speech recognition to set up large vocabulary, the recognizer of keyword can not do the trick arbitrarily, can't well satisfy the application requirements of speech retrieval aspect.And key word recognition is relative reliable technique, and the better application prospect is arranged in the application of speech retrieval.

Summary of the invention

The objective of the invention is to overcome the deficiencies in the prior art, a kind of multimedia resource search method based on the audio content retrieval is provided

Comprise the steps:

1) preprocessing server is converted into video and audio frequency the voice to be identified of standard;

2) speech recognition server is trained to acoustic model with corpus, and voice to be identified and acoustics Model Matching are obtained meaning of one's words text index;

3) keyword index is stored and organized to index server, and the coupling search condition obtains result for retrieval.

Described preprocessing server is converted into video and audio frequency the voice to be identified of standard: have one or more snippets speech voice in video and the audio frequency, from the video resource of input, adopt the audio frequency isolation technics to mention the data of audio-frequency unit, and kept consistent at time-axis direction of the audio frequency separated and original video, voice data is through the digital noise reduction technical finesse, the part that energy is low excessively is processed into quiet with the voice segments that comprises noise signals, after the conversion, audio frequency is output as the voice to be identified of standard.

Described speech recognition server is trained to acoustic model with corpus: the broadcasting speech that corpus adopts the standard Chinese mandarin to read aloud, corpus obtains phonetic feature through characteristic extracting module, characteristic type adopts the Mei Er cepstrum coefficient, and phonetic feature obtains the hidden Markov acoustic model through acoustic training model.

Described voice to be identified and acoustics Model Matching are obtained meaning of one's words text index: voice to be identified obtain voice Mei Er cepstrum coefficient feature to be identified through characteristic extracting module, path-searcher reads voice Mei Er cepstrum coefficient feature to be identified, in the path of hidden Markov acoustic model, carry out shortest path identification, obtain the meaning of one's words text of sound bite correspondence; In conjunction with the voice timeline information, output comprises the meaning of one's words text index of meaning of one's words text and time terminal data.

Keyword index is stored and organized to described index server, and the coupling search condition obtains result for retrieval: to the meaning of one's words text index of speech recognition server output, carrying out inverted index transforms, obtaining with meaning of one's words keyword is major key, the inverted index clauses and subclauses that the positional information sequence is a key assignments appear in keyword, store in the index database, when carrying out retrieval, keyword or keyword combination to retrieval server input text form, read the inverted index clauses and subclauses according to search key, the positional information sequence appears in the output keyword.

Described meaning of one's words text index is that a binary is right, has comprised time period starting point and time period endpoint data that text key word, keyword once occur in voice.Keyword index is that a ternary is right, has comprised keyword, the video of keyword correspondence and time period starting point and the time period endpoint data that audio resource document number, keyword once occur in voice.Result for retrieval is to have described a series of resource files that comprise institute's search key, and in certain resource file, a series of time period information of this keyword sound bite occur.

The present invention utilizes that keyword detects technology in the audio frequency, obtained the meaning of one's words information of audio and video resources inherence, meaning of one's words information to textization is carried out index, reliable more comprehensively audio and video resources information index is provided, can allow searching system match multimedia resource more accurately, and navigate to the exact position that term occurs in audio frequency and video.

Description of drawings

Fig. 1 is according to audio-video frequency content searching system overview flow chart of the present invention;

Fig. 2 is according to audio frequency and video pretreatment process figure of the present invention;

Fig. 3 is according to key word recognition process flow diagram of the present invention;

Fig. 4 is according to index merger process flow diagram of the present invention.

Embodiment

Multimedia resource search method based on the audio content retrieval comprises the steps:

1) preprocessing server is converted into video and audio frequency the voice to be identified of standard; As shown in Figure 1, video data 1-1 and voice data 1-2 are imported into preprocessing server S1, obtain the language material 1-3 to be identified of standard through pre-service.

2) speech recognition server is trained to acoustic model with corpus, and voice to be identified and acoustics Model Matching are obtained meaning of one's words text index; As shown in Figure 1, corpus 1-4 is imported into speech recognition server S2, and training obtains acoustic model, is stored among the S2, language material 1-3 to be identified and acoustic model are input to speech recognition server S2 together, and coupling obtains the meaning of one's words text index information 1-5 among the language material 1-3 to be identified.

3) keyword index is stored and organized to index server, and the coupling search condition obtains result for retrieval.As shown in Figure 1, index information is imported into index server S3, and the inverted index storehouse is gone in merger; When retrieving, search condition 1-6 is input to index server S3, and S3 mates the keyword in the condition to be retrieved in index database, obtain matched record, and the merger matched record is finally returned result for retrieval 1-7.

Described preprocessing server is converted into video and audio frequency the voice to be identified of standard: have one or more snippets speech voice in video and the audio frequency, from the video resource of input, adopt the audio frequency isolation technics to mention the data of audio-frequency unit, and kept consistent at time-axis direction of the audio frequency separated and original video, voice data is through the digital noise reduction technical finesse, the part that energy is low excessively is processed into quiet with the voice segments that comprises noise signals, after the conversion, audio frequency is output as the voice to be identified of standard.As shown in Figure 2, the video data 2-1 of input advanced speech data extraction module 2-2, obtained the corresponding audio data; Directly the audio data of the audio data 2-3 of input and video extraction all advanced noise reduction process module 2-4, the voice 2-5 to be identified that final output can be used for discerning.

Described speech recognition server is trained to acoustic model with corpus: the broadcasting speech that corpus adopts the standard Chinese mandarin to read aloud, corpus obtains phonetic feature through characteristic extracting module, characteristic type adopts the Mei Er cepstrum coefficient, and phonetic feature obtains the hidden Markov acoustic model through acoustic training model.As shown in Figure 3, corpus 3-1 obtains sample voice Mei Er cepstrum coefficient characteristic through voice pretreatment module 3-2, characteristic extracting module 3-3; Sample voice Mei Er cepstrum coefficient characteristic is input to engine training and identification module 3-4 trains hidden Markov acoustic model 3-5;

Described voice to be identified and acoustics Model Matching are obtained meaning of one's words text index: voice to be identified obtain voice Mei Er cepstrum coefficient feature to be identified through characteristic extracting module, path-searcher reads voice Mei Er cepstrum coefficient feature to be identified, in the path of hidden Markov acoustic model, carry out shortest path identification, obtain the meaning of one's words text of sound bite correspondence; In conjunction with the voice timeline information, output comprises the meaning of one's words text index of meaning of one's words text and time terminal data.As shown in Figure 3, voice 3-6 to be identified obtains voice Mei Er cepstrum coefficient characteristic to be identified through voice pretreatment module 3-2, characteristic extracting module 3-3; Voice feature data to be identified and hidden Markov acoustic model 3-5 obtain discerning meaning of one's words text index 3-8 through route searching matching module 3-7 together, its form be binary right＜KW, Ref 〉, wherein comprised time period starting point and time period endpoint data Ref that text key word KW, keyword once occur in voice.

Keyword index is stored and organized to described index server, and the coupling search condition obtains result for retrieval: to the meaning of one's words text index of speech recognition server output, carrying out inverted index transforms, obtaining with meaning of one's words keyword is major key, the inverted index clauses and subclauses that the positional information sequence is a key assignments appear in keyword, store in the index database.As shown in Figure 4, at first extract indexing key words KW, in having index database now, find inverted index clauses and subclauses＜KW according to KW from the single newly-increased index of sound identification module output (form be＜KW Ref 〉),＜Ref1, Ref2...Refn〉〉, will increase index and inverted index clauses and subclauses again newly and merge, obtain new clauses and subclauses＜KW,＜Ref1, Ref2...Refn, Refn+1 〉, merging process is considered the repetition of going of index.Last new clauses and subclauses are write back index database.When carrying out retrieval, to the keyword or the keyword combination of retrieval server input text form, read the inverted index clauses and subclauses according to search key, the positional information sequence appears in the output keyword.As shown in Figure 4,, in existing index database, find inverted index clauses and subclauses＜KW,＜Ref1, Ref2...Refn according to search key KW〉〉, return as result for retrieval.

Claims

1, a kind of multimedia resource search method based on the audio content retrieval is characterized in that comprising the steps:

3) keyword index is stored and organized to index server, and the coupling search condition obtains result for retrieval;

Described preprocessing server is converted into video and audio frequency the voice to be identified of standard: have one or more snippets speech voice in video and the audio frequency, from the video resource of input, adopt the audio frequency isolation technics to mention the data of audio-frequency unit, and kept consistent at time-axis direction of the audio frequency separated and original video, voice data is through the digital noise reduction technical finesse, the part that energy is low excessively is processed into quiet with the voice segments that comprises noise signals, after the conversion, audio frequency is output as the voice to be identified of standard;

Described speech recognition server is trained to acoustic model with corpus: the broadcasting speech that corpus adopts the standard Chinese mandarin to read aloud, corpus obtains phonetic feature through characteristic extracting module, characteristic type adopts the Mei Er cepstrum coefficient, and phonetic feature obtains the hidden Markov acoustic model through acoustic training model;

Described voice to be identified and acoustics Model Matching are obtained meaning of one's words text index: voice to be identified obtain voice Mei Er cepstrum coefficient feature to be identified through characteristic extracting module, path-searcher reads voice Mei Er cepstrum coefficient feature to be identified, in the path of hidden Markov acoustic model, carry out shortest path identification, obtain the meaning of one's words text of sound bite correspondence; In conjunction with the voice timeline information, output comprises the meaning of one's words text index of meaning of one's words text and time terminal data;

Keyword index is stored and organized to described index server, and the coupling search condition obtains result for retrieval: to the meaning of one's words text index of speech recognition server output, carrying out inverted index transforms, obtaining with meaning of one's words keyword is major key, the inverted index clauses and subclauses that the positional information sequence is a key assignments appear in keyword, store in the index database, when carrying out retrieval, keyword or keyword combination to retrieval server input text form, read the inverted index clauses and subclauses according to search key, the positional information sequence appears in the output keyword

2, a kind of multimedia resource search method according to claim 1 based on the audio content retrieval, it is characterized in that described meaning of one's words text index is that a binary is right, comprised time period starting point and time period endpoint data that text key word, keyword once occur in voice.

3, a kind of multimedia resource search method according to claim 1 based on the audio content retrieval, it is characterized in that described keyword index is that a ternary is right, comprised keyword, the video of keyword correspondence and time period starting point and the time period endpoint data that audio resource document number, keyword once occur in voice.

4, a kind of multimedia resource search method according to claim 1 based on the audio content retrieval, it is characterized in that described result for retrieval is to have described a series of resource files that comprise institute's search key, and in certain resource file, a series of time period information of this keyword sound bite appear.