CN103247316A

CN103247316A - Method and system for constructing index in voice frequency retrieval

Info

Publication number: CN103247316A
Application number: CN2012100315341A
Authority: CN
Inventors: 黄石磊; 刘轶; 程刚; 曹文晓
Original assignee: PKU-HKUST SHENZHEN INSTITUTE; SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER; SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd
Current assignee: PKU-HKUST SHENZHEN INSTITUTE; SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER; SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd
Priority date: 2012-02-13
Filing date: 2012-02-13
Publication date: 2013-08-14
Anticipated expiration: 2032-02-13
Also published as: CN103247316B

Abstract

A method for constructing index in voice frequency retrieval comprises the steps as follows: a voice frequency acquisition device acquires voice frequency data; the voice frequency acquisition device calculates the index value of the voice frequency data, and sends the voice frequency data and the index value of the voice frequency data to a server; and the server constructs the index according to the received voice frequency data and the received index value of the voice frequency data. In addition, a system for constructing the index in the voice frequency retrieval is provided. The method and the system for constructing the index in the voice frequency retrieval can lower the cost of additional hardware during the capacity expansion of a voice frequency retrieval system.

Description

The method and system of index building in a kind of audio retrieval

[technical field]

The present invention relates to the multimedia messages processing technology field, particularly the method and system of index building in a kind of audio retrieval.

[background technology]

Audio frequency is a kind of important information carrier, and audio retrieval mainly is by keyword, and a large amount of audio-frequency information files are searched for, and obtains a kind of technology of correlated results.Wherein keyword can be text, can be audio-frequency fragments.In content-based audio retrieval mode, need to extract the characteristic parameter of audio file, and generation and voice manipulative indexing, this is a kind of operation of very consumption calculations resource.

Audio search method in the conventional art is set up the audio resource storehouse at centralized server in advance.The query and search client is obtained audio fragment or the text key word of input, then audio frequency sheet or text key word section are sent to server, after server receives, calculate the condition code of this audio fragment according to speech recognition algorithm, perhaps use text key word, in the audio samples storehouse, search the audio resource that mates with the condition code of this audio fragment, and send to the retrieval client.

Yet, bear processor active task jointly though can use some station servers, mainly adopt server to carry out centralized processing during audio retrieval index building in the conventional art, mainly showing needs more server index building again after receiving voice data.When voice data more for a long time, particularly similar call center all produces a large amount of speech data environment every day, index building need expend a large amount of server computational resources, when business is expanded, just must increase server, thereby the additional firmware cost when having increased dilatation is not easy to dilatation.

[summary of the invention]

Based on this, be necessary to provide a kind of for audio retrieval, can be easy to the method for the index building of dilatation.

The method of index building in a kind of audio retrieval may further comprise the steps:

Audio collecting device obtains voice data;

Audio collecting device calculates the index value of described voice data, and the index value of described voice data and described voice data is sent to server;

Server is according to the described voice data that receives and the index value index building of described voice data.

Preferably, described index comprises the overall identification corresponding with described voice data.

Preferably, described audio collecting device has a plurality of;

Described server is specially according to the step of the index value index building of the described voice data that receives and described voice data:

Server filters out the identical audio resource of index value earlier, stores in the audio resource storehouse then according to the described voice data after filtering and the index value index building of described voice data, and with described voice data.

Preferably, the described audio collecting device step of calculating the index value of described voice data is specially:

Audio collecting device carries out pre-service to described voice data, extracts acoustical characteristic parameters;

Audio collecting device carries out to described voice data that the speaker is cut apart and voice segment;

Audio collecting device calculates the index value of the voice data after the described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.

Preferably, described audio collecting device carries out the step that the speaker cuts apart with voice segment to described voice data and also comprises:

Mourning in silence in the described voice data detected, with audio parsing, and the voice data after the segmentation classified according to speaker's classification.

Preferably, described audio collecting device is specially according to the step that described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary calculate the index value of the voice data after the described segmentation:

According to speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary of the voice data after the described segmentation, generate the phonetic grid by speech recognition decoder;

According to described phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid;

Generate the index value of the voice data after the described segmentation according to institute's predicate grid.

Preferably, described method also comprises:

The retrieval client is obtained retrieval request;

The retrieval client judges whether described retrieval request comprises audio fragment, if, then from described retrieval request, extract audio fragment and calculate the index value of this audio fragment, send to server then;

Server is searched in index and described index value corresponding audio data according to index value, and is handed down to the retrieval client.

In addition, also be necessary to provide a kind of for audio retrieval, can be easy to the system of the index building of dilatation.

The system of index building in a kind of audio retrieval comprises audio collecting device and server, and described audio collecting device comprises:

The audio frequency acquisition module is used for obtaining voice data;

The index value computing module be used for to calculate the index value of described voice data, and the index value of described voice data and described voice data is sent to server;

Described server comprises:

The index construct module is used for server according to the described voice data that receives and the index value index building of described voice data.

Preferably, described audio collecting device has a plurality of;

Described index construct module also is used for filtering out the identical audio resource of index value, according to the described voice data after filtering and the index value index building of described voice data, and described voice data is stored in the audio resource storehouse.

Preferably, described index value computing module also is used for described voice data is carried out pre-service, extracts acoustical characteristic parameters; Described voice data carried out the speaker is cut apart and voice segment; Calculate the index value of the voice data after the described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.

Preferably, described index value computing module also is used for mourning in silence of described voice data detected, and with audio parsing, and the voice data after the segmentation is classified according to speaker's classification.

Preferably, described index value computing module also is used for speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary according to the voice data after the described segmentation, generates the phonetic grid by speech recognition decoder; According to described phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid; Generate the index value of the voice data after the described segmentation according to institute's predicate grid.

Preferably, also comprise the retrieval client, be used for obtaining retrieval request, judge whether described retrieval request comprises audio fragment, if, then from described retrieval request, extract audio fragment and calculate the index value of described audio fragment, send to server then;

Described server also comprises the audio retrieval module, is used for searching and described index value corresponding audio data at index according to index value, and is handed down to the retrieval client.

The index building method and system obtains voice data by audio collecting device in the above-mentioned audio retrieval, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data are uploaded onto the server.Server is again according to index value and voice data index building.The work that makes server will calculate the index value of voice data has been transferred on the audio collecting device.For example, in the call center, audio collecting device can be contact staff's terminal PC.Each terminal PC can be handled the voice data of gathering its same day the same day.When the call center system capacity enlarges, can carry out dilatation by the quantity that increases terminal PC, and not need additionally to add server, thereby not increase extra expense.Therefore, the index building method has reduced the additional firmware cost of audio retrieval system when dilatation in the above-mentioned audio retrieval, thereby is easy to dilatation more.

[description of drawings]

Fig. 1 is the process flow diagram of the method for index building in the embodiment sound intermediate frequency retrieval;

Fig. 2 is the process flow diagram of the step of embodiment sound intermediate frequency harvester index value of calculating voice data;

To be an embodiment sound intermediate frequency harvester calculate the process flow diagram of step of the index value of the voice data after the segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary to Fig. 3;

Fig. 4 is the process flow diagram of the step of an embodiment sound intermediate frequency retrieval;

Fig. 5 is the structural representation of the system of index building in the embodiment sound intermediate frequency retrieval;

Fig. 6 is the structural representation of the system of index building in another embodiment sound intermediate frequency retrieval.

[embodiment]

As shown in Figure 1, in one embodiment, the method for index building in a kind of audio retrieval may further comprise the steps:

Step S102, audio collecting device obtains voice data.

Voice data can be voice, music etc.Audio collecting device can obtain voice data by collection user's voice such as audio input device such as microphone or sound card output bufferses, also can obtain voice data by obtaining audio file.

Step S104, audio collecting device calculates the index value of voice data, and the index value of voice data and voice data is sent to server.

At an embodiment, audio collecting device can be the terminal device with certain arithmetic capability, and it not only can gather voice, can also carry out calculation process to audio frequency.For example, the retrieval client terminal PC of the telephonist in call center's machine room, retrieval such as user's smart mobile phone client among the mobile network.

Audio collecting device generates and the voice data corresponding index value by the feature of audio data, then index value and voice data is sent to server together.In one embodiment, audio collecting device is delayed delivery with the mode that index value and voice data send to server.When audio collecting device detects server when busy, first being buffered in the audio collecting device with the form of this voice data corresponding index value with partial indexes the voice data that gets access to and generation, wait until that then server load hour, uploads this partial indexes again.

Step S106, server is according to the voice data that receives and the index value index building of voice data.

Server receive voice data that audio collecting device uploads and with the voice data corresponding index value after, can distribute the overall identification corresponding with this voice data for voice data in advance, in the index that makes up, this index can comprise the overall identification that this is corresponding with voice data then.Index value can be corresponding with this overall identification by key-value pair.Whole index informations constitutes " global index ".

In one embodiment, audio collecting device can have a plurality of, and server can filter out the identical audio resource of index value earlier, stores in the audio resource storehouse then according to the voice data after filtering and the index value index building of voice data, and with voice data.

In one embodiment, audio collecting device adopts the mode of delayed delivery that index value and voice data are sent to server.Audio collecting device is stored as partial indexes with index value and the voice data of buffer memory, and the index of storing on the server is global index.After server receives the partial indexes of audio collecting device delayed delivery, filter out the part that partial indexes and global index repeat, the partial indexes after will filtering then is added in the global index.The index that filters out repetition can reduce the storage pressure of server.

The index building method is obtained voice data by audio collecting device in the above-mentioned audio retrieval, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data are uploaded onto the server.Server is again according to index value and voice data index building.The work that makes server will calculate the index value of voice data has been transferred on the audio collecting device.For example, in the call center, audio collecting device can be contact staff's terminal PC.Each terminal PC can be handled the voice data of gathering its same day the same day.When the call center system capacity enlarges, can carry out dilatation by the quantity that increases terminal PC, and not need additionally to add server, thereby not increase extra expense.Therefore, the index building method has reduced the additional firmware cost of audio retrieval system when dilatation in the above-mentioned audio retrieval, thereby is easy to dilatation more.

In one embodiment, as shown in Figure 2, the step that audio collecting device calculates the index value of voice data can be specially:

Step S202, audio collecting device carries out pre-service to voice data, extracts acoustical characteristic parameters.

Step S204, audio collecting device carry out to voice data that the speaker is cut apart and voice segment.

Step S206, audio collecting device calculate the index value of the voice data after the segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.

Among the step S202, when audio collecting device carries out pre-service to voice data, can carry out pre-service by at least a operation that voice data is carried out in filtering, pre-emphasis, branch frame, windowing, the zero padding.With voice data through after the pre-service, can be by code conversion, cut apart, simple marking associates voice corresponding client and customer service information, deposit the audio collecting device client database in and preserve.

When extracting the characteristic parameter of voice data, can carry out feature extraction by the voice to minute frame and obtain characteristic parameter.For example, can by the extraction MFCC (Mel frequency cepstral coefficient) in the conventional art, method extract the acoustical characteristic parameters of voice data.

Among the step S204, audio collecting device can be by mourning in silence in the voice data detected, and with audio parsing, and the voice data after the segmentation classified according to speaker's classification.

Silence period is that the amplitude that occurs in the continuous sound signal is less than the time period of threshold value.Can by the default silence period to the voice data detection of mourning in silence.Detect and long section voice can be divided into multistage by statement by mourning in silence.For example, in the call center, audio collecting device is classified client's voice earlier according to speaker's classification.Can use GMM model (Gaussian Mixture Model, gauss hybrid models), classify according to the model of several speakers in the existing model bank, can classify according to male voice, schoolgirl, neutral sound, be used for distinguishing; Can also be numbered the user in advance, and adopt Customs Assigned Number to distinguish.

Among the step S206, as shown in Figure 3, audio collecting device is specially according to the step that acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary calculate the index value of the voice data after the segmentation:

Step S302 according to speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary of the voice data after the segmentation, generates phonetic grid (lattice) by speech recognition decoder.

Can from phonetic language model bank and acoustics model bank, select phonetic model and the acoustic model corresponding with speaker's classification according to speaker's classification of the voice data correspondence after the segmentation.For example, if customer voice is male voice, then select phonetic language model and the acoustic model of male voice correspondence.

First Pronounceable dictionary is represented in the phonetic language model sound mother's relation in each unit (syllable) and acoustic model.

The phonetic language model is for being the statistical language model of unit with the pinyin syllable.First Pronounceable dictionary is represented in the phonetic language model sound mother's relation in each unit (syllable) and acoustic model.Basic acoustic elements can comprise sound, phoneme and syllable, is the basic modeling unit of acoustic model.Preferably, acoustic model can be HMM (Hidden Markov Model (HMM)), can carry out speech recognition decoder by generating the phonetic grid by Viterbi (Viterbi) algorithm.Each node is represented a pinyin syllable in the phonetic grid, is marked with its initial or concluding time of this syllable in the node, the acoustics probability of mark syllable and linguistics probability in the connection between the node.

Step S304 is according to phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid.

Second Pronounceable dictionary is represented the relation based on the syllable in each unit (Chinese word) and the phonetic language model in the language model of word (node in the phonetic grid that generates when adopting the decoding of Viterbi (Viterbi) algorithm).Each node is represented a Chinese word in the word grid that generates, and in the node mark the initial or concluding time of this Chinese word, the acoustics probability of mark Chinese word and linguistics probability in the connection between the node.For polyphonic word also need be in node the pronunciation type sequence number of this Chinese word of mark.

Further, can calculate the degree of confidence of each word in the word grid.Degree of confidence is the score of each word in the word grid, can obtain according to information calculations such as number of candidates information in the probability of acoustic model probability, pinyin syllable probability, word, word time span, the word grid.Degree of confidence can be used for judging the order of accuarcy of estimating of speech recognition.For example, one section voice ambiguous or that have a polyphone easily is identified as multiple Chinese word, the accuracy of the Chinese word after the identification that degree of confidence can be used for representing to estimate, and degree of confidence is more high, is disturbed for a short time during expression identification, and the Chinese word after the identification is more reliable.

Step S306 is according to the index value of the voice data after the word grid generation segmentation.

Can directly adopt the word grid as the index value of voice data, also can calculate based on the cryptographic hash of the lattice of the word index value as voice data by default hash function.

In one embodiment, after the index value index building of server according to the voice data that receives and voice data, also can compress voice data and index, thus the storage space of saving server.

In one embodiment, after the index value index building of server according to the voice data that receives and voice data, server also can be by the reverse indexing storage with this index stores.

In one embodiment, as shown in Figure 4, the method for index building also comprises the step of audio retrieval in the audio retrieval:

Step S402, the retrieval client is obtained retrieval request.

Step S404, the retrieval client judges whether retrieval request comprises audio fragment, if, then from this retrieval request, extract audio fragment and calculate the index value of this audio fragment, send to server then.

Step S406, server is searched in index and index value corresponding audio data according to index value, and is handed down to the retrieval client.

Wherein, the retrieval client method obtaining audio fragment and calculate the index value of audio fragment is obtained voice data with audio collecting device and calculates the method for index value of voice data identical.Can guarantee that the index value that identical voice data calculates is identical in retrieval client and audio collecting device.

In one embodiment, server is behind the index value that receives the retrieval client upload, in index, retrieve, obtain the overall identification of index value corresponding audio data, in the audio resource storehouse, obtain and these overall identification corresponding audio data according to this overall identification then, and be handed down to the retrieval client.

In one embodiment, server regular phonetic language model, acoustic model, the language model based on word, the pronunciation dictionary in isochronous audio harvester and the retrieval client also.Synchronously, audio collecting device and retrieval client can adopt identical algorithm and parameter when calculating the index value of voice data, have guaranteed that the index value that identical voice data calculates in retrieval client and audio collecting device is identical.

As shown in Figure 5, in one embodiment, the system of index building in a kind of audio retrieval comprises audio collecting device 100 and server 200, and wherein, audio collecting device 100 comprises:

Audio frequency acquisition module 102 is used for obtaining voice data.

Voice data can be voice, music etc.Audio frequency acquisition module 102 can obtain voice data by collection user's voice such as audio input device such as microphone or sound card output bufferses, also can obtain voice data by obtaining audio file.

Index value computing module 104 be used for to calculate the index value of voice data, and the index value of voice data and voice data is sent to server 200.

At an embodiment, audio collecting device 100 can be the terminal device with certain arithmetic capability, and it not only can gather voice by audio frequency acquisition module 102, can also carry out calculation process by 104 pairs of audio frequency of index value computing module.For example, the retrieval client terminal PC of the telephonist in call center's machine room, retrieval such as user's smart mobile phone client among the mobile network.

Index value computing module 104 generates and the voice data corresponding index value by the feature of audio data, then index value and voice data is sent to server 200 together.

In one embodiment, index value computing module 104 is delayed delivery with the mode that index value and voice data send to server 200.When index value computing module 104 detects server 200 when busy, first being buffered in the audio collecting device 100 with the form of this voice data corresponding index value with partial indexes the voice data that gets access to and generation, wait until that then server 200 loads hour, upload this partial indexes again.

Server 200 comprises:

Index construct module 202 is used for according to the voice data that receives and the index value index building of voice data.

Server 200 receive voice data that index value computing module 104 uploads and with the voice data corresponding index value after, can distribute the overall identification corresponding with this voice data for voice data in advance, in the index that makes up, this index can comprise the overall identification that this is corresponding with voice data then.Index value can be corresponding with this overall identification by key-value pair.

In one embodiment, audio collecting device 100 can have a plurality of.Index construct module 202 also can be used for filtering out the identical audio resource of index value, according to the voice data after filtering and the index value index building of voice data, and voice data is stored in the audio resource storehouse.

The system of index building obtains voice data by audio collecting device in the above-mentioned audio retrieval, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data are uploaded onto the server.Server is again according to index value and voice data index building.The work that makes server will calculate the index value of voice data has been transferred on the audio collecting device.For example, in the call center, audio collecting device can be contact staff's terminal PC.Each terminal PC can be handled the voice data of gathering its same day the same day.When the call center system capacity enlarges, can carry out dilatation by the quantity that increases terminal PC, and not need additionally to add server, thereby not increase extra expense.Therefore, the index building method has reduced the additional firmware cost of audio retrieval system when dilatation in the above-mentioned audio retrieval, thereby is easy to dilatation more.

In one embodiment, index value computing module 104 also can be used for voice data is carried out pre-service, extracts acoustical characteristic parameters; Voice data carried out the speaker is cut apart and voice segment; Calculate the index value of the voice data after the segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.

Index value computing module 104 also can be used for carrying out pre-service by at least a operation that voice data is carried out in filtering, pre-emphasis, branch frame, windowing, the zero padding.With voice data through after the pre-service, can be by code conversion, cut apart, simple marking associates voice corresponding client and customer service information, deposit the audio collecting device client database in and preserve.

Index value computing module 104 also can be used for carrying out feature extraction by the voice to minute frame and obtains characteristic parameter.For example, can by the extraction MFCC (Mel frequency cepstral coefficient) in the conventional art, method extract the acoustical characteristic parameters of voice data.

Index value computing module 104 also can be used for by mourning in silence in the voice data detected, and with audio parsing, and the voice data after the segmentation is classified according to speaker's classification.

Further, index value computing module 104 also can be used for according to the speech characteristic parameter of the voice data after the segmentation, phonetic language model, acoustic model and first Pronounceable dictionary, generates the phonetic grid by speech recognition decoder; According to phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid; Index value according to the voice data after the word grid generation segmentation.

In one embodiment, index construct module 202 also can be compressed voice data and index, thereby saves the storage space of server 200.

In one embodiment, index construct module 202 also can be used for by the reverse indexing storage the index stores that makes up.

In one embodiment, as shown in Figure 6, the system of index building also comprises retrieval client 300 in the audio retrieval, be used for obtaining retrieval request, the retrieval client judges whether retrieval request comprises audio fragment, if, then from this retrieval request, extract audio fragment and calculate the index value of this audio fragment, send to server 200 then.

Server 200 also comprises:

Audio retrieval module 204 is used for searching and index value corresponding audio data at index according to index value, and is handed down to retrieval client 300.

Wherein, retrieval client 300 method obtaining audio fragment and calculate the index value of audio fragment is obtained voice data with index calculation module 102 and calculates the method for index value of voice data identical.Can guarantee that the index value that identical voice data calculates is identical in retrieval client 300 and index calculation module 102.

In one embodiment, server 200 is after receiving the index value uploaded of retrieval client 300, in index, retrieve, obtain the overall identification of index value corresponding audio data, in the audio resource storehouse, obtain and these overall identification corresponding audio data according to this overall identification then, and be handed down to retrieval client 300.

In one embodiment, server 200 phonetic language model, acoustic model, the language model based on word, the pronunciation dictionary in regular isochronous audio harvester 100 and the retrieval client 300 also.Synchronously, audio collecting device 100 and retrieval client 300 are when calculating the index value of voice data, can adopt identical algorithm and parameter, guaranteed that the index value that identical voice data calculates is identical in retrieval client 300 and audio collecting device 100.

The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. the method for index building in the audio retrieval may further comprise the steps:

Audio collecting device obtains voice data;

2. the method for index building in the audio retrieval according to claim 1 is characterized in that, described index comprises the overall identification corresponding with described voice data.

3. the method for index building in the audio retrieval according to claim 1 is characterized in that described audio collecting device has a plurality of;

4. the method for index building in the audio retrieval according to claim 1 is characterized in that the step that described audio collecting device calculates the index value of described voice data is specially:

5. according to the method for index building in any described audio retrieval in the claim 4, it is characterized in that described audio collecting device carries out the step that the speaker cuts apart with voice segment to described voice data and also comprises:

6. the method for index building in the audio retrieval according to claim 4, it is characterized in that described audio collecting device is specially according to the step that described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary calculate the index value of the voice data after the described segmentation:

7. according to the method for index building in each described audio retrieval of claim 1 to 6, it is characterized in that described method also comprises:

The retrieval client is obtained retrieval request;

8. the system of index building in the audio retrieval is characterized in that comprise audio collecting device and server, described audio collecting device comprises:

The audio frequency acquisition module is used for obtaining voice data;

Described server comprises:

9. the system of index building in the audio retrieval according to claim 8 is characterized in that, described index comprises the overall identification corresponding with described voice data.

10. the system of index building in the audio retrieval according to claim 8 is characterized in that described audio collecting device has a plurality of;

11. the system of index building is characterized in that in the audio retrieval according to claim 8, described index value computing module also is used for described voice data is carried out pre-service, extracts acoustical characteristic parameters; Described voice data carried out the speaker is cut apart and voice segment; Calculate the index value of the voice data after the described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.

12. the system according to index building in the audio retrieval described in the claim 11, it is characterized in that, described index value computing module also is used for mourning in silence of described voice data detected, and with audio parsing, and the voice data after the segmentation is classified according to speaker's classification.

13. the system of index building in the audio retrieval according to claim 11, it is characterized in that, described index value computing module also is used for speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary according to the voice data after the described segmentation, generates the phonetic grid by speech recognition decoder; According to described phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid; Generate the index value of the voice data after the described segmentation according to institute's predicate grid.

14. the system of index building to 13 any described audio retrievals according to Claim 8, it is characterized in that, also comprise the retrieval client, be used for obtaining retrieval request, judge whether described retrieval request comprises audio fragment, if, then from described retrieval request, extract audio fragment and calculate the index value of described audio fragment, send to server then;