CN103247316B - The method and system of index building in a kind of audio retrieval - Google Patents

The method and system of index building in a kind of audio retrieval Download PDF

Info

Publication number
CN103247316B
CN103247316B CN201210031534.1A CN201210031534A CN103247316B CN 103247316 B CN103247316 B CN 103247316B CN 201210031534 A CN201210031534 A CN 201210031534A CN 103247316 B CN103247316 B CN 103247316B
Authority
CN
China
Prior art keywords
voice data
audio
index
index value
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210031534.1A
Other languages
Chinese (zh)
Other versions
CN103247316A (en
Inventor
黄石磊
刘轶
程刚
曹文晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PKU-HKUST SHENZHEN INSTITUTE
SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER
SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd
Original Assignee
PKU-HKUST SHENZHEN INSTITUTE
SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER
SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PKU-HKUST SHENZHEN INSTITUTE, SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER, SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd filed Critical PKU-HKUST SHENZHEN INSTITUTE
Priority to CN201210031534.1A priority Critical patent/CN103247316B/en
Publication of CN103247316A publication Critical patent/CN103247316A/en
Application granted granted Critical
Publication of CN103247316B publication Critical patent/CN103247316B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

A method for index building in audio retrieval, comprises the following steps: audio collecting device obtains voice data; Audio collecting device calculates the index value of described voice data, and the index value of described voice data and described voice data is sent to server; Server is according to the index value index building of the described voice data received and described voice data.In addition, the system of index building in a kind of audio retrieval is additionally provided.In above-mentioned audio retrieval, the method and system of index building can reduce the additional firmware cost of audio retrieval system when dilatation.

Description

The method and system of index building in a kind of audio retrieval
[technical field]
The present invention relates to multimedia information field, particularly the method and system of index building in a kind of audio retrieval.
[background technology]
Audio frequency is a kind of important information carrier, and audio retrieval mainly by keyword, is searched for a large amount of audio information files, obtains a kind of technology of correlated results.Wherein keyword can be text, can be audio-frequency fragments.In content-based audio retrieval mode, need the characteristic parameter extracting audio file, and generate and voice manipulative indexing, this is a kind of operation of very consumption calculations resource.
Audio search method in conventional art sets up audio resource storehouse in advance on centralized server.Query and search client obtains audio fragment or the text key word of input, then audio frequency sheet or text key word section are sent to server, after server receives, the condition code of this audio fragment is calculated according to speech recognition algorithm, or use text key word, in audio sample storehouse, search the audio resource mated with the condition code of this audio fragment, and send to retrieval client.
But, although some station servers can be used to carry out shared processor active task, mainly adopt server to carry out centralized processing during audio retrieval index building in conventional art, be mainly manifested in and need more server index building again after receiving voice data.When voice data is more, particularly similar call center all produces a large amount of speech data environment every day, and index building needs the server computational resource of at substantial, when operation expanding, just must increase server, thus additional firmware cost when adding dilatation, be not easy to dilatation.
[summary of the invention]
Based on this, be necessary to provide a kind of for audio retrieval, the method for the index building of dilatation can be easy to.
A method for index building in audio retrieval, comprises the following steps:
Audio collecting device obtains voice data;
Audio collecting device calculates the index value of described voice data, and the index value of described voice data and described voice data is sent to server;
Server is according to the index value index building of the described voice data received and described voice data.
Preferably, described index comprises the overall identification corresponding with described voice data.
Preferably, described audio collecting device has multiple;
Described server is specially according to the step of the index value index building of the described voice data received and described voice data:
Server first filters out the identical audio resource of index value, then according to the described voice data after filtering and the index value index building of described voice data, and is stored in audio resource storehouse by described voice data.
Preferably, the step that described audio collecting device calculates the index value of described voice data is specially:
Audio collecting device carries out pre-service to described voice data, extracts acoustical characteristic parameters;
Audio collecting device carries out speaker's segmentation and voice segment to described voice data;
Audio collecting device calculates the index value of the voice data after described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
Preferably, described audio collecting device also comprises the step that described voice data carries out speaker's segmentation and voice segment:
Mourning in silence in described voice data is detected, by audio parsing, and the voice data after segmentation is classified according to speaker's classification.
Preferably, the step of the index value of the voice data after described audio collecting device calculates described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary is specially:
According to the speech characteristic parameter of the voice data after described segmentation, phonetic language model, acoustic model and the first Pronounceable dictionary, generate phonetic grid by speech recognition decoder;
According to described phonetic grid, the language model based on word, the second Pronounceable dictionary generation word grid;
The index value of the voice data after segmentation according to institute's predicate mess generation.
Preferably, described method also comprises:
Retrieval client obtains retrieval request;
Retrieval client judges whether described retrieval request comprises audio fragment, if so, then from described retrieval request, extracts audio fragment and calculates the index value of this audio fragment, then sending to server;
Server searches the voice data corresponding with described index value in the index according to index value, and is handed down to retrieval client.
In addition, there is a need to provide a kind of for audio retrieval, the system of the index building of dilatation can be easy to.
A system for index building in audio retrieval, comprises audio collecting device and server, and described audio collecting device comprises:
Audio frequency acquisition module, for obtaining voice data;
Index value computing module, for calculating the index value of described voice data, and sends to server by the index value of described voice data and described voice data;
Described server comprises:
Index construct module, for the index value index building of server according to the described voice data received and described voice data.
Preferably, described index comprises the overall identification corresponding with described voice data.
Preferably, described audio collecting device has multiple;
Described voice data, also for filtering out the identical audio resource of index value, according to the described voice data after filtering and the index value index building of described voice data, and is stored in audio resource storehouse by described index construct module.
Preferably, described index value computing module, also for carrying out pre-service to described voice data, extracts acoustical characteristic parameters; Speaker's segmentation and voice segment are carried out to described voice data; The index value of the voice data after described segmentation is calculated according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
Preferably, the voice data after segmentation, also for detecting mourning in silence in described voice data, by audio parsing, and is classified according to speaker's classification by described index value computing module.
Preferably, described index value computing module, also for the speech characteristic parameter according to the voice data after described segmentation, phonetic language model, acoustic model and the first Pronounceable dictionary, generates phonetic grid by speech recognition decoder; According to described phonetic grid, the language model based on word, the second Pronounceable dictionary generation word grid; The index value of the voice data after segmentation according to institute's predicate mess generation.
Preferably, also comprising retrieval client, for obtaining retrieval request, judging whether described retrieval request comprises audio fragment, if so, then from described retrieval request, extract audio fragment and calculate the index value of described audio fragment, then sending to server;
Described server also comprises audio retrieval module, for searching the voice data corresponding with described index value in the index according to index value, and is handed down to retrieval client.
In above-mentioned audio retrieval, index building method and system, obtains voice data by audio collecting device, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data is uploaded onto the server.Server is again according to index value and voice data index building.The work of the index value calculating voice data has been transferred on audio collecting device by server.Such as, in a call in the heart, audio collecting device can be the terminal PC of contact staff.Each terminal PC can process the voice data gathered its same day the same day.When call center system capacity expansion, carry out dilatation by the quantity increasing terminal PC, and do not need additionally to add server, thus do not increase extra expense.Therefore, in above-mentioned audio retrieval, index building method reduces the additional firmware cost of audio retrieval system when dilatation, thus is more easy to dilatation.
[accompanying drawing explanation]
Fig. 1 is the process flow diagram of the method for index building in an embodiment sound intermediate frequency retrieval;
Fig. 2 is the process flow diagram that an embodiment sound intermediate frequency harvester calculates the step of the index value of voice data;
Fig. 3 is an embodiment sound intermediate frequency harvester calculates the step of the index value of the voice data after segmentation process flow diagram according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary;
Fig. 4 is the process flow diagram of the step of an embodiment sound intermediate frequency retrieval;
Fig. 5 is the structural representation of the system of index building in an embodiment sound intermediate frequency retrieval;
Fig. 6 is the structural representation of the system of index building in the retrieval of another embodiment sound intermediate frequency.
[embodiment]
As shown in Figure 1, in one embodiment, a kind of method of index building in audio retrieval, comprises the following steps:
Step S102, audio collecting device obtains voice data.
Voice data can be voice, music etc.The voice that audio collecting device exports the collection users such as buffer memory by the audio input device such as microphone or sound card obtain voice data, also can obtain voice data by obtaining audio file.
Step S104, audio collecting device calculates the index value of voice data, and the index value of voice data and voice data is sent to server.
An embodiment, audio collecting device can be the terminal device with certain arithmetic capability, and it not only can gather voice, can also carry out calculation process to audio frequency.Such as, the retrieval client terminal PC of the telephonist in call center's machine room, in mobile network, the smart mobile phone etc. of user retrieves client.
Audio collecting device, by the feature of audio data, generates the index value corresponding with voice data, then index value is sent to server together with voice data.In one embodiment, index value and voice data send to the mode of server to be delayed delivery by audio collecting device.When audio collecting device detects that server is busy, first the index value corresponding with this voice data of the voice data got and generation is buffered in audio collecting device with the form of partial indexes, when then waiting until that server load is less, then this partial indexes is uploaded.
Step S106, server is according to the index value index building of the voice data received and voice data.
After server receives the voice data and the index value corresponding with voice data that audio collecting device uploads, can in advance for voice data distributes the overall identification corresponding with this voice data, then, in the index built, this index can comprise this overall identification corresponding with voice data.Index value is corresponding with this overall identification by key-value pair.Whole index informations forms " global index ".
In one embodiment, audio collecting device can have multiple, and server first can filter out the identical audio resource of index value, then according to the voice data after filtering and the index value index building of voice data, and is stored in audio resource storehouse by voice data.
In one embodiment, audio collecting device adopts the mode of delayed delivery that index value and voice data are sent to server.The index value of buffer memory and voice data are stored as partial indexes by audio collecting device, and the index that server stores is global index.After server receives the partial indexes of audio collecting device delayed delivery, filter out the part that partial indexes and global index repeat, then the partial indexes after filtering is added in global index.The index filtering out repetition can reduce the storage pressure of server.
In above-mentioned audio retrieval, index building method, obtains voice data by audio collecting device, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data is uploaded onto the server.Server is again according to index value and voice data index building.The work of the index value calculating voice data has been transferred on audio collecting device by server.Such as, in a call in the heart, audio collecting device can be the terminal PC of contact staff.Each terminal PC can process the voice data gathered its same day the same day.When call center system capacity expansion, carry out dilatation by the quantity increasing terminal PC, and do not need additionally to add server, thus do not increase extra expense.Therefore, in above-mentioned audio retrieval, index building method reduces the additional firmware cost of audio retrieval system when dilatation, thus is more easy to dilatation.
In one embodiment, as shown in Figure 2, the step of the index value of audio collecting device calculating voice data can be specially:
Step S202, audio collecting device carries out pre-service to voice data, extracts acoustical characteristic parameters.
Step S204, audio collecting device carries out speaker's segmentation and voice segment to voice data.
Step S206, audio collecting device calculates the index value of the voice data after segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
In step S202, when audio collecting device carries out pre-service to voice data, carry out pre-service by least one operation carried out voice data in filtering, pre-emphasis, framing, windowing, zero padding.By voice data after pre-service, by code conversion, segmentation, simple marking, client corresponding for voice and customer service information association can be got up, preserve stored in audio collecting device client database.
When extracting the characteristic parameter of voice data, obtain characteristic parameter by carrying out feature extraction to the voice of framing.Such as, by the extraction MFCC (Mel frequency cepstral coefficient) in conventional art, method extract the acoustical characteristic parameters of voice data.
In step S204, the voice data after segmentation, by detecting mourning in silence in voice data, by audio parsing, and is classified according to speaker's classification by audio collecting device.
The amplitude occurred in silence period and continuous sound signal is less than the time period of threshold value.By the default silence period, voice data is mourned in silence detection.By mourning in silence, long section voice can be divided into multistage by statement by detection.Such as, in a call in the heart, the voice of client are first classified according to speaker's classification by audio collecting device.GMM model (GaussianMixtureModel, gauss hybrid models) can be used, classify according to the model of several speakers in existing model bank, can classify according to male voice, schoolgirl, neutral sound, for distinguishing; Can also be numbered user in advance, and adopt Customs Assigned Number to distinguish.
In step S206, as shown in Figure 3, the step of the index value of the voice data after audio collecting device calculates segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary is specially:
Step S302, according to the speech characteristic parameter of the voice data after segmentation, phonetic language model, acoustic model and the first Pronounceable dictionary, generates phonetic grid (lattice) by speech recognition decoder.
Speaker's classification that can be corresponding according to the voice data after segmentation, from phonetic language model bank and acoustics model bank, selects the Pinyin model corresponding with speaker's classification and acoustic model.Such as, if customer voice is male voice, then the phonetic language model that selection male voice is corresponding and acoustic model.
First Pronounceable dictionary represents the relation of sound mother in each unit (syllable) and acoustic model in phonetic language model.
Phonetic language model is the statistical language model in units of pinyin syllable.First Pronounceable dictionary represents the relation of sound mother in each unit (syllable) and acoustic model in phonetic language model.Basic acoustic elements can comprise sound, phoneme and syllable, is the basic modeling unit of acoustic model.Preferably, acoustic model can be HMM (Hidden Markov Model (HMM)), carries out speech recognition decoder by Viterbi (Viterbi) algorithm by generating phonetic grid.In phonetic grid, each node represents a pinyin syllable, is marked with its initial or end time of this syllable in node, the connection between node marks acoustics probability and the linguistics probability of syllable.
Step S304, according to phonetic grid, the language model based on word, the second Pronounceable dictionary generation word grid.
Second Pronounceable dictionary represents the relation of the syllable (adopting the node in the phonetic grid generated during the decoding of Viterbi (Viterbi) algorithm) in the language model based on word in each unit (Chinese word) and phonetic language model.In the word grid generated, each node represents a Chinese word, and marked the initial of this Chinese word or end time in node, the connection between node marks acoustics probability and the linguistics probability of Chinese word.The pronunciation type sequence number marking this Chinese word in node is also needed for polyphonic word.
Further, the degree of confidence of each word in word grid can be calculated.Degree of confidence is the score of each word in word grid, can calculate obtain according to information such as number of candidates information in the probability of acoustic model probability, pinyin syllable probability, word, word time span, word grid.What degree of confidence can be used for judging speech recognition estimates order of accuarcy.Such as, one section of voice that are ambiguous or that have a polyphone is easily identified as multiple Chinese word, and degree of confidence can be used for representing the accuracy of the Chinese word after the identification estimated, and degree of confidence is higher, and be disturbed little when representing and identify, the Chinese word after identification is more reliable.
Step S306, according to the index value of the voice data after the segmentation of word mess generation.
Word grid directly can be adopted as the index value of voice data, also can be calculated cryptographic hash based on the lattice of word by the hash function preset as the index value of voice data.
In one embodiment, server, according to after the index value index building of the voice data received and voice data, also can compress voice data and index, thus saves the storage space of server.
In one embodiment, server is according to after the index value index building of the voice data received and voice data, and server also stores this index stores by reverse indexing.
In one embodiment, as shown in Figure 4, in audio retrieval, the method for index building also comprises the step of audio retrieval:
Step S402, retrieval client obtains retrieval request.
Step S404, retrieval client judges whether retrieval request comprises audio fragment, if so, then from this retrieval request, extracts audio fragment and calculates the index value of this audio fragment, then sending to server.
Step S406, server searches the voice data corresponding with index value in the index according to index value, and is handed down to retrieval client.
Wherein, retrieve client and obtain audio fragment and the method calculating the index value of audio fragment obtains voice data with audio collecting device and to calculate the method for the index value of voice data identical.Can guarantee that the index value that identical voice data calculates in retrieval client with audio collecting device is identical.
In one embodiment, server is after the index value receiving retrieval client upload, retrieve in the index, obtain the overall identification of voice data corresponding to index value, then in audio resource storehouse, obtain the voice data corresponding with this overall identification according to this overall identification, and be handed down to retrieval client.
In one embodiment, server also periodic synchronization audio collecting device and phonetic language model, acoustic model, the language model based on word, the pronunciation dictionary in retrieval client.After synchronous, audio collecting device and retrieval client, when calculating the index value of voice data, can adopt identical algorithm and parameter, ensure that the index value that identical voice data calculates in retrieval client with audio collecting device is identical.
As shown in Figure 5, in one embodiment, the system of index building in a kind of audio retrieval, comprises audio collecting device 100 and server 200, and wherein, audio collecting device 100 comprises:
Audio frequency acquisition module 102, for obtaining voice data.
Voice data can be voice, music etc.The voice that audio frequency acquisition module 102 exports the collection users such as buffer memory by the audio input device such as microphone or sound card obtain voice data, also can obtain voice data by obtaining audio file.
Index value computing module 104, for calculating the index value of voice data, and sends to server 200 by the index value of voice data and voice data.
An embodiment, audio collecting device 100 can be the terminal device with certain arithmetic capability, and it not only can gather voice by audio frequency acquisition module 102, can also carry out calculation process by index value computing module 104 pairs of audio frequency.Such as, the retrieval client terminal PC of the telephonist in call center's machine room, in mobile network, the smart mobile phone etc. of user retrieves client.
Index value computing module 104, by the feature of audio data, generates the index value corresponding with voice data, then index value is sent to server 200 together with voice data.
In one embodiment, index value and voice data send to the mode of server 200 to be delayed delivery by index value computing module 104.When index value computing module 104 detects that server 200 is busy, first the index value corresponding with this voice data of the voice data got and generation is buffered in audio collecting device 100 with the form of partial indexes, when then waiting until that server 200 load is less, then this partial indexes is uploaded.
Server 200 comprises:
Index construct module 202, for the index value index building according to the voice data received and voice data.
After server 200 receives the voice data and the index value corresponding with voice data that index value computing module 104 uploads, can in advance for voice data distributes the overall identification corresponding with this voice data, then, in the index built, this index can comprise this overall identification corresponding with voice data.Index value is corresponding with this overall identification by key-value pair.
In one embodiment, audio collecting device 100 can have multiple.Index construct module 202 also can be used for filtering out the identical audio resource of index value, according to the voice data after filtering and the index value index building of voice data, and is stored in audio resource storehouse by voice data.
In one embodiment, audio collecting device adopts the mode of delayed delivery that index value and voice data are sent to server.The index value of buffer memory and voice data are stored as partial indexes by audio collecting device, and the index that server stores is global index.After server receives the partial indexes of audio collecting device delayed delivery, filter out the part that partial indexes and global index repeat, then the partial indexes after filtering is added in global index.The index filtering out repetition can reduce the storage pressure of server.
In above-mentioned audio retrieval, the system of index building, obtains voice data by audio collecting device, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data is uploaded onto the server.Server is again according to index value and voice data index building.The work of the index value calculating voice data has been transferred on audio collecting device by server.Such as, in a call in the heart, audio collecting device can be the terminal PC of contact staff.Each terminal PC can process the voice data gathered its same day the same day.When call center system capacity expansion, carry out dilatation by the quantity increasing terminal PC, and do not need additionally to add server, thus do not increase extra expense.Therefore, in above-mentioned audio retrieval, index building method reduces the additional firmware cost of audio retrieval system when dilatation, thus is more easy to dilatation.
In one embodiment, index value computing module 104 also can be used for carrying out pre-service to voice data, extracts acoustical characteristic parameters; Speaker's segmentation and voice segment are carried out to voice data; The index value of the voice data after segmentation is calculated according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
Index value computing module 104 also can be used for carrying out pre-service by least one operation carried out voice data in filtering, pre-emphasis, framing, windowing, zero padding.By voice data after pre-service, by code conversion, segmentation, simple marking, client corresponding for voice and customer service information association can be got up, preserve stored in audio collecting device client database.
Index value computing module 104 also can be used for obtaining characteristic parameter by carrying out feature extraction to the voice of framing.Such as, by the extraction MFCC (Mel frequency cepstral coefficient) in conventional art, method extract the acoustical characteristic parameters of voice data.
Index value computing module 104 also can be used for by detecting mourning in silence in voice data, by audio parsing, and is classified according to speaker's classification by the voice data after segmentation.
The amplitude occurred in silence period and continuous sound signal is less than the time period of threshold value.By the default silence period, voice data is mourned in silence detection.By mourning in silence, long section voice can be divided into multistage by statement by detection.Such as, in a call in the heart, the voice of client are first classified according to speaker's classification by audio collecting device.GMM model (GaussianMixtureModel, gauss hybrid models) can be used, classify according to the model of several speakers in existing model bank, can classify according to male voice, schoolgirl, neutral sound, for distinguishing; Can also be numbered user in advance, and adopt Customs Assigned Number to distinguish.
Further, index value computing module 104 also can be used for the speech characteristic parameter of the voice data after according to segmentation, phonetic language model, acoustic model and the first Pronounceable dictionary, generates phonetic grid by speech recognition decoder; According to phonetic grid, the language model based on word, the second Pronounceable dictionary generation word grid; According to the index value of the voice data after the segmentation of word mess generation.
Speaker's classification that can be corresponding according to the voice data after segmentation, from phonetic language model bank and acoustics model bank, selects the Pinyin model corresponding with speaker's classification and acoustic model.Such as, if customer voice is male voice, then the phonetic language model that selection male voice is corresponding and acoustic model.
First Pronounceable dictionary represents the relation of sound mother in each unit (syllable) and acoustic model in phonetic language model.
Phonetic language model is the statistical language model in units of pinyin syllable.First Pronounceable dictionary represents the relation of sound mother in each unit (syllable) and acoustic model in phonetic language model.Basic acoustic elements can comprise sound, phoneme and syllable, is the basic modeling unit of acoustic model.Preferably, acoustic model can be HMM (Hidden Markov Model (HMM)), carries out speech recognition decoder by Viterbi (Viterbi) algorithm by generating phonetic grid.In phonetic grid, each node represents a pinyin syllable, is marked with its initial or end time of this syllable in node, the connection between node marks acoustics probability and the linguistics probability of syllable.
Second Pronounceable dictionary represents the relation of the syllable (adopting the node in the phonetic grid generated during the decoding of Viterbi (Viterbi) algorithm) in the language model based on word in each unit (Chinese word) and phonetic language model.In the word grid generated, each node represents a Chinese word, and marked the initial of this Chinese word or end time in node, the connection between node marks acoustics probability and the linguistics probability of Chinese word.The pronunciation type sequence number marking this Chinese word in node is also needed for polyphonic word.
Further, the degree of confidence of each word in word grid can be calculated.Degree of confidence is the score of each word in word grid, can calculate obtain according to information such as number of candidates information in the probability of acoustic model probability, pinyin syllable probability, word, word time span, word grid.What degree of confidence can be used for judging speech recognition estimates order of accuarcy.Such as, one section of voice that are ambiguous or that have a polyphone is easily identified as multiple Chinese word, and degree of confidence can be used for representing the accuracy of the Chinese word after the identification estimated, and degree of confidence is higher, and be disturbed little when representing and identify, the Chinese word after identification is more reliable.
In one embodiment, index construct module 202 also can be compressed voice data and index, thus saves the storage space of server 200.
In one embodiment, index construct module 202 also can be used for storing by reverse indexing the index stores that will build.
In one embodiment, as shown in Figure 6, in audio retrieval, the system of index building also comprises retrieval client 300, for obtaining retrieval request, retrieval client judges whether retrieval request comprises audio fragment, if so, then from this retrieval request, extract audio fragment and calculate the index value of this audio fragment, then sending to server 200.
Server 200 also comprises:
Audio retrieval module 204, for searching the voice data corresponding with index value in the index according to index value, and is handed down to retrieval client 300.
Wherein, retrieve client 300 and obtain audio fragment and the method calculating the index value of audio fragment obtains voice data with index calculation module 102 and to calculate the method for the index value of voice data identical.Can guarantee that the index value that identical voice data calculates in retrieval client 300 with index calculation module 102 is identical.
In one embodiment, server 200 is after receiving the index value uploaded of retrieval client 300, retrieve in the index, obtain the overall identification of voice data corresponding to index value, then in audio resource storehouse, obtain the voice data corresponding with this overall identification according to this overall identification, and be handed down to retrieval client 300.
In one embodiment, server 200 goes back phonetic language model, acoustic model, the language model based on word, the pronunciation dictionary in periodic synchronization audio collecting device 100 and retrieval client 300.After synchronous, audio collecting device 100 and retrieval client 300 are when calculating the index value of voice data, identical algorithm and parameter can be adopted, ensure that the index value that identical voice data calculates in retrieval client 300 with audio collecting device 100 is identical.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (14)

1. the method for index building in audio retrieval, comprises the following steps:
Audio collecting device obtains voice data;
Audio collecting device calculates the index value of described voice data, and the index value of described voice data and described voice data is sent to server;
Server is according to the index value index building of the described voice data received and described voice data, concrete, server first filters out the identical audio resource of index value, then according to the described voice data after filtering and the index value index building of described voice data, and described voice data is stored in audio resource storehouse.
2. the method for index building in audio retrieval according to claim 1, it is characterized in that, described index comprises the overall identification corresponding with described voice data.
3. the method for index building in audio retrieval according to claim 1, it is characterized in that, described audio collecting device has multiple.
4. the method for index building in audio retrieval according to claim 1, it is characterized in that, the step that described audio collecting device calculates the index value of described voice data is specially:
Audio collecting device carries out pre-service to described voice data, extracts acoustical characteristic parameters;
Audio collecting device carries out speaker's segmentation and voice segment to described voice data;
Audio collecting device calculates the index value of the voice data after described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
5. the method for index building in audio retrieval according to claim 4, it is characterized in that, described audio collecting device also comprises the step that described voice data carries out speaker's segmentation and voice segment:
Mourning in silence in described voice data is detected, by audio parsing, and the voice data after segmentation is classified according to speaker's classification.
6. the method for index building in audio retrieval according to claim 4, it is characterized in that, the step of the index value of the voice data after described audio collecting device calculates described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary is specially:
According to the speech characteristic parameter of the voice data after described segmentation, phonetic language model, acoustic model and the first Pronounceable dictionary, generate phonetic grid by speech recognition decoder;
According to described phonetic grid, the language model based on word, the second Pronounceable dictionary generation word grid;
The index value of the voice data after segmentation according to institute's predicate mess generation.
7. in the audio retrieval according to any one of claim 1 to 6, the method for index building, is characterized in that, described method also comprises:
Retrieval client obtains retrieval request;
Retrieval client judges whether described retrieval request comprises audio fragment, if so, then from described retrieval request, extracts audio fragment and calculates the index value of this audio fragment, then sending to server;
Server searches the voice data corresponding with described index value in the index according to index value, and is handed down to retrieval client.
8. the system of index building in audio retrieval, it is characterized in that, comprise audio collecting device and server, described audio collecting device comprises:
Audio frequency acquisition module, for obtaining voice data;
Index value computing module, for calculating the index value of described voice data, and sends to server by the index value of described voice data and described voice data;
Described server comprises:
Index construct module, for the index value index building of server according to the described voice data received and described voice data, concrete, described index construct module is also for filtering out the identical audio resource of index value, according to the described voice data after filtering and the index value index building of described voice data, and described voice data is stored in audio resource storehouse.
9. the system of index building in audio retrieval according to claim 8, it is characterized in that, described index comprises the overall identification corresponding with described voice data.
10. the system of index building in audio retrieval according to claim 8, it is characterized in that, described audio collecting device has multiple.
In 11. audio retrievals according to claim 8, the system of index building, is characterized in that, described index value computing module, also for carrying out pre-service to described voice data, extracts acoustical characteristic parameters; Speaker's segmentation and voice segment are carried out to described voice data; The index value of the voice data after described segmentation is calculated according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
12. according to the system of index building in the audio retrieval described in claim 11, it is characterized in that, voice data after segmentation, also for detecting mourning in silence in described voice data, by audio parsing, and is classified according to speaker's classification by described index value computing module.
The system of index building in 13. audio retrievals according to claim 11, it is characterized in that, described index value computing module, also for the speech characteristic parameter according to the voice data after described segmentation, phonetic language model, acoustic model and the first Pronounceable dictionary, generates phonetic grid by speech recognition decoder; According to described phonetic grid, the language model based on word, the second Pronounceable dictionary generation word grid; The index value of the voice data after segmentation according to institute's predicate mess generation.
The system of index building in audio retrieval described in 14. according to Claim 8 to 13 any one, it is characterized in that, also comprise retrieval client, for obtaining retrieval request, judge whether described retrieval request comprises audio fragment, if so, then from described retrieval request, extract audio fragment and calculate the index value of described audio fragment, then sending to server;
Described server also comprises audio retrieval module, for searching the voice data corresponding with described index value in the index according to index value, and is handed down to retrieval client.
CN201210031534.1A 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval Expired - Fee Related CN103247316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210031534.1A CN103247316B (en) 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210031534.1A CN103247316B (en) 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval

Publications (2)

Publication Number Publication Date
CN103247316A CN103247316A (en) 2013-08-14
CN103247316B true CN103247316B (en) 2016-03-16

Family

ID=48926792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210031534.1A Expired - Fee Related CN103247316B (en) 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval

Country Status (1)

Country Link
CN (1) CN103247316B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978366B (en) * 2014-04-14 2018-09-25 深圳市北科瑞声科技股份有限公司 Voice data index establishing method based on mobile terminal and system
CN106782546A (en) * 2015-11-17 2017-05-31 深圳市北科瑞声科技有限公司 Audio recognition method and device
CN108564968A (en) * 2018-04-26 2018-09-21 广州势必可赢网络科技有限公司 A kind of method and device of evaluation customer service
CN113536026B (en) * 2020-04-13 2024-01-23 阿里巴巴集团控股有限公司 Audio searching method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW436745B (en) * 1999-01-16 2001-05-28 Tian Jung Nan Intelligent digital surveillance system
CN101281745A (en) * 2008-05-23 2008-10-08 深圳市北科瑞声科技有限公司 Interactive system for vehicle-mounted voice

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW436745B (en) * 1999-01-16 2001-05-28 Tian Jung Nan Intelligent digital surveillance system
CN101281745A (en) * 2008-05-23 2008-10-08 深圳市北科瑞声科技有限公司 Interactive system for vehicle-mounted voice

Also Published As

Publication number Publication date
CN103247316A (en) 2013-08-14

Similar Documents

Publication Publication Date Title
US11557280B2 (en) Background audio identification for speech disambiguation
US10204619B2 (en) Speech recognition using associative mapping
CN109493850B (en) Growing type dialogue device
EP2252995B1 (en) Method and apparatus for voice searching for stored content using uniterm discovery
US8019604B2 (en) Method and apparatus for uniterm discovery and voice-to-voice search on mobile device
CN107293307B (en) Audio detection method and device
US20160163318A1 (en) Metadata extraction of non-transcribed video and audio streams
CN103164403B (en) The generation method and system of video index data
CN103885949B (en) A kind of song retrieval system and its search method based on the lyrics
CN102543071A (en) Voice recognition system and method used for mobile equipment
CN101415259A (en) System and method for searching information of embedded equipment based on double-language voice enquiry
CN103794211B (en) A kind of audio recognition method and system
CN103730115A (en) Method and device for detecting keywords in voice
CN106713111B (en) Processing method for adding friends, terminal and server
CN103247316B (en) The method and system of index building in a kind of audio retrieval
CN104199825A (en) Information inquiry method and system
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
JP5112978B2 (en) Speech recognition apparatus, speech recognition system, and program
CN111326146A (en) Method and device for acquiring voice awakening template, electronic equipment and computer readable storage medium
CN110674243A (en) Corpus index construction method based on dynamic K-means algorithm
CN102623008A (en) Voiceprint identification method
Wen et al. Rtsi: An index structure for multi-modal real-time search on live audio streaming services
US10929601B1 (en) Question answering for a multi-modal system
JP6374771B2 (en) Retrieval device, program, and method for retrieving vague record of user's memory
CN112820274B (en) Voice information recognition correction method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160316

Termination date: 20210213