CN103247316A - Method and system for constructing index in voice frequency retrieval - Google Patents

Method and system for constructing index in voice frequency retrieval Download PDF

Info

Publication number
CN103247316A
CN103247316A CN2012100315341A CN201210031534A CN103247316A CN 103247316 A CN103247316 A CN 103247316A CN 2012100315341 A CN2012100315341 A CN 2012100315341A CN 201210031534 A CN201210031534 A CN 201210031534A CN 103247316 A CN103247316 A CN 103247316A
Authority
CN
China
Prior art keywords
voice data
audio
index
index value
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100315341A
Other languages
Chinese (zh)
Other versions
CN103247316B (en
Inventor
黄石磊
刘轶
程刚
曹文晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PKU-HKUST SHENZHEN INSTITUTE
SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER
SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd
Original Assignee
PKU-HKUST SHENZHEN INSTITUTE
SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER
SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PKU-HKUST SHENZHEN INSTITUTE, SHENGANG MANUFACTURE-LEARNING-RESEARCH BASE INDUSTRY DEVELOPMENT CENTER, SHENZHEN BEIKE RUISHENG TECHNOLOGY Co Ltd filed Critical PKU-HKUST SHENZHEN INSTITUTE
Priority to CN201210031534.1A priority Critical patent/CN103247316B/en
Publication of CN103247316A publication Critical patent/CN103247316A/en
Application granted granted Critical
Publication of CN103247316B publication Critical patent/CN103247316B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

A method for constructing index in voice frequency retrieval comprises the steps as follows: a voice frequency acquisition device acquires voice frequency data; the voice frequency acquisition device calculates the index value of the voice frequency data, and sends the voice frequency data and the index value of the voice frequency data to a server; and the server constructs the index according to the received voice frequency data and the received index value of the voice frequency data. In addition, a system for constructing the index in the voice frequency retrieval is provided. The method and the system for constructing the index in the voice frequency retrieval can lower the cost of additional hardware during the capacity expansion of a voice frequency retrieval system.

Description

The method and system of index building in a kind of audio retrieval
[technical field]
The present invention relates to the multimedia messages processing technology field, particularly the method and system of index building in a kind of audio retrieval.
[background technology]
Audio frequency is a kind of important information carrier, and audio retrieval mainly is by keyword, and a large amount of audio-frequency information files are searched for, and obtains a kind of technology of correlated results.Wherein keyword can be text, can be audio-frequency fragments.In content-based audio retrieval mode, need to extract the characteristic parameter of audio file, and generation and voice manipulative indexing, this is a kind of operation of very consumption calculations resource.
Audio search method in the conventional art is set up the audio resource storehouse at centralized server in advance.The query and search client is obtained audio fragment or the text key word of input, then audio frequency sheet or text key word section are sent to server, after server receives, calculate the condition code of this audio fragment according to speech recognition algorithm, perhaps use text key word, in the audio samples storehouse, search the audio resource that mates with the condition code of this audio fragment, and send to the retrieval client.
Yet, bear processor active task jointly though can use some station servers, mainly adopt server to carry out centralized processing during audio retrieval index building in the conventional art, mainly showing needs more server index building again after receiving voice data.When voice data more for a long time, particularly similar call center all produces a large amount of speech data environment every day, index building need expend a large amount of server computational resources, when business is expanded, just must increase server, thereby the additional firmware cost when having increased dilatation is not easy to dilatation.
[summary of the invention]
Based on this, be necessary to provide a kind of for audio retrieval, can be easy to the method for the index building of dilatation.
The method of index building in a kind of audio retrieval may further comprise the steps:
Audio collecting device obtains voice data;
Audio collecting device calculates the index value of described voice data, and the index value of described voice data and described voice data is sent to server;
Server is according to the described voice data that receives and the index value index building of described voice data.
Preferably, described index comprises the overall identification corresponding with described voice data.
Preferably, described audio collecting device has a plurality of;
Described server is specially according to the step of the index value index building of the described voice data that receives and described voice data:
Server filters out the identical audio resource of index value earlier, stores in the audio resource storehouse then according to the described voice data after filtering and the index value index building of described voice data, and with described voice data.
Preferably, the described audio collecting device step of calculating the index value of described voice data is specially:
Audio collecting device carries out pre-service to described voice data, extracts acoustical characteristic parameters;
Audio collecting device carries out to described voice data that the speaker is cut apart and voice segment;
Audio collecting device calculates the index value of the voice data after the described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
Preferably, described audio collecting device carries out the step that the speaker cuts apart with voice segment to described voice data and also comprises:
Mourning in silence in the described voice data detected, with audio parsing, and the voice data after the segmentation classified according to speaker's classification.
Preferably, described audio collecting device is specially according to the step that described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary calculate the index value of the voice data after the described segmentation:
According to speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary of the voice data after the described segmentation, generate the phonetic grid by speech recognition decoder;
According to described phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid;
Generate the index value of the voice data after the described segmentation according to institute's predicate grid.
Preferably, described method also comprises:
The retrieval client is obtained retrieval request;
The retrieval client judges whether described retrieval request comprises audio fragment, if, then from described retrieval request, extract audio fragment and calculate the index value of this audio fragment, send to server then;
Server is searched in index and described index value corresponding audio data according to index value, and is handed down to the retrieval client.
In addition, also be necessary to provide a kind of for audio retrieval, can be easy to the system of the index building of dilatation.
The system of index building in a kind of audio retrieval comprises audio collecting device and server, and described audio collecting device comprises:
The audio frequency acquisition module is used for obtaining voice data;
The index value computing module be used for to calculate the index value of described voice data, and the index value of described voice data and described voice data is sent to server;
Described server comprises:
The index construct module is used for server according to the described voice data that receives and the index value index building of described voice data.
Preferably, described index comprises the overall identification corresponding with described voice data.
Preferably, described audio collecting device has a plurality of;
Described index construct module also is used for filtering out the identical audio resource of index value, according to the described voice data after filtering and the index value index building of described voice data, and described voice data is stored in the audio resource storehouse.
Preferably, described index value computing module also is used for described voice data is carried out pre-service, extracts acoustical characteristic parameters; Described voice data carried out the speaker is cut apart and voice segment; Calculate the index value of the voice data after the described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
Preferably, described index value computing module also is used for mourning in silence of described voice data detected, and with audio parsing, and the voice data after the segmentation is classified according to speaker's classification.
Preferably, described index value computing module also is used for speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary according to the voice data after the described segmentation, generates the phonetic grid by speech recognition decoder; According to described phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid; Generate the index value of the voice data after the described segmentation according to institute's predicate grid.
Preferably, also comprise the retrieval client, be used for obtaining retrieval request, judge whether described retrieval request comprises audio fragment, if, then from described retrieval request, extract audio fragment and calculate the index value of described audio fragment, send to server then;
Described server also comprises the audio retrieval module, is used for searching and described index value corresponding audio data at index according to index value, and is handed down to the retrieval client.
The index building method and system obtains voice data by audio collecting device in the above-mentioned audio retrieval, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data are uploaded onto the server.Server is again according to index value and voice data index building.The work that makes server will calculate the index value of voice data has been transferred on the audio collecting device.For example, in the call center, audio collecting device can be contact staff's terminal PC.Each terminal PC can be handled the voice data of gathering its same day the same day.When the call center system capacity enlarges, can carry out dilatation by the quantity that increases terminal PC, and not need additionally to add server, thereby not increase extra expense.Therefore, the index building method has reduced the additional firmware cost of audio retrieval system when dilatation in the above-mentioned audio retrieval, thereby is easy to dilatation more.
[description of drawings]
Fig. 1 is the process flow diagram of the method for index building in the embodiment sound intermediate frequency retrieval;
Fig. 2 is the process flow diagram of the step of embodiment sound intermediate frequency harvester index value of calculating voice data;
To be an embodiment sound intermediate frequency harvester calculate the process flow diagram of step of the index value of the voice data after the segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary to Fig. 3;
Fig. 4 is the process flow diagram of the step of an embodiment sound intermediate frequency retrieval;
Fig. 5 is the structural representation of the system of index building in the embodiment sound intermediate frequency retrieval;
Fig. 6 is the structural representation of the system of index building in another embodiment sound intermediate frequency retrieval.
[embodiment]
As shown in Figure 1, in one embodiment, the method for index building in a kind of audio retrieval may further comprise the steps:
Step S102, audio collecting device obtains voice data.
Voice data can be voice, music etc.Audio collecting device can obtain voice data by collection user's voice such as audio input device such as microphone or sound card output bufferses, also can obtain voice data by obtaining audio file.
Step S104, audio collecting device calculates the index value of voice data, and the index value of voice data and voice data is sent to server.
At an embodiment, audio collecting device can be the terminal device with certain arithmetic capability, and it not only can gather voice, can also carry out calculation process to audio frequency.For example, the retrieval client terminal PC of the telephonist in call center's machine room, retrieval such as user's smart mobile phone client among the mobile network.
Audio collecting device generates and the voice data corresponding index value by the feature of audio data, then index value and voice data is sent to server together.In one embodiment, audio collecting device is delayed delivery with the mode that index value and voice data send to server.When audio collecting device detects server when busy, first being buffered in the audio collecting device with the form of this voice data corresponding index value with partial indexes the voice data that gets access to and generation, wait until that then server load hour, uploads this partial indexes again.
Step S106, server is according to the voice data that receives and the index value index building of voice data.
Server receive voice data that audio collecting device uploads and with the voice data corresponding index value after, can distribute the overall identification corresponding with this voice data for voice data in advance, in the index that makes up, this index can comprise the overall identification that this is corresponding with voice data then.Index value can be corresponding with this overall identification by key-value pair.Whole index informations constitutes " global index ".
In one embodiment, audio collecting device can have a plurality of, and server can filter out the identical audio resource of index value earlier, stores in the audio resource storehouse then according to the voice data after filtering and the index value index building of voice data, and with voice data.
In one embodiment, audio collecting device adopts the mode of delayed delivery that index value and voice data are sent to server.Audio collecting device is stored as partial indexes with index value and the voice data of buffer memory, and the index of storing on the server is global index.After server receives the partial indexes of audio collecting device delayed delivery, filter out the part that partial indexes and global index repeat, the partial indexes after will filtering then is added in the global index.The index that filters out repetition can reduce the storage pressure of server.
The index building method is obtained voice data by audio collecting device in the above-mentioned audio retrieval, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data are uploaded onto the server.Server is again according to index value and voice data index building.The work that makes server will calculate the index value of voice data has been transferred on the audio collecting device.For example, in the call center, audio collecting device can be contact staff's terminal PC.Each terminal PC can be handled the voice data of gathering its same day the same day.When the call center system capacity enlarges, can carry out dilatation by the quantity that increases terminal PC, and not need additionally to add server, thereby not increase extra expense.Therefore, the index building method has reduced the additional firmware cost of audio retrieval system when dilatation in the above-mentioned audio retrieval, thereby is easy to dilatation more.
In one embodiment, as shown in Figure 2, the step that audio collecting device calculates the index value of voice data can be specially:
Step S202, audio collecting device carries out pre-service to voice data, extracts acoustical characteristic parameters.
Step S204, audio collecting device carry out to voice data that the speaker is cut apart and voice segment.
Step S206, audio collecting device calculate the index value of the voice data after the segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
Among the step S202, when audio collecting device carries out pre-service to voice data, can carry out pre-service by at least a operation that voice data is carried out in filtering, pre-emphasis, branch frame, windowing, the zero padding.With voice data through after the pre-service, can be by code conversion, cut apart, simple marking associates voice corresponding client and customer service information, deposit the audio collecting device client database in and preserve.
When extracting the characteristic parameter of voice data, can carry out feature extraction by the voice to minute frame and obtain characteristic parameter.For example, can by the extraction MFCC (Mel frequency cepstral coefficient) in the conventional art, method extract the acoustical characteristic parameters of voice data.
Among the step S204, audio collecting device can be by mourning in silence in the voice data detected, and with audio parsing, and the voice data after the segmentation classified according to speaker's classification.
Silence period is that the amplitude that occurs in the continuous sound signal is less than the time period of threshold value.Can by the default silence period to the voice data detection of mourning in silence.Detect and long section voice can be divided into multistage by statement by mourning in silence.For example, in the call center, audio collecting device is classified client's voice earlier according to speaker's classification.Can use GMM model (Gaussian Mixture Model, gauss hybrid models), classify according to the model of several speakers in the existing model bank, can classify according to male voice, schoolgirl, neutral sound, be used for distinguishing; Can also be numbered the user in advance, and adopt Customs Assigned Number to distinguish.
Among the step S206, as shown in Figure 3, audio collecting device is specially according to the step that acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary calculate the index value of the voice data after the segmentation:
Step S302 according to speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary of the voice data after the segmentation, generates phonetic grid (lattice) by speech recognition decoder.
Can from phonetic language model bank and acoustics model bank, select phonetic model and the acoustic model corresponding with speaker's classification according to speaker's classification of the voice data correspondence after the segmentation.For example, if customer voice is male voice, then select phonetic language model and the acoustic model of male voice correspondence.
First Pronounceable dictionary is represented in the phonetic language model sound mother's relation in each unit (syllable) and acoustic model.
The phonetic language model is for being the statistical language model of unit with the pinyin syllable.First Pronounceable dictionary is represented in the phonetic language model sound mother's relation in each unit (syllable) and acoustic model.Basic acoustic elements can comprise sound, phoneme and syllable, is the basic modeling unit of acoustic model.Preferably, acoustic model can be HMM (Hidden Markov Model (HMM)), can carry out speech recognition decoder by generating the phonetic grid by Viterbi (Viterbi) algorithm.Each node is represented a pinyin syllable in the phonetic grid, is marked with its initial or concluding time of this syllable in the node, the acoustics probability of mark syllable and linguistics probability in the connection between the node.
Step S304 is according to phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid.
Second Pronounceable dictionary is represented the relation based on the syllable in each unit (Chinese word) and the phonetic language model in the language model of word (node in the phonetic grid that generates when adopting the decoding of Viterbi (Viterbi) algorithm).Each node is represented a Chinese word in the word grid that generates, and in the node mark the initial or concluding time of this Chinese word, the acoustics probability of mark Chinese word and linguistics probability in the connection between the node.For polyphonic word also need be in node the pronunciation type sequence number of this Chinese word of mark.
Further, can calculate the degree of confidence of each word in the word grid.Degree of confidence is the score of each word in the word grid, can obtain according to information calculations such as number of candidates information in the probability of acoustic model probability, pinyin syllable probability, word, word time span, the word grid.Degree of confidence can be used for judging the order of accuarcy of estimating of speech recognition.For example, one section voice ambiguous or that have a polyphone easily is identified as multiple Chinese word, the accuracy of the Chinese word after the identification that degree of confidence can be used for representing to estimate, and degree of confidence is more high, is disturbed for a short time during expression identification, and the Chinese word after the identification is more reliable.
Step S306 is according to the index value of the voice data after the word grid generation segmentation.
Can directly adopt the word grid as the index value of voice data, also can calculate based on the cryptographic hash of the lattice of the word index value as voice data by default hash function.
In one embodiment, after the index value index building of server according to the voice data that receives and voice data, also can compress voice data and index, thus the storage space of saving server.
In one embodiment, after the index value index building of server according to the voice data that receives and voice data, server also can be by the reverse indexing storage with this index stores.
In one embodiment, as shown in Figure 4, the method for index building also comprises the step of audio retrieval in the audio retrieval:
Step S402, the retrieval client is obtained retrieval request.
Step S404, the retrieval client judges whether retrieval request comprises audio fragment, if, then from this retrieval request, extract audio fragment and calculate the index value of this audio fragment, send to server then.
Step S406, server is searched in index and index value corresponding audio data according to index value, and is handed down to the retrieval client.
Wherein, the retrieval client method obtaining audio fragment and calculate the index value of audio fragment is obtained voice data with audio collecting device and calculates the method for index value of voice data identical.Can guarantee that the index value that identical voice data calculates is identical in retrieval client and audio collecting device.
In one embodiment, server is behind the index value that receives the retrieval client upload, in index, retrieve, obtain the overall identification of index value corresponding audio data, in the audio resource storehouse, obtain and these overall identification corresponding audio data according to this overall identification then, and be handed down to the retrieval client.
In one embodiment, server regular phonetic language model, acoustic model, the language model based on word, the pronunciation dictionary in isochronous audio harvester and the retrieval client also.Synchronously, audio collecting device and retrieval client can adopt identical algorithm and parameter when calculating the index value of voice data, have guaranteed that the index value that identical voice data calculates in retrieval client and audio collecting device is identical.
As shown in Figure 5, in one embodiment, the system of index building in a kind of audio retrieval comprises audio collecting device 100 and server 200, and wherein, audio collecting device 100 comprises:
Audio frequency acquisition module 102 is used for obtaining voice data.
Voice data can be voice, music etc.Audio frequency acquisition module 102 can obtain voice data by collection user's voice such as audio input device such as microphone or sound card output bufferses, also can obtain voice data by obtaining audio file.
Index value computing module 104 be used for to calculate the index value of voice data, and the index value of voice data and voice data is sent to server 200.
At an embodiment, audio collecting device 100 can be the terminal device with certain arithmetic capability, and it not only can gather voice by audio frequency acquisition module 102, can also carry out calculation process by 104 pairs of audio frequency of index value computing module.For example, the retrieval client terminal PC of the telephonist in call center's machine room, retrieval such as user's smart mobile phone client among the mobile network.
Index value computing module 104 generates and the voice data corresponding index value by the feature of audio data, then index value and voice data is sent to server 200 together.
In one embodiment, index value computing module 104 is delayed delivery with the mode that index value and voice data send to server 200.When index value computing module 104 detects server 200 when busy, first being buffered in the audio collecting device 100 with the form of this voice data corresponding index value with partial indexes the voice data that gets access to and generation, wait until that then server 200 loads hour, upload this partial indexes again.
Server 200 comprises:
Index construct module 202 is used for according to the voice data that receives and the index value index building of voice data.
Server 200 receive voice data that index value computing module 104 uploads and with the voice data corresponding index value after, can distribute the overall identification corresponding with this voice data for voice data in advance, in the index that makes up, this index can comprise the overall identification that this is corresponding with voice data then.Index value can be corresponding with this overall identification by key-value pair.
In one embodiment, audio collecting device 100 can have a plurality of.Index construct module 202 also can be used for filtering out the identical audio resource of index value, according to the voice data after filtering and the index value index building of voice data, and voice data is stored in the audio resource storehouse.
In one embodiment, audio collecting device adopts the mode of delayed delivery that index value and voice data are sent to server.Audio collecting device is stored as partial indexes with index value and the voice data of buffer memory, and the index of storing on the server is global index.After server receives the partial indexes of audio collecting device delayed delivery, filter out the part that partial indexes and global index repeat, the partial indexes after will filtering then is added in the global index.The index that filters out repetition can reduce the storage pressure of server.
The system of index building obtains voice data by audio collecting device in the above-mentioned audio retrieval, and then calculates the index value of audio fragment by audio collecting device, and index value and voice data are uploaded onto the server.Server is again according to index value and voice data index building.The work that makes server will calculate the index value of voice data has been transferred on the audio collecting device.For example, in the call center, audio collecting device can be contact staff's terminal PC.Each terminal PC can be handled the voice data of gathering its same day the same day.When the call center system capacity enlarges, can carry out dilatation by the quantity that increases terminal PC, and not need additionally to add server, thereby not increase extra expense.Therefore, the index building method has reduced the additional firmware cost of audio retrieval system when dilatation in the above-mentioned audio retrieval, thereby is easy to dilatation more.
In one embodiment, index value computing module 104 also can be used for voice data is carried out pre-service, extracts acoustical characteristic parameters; Voice data carried out the speaker is cut apart and voice segment; Calculate the index value of the voice data after the segmentation according to acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
Index value computing module 104 also can be used for carrying out pre-service by at least a operation that voice data is carried out in filtering, pre-emphasis, branch frame, windowing, the zero padding.With voice data through after the pre-service, can be by code conversion, cut apart, simple marking associates voice corresponding client and customer service information, deposit the audio collecting device client database in and preserve.
Index value computing module 104 also can be used for carrying out feature extraction by the voice to minute frame and obtains characteristic parameter.For example, can by the extraction MFCC (Mel frequency cepstral coefficient) in the conventional art, method extract the acoustical characteristic parameters of voice data.
Index value computing module 104 also can be used for by mourning in silence in the voice data detected, and with audio parsing, and the voice data after the segmentation is classified according to speaker's classification.
Silence period is that the amplitude that occurs in the continuous sound signal is less than the time period of threshold value.Can by the default silence period to the voice data detection of mourning in silence.Detect and long section voice can be divided into multistage by statement by mourning in silence.For example, in the call center, audio collecting device is classified client's voice earlier according to speaker's classification.Can use GMM model (Gaussian Mixture Model, gauss hybrid models), classify according to the model of several speakers in the existing model bank, can classify according to male voice, schoolgirl, neutral sound, be used for distinguishing; Can also be numbered the user in advance, and adopt Customs Assigned Number to distinguish.
Further, index value computing module 104 also can be used for according to the speech characteristic parameter of the voice data after the segmentation, phonetic language model, acoustic model and first Pronounceable dictionary, generates the phonetic grid by speech recognition decoder; According to phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid; Index value according to the voice data after the word grid generation segmentation.
Can from phonetic language model bank and acoustics model bank, select phonetic model and the acoustic model corresponding with speaker's classification according to speaker's classification of the voice data correspondence after the segmentation.For example, if customer voice is male voice, then select phonetic language model and the acoustic model of male voice correspondence.
First Pronounceable dictionary is represented in the phonetic language model sound mother's relation in each unit (syllable) and acoustic model.
The phonetic language model is for being the statistical language model of unit with the pinyin syllable.First Pronounceable dictionary is represented in the phonetic language model sound mother's relation in each unit (syllable) and acoustic model.Basic acoustic elements can comprise sound, phoneme and syllable, is the basic modeling unit of acoustic model.Preferably, acoustic model can be HMM (Hidden Markov Model (HMM)), can carry out speech recognition decoder by generating the phonetic grid by Viterbi (Viterbi) algorithm.Each node is represented a pinyin syllable in the phonetic grid, is marked with its initial or concluding time of this syllable in the node, the acoustics probability of mark syllable and linguistics probability in the connection between the node.
Second Pronounceable dictionary is represented the relation based on the syllable in each unit (Chinese word) and the phonetic language model in the language model of word (node in the phonetic grid that generates when adopting the decoding of Viterbi (Viterbi) algorithm).Each node is represented a Chinese word in the word grid that generates, and in the node mark the initial or concluding time of this Chinese word, the acoustics probability of mark Chinese word and linguistics probability in the connection between the node.For polyphonic word also need be in node the pronunciation type sequence number of this Chinese word of mark.
Further, can calculate the degree of confidence of each word in the word grid.Degree of confidence is the score of each word in the word grid, can obtain according to information calculations such as number of candidates information in the probability of acoustic model probability, pinyin syllable probability, word, word time span, the word grid.Degree of confidence can be used for judging the order of accuarcy of estimating of speech recognition.For example, one section voice ambiguous or that have a polyphone easily is identified as multiple Chinese word, the accuracy of the Chinese word after the identification that degree of confidence can be used for representing to estimate, and degree of confidence is more high, is disturbed for a short time during expression identification, and the Chinese word after the identification is more reliable.
In one embodiment, index construct module 202 also can be compressed voice data and index, thereby saves the storage space of server 200.
In one embodiment, index construct module 202 also can be used for by the reverse indexing storage the index stores that makes up.
In one embodiment, as shown in Figure 6, the system of index building also comprises retrieval client 300 in the audio retrieval, be used for obtaining retrieval request, the retrieval client judges whether retrieval request comprises audio fragment, if, then from this retrieval request, extract audio fragment and calculate the index value of this audio fragment, send to server 200 then.
Server 200 also comprises:
Audio retrieval module 204 is used for searching and index value corresponding audio data at index according to index value, and is handed down to retrieval client 300.
Wherein, retrieval client 300 method obtaining audio fragment and calculate the index value of audio fragment is obtained voice data with index calculation module 102 and calculates the method for index value of voice data identical.Can guarantee that the index value that identical voice data calculates is identical in retrieval client 300 and index calculation module 102.
In one embodiment, server 200 is after receiving the index value uploaded of retrieval client 300, in index, retrieve, obtain the overall identification of index value corresponding audio data, in the audio resource storehouse, obtain and these overall identification corresponding audio data according to this overall identification then, and be handed down to retrieval client 300.
In one embodiment, server 200 phonetic language model, acoustic model, the language model based on word, the pronunciation dictionary in regular isochronous audio harvester 100 and the retrieval client 300 also.Synchronously, audio collecting device 100 and retrieval client 300 are when calculating the index value of voice data, can adopt identical algorithm and parameter, guaranteed that the index value that identical voice data calculates is identical in retrieval client 300 and audio collecting device 100.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (14)

1. the method for index building in the audio retrieval may further comprise the steps:
Audio collecting device obtains voice data;
Audio collecting device calculates the index value of described voice data, and the index value of described voice data and described voice data is sent to server;
Server is according to the described voice data that receives and the index value index building of described voice data.
2. the method for index building in the audio retrieval according to claim 1 is characterized in that, described index comprises the overall identification corresponding with described voice data.
3. the method for index building in the audio retrieval according to claim 1 is characterized in that described audio collecting device has a plurality of;
Described server is specially according to the step of the index value index building of the described voice data that receives and described voice data:
Server filters out the identical audio resource of index value earlier, stores in the audio resource storehouse then according to the described voice data after filtering and the index value index building of described voice data, and with described voice data.
4. the method for index building in the audio retrieval according to claim 1 is characterized in that the step that described audio collecting device calculates the index value of described voice data is specially:
Audio collecting device carries out pre-service to described voice data, extracts acoustical characteristic parameters;
Audio collecting device carries out to described voice data that the speaker is cut apart and voice segment;
Audio collecting device calculates the index value of the voice data after the described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
5. according to the method for index building in any described audio retrieval in the claim 4, it is characterized in that described audio collecting device carries out the step that the speaker cuts apart with voice segment to described voice data and also comprises:
Mourning in silence in the described voice data detected, with audio parsing, and the voice data after the segmentation classified according to speaker's classification.
6. the method for index building in the audio retrieval according to claim 4, it is characterized in that described audio collecting device is specially according to the step that described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary calculate the index value of the voice data after the described segmentation:
According to speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary of the voice data after the described segmentation, generate the phonetic grid by speech recognition decoder;
According to described phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid;
Generate the index value of the voice data after the described segmentation according to institute's predicate grid.
7. according to the method for index building in each described audio retrieval of claim 1 to 6, it is characterized in that described method also comprises:
The retrieval client is obtained retrieval request;
The retrieval client judges whether described retrieval request comprises audio fragment, if, then from described retrieval request, extract audio fragment and calculate the index value of this audio fragment, send to server then;
Server is searched in index and described index value corresponding audio data according to index value, and is handed down to the retrieval client.
8. the system of index building in the audio retrieval is characterized in that comprise audio collecting device and server, described audio collecting device comprises:
The audio frequency acquisition module is used for obtaining voice data;
The index value computing module be used for to calculate the index value of described voice data, and the index value of described voice data and described voice data is sent to server;
Described server comprises:
The index construct module is used for server according to the described voice data that receives and the index value index building of described voice data.
9. the system of index building in the audio retrieval according to claim 8 is characterized in that, described index comprises the overall identification corresponding with described voice data.
10. the system of index building in the audio retrieval according to claim 8 is characterized in that described audio collecting device has a plurality of;
Described index construct module also is used for filtering out the identical audio resource of index value, according to the described voice data after filtering and the index value index building of described voice data, and described voice data is stored in the audio resource storehouse.
11. the system of index building is characterized in that in the audio retrieval according to claim 8, described index value computing module also is used for described voice data is carried out pre-service, extracts acoustical characteristic parameters; Described voice data carried out the speaker is cut apart and voice segment; Calculate the index value of the voice data after the described segmentation according to described acoustical characteristic parameters, default acoustic model, language model and Pronounceable dictionary.
12. the system according to index building in the audio retrieval described in the claim 11, it is characterized in that, described index value computing module also is used for mourning in silence of described voice data detected, and with audio parsing, and the voice data after the segmentation is classified according to speaker's classification.
13. the system of index building in the audio retrieval according to claim 11, it is characterized in that, described index value computing module also is used for speech characteristic parameter, phonetic language model, acoustic model and first Pronounceable dictionary according to the voice data after the described segmentation, generates the phonetic grid by speech recognition decoder; According to described phonetic grid, the language model based on word, second Pronounceable dictionary generation word grid; Generate the index value of the voice data after the described segmentation according to institute's predicate grid.
14. the system of index building to 13 any described audio retrievals according to Claim 8, it is characterized in that, also comprise the retrieval client, be used for obtaining retrieval request, judge whether described retrieval request comprises audio fragment, if, then from described retrieval request, extract audio fragment and calculate the index value of described audio fragment, send to server then;
Described server also comprises the audio retrieval module, is used for searching and described index value corresponding audio data at index according to index value, and is handed down to the retrieval client.
CN201210031534.1A 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval Expired - Fee Related CN103247316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210031534.1A CN103247316B (en) 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210031534.1A CN103247316B (en) 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval

Publications (2)

Publication Number Publication Date
CN103247316A true CN103247316A (en) 2013-08-14
CN103247316B CN103247316B (en) 2016-03-16

Family

ID=48926792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210031534.1A Expired - Fee Related CN103247316B (en) 2012-02-13 2012-02-13 The method and system of index building in a kind of audio retrieval

Country Status (1)

Country Link
CN (1) CN103247316B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978366A (en) * 2014-04-14 2015-10-14 深圳市北科瑞声科技有限公司 Voice data index building method and system based on mobile terminal
CN106782546A (en) * 2015-11-17 2017-05-31 深圳市北科瑞声科技有限公司 Audio recognition method and device
CN108564968A (en) * 2018-04-26 2018-09-21 广州势必可赢网络科技有限公司 Method and device for evaluating customer service
CN113536026A (en) * 2020-04-13 2021-10-22 阿里巴巴集团控股有限公司 Audio searching method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW436745B (en) * 1999-01-16 2001-05-28 Tian Jung Nan Intelligent digital surveillance system
CN101281745A (en) * 2008-05-23 2008-10-08 深圳市北科瑞声科技有限公司 Interactive system for vehicle-mounted voice

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW436745B (en) * 1999-01-16 2001-05-28 Tian Jung Nan Intelligent digital surveillance system
CN101281745A (en) * 2008-05-23 2008-10-08 深圳市北科瑞声科技有限公司 Interactive system for vehicle-mounted voice

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978366A (en) * 2014-04-14 2015-10-14 深圳市北科瑞声科技有限公司 Voice data index building method and system based on mobile terminal
CN106782546A (en) * 2015-11-17 2017-05-31 深圳市北科瑞声科技有限公司 Audio recognition method and device
CN108564968A (en) * 2018-04-26 2018-09-21 广州势必可赢网络科技有限公司 Method and device for evaluating customer service
CN113536026A (en) * 2020-04-13 2021-10-22 阿里巴巴集团控股有限公司 Audio searching method, device and equipment
CN113536026B (en) * 2020-04-13 2024-01-23 阿里巴巴集团控股有限公司 Audio searching method, device and equipment

Also Published As

Publication number Publication date
CN103247316B (en) 2016-03-16

Similar Documents

Publication Publication Date Title
US12002452B2 (en) Background audio identification for speech disambiguation
CN103700370B (en) A kind of radio and television speech recognition system method and system
CN109493850B (en) Growing type dialogue device
US9230547B2 (en) Metadata extraction of non-transcribed video and audio streams
US10204619B2 (en) Speech recognition using associative mapping
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
CN102782751B (en) Digital media voice tags in social networks
CN100461179C (en) Audio analysis system based on content
CN103500579B (en) Audio recognition method, Apparatus and system
CN107293307A (en) Audio-frequency detection and device
CN101231660A (en) System and method for digging key information of telephony nature conversation
CN101415259A (en) System and method for searching information of embedded equipment based on double-language voice enquiry
CN103730115A (en) Method and device for detecting keywords in voice
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
CN111369981B (en) Dialect region identification method and device, electronic equipment and storage medium
CN110600008A (en) Voice wake-up optimization method and system
CN103164403A (en) Generation method of video indexing data and system
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN103247316B (en) The method and system of index building in a kind of audio retrieval
CN111210821A (en) Intelligent voice recognition system based on internet application
CN106713111A (en) Processing method for adding friends, terminal and server
JP5112978B2 (en) Speech recognition apparatus, speech recognition system, and program
CN110674243A (en) Corpus index construction method based on dynamic K-means algorithm
CN112037772B (en) Response obligation detection method, system and device based on multiple modes
US10929601B1 (en) Question answering for a multi-modal system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160316

Termination date: 20210213