CN103853749A

CN103853749A - Mode-based audio retrieval method and system

Info

Publication number: CN103853749A
Application number: CN201210505562.2A
Authority: CN
Inventors: 张世磊; 涂旭东; 金锋; 金琴; 刘�文; 秦勇
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2012-11-30
Filing date: 2012-11-30
Publication date: 2014-06-11
Anticipated expiration: 2032-11-30
Also published as: CN103853749B

Abstract

The invention provides a mode-based audio retrieval method and system. The audio retrieval method includes marking a plurality of original audio data on the basis of modes to acquire the audio marking sequences of the original audio data; acquiring the audio marking sequences of target audio data; determining the matching degree of the target audio data and the original audio data according to preset matching rules on the basis of the audio marking sequences of the target audio data and the audio marking sequences of the original audio data; outputting the original audio data with the matching degree higher than a preset matching degree threshold as retrieval results. By using the method and system, audio marking and retrieving can be performed automatically and iteratively on the basis of background modes without manual marking, and accordingly more accurate and reasonable audio retrieval results are provided.

Description

Audio search method based on pattern and system

Technical field

Present invention relates in general to multimedia information retrieval field, especially, relate to audio search method and system based on pattern.

Background technology

The extensively universal high speed development that has promoted multimedia information technology of internet.The multi-medium data amount rapid growth that can obtain from internet.For example, the audio-video document that per minute is uploaded on YouTube website reaches 48 hours more than.The data of magnanimity make to browse one by one, and the index of data and retrieval are also had more to challenge.

How from data bank, finding exactly the data file of required subject matter is one of the study hotspot in multimedia information retrieval field.For example, wedding celebration design corporation may wish, according to a small amount of wedding ceremony sample, to find magnanimity material in order to make final wedding celebration file.The producer in radio station or the production team of video website, wish from mass data, to search interested program category based on limited information, for program making is fast offered help.In addition, user may wish own multimedia database to carry out automatic mark filing, thereby carries out more effective management.

Than the retrieval based on video, the retrieval scope of application based on audio frequency is wider, for example in the situation that can only obtaining voice data (for example, radio broadcasting).Audio pack has contained the quite a large amount of information that contributes to understanding content, and audio file is conventionally less than video.Therefore, for example, uploading capacity limit and have to video file to be compressed to slightly fuzzy in the situation that due to network, audio frequency but can be made comparatively clearly.

But the audio index of prior art and search method have many defects.First, existing audio index and search method need a large amount of manual markings.For example, for audio frequency website, conventionally have a large amount of unmarked files or the file of simple marking, between these files, well do not describe, and shortage recommends to be connected with the effective correlativity of other data.Staff can only be manually carries out manual markings to the famous program of part or the high file of visit capacity and is connected with recommending.Therefore, such audio index and search method only can be used for specific area and limited set of data samples.

Secondly, existing audio index and search method are only carried out modeling based on audio indicia itself, this can make index and result for retrieval inaccurate.For example, be current sound equally, the meaning under natural river background mode and under family kitchen background mode is diverse.Again for example, brouhaha is also different in entertainment, talk show or sports cast.If user inputs one section of river singing of the stream as sample, wish from multimedia database, to retrieve similar material, so existing audio search method can not distinguish provide comprise under natural river pattern and family kitchen pattern under the data file of current sound.Obviously, do not considering in contextual situation, the result of a lot of audio retrievals is inaccurate.

Again, existing audio search method adopts single sequential search strategy conventionally, first by voice data segmentation, then carries out Classification and Identification for every section.Thus, the mistake in previous steps can affect the execution result of subsequent step, causes being progressively accumulated in last result for retrieval, makes the inaccurate searched targets that departs from even completely of result for retrieval.

Therefore, need a kind of audio search method and system automatically performing without artificial participation mark.

Further, need a kind of audio search method and the system that also can consider audio class similarity based on background mode.

Further again, thus need a kind of cumulative errors of can automatically eliminating that audio search method and the system of result for retrieval more are accurately provided.

Summary of the invention

One object of the present invention is, automatically source voice data is carried out to mark and the modeling based on pattern, and considers audio class similarity audio retrieval result is accurately provided.

For this reason, audio search method of the present invention and system are integrated to process by iteration segmentation and cluster source voice data are carried out to automated audio mark, in each iteration, build the decision tree based on background mode and train segmentation markers model for the leaf node on decision tree, finally relatively and in conjunction with audio class similarity providing audio retrieval result based on pattern.

According to a first aspect of the invention, provide a kind of audio search method based on pattern, having comprised: based on pattern, multiple sources voice data has been carried out to mark, to obtain the audio indicia sequence of each source voice data; Obtain the audio indicia sequence of target audio data; The audio indicia sequence of the audio indicia sequence of based target voice data and each source voice data, determines the matching degree between target audio data and source voice data according to predetermined matched rule; And output matching degree is higher than the source voice data of predetermined matching degree threshold value, as result for retrieval.

In one embodiment, based on pattern, multiple sources voice data is carried out to mark and comprise, carry out following operation for each source voice data: (a) each source voice data is divided, to obtain multiple segmentations; (b) the multiple segmentations based on obtained, utilize clustering algorithm to determine the audio class sequence of each source voice data; (c) basis is for the determined audio class sequence of multiple described sources voice data, based on mode construction decision tree; (d) for the each leaf node on decision tree, training segmentation markers model; (e) utilize the segmentation markers model of training, obtain the audio indicia sequence of each source voice data and adjust the division to this source voice data; And (f) in the situation that meeting predetermined iterated conditional, repeat aforesaid operations (b) to (e).

According to a second aspect of the invention, provide a kind of audio retrieval system based on pattern, having comprised: labelling apparatus, has been configured to, based on pattern, multiple sources voice data is carried out to mark, to obtain the audio indicia sequence of each source voice data; Target Acquisition device, is configured to obtain the audio indicia sequence of target audio data; Matching degree determining device, be configured to the audio indicia sequence of the audio indicia sequence of the target audio data of obtaining based on described Target Acquisition device and each source voice data that described labelling apparatus obtains, determine the matching degree between target audio data and source voice data according to predetermined matched rule; And search and output device, be configured to output by the definite matching degree of described matching degree determining device the source voice data higher than predetermined matching degree threshold value, as result for retrieval.

In one embodiment, described labelling apparatus comprises: divide device, be configured to each source voice data to divide, to obtain multiple segmentations; Clustering apparatus, is configured to the multiple segmentations based on obtained, and utilizes clustering algorithm to determine the audio class sequence of each source voice data; Decision tree construction device, is configured to according to described clustering apparatus for the definite audio class sequence of multiple described sources voice data, based on mode construction decision tree; Model training apparatus, is configured to for the each leaf node on the decision tree being built by described decision tree construction device, training segmentation markers model; Segmentation adjusting gear, is configured to utilize the segmentation markers model of being trained by described model training apparatus, obtains the audio indicia sequence of each source voice data and adjusts the division to this source voice data; And iterated conditional judgment means, be configured to judge whether to meet predetermined iterated conditional.

Utilize method and system of the present invention, can automatically perform audio retrieval without artificial participation mark.

Utilize method and system of the present invention, can carry out iteratively audio class mark based on background mode, thereby more accurately rational audio retrieval result is provided.

Utilize method and system of the present invention, can consider audio class similarity and in conjunction with background mode carry out audio retrieval.

Accompanying drawing explanation

In conjunction with the drawings disclosure illustrative embodiments is described in more detail, above-mentioned and other object of the present disclosure, Characteristics and advantages will become more obvious, wherein, in disclosure illustrative embodiments, identical reference number represents same parts conventionally.

Fig. 1 shows the block diagram that is suitable for the exemplary computer system/server for realizing embodiment of the present invention.

Fig. 2 is exemplified with according to the general flow chart of the audio search method based on pattern of the embodiment of the present invention.

Fig. 3 schematically shows an example of audio class sequence.

Fig. 4 be exemplified with according to the embodiment of the present invention for source voice data being carried out to the process flow diagram of the processing of the audio class mark based on pattern.

Fig. 5 schematically shows an example of clustering processing.

Fig. 6 is exemplified with according to the process flow diagram for the processing based on mode construction decision tree of the embodiment of the present invention.

Fig. 7 schematically shows decision tree and builds an example of processing.

Fig. 8 be exemplified with according to the embodiment of the present invention for determining the process flow diagram of processing of the matching degree between target audio data and source voice data.

Fig. 9 shows according to the functional block diagram of the audio retrieval system based on pattern of the embodiment of the present invention.

Embodiment

Preferred implementation of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown preferred implementation of the present disclosure in accompanying drawing, but should be appreciated that, can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to make the disclosure more thorough and complete that these embodiments are provided, and the scope of the present disclosure intactly can be conveyed to those skilled in the art.

Person of ordinary skill in the field knows, the present invention can be implemented as system, method or computer program.Therefore, the disclosure can specific implementation be following form, that is: can be completely hardware, also can be software (comprising firmware, resident software, microcode etc.) completely, can also be the form of hardware and software combination, be commonly referred to as " circuit ", " module " or " system " herein.In addition, in certain embodiments, the present invention can also be embodied as the form of the computer program in one or more computer-readable mediums, comprises computer-readable program code in this computer-readable medium.

Can adopt the combination in any of one or more computer-readable media.Computer-readable medium can be computer-readable signal media or computer-readable recording medium.Computer-readable recording medium for example may be-but not limited to-electricity, magnetic, optical, electrical magnetic, infrared ray or semi-conductive system, device or device, or any above combination.The example more specifically (non exhaustive list) of computer-readable recording medium comprises: have the electrical connection, portable computer diskette, hard disk, random-access memory (ram), ROM (read-only memory) (ROM), erasable type programmable read only memory (EPROM or flash memory), optical fiber, Portable, compact disk ROM (read-only memory) (CD-ROM), light storage device, magnetic memory device of one or more wires or the combination of above-mentioned any appropriate.In presents, computer-readable recording medium can be any comprising or stored program tangible medium, and this program can be used or be combined with it by instruction execution system, device or device.

Computer-readable signal media can be included in the data-signal of propagating in base band or as a carrier wave part, has wherein carried computer-readable program code.The combination of electromagnetic signal that the data-signal of this propagation can adopt various ways, comprises---but being not limited to---, light signal or above-mentioned any appropriate.Computer-readable signal media can also be any computer-readable medium beyond computer-readable recording medium, and this computer-readable medium can send, propagates or transmit the program for being used or be combined with it by instruction execution system, device or device.

The program code comprising on computer-readable medium can be with any suitable medium transmission, comprises that---but being not limited to---is wireless, electric wire, optical cable, RF etc., or the combination of above-mentioned any appropriate.

Can combine to write the computer program code for carrying out the present invention's operation with one or more programming languages or its, described programming language comprises object-oriented programming language-such as Java, Smalltalk, C++, also comprise conventional process type programming language-such as " C " language or similar programming language.Program code can fully be carried out, partly on subscriber computer, carries out, carry out or on remote computer or server, carry out completely as an independently software package execution, part part on subscriber computer on remote computer on subscriber computer.In the situation that relates to remote computer, remote computer can be by the network of any kind---comprise LAN (Local Area Network) (LAN) or wide area network (WAN)-be connected to subscriber computer, or, can be connected to outer computer (for example utilizing ISP to pass through Internet connection).

Process flow diagram and/or block diagram below with reference to method, device (system) and the computer program of the embodiment of the present invention are described the present invention.Should be appreciated that the combination of each square frame in each square frame of process flow diagram and/or block diagram and process flow diagram and/or block diagram, can be realized by computer program instructions.These computer program instructions can offer the processor of multi-purpose computer, special purpose computer or other programmable data treating apparatus, thereby produce a kind of machine, these computer program instructions are carried out by computing machine or other programmable data treating apparatus, have produced the device of the function/operation stipulating in the square frame in realization flow figure and/or block diagram.

Also these computer program instructions can be stored in and can make in computing machine or the computer-readable medium of other programmable data treating apparatus with ad hoc fashion work, like this, the instruction being stored in computer-readable medium just produces a manufacture (manufacture) that comprises the command device (instruction means) of the function/operation stipulating in the square frame in realization flow figure and/or block diagram.

Also computer program instructions can be loaded on computing machine, other programmable data treating apparatus or miscellaneous equipment, make to carry out sequence of operations step on computing machine, other programmable data treating apparatus or miscellaneous equipment, to produce computer implemented process, thus the process of function/operation that the instruction that makes to carry out on computing machine or other programmable device stipulates during the square frame in realization flow figure and/or block diagram can be provided.

Fig. 1 shows the block diagram that is suitable for the exemplary computer system/server for realizing embodiment of the present invention 12.The computer system/server 12 that Fig. 1 shows is only an example, should not bring any restriction to the function of the embodiment of the present invention and usable range.

As shown in Figure 1, computer system/server 12 is with the form performance of universal computing device.The assembly of computer system/server 12 can include but not limited to: one or more processor or processing unit 16, system storage 28, the bus 18 of connection different system assembly (comprising system storage 28 and processing unit 16).

Bus 18 represents one or more in a few class bus structure, comprises memory bus or Memory Controller, peripheral bus, AGP, processor or use any bus-structured local bus in multiple bus structure.For instance, these architectures include but not limited to ISA(Industry Standard Architecture) bus, MCA (MAC) bus, enhancement mode isa bus, VESA's (VESA) local bus and periphery component interconnection (PCI) bus.

Computer system/server 12 typically comprises various computing systems computer-readable recording medium.These media can be any usable mediums that can be accessed by computer system/server 12, comprise volatibility and non-volatile media, movably with immovable medium.

System storage 28 can comprise the computer system-readable medium of volatile memory form, for example random-access memory (ram) 30 and/or cache memory 32.Computer system/server 12 may further include that other is removable/immovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can immovable for reading and writing, non-volatile magnetic medium (Fig. 1 does not show, is commonly referred to " hard disk drive ").Although not shown in Fig. 1, can be provided for for example, disc driver to removable non-volatile magnetic disk (" floppy disk ") read-write, and the CD drive that removable non-volatile CD (for example CD-ROM, DVD-ROM or other light medium) is read and write.In these cases, each driver can be connected with bus 18 by one or more data media interfaces.Storer 28 can comprise at least one program product, and this program product has one group of (for example at least one) program module, and these program modules are configured to carry out the function of various embodiments of the present invention.

There is the program/utility 40 of one group of (at least one) program module 42, for example can be stored in storer 28, such program module 42 comprises---but being not limited to---operating system, one or more application program, other program module and routine data, may comprise the realization of network environment in each in these examples or certain combination.Program module 42 is carried out function and/or the method in embodiment described in the invention conventionally.

Computer system/server 12 also can with one or more external unit 14(such as keyboard, sensing equipment, display 24 etc.) communicate by letter, also can make the devices communicating that user can be mutual with this computer system/server 12 with one or more, and/or for example, communicate by letter with any equipment (network interface card, modulator-demodular unit etc.) that this computer system/server 12 can be communicated with one or more other computing equipments.This communication can be undertaken by I/O (I/O) interface 22.And computer system/server 12 can also for example, for example, by network adapter 20 and one or more network (Local Area Network, wide area network (WAN) and/or public network, the Internet) communication.As shown in the figure, network adapter 20 is by other module communication of bus 18 and computer system/server 12.Be understood that, although not shown, can use other hardware and/or software module in conjunction with computer system/server 12, include but not limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc.

As previously mentioned, audio search method of the present invention and system are integrated to process by iteration segmentation and cluster source voice data are carried out to automated audio mark, in each iteration, build the decision tree based on background mode and train segmentation markers model for the leaf node on decision tree, finally relatively and in conjunction with audio class similarity providing audio retrieval result based on pattern.

Below with reference to Fig. 2 to Fig. 9, embodiments of the invention are described particularly.Fig. 2 is exemplified with according to the general flow chart of the audio search method based on pattern 200 of the embodiment of the present invention.First, need to carry out the audio class mark based on pattern to the multiple sources voice data being for example included in audio database, to obtain the audio indicia sequence (step 202) of each source voice data.

It should be noted that, alleged " audio class " refers to the one classification for audio frequency herein.Ideally, " audio class " can be the related event category of a section audio, such as shot, singing of the stream, cheer or birdie etc.But generally, " audio class ", not necessarily strictly corresponding to the related event category of audio frequency, it can be only for example, operation result based on special audio Processing Algorithm (, clustering algorithm), and may not possess semantic meaning.Just can carry out audio indicia and retrieval accurately without knowing the event category that each audio class specifically represents in the present invention, just therefore audio class of the present invention and search method automatically perform without supervision.

Voice data is to be made up of continuous or discrete multistage audio frequency, so alleged " audio class sequence " refers to a series of audio class in time herein, it has recorded the audio class that in voice data, order occurs and corresponding duration thereof.An example of audio class sequence has ideally been shown in Fig. 3.Alleged " background mode " or " pattern " refer to the ambient conditions that voice data is related herein, such as natural river, family kitchen, station, entertainment, talk show or sports cast etc.

Fig. 4 at length processes 400 exemplified with a kind of specific implementation of step 202, wherein integrate to process by iteration segmentation and cluster source voice data is carried out to automated audio mark, in each iteration, build the decision tree based on background mode and train segmentation markers model for the leaf node on decision tree.

Processing 400 can start at step 402 place.In step 402, each the source voice data in the voice data of multiple sources is divided, to obtain multiple segmentations.In one embodiment, can be according to quiet division the in the voice data of source.In another embodiment, can divide source voice data according to the audio frequency window of scheduled duration.In another embodiment, can be by even division of time source voice data.Going back in an embodiment, can adopt quiet division, audio frequency window to divide and by any number of combination in even division of time, source voice data is divided.

It should be noted that, may be more rough to the division result of source voice data in step 402.Build and process and model training processing by follow-up iteration clustering processing, decision tree, and by adopting Viterbi algorithm, can obtain dividing more and more accurately.

Then,, in step 404, the multiple segmentations that obtain based on the division in step 402, utilize clustering algorithm to determine the audio class sequence of each source voice data.In one example, utilize and build mixed Gauss model (GMM) from the audio frequency characteristics of obtained multiple stage extractions.Once determine model, just can determine the distance of each audio class.Then, based on constructed GMM, utilize clustering algorithm for example, based on special audio feature (, the audio frequency characteristics of time domain or frequency domain) and audio class distance, step by step cluster the final audio class sequence of determining source voice data.

According to clustering algorithm and predetermined clusters criterion, clustering processing can stop at the cluster level place of expecting.The variable-definition of the level in this example, clustering processing being stopped is " audio class ", is " audio frequency subclass " and descend the variable-definition of each level.Correspondingly, a series of audio class of arranging in chronological order can form " audio class sequence ".As previously mentioned, should be appreciated that the audio class and the audio frequency subclass that in step 404, obtain may be without semantic meanings.

Fig. 5 shows an example of clustering processing, wherein the each point in L1 represents the GMM model variable building according to the audio frequency characteristics extracting from multiple audio parsings, L2, L3....Ln represent to utilize clustering algorithm based on specific time domain or frequency domain audio frequency characteristics and audio class apart from the Audio clustering rank obtaining, wherein the each point in Ln (for example, a, b, c, d, e etc.) be defined as audio class, and each point in L2 to Ln-1 can be considered to the audio frequency subclass of this voice data.

Next,, in step 406, multiple audio class sequences of determining for multiple sources voice data according to step 404, based on mode construction decision tree.Fig. 6 shows a kind of specific implementation based on mode construction decision tree of step 406 and processes 600.First, at step 602 place, for example, by (, a in Ln level, b, c, d, the e in Fig. 5 of the each audio class in audio class sequence definite in step 404 ...) be defined as the root node of decision tree.

Then,, in step 604, the context of the audio class based on being defined as root node in audio class sequence, builds pattern problem set.Pattern problem set can build according to pre-defined rule, for example, make the differentiation maximum of branch.In one example, the context of audio class can refer in audio class sequence audio class before this audio class and afterwards.In another example, the context of audio class can refer to the one or more audio frequency subclasses that obtain for this audio class in the clustering processing in step 404.The context of audio class can reflect the background mode of audio class to a certain extent.For example, for the audio class relevant to train blast of whistle, if the last audio class of this audio class in sequence and broadcast acoustic correlation, and a rear audio class and the noisy people acoustic correlation of this audio class in sequence is probably the background mode in railway station.But if last audio class is relevant to shot, a rear audio class is relevant to cheer, that is probably the film scene pattern of " railway guerrilla forces " and so on.

Finally, in step 606, with constructed pattern problem set, the audio class in audio class sequence is carried out to branch, thereby build the leaf node of decision tree.Alleged " leaf node of decision tree " refers to the node that does not possess any downward child node in decision tree herein., any node that possesses downward child node is all defined as " root node ".It should be noted that, decision tree can be branched off into destined node rank downwards, for example, when the audio indicia number comprising in each leaf node finishes the structure to decision tree during lower than predetermined threshold.

Fig. 7 shows decision tree and builds an example of processing, and wherein audio class b is for example an audio class in the audio class sequence obtaining by clustering processing in the example of Fig. 5.Have four groups that comprise audio class b in the audio class sequence of supposing to obtain for multiple sources voice data by clustering processing, as shown in Figure 7, be respectively (a-b+c), (a-b+e), (d-b+a) and (d-b+c), wherein symbol "-" represents the last audio class of audio class b in sequence, and symbol "+" represents the rear audio class of audio class b in sequence., (a-b+c) represent the audio class b previous audio class in sequence be a then an audio class be c.

Utilize based on contextual problem set, progressively by audio class b to inferior division until the such as leaf node of b1, b2, b3, b4 etc.For example, can first select " in context, whether comprising audio class a ", as problem, audio class b is carried out to branch, therefore, branch out (d-b+c) and be defined as leaf node b1.Then, can select " whether last audio class is a " to come further branch as problem, branch out thus (d-b+a) and be defined as leaf node b2.Then, can select " whether a rear audio class is c " as problem further branch again, distinguish thus (a-b+e) and (a-b+c) and by it be defined as respectively leaf node b3 and b4.So far, completed the structure to decision tree.

Get back to Fig. 4, next, in step 408, for the each leaf node on decision tree, training segmentation markers model.In one example, segmentation markers model can comprise hidden Markov model (HMM) and duration model.Then, utilize the segmentation markers model of training, obtain the audio indicia sequence of each source voice data, and adjust the division (step 410) to this source voice data.It should be noted that, alleged " audio indicia sequence " and audio class Serial relation but be different from audio class sequence herein, it does not also correspond to the related event category of audio frequency, and based on some audio frequency Processing Algorithm (be for example only, Viterbi algorithm) operation result so that follow-up matching treatment.In one embodiment of the invention, step 410 can realize by following operation: first, utilize the segmentation markers model of training in step 408, determine the audio class distance of source voice data; Then, the segmentation markers model based on trained, utilizes audio frequency characteristics and the determined audio class distance extracted from source voice data to carry out Viterbi decoding; Finally, according to Viterbi decoded result, obtain the audio indicia sequence of source voice data, and adjust the division to source voice data.

Next, enter determination step 412, determine whether to meet predetermined iterated conditional.In one example, predetermined iterated conditional can comprise: the adjustment amount of the division to source voice data is not less than predetermined segment difference, and/or iterations is less than predetermined iterations threshold value.

In step 412, judge that the in the situation that of need to carrying out iteration, method 400 forwards step 404 to, carry out clustering processing, the processing of decision tree structure and the processing of segmentation markers model training with the segmentation based on after readjusting in step 410.And in step 412, judge and can jump out iteration, the audio indicia sequence of the voice data that output obtains in step 414.

In one embodiment of the invention, before dividing voice data, step 402 can also determine whether source voice data is speech data (step 416).The source voice data comprising in audio database may be that speech data may be also non-speech data.Can utilize support vector machine well known in the art (SVM) method to carry out the differentiation of speech/non-speech.Distinguish exactly voice and non-voice, contribute to follow-up segmentation, cluster, decision tree to build and model training step.

Turn back to the method 200 of Fig. 2 below, obtained the audio indicia sequence of each source voice data in step 202 after, method 200 proceeds to step 204.In step 204, obtain the audio indicia sequence of target audio data.In one embodiment of the invention, segmentation markers model that can be based on for example step 408 place at Fig. 4 trains, carries out Viterbi decoding to target audio data, to obtain the audio indicia sequence of these target audio data.

Next, at step 206 place, the audio indicia sequence of each source voice data that the audio indicia sequence of the target audio data that obtain based on step 204 place and step 202 place obtain, determines the matching degree between target audio data and source voice data according to predetermined matched rule.

Fig. 8 shows a kind of specific implementation of the matching degree that step 206 sets the goal between voice data and source voice data really and processes 800, wherein considers similarity between audio class and the matching of background mode and retrieves the source voice data relevant to target audio data with sequence.

First,, at step 802 place, determine the audio class distance between the audio class relevant with source voice data to target audio data.The segmentation markers model for example can step 408 place based at Fig. 4 training is determined audio class distance.Then,, at step 804 place, by the audio indicia sequence of the audio indicia sequence of target audio data and source voice data is compared, the audio class distance based on definite in step 802 is carried out sequence of calculation matching score.In one example, can utilize dynamic time consolidation (DTW) algorithm to calculate the similarity between the audio indicia sequence of target audio data and the audio indicia sequence of source voice data, i.e. sequences match score using audio class distance as weight.

Then, at step 806 place, by the number of each audio class in the audio class sequence of counting target audio data and source voice data, count matching score.For example, can count every kind of audio class and in specific time period, occur how many times.Count matches score is calculated and is contributed to find similar background mode.Finally, in step 808, with the count matches score of calculating in the sequences match score calculated in each self-corresponding weighted value combination step 804 and step 806, thereby determine the matching degree of target audio data and source voice data.It should be noted that, the each self-corresponding weighted value of sequences match score and count matches score can be come to determine according to actual needs or based on experience value.In one example, can only consider any in sequences match and count matches.For example, can only must assign to determine the matching degree of target audio data and source voice data based on sequences match.

Get back to Fig. 2, determine the matching degree between target audio data and source voice data in step 206 after, method 200 proceeds to step 208, output matching degree higher than the source voice data of predetermined matching degree threshold value as result for retrieval.Arrive this, method 200 finishes.In some embodiments, after having determined result for retrieval, source voice data can also be added in audio database in order to further training as the segmentation markers model in the step 408 of Fig. 4.

Fig. 9 shows according to the functional block diagram of the audio retrieval system 900 based on pattern of the embodiment of the present invention.The functional module of audio retrieval system 900 can be realized by the combination of hardware, software or the hardware and software of realizing the principle of the invention.It will be understood by those skilled in the art that the functional module described in Fig. 9 can combine or be divided into submodule, thereby realize the principle of foregoing invention.Therefore, the description of this paper can be supported any possible combination to functional module described herein or divide or further limit.

Audio retrieval system 900 can be carried out audio class mark and retrieval based on background mode automatically iteratively without artificial participation mark, thereby more accurately rational audio retrieval result is provided.Audio retrieval system 900 can comprise labelling apparatus 902, Target Acquisition device 904, matching degree determining device 906 and search and output device 908.

Labelling apparatus 902 is configured to, based on pattern, the multiple sources voice data being for example included in audio database is carried out to mark, to obtain the audio indicia sequence of each source voice data.In one embodiment, labelling apparatus 902 can comprise division device 912, clustering apparatus 914, decision tree construction device 916, model training apparatus 918, segmentation adjusting gear 920 and iterated conditional judgment means 922.Divide device 912 and be configured to each source voice data to divide, to obtain multiple segmentations.In one example, dividing device 912 can be by with lower any or appoint multiple combinations to divide source voice data: according to quiet division the in the voice data of source; Audio frequency window according to scheduled duration is divided source voice data; And by even division of time source voice data.In one embodiment, dividing device 912 comprises and is configured to determine that whether source voice data is the speech recognition equipment of speech data and the division actuating unit that is configured to based on the definite result of speech recognition equipment, source voice data to be divided to obtain multiple segmentations.

Clustering apparatus 914 can be configured to the multiple segmentations based on obtained, and utilizes clustering algorithm to determine the audio class sequence of each source voice data.In one example, clustering apparatus 914 comprises: the sub-device of the first cluster, is configured to utilize build GMM from the audio frequency characteristics of obtained multiple stage extractions; With the sub-device of the second cluster, be configured to the GMM based on the sub-device structure of the first cluster, utilize clustering algorithm based on special audio feature and audio class distance, determine the audio class sequence of source voice data.

Decision tree construction device 916 can be configured to according to clustering apparatus 914 for the definite audio class sequence of multiple sources voice data, based on mode construction decision tree.In one example, decision tree construction device 916 comprises: the first decision tree builds sub-device, is configured to audio class in the audio class sequence determined by clustering apparatus 914 of the definition root node as decision tree; The second decision tree builds sub-device, is configured to, based on built audio class that sub-device the is defined as root node context in audio class sequence by the first decision tree, build pattern problem set; And the 3rd decision tree build sub-device, be configured to the pattern problem set based on constructed, the audio class in determined audio class sequence is carried out to branch, thereby builds the leaf node of decision tree.

Model training apparatus 918 can be configured to for the each leaf node on the decision tree being built by decision tree construction device 916, training segmentation markers model.In one example, segmentation markers model is for example HMM and duration model.

Segmentation adjusting gear 920 can be configured to utilize the segmentation markers model of being trained by model training apparatus 918, obtains the audio indicia sequence of each source voice data and adjusts the division to this source voice data.In one example, segmentation adjusting gear 920 comprises: sub-device is adjusted in the first segmentation, is configured to utilize the segmentation markers model of being trained by model training apparatus 918, determines the audio class distance of source voice data; Sub-device is adjusted in the second segmentation, is configured to the segmentation markers model based on trained, and utilizes the audio frequency characteristics extracting from source voice data and adjusts the definite audio class distance of sub-device by the first segmentation and carry out Viterbi decoding; And the 3rd segmentation adjust sub-device, be configured to, according to adjusted the Viterbi decoded result that sub-device obtains by the second segmentation, obtain the audio indicia sequence of source voice data, and adjust the division to source voice data.

Iterated conditional judgment means 922 can be configured to judge whether to meet predetermined iterated conditional.In one example, predetermined iterated conditional can comprise: the adjustment amount of the division to source voice data is not less than predetermined segment difference, and/or iterations is less than predetermined iterations threshold value.

Target Acquisition device 904 can be configured to obtain the audio indicia sequence of target audio data.In one embodiment, Target Acquisition device 904 can comprise and is configured to the segmentation markers model of training based on model training apparatus 918, and target audio data are carried out to Viterbi decoding, to obtain the device of audio indicia sequence of these target audio data.

Matching degree determining device 906 can be configured to the audio indicia sequence of the each source voice data in the audio indicia sequence of the target audio data that based target acquisition device 904 obtains and audio database that labelling apparatus 902 obtains, determines the matching degree between target audio data and source voice data according to predetermined matched rule.

In one embodiment, matching degree determining device 906 comprises: audio class similarity determining device, is configured to determine the audio class distance between the audio class relevant with source voice data to target audio data; Sequence comparison means, is configured to by the audio indicia sequence of the audio indicia sequence of target audio data and source voice data is compared, based on carrying out sequence of calculation matching score by the definite audio class distance of audio class similarity determining device; Counting comparison means, is configured to by the number of each audio class in the audio class sequence of counting target audio data and source voice data, count matching score; And matching degree calculation element, be configured to combine the sequences match score of being calculated by sequence comparison means and the count matches score of being calculated by counting comparison means with weighted value separately, calculate the matching degree of target audio data and source voice data.

Search and output device 908 can be configured in outputting audio data storehouse the source voice data higher than predetermined matching degree threshold value by the definite matching degree of matching degree determining device 906, as result for retrieval.

Process flow diagram in accompanying drawing and block diagram have shown according to architectural framework in the cards, function and the operation of the system of multiple embodiment of the present invention, method and computer program product.In this, the each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more for realizing the executable instruction of logic function of regulation.Also it should be noted that in some realization as an alternative, in square frame, the function of institute's mark also can be to be different from occurring in sequence of institute's mark in accompanying drawing.For example, in fact two continuous square frames can be carried out substantially concurrently, and they also can be carried out by contrary order sometimes, and this determines according to related function.Also be noted that, the combination of the square frame in each square frame and block diagram and/or process flow diagram in block diagram and/or process flow diagram, can realize by the special hardware based system of the function putting rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.

Below described various embodiments of the present invention, above-mentioned explanation is exemplary, not exhaustive, and be also not limited to disclosed each embodiment.In the case of not departing from the scope and spirit of illustrated each embodiment, many modifications and changes are all apparent for those skilled in the art.The selection of term used herein, is intended to explain best principle, practical application or the technological improvement to the technology in market of each embodiment, or makes other those of ordinary skill of the art can understand the each embodiment disclosing herein.

Claims

1. the audio search method based on pattern, comprising:

Based on pattern, multiple sources voice data is carried out to mark, to obtain the audio indicia sequence of each source voice data;

Obtain the audio indicia sequence of target audio data;

The audio indicia sequence of the audio indicia sequence of based target voice data and each source voice data, determines the matching degree between target audio data and source voice data according to predetermined matched rule; And

Output matching degree is higher than the source voice data of predetermined matching degree threshold value, as result for retrieval.

2. method according to claim 1, wherein, based on pattern, multiple sources voice data is carried out to mark and comprise:

(a) each source voice data is divided, to obtain multiple segmentations;

(b) the multiple segmentations based on obtained, utilize clustering algorithm to determine the audio class sequence of each source voice data;

(c) basis is for the determined audio class sequence of multiple described sources voice data, based on mode construction decision tree;

(d) for the each leaf node on decision tree, training segmentation markers model;

(e) utilize the segmentation markers model of training, obtain the audio indicia sequence of each source voice data and adjust the division to this source voice data; And

(f), in the situation that meeting predetermined iterated conditional, repeat aforesaid operations (b) to (e).

3. method according to claim 2, wherein, each source voice data is divided and comprised following any one or more:

According to quiet division the in the voice data of source;

Audio frequency window according to scheduled duration is divided source voice data; And

By even division of time source voice data.

4. method according to claim 2, wherein, the multiple segmentations based on obtained utilize clustering algorithm to determine that the audio class sequence of each source voice data comprises:

Utilize and build mixed Gauss model (GMM) from the audio frequency characteristics of obtained multiple stage extractions; With

Based on constructed GMM, utilize clustering algorithm based on special audio feature and audio class distance, determine the audio class sequence of source voice data.

5. method according to claim 2, wherein, according to comprising based on mode construction decision tree for multiple described sources determined audio class sequence of voice data:

Define audio class in the determined audio class sequence root node as decision tree;

The context of audio class based on being defined as root node in audio class sequence, builds pattern problem set; And

Pattern problem set based on constructed, carries out branch to the audio class in determined audio class sequence, thereby builds the leaf node of decision tree.

6. method according to claim 4, wherein, comprises for the each leaf node training segmentation markers model on decision tree:

For the each leaf node on decision tree, training hidden Markov model (HMM) and duration model.

7. method according to claim 2, wherein, utilize the segmentation markers model of training to obtain the audio indicia sequence of source voice data and adjust the division of source voice data is comprised:

Utilize the segmentation markers model of training, determine the audio class distance of source voice data;

Segmentation markers model based on trained, utilizes audio frequency characteristics and the determined audio class distance extracted from described source voice data to carry out Viterbi decoding; And

According to Viterbi decoded result, obtain the audio indicia sequence of source voice data, and adjust the division to source voice data.

8. method according to claim 2, wherein, source voice data is divided to obtain multiple segmentations and comprise:

Determine whether source voice data is speech data; And

Based on described definite result, source voice data is divided to obtain multiple segmentations.

9. method according to claim 2, wherein, described predetermined iterated conditional comprises following any one or more:

The adjustment amount of the division to source voice data is not less than predetermined segment difference; And

Iterations is less than predetermined iterations threshold value.

10. method according to claim 2, wherein, the audio indicia sequence of obtaining target audio data comprises:

Segmentation markers model based on trained, carries out Viterbi decoding to described target audio data, to obtain the audio indicia sequence of these target audio data.

11. according to the method described in any one in claim 2 to 10, wherein, determines that according to predetermined matched rule the matching degree between target audio data and source voice data comprises:

Determine the audio class distance between the audio class relevant with source voice data to target audio data;

By the audio indicia sequence of the audio indicia sequence of target audio data and source voice data is compared, carry out sequence of calculation matching score based on determined audio class distance;

By the number of each audio class in the audio class sequence of counting target audio data and source voice data, count matching score; And

Combine with weighted value separately sequences match score and the count matches score calculated, calculate the matching degree of target audio data and source voice data.

12. 1 kinds of audio retrieval systems based on pattern, comprising:

Labelling apparatus, is configured to, based on pattern, multiple sources voice data is carried out to mark, to obtain the audio indicia sequence of each source voice data;

Target Acquisition device, is configured to obtain the audio indicia sequence of target audio data;

Matching degree determining device, be configured to the audio indicia sequence of the audio indicia sequence of the target audio data of obtaining based on described Target Acquisition device and each source voice data that described labelling apparatus obtains, determine the matching degree between target audio data and source voice data according to predetermined matched rule; And

Search and output device, be configured to output by the definite matching degree of described matching degree determining device the source voice data higher than predetermined matching degree threshold value, as result for retrieval.

13. systems according to claim 12, wherein, described labelling apparatus comprises:

Divide device, be configured to each source voice data to divide, to obtain multiple segmentations;

Clustering apparatus, is configured to the multiple segmentations based on obtained, and utilizes clustering algorithm to determine the audio class sequence of each source voice data;

Decision tree construction device, is configured to according to described clustering apparatus for the definite audio class sequence of multiple described sources voice data, based on mode construction decision tree;

Model training apparatus, is configured to for the each leaf node on the decision tree being built by described decision tree construction device, training segmentation markers model;

Segmentation adjusting gear, is configured to utilize the segmentation markers model of being trained by described model training apparatus, obtains the audio indicia sequence of each source voice data and adjusts the division to this source voice data; And

Iterated conditional judgment means, is configured to judge whether to meet predetermined iterated conditional.

14. systems according to claim 13, wherein, described division device is divided each source voice data by following any one or more:

According to quiet division the in the voice data of source;

By even division of time source voice data.

15. systems according to claim 13, wherein, described clustering apparatus comprises:

The sub-device of the first cluster, is configured to utilize build mixed Gauss model (GMM) from the audio frequency characteristics of obtained multiple stage extractions; With

The sub-device of the second cluster, is configured to the GMM based on the sub-device structure of described the first cluster, utilizes clustering algorithm based on special audio feature and audio class distance, determines the audio class sequence of source voice data.

16. systems according to claim 13, wherein, described decision tree construction device comprises:

The first decision tree builds sub-device, is configured to audio class in the audio class sequence determined by described clustering apparatus of the definition root node as decision tree;

The second decision tree builds sub-device, is configured to, based on built audio class that sub-device the is defined as root node context in audio class sequence by the first decision tree, build pattern problem set; And

The 3rd decision tree builds sub-device, is configured to the pattern problem set based on constructed, and the audio class in determined audio class sequence is carried out to branch, thereby builds the leaf node of decision tree.

17. systems according to claim 15, wherein, described model training apparatus comprises: be configured to for the each leaf node training hidden Markov model (HMM) on decision tree and the device of duration model.

18. systems according to claim 13, wherein, described segmentation adjusting gear comprises:

Sub-device is adjusted in the first segmentation, is configured to utilize the segmentation markers model of being trained by described model training apparatus, determines the audio class distance of source voice data;

Sub-device is adjusted in the second segmentation, is configured to the segmentation markers model based on trained, and utilizes the audio frequency characteristics extracting from described source voice data and adjusts the definite audio class distance of sub-device by described the first segmentation and carry out Viterbi decoding; And

Sub-device is adjusted in the 3rd segmentation, is configured to, according to adjusted the Viterbi decoded result that sub-device obtains by described the second segmentation, obtain the audio indicia sequence of source voice data, and adjust the division to source voice data.

19. systems according to claim 13, wherein, described division device comprises:

Speech recognition equipment, is configured to determine whether source voice data is speech data; With

Divide actuating unit, be configured to the result definite based on described speech recognition equipment, source voice data is divided to obtain multiple segmentations.

20. systems according to claim 13, wherein, described predetermined iterated conditional comprises following any one or more:

Iterations is less than predetermined iterations threshold value.

21. systems according to claim 13, wherein, described Target Acquisition device comprises:

Be configured to the segmentation markers model based on trained, described target audio data are carried out to Viterbi decoding, to obtain the device of audio indicia sequence of these target audio data.

22. according to claim 13 to the system described in any one in 21, and wherein, described matching degree determining device comprises:

Audio class similarity determining device, is configured to determine the audio class distance between the audio class relevant with source voice data to target audio data;

Sequence comparison means, is configured to by the audio indicia sequence of the audio indicia sequence of target audio data and source voice data is compared, based on carrying out sequence of calculation matching score by the definite audio class distance of described audio class similarity determining device;

Counting comparison means, is configured to by the number of each audio class in the audio class sequence of counting target audio data and source voice data, count matching score; And

Matching degree calculation element, is configured to the sequences match score of being calculated by described sequence comparison means with weighted value combination separately and the count matches score of being calculated by described counting comparison means, calculates the matching degree of target audio data and source voice data.