CN103853749B

CN103853749B - Mode-based audio retrieval method and system

Info

Publication number: CN103853749B
Application number: CN201210505562.2A
Authority: CN
Inventors: 张世磊; 涂旭东; 金锋; 金琴; 刘�文; 秦勇
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2012-11-30
Filing date: 2012-11-30
Publication date: 2017-04-26
Anticipated expiration: 2032-11-30
Also published as: CN103853749A

Abstract

The invention provides a mode-based audio retrieval method and system. The audio retrieval method includes marking a plurality of original audio data on the basis of modes to acquire the audio marking sequences of the original audio data; acquiring the audio marking sequences of target audio data; determining the matching degree of the target audio data and the original audio data according to preset matching rules on the basis of the audio marking sequences of the target audio data and the audio marking sequences of the original audio data; outputting the original audio data with the matching degree higher than a preset matching degree threshold as retrieval results. By using the method and system, audio marking and retrieving can be performed automatically and iteratively on the basis of background modes without manual marking, and accordingly more accurate and reasonable audio retrieval results are provided.

Description

Audio search method and system based on pattern

Technical field

Present invention relates in general to multimedia information retrieval field, especially, is related to based on the audio search method of pattern And system.

Background technology

The widely available high speed development for having promoted multimedia information technology of the Internet.The many matchmakers that can be obtained from the Internet Volume data amount rapidly increases.For example, on YouTube websites the audio-video document of upload per minute up to as many as 48 hours.Magnanimity Data make it impossible to be browsed one by one, and the index to data and retrieval are also more challenged.

The data file that required subject matter how is correctly found from information bank is the research in multimedia information retrieval field One of focus.For example, wedding celebration Chevron Research Company (CRC) may want to according to a small amount of wedding ceremony sample, find magnanimity material to make most Whole wedding celebration file.The producer in radio station or the production team of video website, it is desirable to based on limited information from magnanimity number Program category interested is searched according in, for quick program making help is provided.Additionally, user may want to have by oneself Multimedia database carries out automatic labelling filing, so as to more effectively be managed.

Compared to the retrieval based on video, the retrieval scope of application based on audio frequency is wider, for example, can only obtain audio frequency number According in the case of（For example, radiobroadcasting）.Audio frequency is contained contributes to a considerable amount of information of understanding content, and compares Audio file is generally less for video.Therefore, for example having to video file because network upload capacity is limited It is compressed in the case of slightly obscuring, it is more clear that audio frequency but can be made to.

However, the audio index and search method of prior art have many defects.First, existing audio index and retrieval Method needs substantial amounts of manual markings.For example for audio frequency website, generally there is substantial amounts of unmarked file or simple marking File, without well description between these files, and lack and recommend to connect with the effective dependency of other data.Work Personnel can only be manually high to the famous program in part or visit capacity file carry out manual markings and recommend connection.Therefore, so Audio index and search method be simply possible to use in specific area and limited set of data samples.

Secondly, existing audio index and search method are based only on audio indicia and are modeled in itself, and this can cause rope Draw inaccurate with retrieval result.For example, equally it is current sound, under natural river background mode and under family kitchen background mode Meaning be diverse.Again for example, brouhaha is also different in entertainment, talk show or sports cast 's.If one section of river singing of the stream of user input is used as sample, it is desirable to similar material is retrieved from multimedia database, that Existing audio search method is given including the current sound under natural river pattern and under family kitchen pattern in which can not differentiate between Data file.Obviously, in the case where context is not considered, the result of many audio retrievals is inaccurate.

Again, existing audio search method generally adopts single sequential search strategy, i.e., first by audio data segment, connect Carries out Classification and Identification for per section.Thus, the mistake in previous steps can affect the implementing result of subsequent step, cause progressively In accumulating last retrieval result so that retrieval result is inaccurate or even completely offsets from searched targets.

Accordingly, it would be desirable to a kind of audio search method performed automatically without the need for manually participating in labelling and system.

Further, need a kind of based on background mode and the audio search method of audio class similarity can be considered and be System.

Further, needs are a kind of can automatically eliminate cumulative error so as to provide the audio frequency of more accurate retrieval result Search method and system.

The content of the invention

It is an object of the present invention to automatically source audio data are carried out with the labelling based on pattern and modeling, and consider Audio class similarity ground provides accurate audio retrieval result.

For this purpose, the audio search method and system of the present invention are segmented by iteration processing come to source audio number with cluster integration According to automated audio labelling is carried out, the decision tree based on background mode and the leaf node being directed on decision tree are built in each iteration Training segmentation markers model, is finally based on model comparision and with reference to audio class similarity providing audio retrieval result.

According to the first aspect of the invention, there is provided a kind of audio search method based on pattern, including：Based on pattern pair Multiple source audio data are marked, to obtain the audio indicia sequence of each source audio data；Obtain the sound of target audio data Frequency labelled sequence；The audio indicia sequence of audio indicia sequence and each source audio data based on target audio data, according to pre- Determine the matching degree that matched rule is determined between target audio data and source audio data；And output matching degree is higher than predetermined matching The source audio data of degree threshold value, as retrieval result.

In one embodiment, multiple source audio data are marked including for each source audio number based on pattern Operate as follows according to performing：（a）Each source audio data is divided, to obtain multiple segmentations；（b）It is multiple based on what is obtained Segmentation, using clustering algorithm the audio class sequence of each source audio data is determined；（c）According to for multiple source audio data Determined by audio class sequence, based on mode construction decision tree；（d）For each leaf node on decision tree, training segmentation mark Note model；（e）Trained segmentation markers model is utilized, the audio indicia sequence of each source audio data is obtained and is adjusted to this The division of source audio data；And（f）In the case where predetermined iterated conditional is met, repeat aforesaid operations（b）Extremely（e）.

According to the second aspect of the invention, there is provided a kind of audio retrieval system based on pattern, including：Labelling apparatus, It is configured to pattern to be marked multiple source audio data, to obtain the audio indicia sequence of each source audio data；Target Acquisition device, is configured to obtain the audio indicia sequence of target audio data；Matching degree determining device, is configured to the mesh Each source audio data that the audio indicia sequence and the labelling apparatus of the target audio data that mark acquisition device is obtained is obtained Audio indicia sequence, according to predetermined matched rule the matching degree between target audio data and source audio data is determined；And inspection Rope output device, is configured as output to the source sound of the matching degree that determined by the matching degree determining device higher than predetermined matching degree threshold value Frequency evidence, as retrieval result.

In one embodiment, the labelling apparatus include：Device is divided, is configured to that each source audio data is carried out to draw Point, to obtain multiple segmentations；Clustering apparatus, are configured to the multiple segmentations for being obtained, and using clustering algorithm each source is determined The audio class sequence of voice data；Decision tree construction device, is configured to be directed to multiple source audios according to the clustering apparatus The audio class sequence that data determine, based on mode construction decision tree；Model training apparatus, are configured to by the decision tree structure Each leaf node built on the decision tree of device structure, trains segmentation markers model；Segmentation adjusting apparatus, are configured to using by institute The segmentation markers model of model training apparatus training is stated, the audio indicia sequence of each source audio data is obtained and is adjusted to the source The division of voice data；And iterated conditional judgment means, it is configured to judge whether to meet predetermined iterated conditional.

Using the method for the present invention and system, audio retrieval can be automatically performed without the need for manually participating in labelling.

Using the method for the present invention and system, audio class labelling can be made iteratively based on background mode, so as to provide More accurate rational audio retrieval result.

Using the method for the present invention and system, it can be considered that audio class similarity and carrying out audio frequency inspection with reference to background mode Rope.

Description of the drawings

Disclosure illustrative embodiments are described in more detail by combining accompanying drawing, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent from, wherein, in disclosure illustrative embodiments, identical reference number Typically represent same parts.

Fig. 1 shows the block diagram for being suitable to the exemplary computer system/server for realizing embodiment of the present invention.

Fig. 2 is the general flow chart exemplified with the audio search method based on pattern according to embodiments of the present invention.

Fig. 3 schematically shows an example of audio class sequence.

Fig. 4 is for carrying out to source audio data based on the audio class labelling of pattern exemplified with according to embodiments of the present invention Process flow chart.

Fig. 5 schematically shows an example of clustering processing.

Fig. 6 is for the flow chart based on the process of mode construction decision tree exemplified with according to embodiments of the present invention.

Fig. 7 schematically shows decision tree and builds the example for processing.

Fig. 8 be exemplified with it is according to embodiments of the present invention for determine between target audio data and source audio data The flow chart of the process with degree.

Fig. 9 shows the functional block diagram of the audio retrieval system based on pattern according to embodiments of the present invention.

Specific embodiment

The preferred implementation of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Preferred implementation, however, it is to be appreciated that may be realized in various forms the disclosure and the embodiment party that should not be illustrated here Formula is limited.Conversely, these embodiments are provided so that the disclosure is more thorough and complete, and can be by the disclosure Scope intactly conveys to those skilled in the art.

Person of ordinary skill in the field knows that the present invention can be implemented as system, method or computer program. Therefore, the disclosure can be implemented as following form, i.e.,：Can be completely hardware, can also be completely software（Including Firmware, resident software, microcode etc.）, can also be the form that hardware and software is combined, referred to generally herein as " circuit ", " mould Block " or " system ".Additionally, in certain embodiments, the present invention is also implemented as in one or more computer-readable mediums In computer program form, in the computer-readable medium include computer-readable program code.

The combination in any of one or more computer-readable media can be adopted.Computer-readable medium can be calculated Machine readable signal medium or computer-readable recording medium.Computer-readable recording medium for example can be --- but do not limit In --- the system of electricity, magnetic, optical, electromagnetic, infrared ray or quasiconductor, device or device, or arbitrarily more than combination.Calculate The more specifically example of machine readable storage medium storing program for executing（Non exhaustive list）Including：Electrical connection with one or more wires, just Take formula computer disk, hard disk, random access memory（RAM）, read only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this document, computer-readable recording medium can be it is any comprising or storage journey The tangible medium of sequence, the program can be commanded execution system, device, and either device is used or in connection.

Computer-readable signal media can include the data signal propagated in a base band or as a carrier wave part, Wherein carry computer-readable program code.The data signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium beyond computer-readable recording medium, the computer-readable medium can send, propagate or Transmit for by instruction execution system, device, either device to be used or program in connection.

The program code included on computer-readable medium can with any appropriate medium transmission, including --- but do not limit In --- wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.

Computer for performing present invention operation can be write with one or more programming language or its combination Program code, described program design language includes object oriented program language-such as Java, Smalltalk, C++, Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with Fully perform on the user computer, partly perform on the user computer, perform as an independent software kit, portion Part on the user computer is divided to perform on the remote computer or perform on remote computer or server completely. In being related to the situation of remote computer, remote computer can be by the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer（For example carried using Internet service Come by Internet connection for business）.

Method, device below with reference to the embodiment of the present invention（System）With the flow chart of computer program and/or The block diagram description present invention.It should be appreciated that each square frame in each square frame and flow chart and/or block diagram of flow chart and/or block diagram Combination, can be realized by computer program instructions.These computer program instructions can be supplied to general purpose computer, special The processor of computer or other programmable data processing units, so as to produce a kind of machine, these computer program instructions Performed by computer or other programmable data processing units, generate in flowchart and/or the square frame in block diagram and advise The device of fixed function/operation.

These computer program instructions can also be stored in can cause computer or other programmable data processing units In the computer-readable medium for working in a specific way, so, the instruction being stored in computer-readable medium just produces one Command device (the instruction of function/operation specified in the individual square frame including in flowchart and/or block diagram Means manufacture)（manufacture）.

Computer program instructions can also be loaded into computer, other programmable data processing units or miscellaneous equipment On so that series of operation steps is performed on computer, other programmable data processing units or miscellaneous equipment, in terms of producing The process that calculation machine is realized, so that the instruction performed on computer or other programmable devices can provide flowchart And/or the process of function/operation specified in the square frame in block diagram.

Fig. 1 shows the block diagram for being suitable to the exemplary computer system/server 12 for realizing embodiment of the present invention. The computer system/server 12 that Fig. 1 shows is only an example, should not be to the function of the embodiment of the present invention and use range Bring any restriction.

As shown in figure 1, computer system/server 12 is showed in the form of universal computing device.Computer system/service The component of device 12 can be including but not limited to：One or more processor or processing unit 16, system storage 28, connection Different system component（Including system storage 28 and processing unit 16）Bus 18.

Bus 18 represents one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, AGP, processor or using various bus structures in any bus-structured local bus.Lift For example, these architectures include but is not limited to industry standard architecture（ISA）Bus, MCA（MAC） Bus, enhancement mode isa bus, VESA（VESA）Local bus and periphery component interconnection（PCI）Bus.

Computer system/server 12 typically comprises various computing systems computer-readable recording medium.These media can be appointed What usable medium that can be accessed by computer system/server 12, including volatibility and non-volatile media, it is moveable and Immovable medium.

System storage 28 can include the computer system readable media of form of volatile memory, such as random access memory Memorizer（RAM）30 and/or cache memory 32.It is removable that computer system/server 12 may further include other Dynamic/immovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for Read and write immovable, non-volatile magnetic media（Fig. 1 do not show, commonly referred to " hard disk drive "）.Although not showing in Fig. 1 Going out, can providing for may move non-volatile magnetic disk（Such as " floppy disk "）The disc driver of read-write, and to removable Anonvolatile optical disk（Such as CD-ROM, DVD-ROM or other optical mediums）The CD drive of read-write.In these cases, Each driver can be connected by one or more data media interfaces with bus 18.Memorizer 28 can include at least one Individual program product, the program product has one group（For example, at least one）Program module, these program modules are configured to perform The function of various embodiments of the present invention.

With one group（At least one）Program/the utility 40 of program module 42, can be stored in such as memorizer 28 In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs Module and routine data, potentially include the realization of network environment in each or certain combination in these examples.Program mould Block 42 generally performs function and/or method in embodiment described in the invention.

Computer system/server 12 can also be with one or more external equipments 14（Such as keyboard, sensing equipment, aobvious Show device 24 etc.）Communication, the equipment that can also enable a user to be interacted with the computer system/server 12 with one or more leads to Letter, and/or any set with enable the computer system/server 12 to be communicated with one or more of the other computing device It is standby（Such as network interface card, modem etc.）Communication.This communication can pass through input/output（I/O）Interface 22 is carried out.And And, computer system/server 12 can also be by network adapter 20 and one or more network（Such as LAN （LAN）, wide area network（WAN）And/or public network, such as the Internet）Communication.As illustrated, network adapter 20 passes through bus 18 communicate with other modules of computer system/server 12.It should be understood that although not shown in can be with reference to computer Systems/servers 12 use other hardware and/or software module, including but not limited to：At microcode, device driver, redundancy Reason unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

As it was previously stated, the audio search method and system of the present invention are segmented by iteration processing come to source sound with cluster integration Frequency builds the decision tree based on background mode according to automated audio labelling is carried out, in each iteration and for the leaf on decision tree Node trains segmentation markers model, is finally based on model comparision and with reference to audio class similarity providing audio retrieval result.

Embodiments of the invention are specifically described below with reference to Fig. 2 to Fig. 9.Fig. 2 is exemplified with according to of the invention real Apply the general flow chart of the audio search method 200 based on pattern of example.Firstly, it is necessary to being for example included in audio database Multiple source audio data are carried out based on the audio class labelling of pattern, to obtain the audio indicia sequence of each source audio data（Step 202）.

It should be noted that " audio class " referred to herein refers to one kind classification for audio frequency.Ideally, " audio class " can be the event category involved by a section audio, for example shot, singing of the stream, cheer or shriek etc..However, Generally, " audio class " not necessarily exactly corresponds to the event category involved by audio frequency, and it can be only based on specific Audio processing algorithms（For example, clustering algorithm）Operation result, and may not possess semantic meaning.Need not know in the present invention Knowing the event category that each audio class specifically represents can just carry out accurate audio indicia and retrieval, therefore of the invention Audio class and search method are performed automatically without the need for supervision.

Voice data is made up of continuous or discrete many section audios, so herein referred " audio class sequence " is referred to With the sequence of audio class of time, the audio class and its corresponding persistent period sequentially occurred in voice data is which described.Fig. 3 In show an example of audio class sequence ideally." background mode " or " pattern " referred to herein refers to sound Frequency is according to involved ambient conditions, such as natural river, family kitchen, station, entertainment, talk show or sports cast Deng.

Fig. 4 implements process 400 exemplified with one kind of step 202 in detail, wherein whole with cluster by iteration segmentation Conjunction processes to carry out automated audio labelling to source audio data, and decision tree and pin based on background mode are built in each iteration To the leaf node training segmentation markers model on decision tree.

Processing 400 can start at step 402.In step 402, to each the source sound in multiple source audio data Frequency evidence is divided, to obtain multiple segmentations.In one embodiment, can according to source audio data in it is quiet carry out draw Point.In another embodiment, source audio data can be divided according to the audio frequency window of scheduled duration.In another embodiment In, can temporally be evenly dividing source audio data.In a further embodiment, can using quiet division, audio frequency window divide and Any number of combinations to source audio data dividing in being temporally evenly dividing.

It should be noted that may be relatively coarse to the division result of source audio data in step 402.Changed by follow-up Build for clustering processing, decision tree and process and model training process, and by adopting Viterbi algorithm, can obtain increasingly Accurately divide.

Then, in step 404, the multiple segmentations for being obtained based on the division in step 402, are determined every using clustering algorithm The audio class sequence of individual source audio data.In one example, using the audio frequency characteristics from the multiple stage extractions for being obtained come Build mixed Gauss model（GMM）.Once it is determined that model, it is possible to determine the distance of each audio class.Then, based on constructed GMM, using clustering algorithm be based on special audio feature（For example, the audio frequency characteristics of time domain or frequency domain）With audio class distance, by Level cluster and finally the audio class sequence of determination source audio data.

According to clustering algorithm and predetermined clusters criterion, clustering processing can stop at desired cluster level.In this example The variable-definition of the level that clustering processing is stopped is " audio class ", and is " audio frequency by the variable-definition of each level under it Subclass ".Correspondingly, a series of audio class being sequentially arranged may be constructed " audio class sequence ".As it was previously stated, should manage Solution, the audio class obtained in step 404 and audio frequency subclass are probably without semantic meaning.

Fig. 5 shows an example of clustering processing, and each point in wherein L1 is represented according to from the extraction of multiple audio parsings The GMM model variable that builds of audio frequency characteristics, L2, L3....Ln are represented using clustering algorithm based on specific time domain or frequency domain audio The Audio clustering rank that feature and audio class distance are obtained, each point in wherein Ln（For example, a, b, c, d, e etc.）It is defined as sound Frequency class, and each point in L2 to Ln-1 is considered the audio frequency subclass of the voice data.

Next, in a step 406, according to multiple audio class sequences that step 404 determines for multiple source audio data, Based on mode construction decision tree.What Fig. 6 showed step 406 implements process 600 based on one kind of mode construction decision tree. First, each audio class at step 602, in the audio class sequence that will be determined in step 404（A for example, in Fig. 5 in Ln levels, b、c、d、e……）It is defined as the root node of decision tree.

Then, in step 604, based on being defined as context of the audio class of root node in audio class sequence, structure Modeling formula problem set.Mode issue collection can build according to pre-defined rule, for example, cause the distinction of branch maximum.At one In example, the context of audio class can refer to the audio class in audio class sequence before and after the audio class.Show another In example, one or more obtained for the audio class in the clustering processing that the context of audio class can refer in step 404 Audio frequency subclass.The context of audio class can reflect to a certain extent the background mode of audio class.For example, for train vapour The related audio class of whistling, if the audio class previous audio class in the sequence and broadcast acoustic correlation, and the audio class is in sequence In latter audio class and noisy people's acoustic correlation, then be likely to be the background mode in railway station.But, if previous audio class with Shot correlation, latter audio class and cheer acoustic correlation, that is likely to be the film scene pattern of " railway guerrilla forces " etc.

Finally, in step 606, with constructed mode issue collection, branch is carried out to the audio class in audio class sequence, So as to build the leaf node of decision tree." leaf node of decision tree " referred to herein refer to do not possess in decision tree it is any downwards Child node node.That is, any node for possessing downward child node is defined as " root node ".It should be noted that can So that decision tree to be branched off into downwards destined node rank, such as when the audio indicia number included in each leaf node is less than pre- Determine to terminate the structure to decision tree during threshold value.

Fig. 7 shows that decision tree builds the example for processing, wherein audio class b be, for example, Fig. 5 example in by poly- An audio class in the audio class sequence that class process is obtained.Assume what is obtained for multiple source audio data by clustering processing Four groups are had comprising audio class b in audio class sequence, as shown in fig. 7, being respectively（a-b+c）、（a-b+e）、（d-b+a）With （d-b+c）, wherein symbol "-" represents audio class b previous audio class in the sequence, and symbol "+" represents audio class b in sequence In latter audio class.That is,（a-b+c）Represent that audio class b previous audio class in the sequence is a and latter audio class is c。

Using based on context problem set, progressively by audio class b to inferior division until such as b1, b2, b3, b4 etc. Leaf node.For example, can first select " whether comprising audio class a in context " as problem to carry out branch to audio class b, Therefore, branch out（d-b+c）And it is defined as leaf node b1.Then, " whether previous audio class is a " conduct can be selected to ask Topic comes further branch, thus branches out（d-b+a）And it is defined as leaf node b2.It is then possible to select " latter audio class Whether it is c " as problem further branch, thus distinguish（a-b+e）With（a-b+c）And it is respectively defined as leaf node B3 and b4.So far, the structure to decision tree is completed.

Fig. 4 is returned to, next, in a step 408, for each leaf node on decision tree, segmentation markers model is trained. In one example, segmentation markers model can include hidden Markov model（HMM）And duration model.Then, utilize The segmentation markers model trained, obtains the audio indicia sequence of each source audio data, and adjusts to the source audio data Divide（Step 410）.It should be noted that " audio indicia sequence " referred to herein is related to audio class sequence but is different from Audio class sequence, it does not simultaneously correspond to the event category involved by audio frequency, and is only based on some audio processing algorithms（Example Such as, Viterbi algorithm）Operation result, in order to follow-up matching treatment.In one embodiment of the invention, step 410 Can be realized by following operation：First, using the segmentation markers model in step 408 training, source audio data are determined Audio class distance；Then, based on the segmentation markers model trained, using the audio frequency characteristics and institute extracted from source audio data really Fixed audio class distance carries out Viterbi decodings；Finally, according to Viterbi decoded results, the audio frequency mark of source audio data is obtained Note sequence, and adjust the division to source audio data.

Next, into determination step 412, it is determined whether meet predetermined iterated conditional.In one example, predetermined iteration Condition can include：Predetermined segment difference is not less than to the adjustment amount of the division of source audio data, and/or, iterationses Less than predetermined iterationses threshold value.

In the case of judging that needs are iterated in step 412, method 400 goes to step 404, with based in step To carry out, clustering processing, decision tree build process and segmentation markers model training is processed for segmentation after readjusting in 410.And Judge that iteration can be jumped out in step 412, then export the audio indicia sequence of obtained voice data in step 414.

In one embodiment of the invention, it may also be determined that source sound before step 402 is divided to voice data Whether frequency evidence is speech data（Step 416）.Source audio data included in audio database are probably speech data Possibly non-speech data.Support vector machine well known in the art can be utilized（SVM）Method is carrying out the area of speech/non-speech Point.Voice and non-voice are distinguished exactly, contribute to follow-up segmentation, cluster, decision tree structure and model training step.

Return now to the method 200 of Fig. 2, obtain in step 202. each source audio data audio indicia sequence it Afterwards, method 200 proceeds to step 204.In step 204, the audio indicia sequence of target audio data is obtained.The present invention's In one embodiment, the segmentation markers model trained at for example 408 the step of Fig. 4 can be based on, to target audio data Viterbi decodings are carried out, to obtain the audio indicia sequence of the target audio data.

Next, at step 206, the audio indicia sequence and step of the target audio data obtained based on step 204 place The audio indicia sequence of each source audio data obtained at rapid 202, according to predetermined matched rule target audio data and source are determined Matching degree between voice data.

Fig. 8 shows that the one kind for the matching degree that step 206 sets the goal between voice data and source audio data really is concrete Realize processing 800, wherein consider the matching of similarity between audio class and background mode retrieve and sort and The related source audio data of target audio data.

First, at step 802, it is determined that the audio frequency between the audio class related to target audio data and source audio data Class distance.For example audio class distance can be determined based on the segmentation markers model trained at 408 the step of Fig. 4.Then, At step 804, by the way that the audio indicia sequence of target audio data is compared with the audio indicia sequence of source audio data Compared with based on the audio class distance for determining in step 802 come sequence of calculation matching score.In one example, it is possible to use dynamic The consolidation of state time（DTW）Algorithm calculates the audio indicia sequence and source sound of target audio data using audio class distance as weight Similarity between the audio indicia sequence of frequency evidence, i.e. sequences match score.

Then, at step 806, by each audio frequency in the audio class sequence for counting target audio data and source audio data The number of class, count matching score.For example, every kind of audio class can be counted and occurs how many times in specific time period.Count Matching score is calculated to be contributed to finding similar background mode.Finally, in step 808, combined with each self-corresponding weighted value The count matches score calculated in the sequences match score calculated in step 804 and step 806, so that it is determined that target audio data With the matching degree of source audio data.It should be noted that sequences match score and each self-corresponding weighted value of count matches score Can determine according to actual needs or based on experience value.In one example, sequences match and counting can only be considered Any one in matching somebody with somebody.For example, sequences match score can be based only on determine target audio data and source audio data With degree.

Fig. 2 is returned to, after the matching degree between target audio data and source audio data is determined in step 206, method 200 proceed to the source audio data of step 208, i.e. output matching degree higher than predetermined matching degree threshold value as retrieval result.Arrive this, Method 200 terminates.In some embodiments, after retrieval result is determined, source audio data can also be added to audio frequency Segmentation markers model in the step of in data base further to train such as Fig. 4 408.

Fig. 9 shows the functional block diagram of the audio retrieval system 900 based on pattern according to embodiments of the present invention.Audio frequency is examined The functional module of cable system 900 being implemented in combination in by hardware, software or the hardware and software for realizing the principle of the invention.This Art personnel are understandable that the functional module described in Fig. 9 can combine or be divided into submodule, from And realize the principle of foregoing invention.Therefore, description herein can be supported to any possible of functions described herein module Combine or divide or further limit.

Audio retrieval system 900 can be automatically based upon background mode and be made iteratively audio class without the need for manually participating in labelling Labelling and retrieval, so as to provide more accurate rational audio retrieval result.Audio retrieval system 900 can include labelling apparatus 902nd, Target Acquisition device 904, matching degree determining device 906 and search and output device 908.

Labelling apparatus 902 are configured to enter the multiple source audio data being for example included in audio database based on pattern Line flag, to obtain the audio indicia sequence of each source audio data.In one embodiment, labelling apparatus 902 can include drawing Separating device 912, clustering apparatus 914, decision tree construction device 916, model training apparatus 918, segmentation adjusting apparatus 920 and repeatedly For condition judgement device 922.Divide device 912 to be configured to divide each source audio data, to obtain multiple segmentations. In one example, dividing device 912 by any one in following or can appoint multiple combinations to carry out source audio data Divide：Quiet according to source audio data is divided；Source audio data are divided according to the audio frequency window of scheduled duration； And temporally it is evenly dividing source audio data.In one embodiment, dividing device 912 includes being configured to determine source audio number According to the speech recognition equipment for being whether speech data and the result of speech recognition equipment determination is configured to source audio data Divided to obtain the division performs device of multiple segmentations.

Clustering apparatus 914 are configurable to based on the multiple segmentations for being obtained, and using clustering algorithm each source audio is determined The audio class sequence of data.In one example, clustering apparatus 914 include：First cluster sub-device, is configured to using from being obtained Multiple stage extractions audio frequency characteristics building GMM；With the second cluster sub-device, the first cluster sub-device is configured to The GMM of structure, using clustering algorithm based on special audio feature and audio class distance, determines the audio class sequence of source audio data Row.

Decision tree construction device 916 is configurable to the sound determined for multiple source audio data according to clustering apparatus 914 Frequency class sequence, based on mode construction decision tree.In one example, decision tree construction device 916 includes：First decision tree builds Sub-device, is configured to define the audio class in the audio class sequence determined by clustering apparatus 914 as the root node of decision tree；The Two decision trees build sub-device, are configured to be defined as the audio class of root node in audio frequency by the first decision tree structure sub-device Context in class sequence, forming types problem set；And the 3rd decision tree build sub-device, be configured to constructed mould Formula problem set, the audio class in the audio class sequence to determined by carries out branch, so as to build the leaf node of decision tree.

Model training apparatus 918 are configurable to each on the decision tree for being built by decision tree construction device 916 Leaf node, trains segmentation markers model.In one example, segmentation markers model is, for example, HMM and duration model.

Segmentation adjusting apparatus 920 can be configured to, with the segmentation markers model trained by model training apparatus 918, obtain Obtain the division of the audio indicia sequence and adjustment of each source audio data to the source audio data.In one example, segmentation is adjusted Engagement positions 920 include：First segmentation adjustment sub-device, is configured to using the segmentation markers mould trained by model training apparatus 918 Type, determines the audio class distance of source audio data；Second segmentation adjustment sub-device, is configured to trained segmentation markers mould Type, the audio class distance determined using the audio frequency characteristics extracted from source audio data and by the first segmentation adjustment sub-device is carried out Viterbi is decoded；And the 3rd segmentation adjustment sub-device, be configured to according to by second segmentation adjust sub-device obtain Viterbi decoded results, obtain the audio indicia sequence of source audio data, and adjust the division to source audio data.

Iterated conditional judgment means 922 can be configured to judge whether to meet predetermined iterated conditional.In one example, Predetermined iterated conditional can include：Predetermined segment difference is not less than to the adjustment amount of the division of source audio data, and/or, Iterationses are less than predetermined iterationses threshold value.

Target Acquisition device 904 can be configured to obtain the audio indicia sequence of target audio data.In an enforcement In example, Target Acquisition device 904 may be configured to the segmentation markers model based on the training of model training apparatus 918, to mesh Mark voice data carries out Viterbi decodings, to obtain the device of the audio indicia sequence of the target audio data.

Matching degree determining device 906 can be configured to the target audio data obtained based on Target Acquisition device 904 The audio indicia sequence of each source audio data in the audio database that audio indicia sequence and labelling apparatus 902 are obtained, according to Predetermined matched rule determines the matching degree between target audio data and source audio data.

In one embodiment, matching degree determining device 906 includes：Audio class similarity determining device, is configured to determine Audio class distance between the audio class related to target audio data and source audio data；Sequence comparison means, is configured to lead to Cross and be compared the audio indicia sequence of target audio data with the audio indicia sequence of source audio data, based on by audio class The audio class distance that similarity determining device determines carrys out sequence of calculation matching score；Comparison means is counted, counting is configured to pass The number of each audio class, count matching score in the audio class sequence of target audio data and source audio data；And With degree computing device, it is configured to the respective weighted value sequences match score that calculated by sequence comparison means of combination and by counting The count matches score that comparison means is calculated, calculates the matching degree of target audio data and source audio data.

Search and output device 908 can be configured to export what is determined by matching degree determining device 906 in audio database Matching degree is higher than the source audio data of predetermined matching degree threshold value, used as retrieval result.

Flow chart and block diagram in accompanying drawing shows system, method and the computer journey of multiple embodiments of the invention The architectural framework in the cards of sequence product, function and operation.At this point, each square frame in flow chart or block diagram can generation A part for table one module, program segment or code a, part for the module, program segment or code is used comprising one or more In the executable instruction of the logic function for realizing regulation.It should also be noted that in some are as the realization replaced, being marked in square frame The function of note can also be with different from the order generation of institute's labelling in accompanying drawing.For example, two continuous square frames can essentially base Originally it is performed in parallel, they can also be performed in the opposite order sometimes, this is depending on involved function.It is also noted that It is, the combination of each square frame and block diagram and/or the square frame in flow chart in block diagram and/or flow chart can to use and perform rule Fixed function or the special hardware based system of operation, or can be with the groups of specialized hardware and computer instruction realizing Close to realize.

It is described above various embodiments of the present invention, described above is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.In the case of the scope and spirit without departing from illustrated each embodiment, for this skill Many modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purport Best explaining principle, practical application or the technological improvement to the technology in market of each embodiment, or lead this technology Other those of ordinary skill in domain are understood that each embodiment disclosed herein.

Claims

1. a kind of audio search method based on pattern, including：

Multiple source audio data are marked based on pattern, to obtain the audio indicia sequence of each source audio data；

Obtain the audio indicia sequence of target audio data；

The audio indicia sequence of audio indicia sequence and each source audio data based on target audio data, according to predetermined matching rule Then determine the matching degree between target audio data and source audio data；And

Output matching degree is higher than the source audio data of predetermined matching degree threshold value, used as retrieval result；

Wherein, based on pattern multiple source audio data are marked including：

A () divides to each source audio data, to obtain multiple segmentations；

B () determines the audio class sequence of each source audio data based on the multiple segmentations for being obtained using clustering algorithm；

C () basis is directed to audio class sequence determined by multiple source audio data, based on mode construction decision tree；

D () trains segmentation markers model for each leaf node on decision tree；

E () utilizes trained segmentation markers model, obtain the audio indicia sequence of each source audio data and adjust to the source The division of voice data；And

F () repeats aforesaid operations (b) to (e) in the case where predetermined iterated conditional is met.

2. method according to claim 1, wherein, each source audio data is carried out dividing includes any one following Or it is multiple：

Quiet according to source audio data is divided；

Source audio data are divided according to the audio frequency window of scheduled duration；And

Temporally it is evenly dividing source audio data.

3. method according to claim 1, wherein, determine each source using clustering algorithm based on the multiple segmentations for being obtained The audio class sequence of voice data includes：

Mixed Gauss model GMM is built using the audio frequency characteristics from the multiple stage extractions for being obtained；With

Based on constructed GMM, using clustering algorithm based on special audio feature and audio class distance, source audio data are determined Audio class sequence.

4. method according to claim 1, wherein, according to the audio class sequence determined by multiple source audio data Row are included based on mode construction decision tree：

Root node of the audio class in audio class sequence determined by definition as decision tree；

Based on being defined as context of the audio class of root node in audio class sequence, forming types problem set；And

Based on constructed mode issue collection, the audio class in the audio class sequence to determined by carries out branch, sentences so as to build The leaf node certainly set.

5. method according to claim 3, wherein, for each leaf node training segmentation markers model bag on decision tree Include：

For each leaf node on decision tree, training hidden Markov model HMM and duration model.

6. method according to claim 1, wherein, utilize trained segmentation markers model to obtain the sound of source audio data The division of frequency labelled sequence and adjustment to source audio data includes：

Trained segmentation markers model is utilized, the audio class distance of source audio data is determined；

Based on the segmentation markers model trained, the audio frequency characteristics extracted from source audio data audio frequency with determined by is utilized Class distance carries out Viterbi decodings；And

According to Viterbi decoded results, the audio indicia sequence of source audio data is obtained, and adjustment is drawn to source audio data Point.

7. method according to claim 1, wherein, source audio data are divided to be included with obtaining multiple segmentations：

Determine whether source audio data are speech data；And

Based on the result of the determination, source audio data are divided to obtain multiple segmentations.

8. method according to claim 1, wherein, the predetermined iterated conditional includes any one or more following：

Predetermined segment difference is not less than to the adjustment amount of the division of source audio data；And

Iterationses are less than predetermined iterationses threshold value.

9. method according to claim 1, wherein, obtaining the audio indicia sequence of target audio data includes：

Based on the segmentation markers model trained, Viterbi decodings are carried out to the target audio data, to obtain the target sound The audio indicia sequence of frequency evidence.

10. method according to any one of claim 1 to 9, wherein, target sound frequency is determined according to predetermined matched rule Include according to the matching degree between source audio data：

It is determined that the audio class distance between the audio class related to target audio data and source audio data；

By the way that the audio indicia sequence of target audio data is compared with the audio indicia sequence of source audio data, based on institute It is determined that audio class distance carry out sequence of calculation matching score；

By the number of each audio class in the audio class sequence for counting target audio data and source audio data, count matching Score；And

Calculated sequences match score and count matches score is combined with respective weighted value, target audio data and source is calculated The matching degree of voice data.

A kind of 11. audio retrieval systems based on pattern, including：

Labelling apparatus, are configured to pattern and multiple source audio data are marked, to obtain the audio frequency of each source audio data Labelled sequence；

Target Acquisition device, is configured to obtain the audio indicia sequence of target audio data；

Matching degree determining device, is configured to the audio indicia sequence of the target audio data that the Target Acquisition device is obtained The audio indicia sequence of each source audio data obtained with the labelling apparatus, according to predetermined matched rule target sound frequency is determined According to the matching degree between source audio data；And

Search and output device, the matching degree for being configured as output to be determined by the matching degree determining device is higher than predetermined matching degree threshold value Source audio data, as retrieval result；

Wherein, the labelling apparatus include：

Device is divided, is configured to divide each source audio data, to obtain multiple segmentations；

Clustering apparatus, are configured to the multiple segmentations for being obtained, and using clustering algorithm the audio frequency of each source audio data is determined Class sequence；

Decision tree construction device, is configured to the audio class sequence determined for multiple source audio data according to the clustering apparatus Row, based on mode construction decision tree；

Model training apparatus, each leaf node being configured on the decision tree by decision tree construction device structure, instruction Practice segmentation markers model；

Segmentation adjusting apparatus, are configured to, using the segmentation markers model trained by the model training apparatus, obtain each source sound Division of the audio indicia sequence and adjustment of frequency evidence to the source audio data；And

Iterated conditional judgment means, are configured to judge whether to meet predetermined iterated conditional.

12. systems according to claim 11, wherein, the division device is by any one or more following come right Each source audio data is divided：

Quiet according to source audio data is divided；

Temporally it is evenly dividing source audio data.

13. systems according to claim 11, wherein, the clustering apparatus include：

First cluster sub-device, is configured to build mixed Gaussian mould using the audio frequency characteristics from the multiple stage extractions for being obtained Type GMM；With

Second cluster sub-device, is configured to the GMM that the first cluster sub-device builds, using clustering algorithm based on specific Audio frequency characteristics and audio class distance, determine the audio class sequence of source audio data.

14. systems according to claim 11, wherein, the decision tree construction device includes：

First decision tree builds sub-device, and the audio class for being configured to define in the audio class sequence determined by the clustering apparatus is made For the root node of decision tree；

Second decision tree builds sub-device, is configured to build the audio class that sub-device is defined as root node by the first decision tree Context in audio class sequence, forming types problem set；And

3rd decision tree builds sub-device, constructed mode issue collection is configured to, in the audio class sequence to determined by Audio class carry out branch, so as to build the leaf node of decision tree.

15. systems according to claim 13, wherein, the model training apparatus include：It is configured on decision tree Each leaf node training hidden Markov model HMM and duration model device.

16. systems according to claim 11, wherein, the segmentation adjusting apparatus include：

First segmentation adjustment sub-device, is configured to, using the segmentation markers model trained by the model training apparatus, determine source The audio class distance of voice data；

Second segmentation adjustment sub-device, is configured to trained segmentation markers model, using carrying from the source audio data The audio frequency characteristics for taking and the audio class distance determined by the described first segmentation adjustment sub-device carry out Viterbi decodings；And

3rd segmentation adjustment sub-device, is configured to adjust the Viterbi decoding knots that sub-device is obtained according to by the described second segmentation Really, the audio indicia sequence of source audio data is obtained, and adjusts the division to source audio data.

17. systems according to claim 11, wherein, the division device includes：

Speech recognition equipment, is configured to determine whether source audio data are speech data；With

Divide performs device, be configured to the result that the speech recognition equipment determines, source audio data are divided with Obtain multiple segmentations.

18. systems according to claim 11, wherein, the predetermined iterated conditional includes any one following or many It is individual：

Iterationses are less than predetermined iterationses threshold value.

19. systems according to claim 11, wherein, the Target Acquisition device includes：

Trained segmentation markers model is configured to, Viterbi decodings is carried out to the target audio data, to be somebody's turn to do The device of the audio indicia sequence of target audio data.

20. systems according to any one of claim 11 to 19, wherein, the matching degree determining device includes：

Audio class similarity determining device, is configured to determine between the audio class related to target audio data and source audio data Audio class distance；

Sequence comparison means, is configured to pass the audio indicia by the audio indicia sequence of target audio data and source audio data Sequence is compared, and is matched come the sequence of calculation based on the audio class distance determined by the audio class similarity determining device Point；

Comparison means is counted, each audio class in the audio class sequence for counting target audio data and source audio data is configured to pass Number, count matching score；And

Matching degree computing device, is configured to be obtained by the sequences match that the sequence comparison means is calculated with the combination of respective weighted value The count matches score divided and calculated by the counting comparison means, calculating target audio data are matched with source audio data Degree.