GB2469499A - Labelling an audio file in an audio mining system and training a classifier to compensate for false alarm behaviour. - Google Patents

Labelling an audio file in an audio mining system and training a classifier to compensate for false alarm behaviour. Download PDF

Info

Publication number
GB2469499A
GB2469499A GB0906551A GB0906551A GB2469499A GB 2469499 A GB2469499 A GB 2469499A GB 0906551 A GB0906551 A GB 0906551A GB 0906551 A GB0906551 A GB 0906551A GB 2469499 A GB2469499 A GB 2469499A
Authority
GB
United Kingdom
Prior art keywords
audio
classifier
audio file
text
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0906551A
Other versions
GB0906551D0 (en
Inventor
Keith Michael Ponting
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aurix Ltd
Original Assignee
Aurix Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aurix Ltd filed Critical Aurix Ltd
Priority to GB0906551A priority Critical patent/GB2469499A/en
Publication of GB0906551D0 publication Critical patent/GB0906551D0/en
Publication of GB2469499A publication Critical patent/GB2469499A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • G06F17/27
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Abstract

A computer based method of training a classifier to label an audio file in an audio mining system adapted to detect target words in the audio file, the method comprising analyzing one or more labelled text files to generate an output data set representative of the labelled text file, training a classifier based on the output data set, and modifying the classifier to compensate for false alarm behaviour from an audio miner, thereby to enable labelling of an audio file. The output data set comprises an N-gram data set.

Description

Text based training for keyword spotting in an audio classifier
Field of invention
The invention relates to, but not exclusively, a method of training an audio classifier using text data and utilising the audio classifier in conjunction with an audio miner to label audio data.
Background to the invention
It is known in the field of audio data classification to classify audio data, in particular in the field of spoken document retrieval. There have been many efforts within the speech community in the area of spoken document search.
"Topic spotting" is often used to classify audio data. Topic spotting is based on the principle that a document can be described as belonging to one or more categories and the task of topic spotting is to determine which category or categories apply. This enables the data to be indexed and labelled according to the content of the audio file.
It is known to use a Large-Vocabulary Continuous Speech Recognition (LVCSR) classifier to determine the content of the audio. LVCSRs are heavily dependent on the inputted terms for the classification. In particular the lack of ability to recognise potentially significant terms or groups of terms that have not been inputted may result in misclassification of an audio file.
Additionally, it is often impractical to obtain sufficient quantities of labelled audio material that relates to the particular field that one is interested in. When such information is unavailable a user would be required to collect and transcribe the contents of similar audio files or content which may not be readily available, and is also potentially computationally and resource intensive.
Audio topic spotting research has historically largely been driven by the DARPA topic detection and tracking competitions (TDT) based on broadcast news material.
Van Mulbregt et al. describes Dragon Systems' approach to that task, incorporating automatic segmentation by topic as well as topic tracking. The speech recognition system had a 30% word error rate, but even so the performance of the speech-based system was very similar to the text-based performance obtained on the closed captions for the material given the correct story boundaries. Where the story boundaries had to be detected automatically, the speech system performed significantly worse.
Known problems with TDT include: Most of the speech recognisers had been trained and optimised for the same domain (Broadcast News), so presented an optimistic picture of performance when compared to other domains.
The "topics" specified are at a very fine level.
As is common with speech recognition, the availability of suitable databases is a known limiting factor.
Myers et al. describes work based on the Switchboard corpus and its topic labellings, noting in passing that the first five speaker turns should be removed because the artificial nature of that database means that those turns often explicitly name the topic in question.
Therefore, to mitigate at least some of these problems in the prior art there is provided a method of training an audio based classifier that is able to utilise readily available resources to identify key terms with which to identify and classify audio data.
The World Wide Web makes the collection of labelled text material a relatively simple task. The applicant has beneficially realised that despite the inherent incompatibility between the two disparate forms of data such a resource of labelled text material may be used to train an audio topic spotter to label audio material. This is in addition to the known benefit that collecting such data allows the generation of appropriate search terms for a "phonetic" audio mining system or the extension of the vocabulary and language model of an LVCSR system to include the terms most useful for topic discrimination.
There is according to a first aspect of the invention provided a computer based method of training a classifier to label an audio file in an audio mining system adapted to detect target words in the audio file, the method comprising: analysing one or more labelled text files to generate an output data set representative of the labelled text file, training a classifier based on the output data set, and modifying the classifier to compensate for false alarm behaviour from an audio miner, thereby to enable labelling of an audio file.
Preferably wherein the output data set comprises an N-gram data set. More preferably wherein the analysis step uses a fast correlation based filter (FCBF) technique to optimise the N-grams for relevant and non-redundant N-grams; preferably using a relaxed or modified filtering algorithm. Even more preferably wherein the relaxed FCBF technique comprises the steps of: computing a symmetrical uncertainty' as a measure of correlation between term and class for each candidate term; ordering the terms by class uncertainty; truncating the list at a given threshold y; and repeatedly: a) taking the most informative term in the list as a new "predominant" term; b) retaining a term if its correlation with the predominant term is less than f3 times its correlation with class; c) Add the predominant term to the set of retained terms; until no further retained terms remain with class uncertainty greater than y.
Preferably wherein is taken as the uncertainty of the term at position N/loge(N) in the list, where N is the number of items in the list and f3 =30.
Preferably wherein the classifier is based on a support vector machine. Preferably wherein the support vector machine is modified in accordance with an expected false alarm rate obtained from known characteristics of the audio mining system and the selected output data set. More preferably wherein the support vector machine is modified by adjusting thresholds.
According to a second aspect of the invention there is also provided a method of labelling an audio file comprising the steps of: using an audio miner to search the audio file for one or more search terms, identifying hits within the audio file or portions thereof containing with a reasonable level of probability one of the search terms, passing details of the hits of search terms found within the audio file to a text classifier, wherein the text classifier is adapted to enable labelling of the audio file based on classification of the hits.
Preferably wherein the text classifier is trained using labelled text files.
According to yet another aspect of the invention there is provided an audio mining system adapted to label an audio file, the system comprising a processor having a text classifier function and an audio mining function, an input for an audio file in communication with the processor thereby to enable analysis of the audio file by the audio miner function to identify hits within the audio file of one or more predetermined search terms, wherein the hits are communicated to the text classifier which is adapted to analyse the hits and to generate one or more category labels for the audio file, and preferably wherein the audio file is stored in a memory storage with the associated label or labels.
Preferably wherein the processor is adapted to train the text classifier by analysing one or more labelled text files to generate an output data set representative of the labelled text file, training a classifier based on the output data set, and modifying the classifier to compensate for false alarm behaviour from an audio miner, thereby to enable labelling of an audio file.
Further aims and aspects of the invention will be apparent from the appended claims.
It should be noted that use herein to "audio" and "text" in terms of data files, input material, classification, handling and/or storage or use howsoever, refers to the primary form of the data being representative of either an audio file capable of sonic reproduction straight-forwardly from the source form, or to a text data file capable of visual reproduction straight-forwardly from the source form.
Brief description of the drawings
An embodiment of the invention is now described, by way of example only, with reference to the accompanying drawing in which: Figure 1 is a flow diagram of the process of training an audio classifier with text data according to an aspect of the invention; Figure 2 is a flow diagram of the process of classifying audio with a trained audio classifier according to an aspect of the invention; and Figure 3 is a schematic representation of the apparatus.
Detailed description of an embodiment
Figure 1 describes the overall process according to an aspect of the invention.
The system according to an aspect of the invention implements the steps of: 1. Obtaining enough text data labelled according to the desired categories; 2. Extracting all possible N-gram search terms; 3. Reducing the set of search terms using a combination of: a. a minimum search term length (a heuristic constraint, in the preferred embodiment 5 phonemes) b. a relaxed version of fast correlation based feature (FCBF) selection (although other filtering algorithms may be used in further embodiments) 4. Training a Support Vector Machine (SVM) classifier based on the occurrences of the reduced set of search terms in the labelled material; 5. Modifying the SVM classifier to account for the noise introduced by false alarms; 6. Using the classifier to generate labels for spoken audio material There is shown in Figure 1 the step of obtaining the labelled text data at step S 102, extracting an N-gram dataset at step S 104, filtering the dataset with a modified FCBF algorithm at step S106, creating a reduced N-gram dataset from the FCBF algorithm at step S 108, training a SVM text classifier using the reduced N-gram dataset at step Si 10, calculating an expected false alarm rate at step Si i2, modifying the SVM thresholds at step Si 14, inputting the unlabelled audio data at step Si i6 and classifying the unlabelled audio data with the modified SVM at step Si i 8.
At step Si 02 the labelled text data is obtained. Text data has the advantage of being readily available over a number of different resources e.g. the Internet. The text data that is inputted will have been previously classified according to the content of the text. Such classification is preferably a hierarchical classification which allows for the categorisation of the input text from a generalised category e.g. automobiles, travel etc., to more specific sub-categories e.g. SUVs, Paris etc. These sub-categories are also called nodes. From the inputted text keywords and terms are extracted and associated with the labels The most specific sub-categories represent the finest possible classification, and are called "leaf nodes"; any other nodes represent possible decisions to be made and are called "decision nodes". is
At step Si 04 N-grams are extracted from the dataset. Any given text based document may potentially contain thousands of N-grams. Within the overall goal of training on the text data it is clear that word sequences not present in the text data cannot (easily) be included in the training process. Therefore the philosophy of the N-gram approach is to consider all reasonable sequences of words in the training data, applying filtering operations to reduce that set to a manageable size.
The process of extracting the N-grams comprises the following steps: 1. apart from special cases such as hyphen and within-word apostrophe, break up the text at punctuation and at any item containing a digit or other non-word character; 2. for each of the sequences of words so generated, form all possible N-grams up to some pre-determined limit on N; 3. discard any N-grams which: begin or end with a stop word (In the preferred embodiment, N-gram internal "stop" words are allowed as the aim is to produce search terms which people might speak); do not occur more than a specified number of times (pruning); have a minimum path length of fewer than a specified number of phonemes (preferably five though in further embodiments this may be varied).
Therefore at step Si 04 an N-gram dataset is created that contains a large number of N-grams that have been extracted from the inputted text data. However, no consideration has been made regarding the relevancy of the extracted N-grams to the text categories.
Therefore, at step 5i06 the N-gram dataset is filtered using a modified FCBF algorithm. This algorithm both ranks the terms based on the information they provide for the task of topic classification and reduces the size of highly correlated sets of i 0 terms which actually add no further information.
For each node in the category hierarchy (e.g. SUV) the terms that are associated with the category are expected to show a level of correlation, which would not be expected across different categories e.g. the terms to describe a SUV would be very different to iS those to describe, say, Paris. The N-gram dataset for each node is filtered by performing the following steps: i. Compute "symmetrical uncertainty" as a measure of correlation between term and class for each candidate term; 2. Order the terms by class uncertainty (from the most informative to the least); 3. Truncate the list at a given threshold y, removing all terms no more informative than that threshold. In the preferred embodiment value of is taken as the uncertainty of the term at position N/loge(N) in the list; 4. Repeatedly: a) Take the first term (remaining) in the list as the new "predominant" term; b) retain a term if its correlation with the predominant term is less than f3 times its correlation with class.
c) Add the predominant term to the set of retained terms.
From Yu and Liu the symmetrical uncertainty is calculated as: (1) Where H is the entropy from the sample and I is the mutual information between the two variables (X and Y).
The entropy between two variables is given as: I) = jj)L2 (2) However when two variables are involved, it is reasonable to ask the questions: Given a known observation on Y, say bk, what can be said about the uncertainty in X'? This is given by the entropy of the posterior distribution: iilvy= ! = = = (3) If it is planned to observe Y, what is the expected uncertainty that will remain about X after the observation has been taken? This is given by the conditional entropy: = V' Ji = = .y = = = --PfV = Y = 1g[V = rY= q (4) Joint and conditional entropies satisfy the following chain rule: = .i:[X +]IHX = /J}') + (5) The mutual information will have the form: = (6) it satisfies I(X;Y) = I(Y;X), I(X;Y)? 0 and has the alternative symmetric form: i(.XY) = 11(V) H-.Ui(Th Ji(X5) (7) The calculation of the symmetrical uncertainty involves the explicit calculation of the entropy between the pairs of variables which is non-trivial. Further, the computation of probabilities for word occurrence is also complicated by "caching" effects, particularly for nouns.
Therefore, in the preferred embodiments the invention computes entropies as follows: uses sample entropies in which probabilities are replaced by observed frequencies; replaces the observed counts by a simple indicator variable Vt for the presence or absence of term t in a document; ignores any effect of document length on the presence/absence probability.
Therefore the mutual information becomes (i = UI +//t( f1T% ( f/it i = = ____ 1! = (8) where: C denotes the random class variable; "ik the number of documents in class k; nt the number of documents containing term t; i'itk is the number of documents in class k containing term t and N = Xknk the total number of documents.
Similarly the mutual information between two terms t and r is given by: It = * Ii U1 /II1 rf: in--i n-).--n} -.
flit = L1t+ \ + t1_ 1 ? /irfrt_ + 2 (9) where nft denotes the number of documents containing both terms t and r.
-10 -The "fast" part of FCBF refers to the fact that the estimation of mutual entropy in equation 9 is between all pairs of terms is expensive. The FCBF algorithm significantly reduces the number of pairs for which that computation is required.
The uncertainty calculated, using Equation 1 for each candidate term is used to order the terms and to eliminate those candidate terms that have less than the minimum amount of mutual information between term and document class. The low mutual information means that they are unlikely to be key terms that would describe the category of the text document.
Once the terms that have little or no mutual information are removed, the remaining terms are then correlated with the new predominant term at steps 4a), b) and c). These recursive steps identify the new predominant terms which are highly likely to be provide additional key words or key terms for classifying the text.
It must be noted that at step 4b) the standard FCBF algorithm has a value of 3= 1.
However, it is found that the application of such an algorithm to data is found to return a low number of retained terms and therefore a relaxed version, with the relaxation factor of f3 is used. It is found experimentally that a value of ft 30 provides optimal results.
The retained terms therefore form the reduced N-gram set at step S108, which by definition will also comprise the most informative but uncorrelated N-gram terms.
From the steps Si 04 and Si 06 the number of N-grams is greatly reduced and the remaining terms are those which may be considered to be keywords.
At step SilO a SVM classifier is trained using the reduced N-gram dataset of step S 108. It is known to use SVM techniques to classify text data. For example, Joachims 1998, details the implementation of a SVM classifier to classify the text. Such techniques are incorporated at step S108 on the text N-gram dataset.
However, the techniques which are implemented at step SilO are idealised for text based classifications. In a text system the identification of a match between an input term and a search term is relatively trivial and may be determined with a very high degree of certainty.
Additionally, the "length" of a document is reasonably well-defined. However, in audio mining such assumptions are no longer true and accordingly this must be considered when utilising an SVM classifiers for audio data.
With a phonetic audio mining system, any search term is allowed to match any portion of speech, which potentially leads to a large number of false alarms. This problem is mitigated by the presence of confidence scores, which it is hoped will be low for false alarms and high for true hits.
In terms of a support vector machine, the feature vector counts will be corrupted by the presence of some numbers of false alarms, which effectively dilutes the information given by each occurrence of each search term. In the preferred embodiment, the method for compensating for the false alarms is to compensate for the "noise" from the dataset by calculating the expected false-alarm rate and modifying the SVM threshold accordingly. At step Si 12 the expected false alarm rate is calculated and compensated for in the SVM at step S 114.
The standard SVM text topic spotting process normalises observation vectors to unit Euclidean length as described in step 5110. That is, for a set of weighted counts f, the final feature vector is: f (10) For each SVM decision, that vector is then combined with the kernel function k operating on the feature vector and each support vector x. The weighted combination is compared with a threshold p; classification then uses the sign of that difference which can be simplified for a linear kernel as: -12 -ç' = p = p (11) where V does not depend on the test feature vector. If the weighted counts f are corrupted by noise, 1=i�ii (12) then that will affect the decisions taken by the support vector machine. In particular, as the noise level becomes large, the normalised noisy vector will tend to a constant depending only on the weights applied to the raw counts.
Direct compensation for noise is complicated by the normalisation step in equation (10). That normalisation has several purposes: Avoiding a dependency of feature values on document length; Bringing the range of values within [0,1] to reduce arithmetic issues; Constraining distances to look like angles in high-dimensional space, so that, for example, a long document comprising many repeats of the same text would be accounted identical to a document containing only one repetition of that text.
To compensate for the noise the applicant has beneficially realised that a replacement of the denominator of equation 10 by a scaled audio duration overcomes the dependency on document length. Thus the modified noisy vector is: f+ii (13) where d is audio duration and s is a linear scaling factor to help address the remaining purposes. It was found experimentally that s=3 provided optimum performance on the initial data set; this may change in subsequent implementations.
Given that approximation, the linear SVM classification now depends on: -13 -fv (14) i' p = where (15) depends only on the noise level and the SVM and not on the true (unobservable) feature counts f Note that for multi-way classification using the "one-against-one" system there are N(N-1)/2 values of p for an N-class decision. It is found experimentally that modification to the value of p based on a tuning set of audio data results in improvements over pure text training. However those modifications still required a separate tuning set, and tuning operations were required separately for each SVM.
This demonstrates that compensation may be achieved by modifying the support vector machine thresholds. The invention provides a method of computing the noise.
The ideal way to achieve this is to compute or estimate n and then compute using equation (15).
However, it may be more convenient to solve equation (13) for f, subtracting the estimate of n from a scaled version of each noisy feature vector. This approach avoids any need to manipulate the SVM. The value for n can be determined experimentally, and is found to be between 0.06-1 false alarms per hour. These modified vectors are utilised in the SVM classifier and have the advantage of taking into account the noise or errors introduced by the audio mining that are not present when analysing text data.
At step Si 16 unlabelled audio data is inputted into the audio miner so that it may be labelled. In the preferred embodiment the audio miner is an Aurix (RTM) audio miner program as described in European Patent Application No. 06250588.8 "Methods and apparatus relating to searching of spoken audio data". The method of audio mining to -14 -identify words or phrases from phonemes is as described in the aforementioned Patent application.
At step Si 18 the modified SVM classifier is used to classify the audio data The modified SVM will have identified a number of N-grams for a number of different decision nodes. The process of identifying N-grams for each text input is repeated over several tens or hundreds of input documents, with preferably several tens of documents for each node or label. These N-grams are the search terms used to determine the content of the audio data and label the audio file. For each decision node in the category hierarchy, the training process at steps 5102-5110 will have identified the most relevant and useful set of N-grams, typically some thousands of terms per node. Those sets of N-grams are the search terms processed by the audio mining engine to produce hit counts and thus the feature vector for the unlabelled audio data. That feature vector is then processed by the SVM to determine the most appropriate sub-category label or labels. Some SVM implementations are such that "no clear winner" is a possible decision -in such cases either no label or multiple likely labels may be attached to the audio data.
The audio data is mined by the audio miner and analysed by the classifier to determine the audio content of the file and to assign a tag at step S 118. This process is described in more in Figure 2.
Figure 2 shows the process of classifying the audio data with the modified SVM classifier. This process assumes that the steps of Figure 1 have been performed thereby creating a text trained audio classifier.
There is shown the step of inputting the unlabelled audio into the audio miner program at step S202, the audio miner program identifying words at step S204, comparing the identified words to the list of search terms at step S206, deciding if the identified word(s) are on the list of search terms at step S208, adding one to the count for the particular term at step S210, comparing the next term at step S212, feeding the positive hits into the classifier at step S214 and determining and labelling the audio data at step S216.
-15 -For a phonetic audio miner, this process can be simplified, as such a system can be configured to search only for the desired list of search terms, so that each search hit found with confidence above the specified threshold causes 1 to be added to the corresponding count, omitting steps S206 and S208. In an alternative implementation the increment to the count may be a floating point number, preferably between 0 and 1, depending on the confidence score, so that low confidence hits contribute little or nothing to the counts.
The unlabelled audio data is fed into the audio miner at step S202 and the audio miner program identifies the phonemes of the audio in the file and creates a searchable data file. This data file is searched for the list of terms! N-grams as determined by the modified SVM at step Si 14 in Figure 1. Depending on the knowledge of the content of the audio data the number of N-grams searched for in the audio data can be varied e.g. if it is known that the audio file is related to travel then only the N-grams extracted from nodes that relate to travel would be searched and not those that relate to say sport.
The searching for the relevant N-grams occurs at step S208. The method of searching the searchable audio data file for the relevant list of N-gram terms is preferably a dynamic programming method as described in European Patent Application No. 06250588.8.
At step S208 the decision as to whether a hit for a particular N-gram is found in the searchable audio file is made. The order of the N-gram may be varied (i.e. N1, 2 n) corresponding to the N-gram comprising one or more words. The audio miner returns a likelihood of a match for a particular word or sequence of words so that a decision of a match based on the determined likelihood can be made. If a match occurs for a particular N-gram a count for the number of "hits" or matches to that N-gram is set-up if necessary and one is added to the count. Therefore, the invention determines the number of times the N-gram is present in the audio file. -16-
Once one has been added to the number of hits for the N-gram or there is no match the next term is compared at step S212. This process is repeated until all N-gram terms (which may comprise several thousands of N-grams) have been searched for in the searchable data file.
It is important to note that a given N-gram may relate to one or more categories. e.g. If bigrams (i.e. N=2) are used the term "Birmingham car" may be found to relate to say tourism, it may also relate to automobiles. Therefore, the presence of a particular N-gram does not automatically allow for the classification and labelling of the audio data. Additionally, it is found that not all detected N-grams will be relevant to the audio data. For example, a passing reference to a particular town may occur in an audio file, and would be picked up as relating to the decision node for the town, but the town may not be relevant to the overall content of the file.
At step S214 the information regarding the N-grams and the number of times they are present in the audio file (as determined at step S2 10) is fed back into the text classifier. For a particular decision node the most relevant N-grams will have been determined during the classifier training process by steps S104-S108. The counts for those N-grams are then assembled into a feature vector and processed by the SVM associated with that decision node, resulting in a decision for that node. In a complex hierarchy, the decision may correspond to a further set of choices, and the process of forming the feature vector and making the decision is iterated until a leaf node is reached or there is insufficient information to make a clear decision. In an alternative implementation the searching process is interleaved with that decision process, choosing alternative sets of search terms to correspond to each decision node encountered.
The audio file is then labelled with the appropriate label at step S216. This label is preferably in the form of metadata associated with the file.
Figure 3 is a schematic of the apparatus used according to an aspect of the present invention. There is shown the audio channel 10, a system such as a computer 20 comprising an audio miner 22, a processor 24, an audio classifier 26, and storage 28.
-17 -Of course the functionality of the audio classifier 26 and the audio miner 22 might be achieved substantially by the processor 24 itself depending on the configuration of the system 20. There is also shown various text inputs 30.
Audio data is sent to the audio miner 22 via the audio channel 10. The audio channel may be a form of audio input from an external device e.g. a SCART cable or Phono plug etc., or a link to a digital file e.g. an mp3 stored on some form of memory.
The audio miner 22, processor 24, audio classifier 26, and storage 28 are held in a computer 20. The computer 20 is a device known in the art e.g. a desktop device though in further embodiments it may a network of computers connected to a central server, the Internet etc. The processor 24 and storage 28 are known in the art. The audio miner 22 is preferably an Aurix audio miner though other audio miners may be used. It is found that the equations and techniques presented in the specification are particularly suited to phonetic audio miners.
The audio classifier 26 preferably comprises the modified SVM classifier as described in detail with reference to Figure 1. In further embodiments the classifier may be other classifiers e.g. a standard SVM classifier.

Claims (12)

  1. -18 -Claims 1. A computer based method of training a classifier to label an audio file in an audio mining system adapted to detect target words in the audio file, the method comprising: analysing one or more labelled text files to generate an output data set representative of the labelled text file, training a classifier based on the output data set, and modifying the classifier to compensate for false alarm behaviour from an audio miner, thereby to enable labelling of an audio file.
  2. 2. A computer based method according to claim 1 wherein the output data set comprises an N-gram data set.
  3. 3. A computer based method according to claim 2 wherein the analysis step uses a fast correlation based filter (FCBF) technique to optimise the N-grams for relevant and non-redundant N-grams; preferably using a relaxed or modified filtering algorithm.
  4. 4. A computer based method according to claim 3 wherein the relaxed FCBF technique comprises the steps of: computing a symmetrical uncertainty' as a measure of correlation between term and class for each candidate term; ordering the terms by class uncertainty; truncating the list at a given threshold y; and repeatedly: a) taking the most informative term in the list as a new "predominant" term; b) retaining a term if its correlation with the predominant term is less than f3 times its correlation with class, and c) adding the predominant term to the set of retained terms, until no further retained terms remain with class uncertainty greater than y.
    -19 -
  5. 5. A computer based method according to claim 4 wherein y is taken as the uncertainty of the term at position N/loge(N) in the list, where N is the number of items in the list and f3 =30.
  6. 6. A computer based method according to any preceding claim wherein the classifier is based on a support vector machine.
  7. 7. A computer based method according to claim 6 wherein the support vector machine is modified in accordance with an expected false alarm rate obtained from known characteristics of the audio mining system and the selected output data set.
  8. 8. A computer based method according to claim 7 wherein the support vector machine is modified by adjusting thresholds.
  9. 9. A method of labelling an audio file comprising the steps of: using an audio miner to search the audio file for one or more search terms, identifiying hits within the audio file or portions thereof containing with a reasonable level of probability one of the search terms, passing details of the hits of search terms found within the audio file to a text classifier, wherein the text classifier is adapted to enable labelling of the audio file based on counts of the hits.
  10. 10. A method according to claim 9 wherein the text classifier is trained using labelled text files.
  11. 11. An audio mining system adapted to label an audio file, the system comprising a processor having a text classifier ftinction and an audio mining ftinction, an input for an audio file in communication with the processor thereby to enable analysis of the audio file by the audio miner function to identifiy hits within the audio file of one or more predetermined search terms, wherein the hits are communicated to the text classifier which is adapted to analyse the hits and to generate a term for labelling the -20 -audio file, and preferably wherein the audio file is stored in a memory storage with the associated label.
  12. 12. An audio mining system according to claim 10, wherein the processor is adapted to train the text classifier by analysing one or more labelled text files to generate an output data set representative of the labelled text file, training a classifier based on the output data set, and modifying the classifier to compensate for false alarm behaviour from an audio miner, thereby to enable labelling of an audio file.
GB0906551A 2009-04-16 2009-04-16 Labelling an audio file in an audio mining system and training a classifier to compensate for false alarm behaviour. Withdrawn GB2469499A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0906551A GB2469499A (en) 2009-04-16 2009-04-16 Labelling an audio file in an audio mining system and training a classifier to compensate for false alarm behaviour.

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0906551A GB2469499A (en) 2009-04-16 2009-04-16 Labelling an audio file in an audio mining system and training a classifier to compensate for false alarm behaviour.

Publications (2)

Publication Number Publication Date
GB0906551D0 GB0906551D0 (en) 2009-05-20
GB2469499A true GB2469499A (en) 2010-10-20

Family

ID=40750699

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0906551A Withdrawn GB2469499A (en) 2009-04-16 2009-04-16 Labelling an audio file in an audio mining system and training a classifier to compensate for false alarm behaviour.

Country Status (1)

Country Link
GB (1) GB2469499A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239372A (en) * 2013-06-24 2014-12-24 浙江大华技术股份有限公司 Method and device for audio data classification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0590925A1 (en) * 1992-09-29 1994-04-06 International Business Machines Corporation Method of speech modelling and a speech recognizer
EP1484745A1 (en) * 2003-06-03 2004-12-08 Microsoft Corporation Discriminative training of language models for text and speech classification
US20060100876A1 (en) * 2004-06-08 2006-05-11 Makoto Nishizaki Speech recognition apparatus and speech recognition method
US7292982B1 (en) * 2003-05-29 2007-11-06 At&T Corp. Active labeling for spoken language understanding
WO2008109665A1 (en) * 2007-03-08 2008-09-12 Nec Laboratories America. Inc. Fast semantic extraction using a neural network architecture
US20080249762A1 (en) * 2007-04-05 2008-10-09 Microsoft Corporation Categorization of documents using part-of-speech smoothing
US20090144277A1 (en) * 2007-12-03 2009-06-04 Microsoft Corporation Electronic table of contents entry classification and labeling scheme

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0590925A1 (en) * 1992-09-29 1994-04-06 International Business Machines Corporation Method of speech modelling and a speech recognizer
US7292982B1 (en) * 2003-05-29 2007-11-06 At&T Corp. Active labeling for spoken language understanding
EP1484745A1 (en) * 2003-06-03 2004-12-08 Microsoft Corporation Discriminative training of language models for text and speech classification
US20060100876A1 (en) * 2004-06-08 2006-05-11 Makoto Nishizaki Speech recognition apparatus and speech recognition method
WO2008109665A1 (en) * 2007-03-08 2008-09-12 Nec Laboratories America. Inc. Fast semantic extraction using a neural network architecture
US20080249762A1 (en) * 2007-04-05 2008-10-09 Microsoft Corporation Categorization of documents using part-of-speech smoothing
US20090144277A1 (en) * 2007-12-03 2009-06-04 Microsoft Corporation Electronic table of contents entry classification and labeling scheme

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239372A (en) * 2013-06-24 2014-12-24 浙江大华技术股份有限公司 Method and device for audio data classification
CN104239372B (en) * 2013-06-24 2017-09-12 浙江大华技术股份有限公司 A kind of audio data classification method and device

Also Published As

Publication number Publication date
GB0906551D0 (en) 2009-05-20

Similar Documents

Publication Publication Date Title
CN106156204B (en) Text label extraction method and device
KR100388344B1 (en) Method and apparatus for retrieving audio information using content and speaker information
Tur et al. Combining active and semi-supervised learning for spoken language understanding
US11182435B2 (en) Model generation device, text search device, model generation method, text search method, data structure, and program
EP1462950A1 (en) Method of analysis of a text corpus
US9087297B1 (en) Accurate video concept recognition via classifier combination
US8533223B2 (en) Disambiguation and tagging of entities
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
JP2004005600A (en) Method and system for indexing and retrieving document stored in database
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN108228541B (en) Method and device for generating document abstract
JP2004133880A (en) Method for constructing dynamic vocabulary for speech recognizer used in database for indexed document
Prabowo et al. Hierarchical multi-label classification to identify hate speech and abusive language on Indonesian twitter
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
Hillard et al. Learning weighted entity lists from web click logs for spoken language understanding
CN115510500A (en) Sensitive analysis method and system for text content
JP2010118050A (en) System and method for automatically searching patent literature
CN112307364A (en) Character representation-oriented news text place extraction method
CN110348497A (en) A kind of document representation method based on the building of WT-GloVe term vector
GB2469499A (en) Labelling an audio file in an audio mining system and training a classifier to compensate for false alarm behaviour.
Ikonomakis et al. Text classification: a recent overview
GB2572320A (en) Hate speech detection system for online media content
Mirylenka et al. Linking IT product records
Dufour et al. Automatic error region detection and characterization in lvcsr transcriptions of tv news shows
Popova et al. Automatic stop list generation for clustering recognition results of call center recordings

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)