GB2457897A - Audio File Management, Search and Indexing Method and System - Google Patents

Audio File Management, Search and Indexing Method and System Download PDF

Info

Publication number
GB2457897A
GB2457897A GB0803554A GB0803554A GB2457897A GB 2457897 A GB2457897 A GB 2457897A GB 0803554 A GB0803554 A GB 0803554A GB 0803554 A GB0803554 A GB 0803554A GB 2457897 A GB2457897 A GB 2457897A
Authority
GB
United Kingdom
Prior art keywords
word
sound
audio
speech
indexing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0803554A
Other versions
GB0803554D0 (en
Inventor
Felix Flomen
Ami Moyal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
N S C NATURAL SPEECH COMM Ltd
Original Assignee
N S C NATURAL SPEECH COMM Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by N S C NATURAL SPEECH COMM Ltd filed Critical N S C NATURAL SPEECH COMM Ltd
Priority to GB0803554A priority Critical patent/GB2457897A/en
Publication of GB0803554D0 publication Critical patent/GB0803554D0/en
Publication of GB2457897A publication Critical patent/GB2457897A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F17/3074
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of indexing a sound stream, the method comprising: a) generating a matrix of costs ('costs' are confidence ratings associated with basic sound units) from an analysis of the basic sound units that make up the sound stream (the basic sound units comprising sound utterances that people use to vocalize words which are also called 'canonical sound units' and can be phonetic units e.g. phones, diphones, syllables, fenenes or fenones); b) providing at least one word to be located in the sound stream; c) transcribing the at least one word to a string of the basic sounds utterances; d) using the string and the costs to identify the at least one word in the sound stream; e) if identified, indexing the at least one word in a data base that is textually searchable to determine whether the at least one word is present in the sound stream; and f) repeating b)-e) for another at least one word ignoring costs that identified a previous at least one word. The basic sound units may be language independent, spanning a large number of languages. The basic sound units may be non-speech audio events. A set of audio features may be determined and used to identify basic sound units in the sound stream for example by identifying the audio or cultural environment of the recording.

Description

AUDIO FILE MANAGEMENT AND INDEXING METHOD AND SYSTEM
FIELD
Embodiments of the invention relate to managing and indexing audio files.
BACKGROUND
The large and growing global population and number of organizations generate enormous quantities of written, visual and audio data daily and with the rapid development of easy to use data recording, transmission, processing and storage technology, more and more of this data is being archived. Not only are conventional sources, such as TV stations, content to providers, teachers, scientists. institutes, and public figures participating in generating data, but "ordinary" citizens are generating enormous quantities of written and multimedia data, "User Generated Content" (UGC), of all sorts. And much of this data is of interest not only to the people who produce it and want to generate income by exposing it, but also to an increasingly broad international population of people, who for various reasons want to access it.
For example, not more than a decade or so ago only a relatively small portion of the world population had access to technology and had inclination to use it to generate multimedia data. Today, home computers, various audio and video recorders have wet the appetite of the general global community to participate in a frenzy of multimedia data generation and voyeurism. Many millions of average citizens of the global village are producing and archiving pictures, movies, songs, lectures, speeches, news casts and anime, and making the material they produce available to the world public, as is readily attested to by Google, Youtube, Facebook and the Internet formats of any of the news agencies.
An even greater population of information consumers is using various technologies to search the generated data for information relevant to its needs and downloading written, audio and visual material via the Internet for work and entertainment. However, as the quantity of archived data increases, the task of locating relevant information using conventional technologies generally becomes more complex and can be more time consuming and prone to error.
In particular, searching a large audio file, such as for example, a compendium of political speeches, newscasts, audio surveillance material, or vocal music, for spoken materiaL hereinafter also "target material" or "target speech", relevant to a given work or entertainment activity can be complex and results of a search unreliable. The task generally involves processing the audio file to identify speech therein, transcribing the identified speech into computer readable symbolic representation and comparing the transcribed speech to a computer readable representation of the target material to find a match to it in the transcribed
I
speech. Often, target material to be identified in the audio file must be identified against a background of non-speech events. Difficulty in finding a match to the target material may also be compounded because the audio file and/or the target material contains speech in different languages or because speech in a same language in the audio file is stylized by different accents and subculture usage and syntax.
US Patent 6,185,527 describes "facilitating reliable information retrieval, also referred to as "word spotting" in long unstructured audio streams" by analyzing the content of audio streams "to identify content specific, application specific, genre specific clear speech boundaries..." US Patent 6,526,380 describes a "huge vocabulary speech recognition system for recognizing a sequence of spoken words" that comprises a plurality of different speech "recognizers", each associated with a different context associated speech model for recognizing speech. The different contexts "may include health, entertainment, computer, arts, business, education, ...". The sequence of spoken words is converted to a sequence of feature vectors, with each feature vector associated with a frame of the speech. A speech recognizer matches the feature vectors against "an inventory of speech recognition units" to determine a sequence of words the speech comprises. The speech recognition units may represent whole words or sub-word units such as phones, diphones or syllables, as well as derivative units such as fenenes or fenones". A controller determines which sequence of words produced by the recognizers is a most likely word sequence that the spoken words contain.
SUMMARY OF THE INVENTION
An aspect of some embodiments of the invention relates to providing a method and apparatus for indexing and managing an audio file to facilitate retrieval of data from the audio file.
According to an aspect of some embodiments of the invention, the method is substantially language independent and provides indexing and retrieval facilities for speech content of the audio file for a relatively large number of languages in which the speech content may be vocalized.
According to an aspect of some embodiments of the invention, the method is substantially speaker independent and provides indexing and retrieval facilities of speech content for the audio file for a relatively large number of human conditions, such as gender, age, educational level and emotional state, that may affect the way in which the content is vocalized.
According to an aspect of some embodiments of the invention, the method provides indexing and retrieval facilities of speech content for the audio file for different acoustic and/or cultural environments in which speech may be vocalized. Examples of different acoustic environments are the acoustic environments generated by sounds in a factory or a running car, sounds at the beach, sounds of a country evening and sounds of a busy city street. Examples of different cultural environments are for example, a lecture hail, a cocktail party and a political rally. A same person in each of the aforementioned different cultural environments may very well use a same language differently, use different subsets of the language and/or utter the same words differently. A cultural environment may also refer to a persons level of education, and/or a dialect that influence the way in which words are vocalized.
An aspect of some embodiments of the invention relates to providing a "textually searchable" database, hereinafter an "index dictionary", of relevant words that "indexes" the audio file to facilitate finding and retrieving data from the audio file by presenting a query in textual form, i.e. as text, to the data base.
According to an aspect of some embodiments of the invention, the index dictionary is dynamically adjustable to increase the size of its vocabulary and grows naturally with use.
An aspect of some embodiments of the invention, relates to providing a relatively efficient method for processing an audio file to increase the size of its associated index dictionary. In an embodiment of the invention, portions of the audio file that have been identified as a vocalization of a particular word or words with a relatively high confidence level are generally ignored when processing the audio file for a new word or words to be added to the dictionary.
In an embodiment of the invention, an audio stream comprised in an audio file is processed to generate a feature vector for each of a plurality of frames, "audio frames", of the audio stream. Feature vectors are associable with various levels of confidence to different basic sounds, hereinafter referred to as a "canonical sound units" (CSUs), which comprise sound utterances that humans iiiake to vocalize words and optionally sounds that are characteristic of non-speech acoustic events. The number and character of the feature vector components and associable CSUs are such that the CSUs span a gamut of utterances sufficient to identify words in a relatively large number of languages. The CSUs may, for example, comprise a set of phones sufficient to identify a relatively large number of words in any of a relatively large number of different languages. The various CSUs are optionally stored in a "CSU library".
The feature vectors optionally comprise feature vector components that are indicative of factors of an acoustic and/or cultural environment that "color" an uttered CSU in an audio stream. Color factors indicated by feature vector components do not change an identity of an uttered CSU, but modify or "mask" its vocalization and generally modify confidence levels with which a given feature vector of a given audio frame may be associated with the different CSUs. Feature vector component responsive to color factors are optionally responsive to an acoustic background, such as noise or music, against which speech is vocalized. Optionally, "color feature vector components" are responsive to cultural, sub-cultural and/or emotional environments such as for example level of education of a speaker, dialect and/or emotional state. Cultural, sub-cultural and/or emotional environments are generically referred to as cultural environments.
The number and character of the feature vector components and associable CSUs are, optionally, also sufficiently large and varied so that a relatively large number of non-speech events may be identified. Non-speech acoustic events may by way of example include a sound of an airplane, a dog bark, thunder or a dial tone.
In accordance with an embodiment of the invention, a value, hereinafter a "cost", for the feature vector of each audio frame and each of a plurality of CSUs in the CSU library is generated. The cost associated with a CSU for a given feature vector is, or may be used to indicate, a confidence level with which the feature vector may be considered to be generated responsive to utterance of the CSU. Optionally, the costs are stored in a memory as components of a matrix, hereinafter referred to as a CSU cost matrix.
In accordance with an embodiment of the invention, the plurality of CSUs for which costs are stored for individual audio frames is relatively large so that context dependent and context independent searches noted below for CSU string matches can provide relatively reliable results.
In an embodiment of the invention, to determine whether a given word not previously searched for in the audio file or a portion thereof is present in the audio file, the given word is transcribed to a string of CSUs, hereinafter a "target CSU string". The CSU cost matrix of the audio file is then searched to locate a string of CSUs in the matrix that matches the target string.
Any of various methods known in the art may be used to search for a match, hereinafter a CSU match, to the target CSU string. For example, a Viterbi search of the CSU matrix may be conducted to search for a CSU match. Optionally, for a given search, the CSU matrix is filtered so that only costs satisfying a particular constraint are used in performing the search.
For example, only costs greater than or less than (depending upon how the costs are defined) a particular threshold may be involved in a particular search.
Searches may be context independent or context dependent. Context independent searches are substantially independent of statistical correlation between sequential utterances of CSUs. Context dependent searches are responsive to correlation between sequential utterances of CSU that might for example be dependent upon language, culture or subculture. Since, in accordance with an embodiment of the invention, the CSU cost matrix stores costs for a plurality of different CSUs for each audio frame, relatively reliable context dependent and context independent searches can be made of the cost matrix.
If at least one match, i.e. CSU match, for the CSU target string is found, for each location of the at least one CSU match in the audio file, the location and a textual representation of the given word are stored in an index dictionary. Optionally, data in addition to the location at which the CSU match for the word is found is stored in the index dictionary.
For example for a given CSU match, a confidence level may be stored for the match and/or an indication of an acoustic and/or cultural environment in which the audio frame associated with the given CSU match is found may be stored for the match. The indication of the acoustic and/or cultural environment is optionally generated responsive to a characteristic of the feature vectors of the audio frames corresponding to the given CSU match and/or a context dependent search. If on the other hand, a match for the CSU target string is not found, a textual representation of the word is optionally stored in the index dictionary with data indicating that the word is not found in the audio file.
In an embodiment of the invention, if a search determines that a region in the audio file provides a CSU match to a target CStI string, the region in the audio file and a region in the CSU cost matrix associated with the audio file is ignored in subsequent searches for other For convenience of presentation, a search for a CSU match to a target CSU string is referred to as an "audio search". A word and data associated with the word that results from a CSU search, which are stored in the index dictionary, are referred to as an index record for the In accordance with an embodiment of the invention, the index dictionary is stored in a computer memory or in any suitable computer readable medium and may be accessed by a user via a user interface of a suitably configured system to perform textual searches of the audio file and retrieve information therefrom. (A textual search refers to searching text stored in a memory responsive to a query formatted in text.) A system comprising software and/or hardware configured to practice an embodiment of the present invention is referred to for convenience of presentation as an "Audio File Indexing, Retrieval and Management (AF'IRM)" system. AF1RM comprises any of various combinations of hardware, software and communication devices known in the art suitable to interfacing with a user and performing textual and/or audio searches in accordance with an embodiment of the invention.
To perform a textual search of the audio file for a word, in accordance with an embodiment of the invention, the user enters a query comprising the word to the AFIRM interface. A processor optionally comprised in AFIRM performs a textual search of the index dictionary to locate an index record for the word responsive to the query. It' an index record for the word is found, AFTIRM generates a response to the user query that provides the user with data for the word comprised in the index record. For example, the response optionally provides the user with locations in the audio file where the word is located and with information that enables the user to access the locations and download material from the audio file associated If on the other hand, an index record for the word is not found, in accordance with an embodiment of the invention, AFIRM responds to the user notifying him or her that the textual search did not find any matches, performs an audio search for the word and updates the index dictionary with results of the audio search. Optionally, AF]IRIvl offers the user an option of receiving results of the audio search automatically, for example by e-mail, after it is performed or of accessing AFIRM later to again attempt a textual search for the word.
By performing audio searches and updating an index dictionary responsive to user queries, in accordance with an embodiment of the invention, the index dictionary expands responsive to user interest and knowledge, naturally and dynamically. With use, the index dictionary and AFIRM become more efficient at providing useful information rapidly and efficiently.
There is therefore provided in accordance with an embodiment of the invention, a method of indexing a sound stream, the method comprising: a) generating a matrix of costs responsive to the sound stream for a plurality of basic sounds units comprising sound utterances that people use to vocalize words; b) providing at least one word to be located in the sound stream; c) transcribing the at least one word to a string of the basic sounds utterances; d) using the string and the costs to identify the at least one word in the sound stream; e) if identified, indexing the at least one word in a data base that is textually searchable to determine whether the at least one word is present in the sound stream; and f) repeating b)-e) for another at least one word ignoring costs that identified a previous at least one word.
Optionally, the plurality of basic sound units comprises language independent sound utterances. Optionally, the language independent sound utterances span a relatively large number of languages.
In some embodiments of the invention, the plurality of basic sound units comprises basic sound units that are used to generate non-speech audio events. Optionally, the method comprises generating a matrix of costs responsive to the sound stream for the plurality of basic sounds units used to generate non-speech audio events.
Optionally, the method comprises: a) providing in text format at least one non-speech audio event to be located in the sound stream; b) transcribing the at least one non-speech audio event to a string of the basic sounds units; c) using the string and the costs to identify the at least one non-speech audio event in the sound stream; d) if identified, indexing the at least one non-speech audio event in a data base that is textually searchable to determine whether the at least one non-speech audio event is present in the sound stream; and e) repeating a)-d) for another at least one non-speech audio event ignoring costs that identified a previous at least one non-speech audio event.
In some embodiments of the invention, generating a matrix comprises providing the plurality of basic sound units.
In some embodiments of the invention, the method comprises determining a set of audio features useable to identify basic sound units in the sound stream. Optionally, determining a set of audio features comprises determining audio features useable to identify an audio environment in which speech is vocalized. Optionally, indexing the at least one word comprises indexing data indicative of the audio environment in which the at least one word is vocalized.
In some embodiments of the invention, determining a set of audio features comprises determining audio features useable to determine a cultural environment that characterizes vocalized speech. Optionally, indexing the at least one word comprises indexing data indicative of a cultural environment that characterizes vocalization of the at least one word.
A method of searching for at least one word in a sound stream comprising: providing a textually searchable data base responsive to the sound stream by indexing in accordance with an embodiment of the invention; querying the data base using a textual transcription of the at least one word; if the at least one word is indexed, generating a response providing index data; and if the at least one word is not indexed: transcribing the at least one word to a string of the CSUs; using the string and the costs to identify the at least one word in the sound stream; and if identified, indexing the at least one word in the data base.
Optionally, generating a response comprises generating a textual response. Additionally or alternatively, generating a response comprises generating an audio response. Optionally, generating an audio response comprises sounding the at least one word from at least one location in the sound stream at which the at least one word is located.
BRIEF DESCRIPTION OF FIGURES
Non-limiting examples of embodiments of the invention are described below with reference to figures attached hereto that are listed following this paragraph. Identical structures, elements or parts that appear in more than one figure are generally labeled with a same numeral in all the figures in which they appear. Dimensions of components and features shown in the figures are chosen for convenience and clarity of presentation and are not necessarily shown to scale.
Fig. I schematically illustrates generating an AFIRM system generating a CSU matrix of an audio, in accordance with an embodiment of the invention; and Fig. 2 schematically illustrates the AFIRM system responding to a user query, in accordance with an embodiment of the invention.
DETAILED DESCRIPTION
Fig. 1 schematically shows a portion of an AFIRM system 20, indicated by a dashed boundary, comprising a feature extraction engine 22 and a canonical sound unit, i.e. CSU, transcription engine 24, in accordance with an embodiment of the invention. Additional components comprised in AFIRM 20, other than those shown in Fig. 1, are shown in Fig. 2.
In Fig. 1, AFIRM 20 is schematically shown processing an audio file, schematically indicated by a dashed boundary 50, to generate a canonical sound unit (CSU) matrix 70 for the file. Audio file 50 is representative of any file for which it is desired to provide search and retrieval functions. The audio file may for example be a compendium of political speeches, newscasts, audio surveillance material, vocal music, or an audio portion of any of various multimedia files.
A portion of an audio stream comprised in audio file 50 is represented by a waveform 52 shown as a function of time relative to a time to, measured along a time axis 54. The audio stream is routed by AFIRM 20 to feature extraction engine 22, which partitions the audio stream into audio frames 56 indicated by a curly bracket 56. Audio frames 56 have start times t1, t2, ... t, indicated along time axis 54 by registration lines 57 separated by a time interval 58 referred to as a "frame step". Optionally, the frame step is substantially constant.
Optionally, frames 56 are characterized by duration, a "frame size", that is longer than frame step 58 so that adjacent frames overlap partially. By way of numerical example, frame step 58 is typically between about 10 and 20 ms and frame size is between about 16 and about 30 ms.
For each audio frame 56, feature extraction engine 22 generates a feature vector responsive to sound vocalized or otherwise produced during the audio frame. The feature vector is useable to determine for each of a plurality of canonical sound units, i.e. CSUs, a "cost" that is indicative of and/or can be used to determine a measure of confidence at which the sound made during the audio frame can be identified with the CSU. The CSUs comprise basic sound utterances that are used to vocalize speech, and optionally sound units that may be used to identify non-speech sounds such as sounds made by various types of machinery, vehicles, the weather, animals and sound patterns characteristic of particular environments, such as for example, the beach, a restaurant or a symphony hail.
Optionally, the CSUs associated with vocalizing speech are sufficient in number and character so that they span a gamut of utterances sufficient to identify a relatively large number of words in each of a relatively large number of different languages and optionally comprise phones. Optionally, the CSUs comprise sounds classified and represented in the International Phonetic Alphabet (JPA). In some embodiments of the invention, the CSUs comprise sounds that are defined by processing feature vectors of an audio file, such as audio file 50, to determine clusters of the feature vectors. Each cluster of feature vectors defines a different representative feature vector, e.g. an average, of the cluster of vectors and a different CSU, which is associated with the cluster and its representative vector.
Optionally, the feature vectors for audio frames 56 are stored as columns 62 in a feature vector matrix 64 with the feature vector corresponding to a given audio frame and the column in which it is stored labeled with a column index corresponding to the start time "t" of the frame. Each feature vector 62 optionally comprises a sufficient number of different feature vector components to provide costs for the plurality of CSUs used to identify speech vocalizations and optionally non-speech sounds. The feature vectors have "J" components identified by alphanumerics F!, F2, ... FJ. Rows of feature matrix 64 are labeled with the identifiers of the feature vector components and an n-th audio frame has values for feature vector components Fl, Fl. . . .FJ equal to F1(t), F2(t) ... Fi(t) respectively, where t is the start time of the n-th audio frame. For convenience of presentation values Fl(t), F2(t) FJ(t) are indicated only for n = 0, 1 -6 and an N-th audio frame As noted above, feature vectors 62 are associable with various degrees of confidence, indicated by costs, to different CSUs. For clearly enunciated speech vocalized in an acoustic environment with no background noise and little or substantially no cultural environment coloring, a probability distribution function that describes probabilities that a given feature vector is associated with a given CSU is relatively narrow and peaked at a single CSU. As a result, for such speech, a CSU feature vector is associable with a relatively high confidence level with a single CSU.
Optionally components of feature vectors 62 are sufficient in number and character so that a given feature vector 62 may generally be associated with and indicate an acoustic and/or cultural environment that "color" an uttered CSU in audio stream 52. Acoustic and/or cultural environments indicated by feature vector components do not generally change an identity of an uttered CSU, but modify or "mask" its vocalization. The feature vector components are responsive to acoustic and/or cultural factors that generally modify confidence levels with which a feature vector 62, and therefore audio content of a given audio frame 56, may be associated with the different CSUs. The feature vector components are responsive, by way of example, to an acoustic background, such as noise or music, in which speech is vocalized, its peculiar cultural and/or sub-cultural background, for example level of education of a speaker, and/or human conditions, such as various states of emotion, that may affect the way in which speech is uttered.
Feature vectors 62 optionally comprise components that are functions of different frequency components of sound in their respective frames. For example, each of a plurality of components of a feature vector 62 may have a value equal to a Fourier coefficient of a discrete Fourier transform of sound in the frame. Optionally, components of the feature vector are coefficients of an energy spectral density of the sound. In some embodiments of the invention, a feature vector comprises binary valued components that indicate presence or absence of vocal features such as class features, laryngeal features, manner features, and place features. Feature vectors optionally comprise components such as bandwidth, number of zero crossings of amplitude, spectral concentration, and/or presence of harmonics of an audio waveform. US Patent 6,185,527 referenced above describes various features for indicating audio environment that may be used in the practice of the invention.
Feature vectors 62 in feature matrix 64 are processed by canonical sound unit transcription engine 24 to provide for each audio frame 56 and each of a plurality of "M" CSUs a cost, indicative of how probable the sound vocalized in the frame is the CSU. Optionally, the costs for a given frame are stored in a memory, optionally as a vector in a column 72 in CSU matrix 70 for which each row in the matrix is labeled with an alphanumeric label indicating a different CSU. The alphanumeric row labels are, CSU-I, CSU-2, ... CSU-M, where the number following the dash corresponds to the row of the CSU. For each audio frame 56 in audio stream 52 the value in a given row of the column in which CSU vector 72 for the frame is stored is the cost for the audio frame of the CSU that labels the row. The cost for CSU-m (m-th row) for a given n-tb audio frame 56 that begins at time t is indicated by Cm(t) in the column in which the CSU vector for the frame is stored. Costs Cm(tn) are schematically indicated for simplicity only for the first seven columns (t0-t6) and an N-th column of CSU matrix 70.
Any of various methods known in the art may be used to determine costs in CSU matrix 70. For example, let a feature vector for a given audio frame 56 that begins at time t be represented by FV(t). Let a representative feature vector that is identified with an m-th CSU be represented by RFm, Then, optionally, a cost Cm(t) of CSU-m for the frame is equal to a sum of squares of the components of a difference vector m''(tn).
Each cost vector 72 optionally comprises, in addition to costs Cm(t), costs (not shown) determined responsive to the feature vector 62 associated with the cost vector that are responsive to and indicative of the acoustic and/or cultural environments in which the CSUs of the cost vectors are produced. For example, costs additional to the costs Cm(t) may include a cost indicative of whether speech is being vocalized inside a room or in the outdoors, in a factory or in the country in a running car or in a symphony hail during intermission.
Fig. 2 schematically illustrates AFIRM 20 responding to a user query 100 submitted in text format for retrieval of information with respect to audio file 50 shown in Fig. 1, in accordance with an embodiment of the invention.
As shown in Fig. 2. AFIRM 20 comprises an index dictionary 26 for the audio file, a text search engine 28 and a suitable user interface, optionally a computer terminal 30, for communicating with AFIRM 20. Index dictionary 26 comprises index records (not shown), each record identified with a word and comprising information relevant to the word and audio file 50. By way of example, an index record of a word that is known to be in audio file 50 comprises locations of the word in the audio file and data associated with the locations, such as, optionally, a frequency of appearance in the audio file and acoustic and/or cultural environments for locations at which the word is located. Index dictionary 26 also, optionally, compnses index records of words that are known not to be found in audio file 50 that comprise data indicating that the word does not appear in the audio file.
Query 100 optionally comprises a request to locate a word in audio file 50 and the word is entered to AFIRM 20 optionally via a keyboard 31 comprised in computer terminal 30. In accordance with an embodiment of the invention, AFIRM 20 enables the user to characterize his or her request, such as to require that the word be uttered in the context of a particular acoustic and/or cultural environment. For example, the user may require that the word be uttered by an educated person, or by a person in an excited emotional state or in a didactic situation. The user query may request that the word be uttered by a particular person or that the word be uttered in any of a number of different languages Once entered, query 100 is directed by AFIRM 20 at communication block arrow 29 to text search engine 28. The text search engine performs a textual search, as indicated by bi-directional communication block arrow 27, of index dictionary 26 for the word and at communication block arrow 32 provides the user with a result 102 of the search. If index dictionary 26 comprises an index record for the word, the result comprises information comprised in the index record relevant to query 100. For example, if the index record indicates that the word appears in audio file 50, the result may comprise locations in the audio file at which the word was uttered within the context of the acoustic and/or cultural environment that was required by query 100. Result 102 may comprise information not directly requested in the query or information tangential to the query. For example, result 102 optionally comprises information as to the frequency of appearance of the word in audio file 50 or locations in which the word appears in acoustic or cultural contexts other than those requested by the query. If the index record indicates that the word is not found in audio file 50, result 102 informs the user of that circumstance and optionally provides additional information, such as a synonym that might be searched for in index dictionary 26 or a last time that the word was requested.
If no record for the word exists in index dictionary 26, result 102 informs the user that no record exists. For such a situation, in accordance with an embodiment of the invention, AFIRM 20 in general performs a search, i.e. an audio search, of audio file 50 to determine whether the audio file contains the word and optionally informs the user that an audio search will be undertaken. Whereas, in general a textual search, such as a textual search of index dictionary 26 can be performed relatively rapidly to provide a response to a query, an audio search can require an extended period of time. Therefore, AFIRM 20 optionally alerts the user that an audio search will possibly require a relatively long period of time and advises the user accordingly, optionally suggesting that the user attempt to search for the word at a suitable later time and/or offering to e-mail a result of the audio search when it is completed.
As indicated at condition diamond 32, to perform the audio search for the "recordless" word, text search engine 28 transmits the word to a CSU transcription engine 34, which transcribes the word to a CSU target string 104. Target CSU string 104 is transmitted to a word spotting engine 36 which searches for matching CSU strings in CSU matrix 70. Any of various algorithms known in the art, for example a dynamic programming algorithm such as a Viterbi algorithm, for finding a match to a symbol string may be used to process data in CSU matrix to find a CSU match to target string 104.
By way of example. assume that target CSU string 104 is a four element CSU target string (CSU-M)(CSU-2)(CSU-3)(CSIJ-l) and that a match for the string is found in CSU matrix 70 in a sequence of CSU vectors 72 indicated by a curly bracket 106 and having start times t1, t2, t3 and t4. Costs for the components in the vectors of the CSU match are indicated by shading. A spotting result 108 for the spotted word is transmitted by word spotting engine 36 to an indexing engine 38. The indexing engine generates a record (not shown) for the spotted word that includes data noting location indicated by curly bracket 106 where the word was spotted and optionally additional data that may include information regarding the acoustic and/or cultural environment of the location and/or a confidence level for the spotting of the In accordance with an embodiment of the invention, indexing engine 38 updates index dictionary 26 with the record and in addition updates CSU matrix by generating data that indicates that region 106 of the matrix has been identified with a word having a record in index dictionary 26 and should in general be ignored in future audio searches. By way of example, an updated version 71 of CSU matrix 70 is shown in which vectors in region 106 are schematically indicated as corresponding to a word having a record in index dictionary 26 by having their components deleted.
In the description and claims of the present application, each of the verbs, "comprise" "include" and "have", and conjugates thereof, are used to indicate that the object or objects of the verb are not necessarily an exhaustive listing of members, components, elements or parts of the subject or subjects of the verb.
The invention has been described with reference to embodiments thereof that are provided by way of example and are not intended to limit the scope of the invention. The described embodiments comprise different features, not all of which are required in all embodiments of the invention. Some embodiments of the invention utilize only some of the features or possible combinations of the features. Variations of embodiments of the described invention and embodiments of the invention comprising different combinations of features than those noted in the described embodiments will occur to persons of the art. The scope of the invention is limited only by the following claims.

Claims (17)

  1. CLAIMS1. A method of indexing a sound stream, the method comprising: a) generating a matrix of costs responsive to the sound stream for a plurality of basic sounds units comprising sound utterances that people use to vocalize words; b) providing at least one word to be located in the sound stream; c) transcribing the at least one word to a string of the basic sounds utterances; d) using the string and the costs to identify the at least one word in the sound stream; e) if identified, indexing the at least one word in a data base that is textually searchable to determinc whether the at least one word is present in the sound stream; and f) repeating b)-e) for another at least one word ignoring costs that identified a previous at least one word.
  2. 2. A method according to claim 1 wherein the plurality of basic sound units comprises language independent sound utterances.
  3. 3. A method according to claim 2 wherein the language independent sound utterances span a relatively large number of languages.
  4. 4. A method according to any of claims 1 -3 wherein the plurality of basic sound units comprises basic sound units that are used to generate non-speech audio events.
  5. 5. A method according to claim 4 and comprising generating a matrix of costs responsive to the sound stream for the plurality of basic sounds units used to generate non-speech audio events.
  6. 6. A method according to claim 5 comprising: a) providing in text format at least one non-speech audio event to be located in the sound stream; b) transcribing the at least one non-speech audio event to a string of the basic sounds units; c) using the string and the costs to identify the at least one non-speech audio event in the sound stream; d) if identified, indexing the at least one non-speech audio event in a data base that is textually searchable to determine whether the at least one non-speech audio event is present in the sound stream; and e) repeating a)-d) for another at least one non-speech audio event ignoring costs that identified a previous at least one non-speech audio event.
  7. 7. A method according to any of claims 1-6 wherein generating a matrix comprises providing the plurality of basic sound units.
  8. 8. A method according to any of claims 1-7 and comprising determining a set of audio features useable to identify basic sound units in the sound stream.
  9. 9. A method according to claim 8 wherein determining a set of audio features comprises determining audio features useable to identify an audio environment in which speech is vocalized.
  10. 10. A method according to claim 9 wherein indexing the at least one word comprises indexing data indicative of the audio environment in which the at least one word is vocalized.
  11. 11. A method according to any of claims 8-10 wherein determining a set of audio features comprises determining audio features useable to determine a cultural environment that characterizes vocalized speech.
  12. 12. A method according to claim 11 wherein indexing the at least one word comprises indexing data indicative of a cultural environment that characterizes vocalization of the at least one word.
  13. 13. A method of searching for at least one word in a sound stream comprising: providing a textually searchable data base responsive to the sound stream by indexing in accordance with any of claim 1-12; querying the data base using a textual transcription of the at least one word; if the at least one word is indexed, generating a response providing index data; and if the at least one word is not indexed: transcribing the at least one word to a string of the CSUs; using the string and the costs to identify the at least one word in the sound stream; and if identified, indexing the at least one word in the data base.
  14. 14. A method according to claim 13 wherein generating a response comprises generating a textual response.
  15. 15. A method according to claim 13 or claim 14 wherein generating a response comprises generating an audio response.
  16. 16. A method according to claim 15 wherein generating an audio response comprises sounding the at least one word from at least one location in the sound stream at which the at least one word is located.
  17. 17. A method substantially as hereinbefore described with reference to the accompanying drawings.
GB0803554A 2008-02-27 2008-02-27 Audio File Management, Search and Indexing Method and System Withdrawn GB2457897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0803554A GB2457897A (en) 2008-02-27 2008-02-27 Audio File Management, Search and Indexing Method and System

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0803554A GB2457897A (en) 2008-02-27 2008-02-27 Audio File Management, Search and Indexing Method and System

Publications (2)

Publication Number Publication Date
GB0803554D0 GB0803554D0 (en) 2008-04-02
GB2457897A true GB2457897A (en) 2009-09-02

Family

ID=39284634

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0803554A Withdrawn GB2457897A (en) 2008-02-27 2008-02-27 Audio File Management, Search and Indexing Method and System

Country Status (1)

Country Link
GB (1) GB2457897A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000008634A1 (en) * 1998-08-07 2000-02-17 Fonix Corporation Methods and apparatus for phoneme estimation using neural networks
US6185527B1 (en) * 1999-01-19 2001-02-06 International Business Machines Corporation System and method for automatic audio content analysis for word spotting, indexing, classification and retrieval
US6526380B1 (en) * 1999-03-26 2003-02-25 Koninklijke Philips Electronics N.V. Speech recognition system having parallel large vocabulary recognition engines
EP1688915A1 (en) * 2005-02-05 2006-08-09 Aurix Limited Methods and apparatus relating to searching of spoken audio data
US20070179784A1 (en) * 2006-02-02 2007-08-02 Queensland University Of Technology Dynamic match lattice spotting for indexing speech content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000008634A1 (en) * 1998-08-07 2000-02-17 Fonix Corporation Methods and apparatus for phoneme estimation using neural networks
US6185527B1 (en) * 1999-01-19 2001-02-06 International Business Machines Corporation System and method for automatic audio content analysis for word spotting, indexing, classification and retrieval
US6526380B1 (en) * 1999-03-26 2003-02-25 Koninklijke Philips Electronics N.V. Speech recognition system having parallel large vocabulary recognition engines
EP1688915A1 (en) * 2005-02-05 2006-08-09 Aurix Limited Methods and apparatus relating to searching of spoken audio data
US20070179784A1 (en) * 2006-02-02 2007-08-02 Queensland University Of Technology Dynamic match lattice spotting for indexing speech content

Also Published As

Publication number Publication date
GB0803554D0 (en) 2008-04-02

Similar Documents

Publication Publication Date Title
US10032454B2 (en) Speaker and call characteristic sensitive open voice search
Ng et al. Subword-based approaches for spoken document retrieval
JP3488174B2 (en) Method and apparatus for retrieving speech information using content information and speaker information
US8380505B2 (en) System for recognizing speech for searching a database
US10019514B2 (en) System and method for phonetic search over speech recordings
US20040230432A1 (en) Systems and methods for classifying audio into broad phoneme classes
US20110071833A1 (en) Speech retrieval apparatus and speech retrieval method
Fendji et al. Automatic speech recognition using limited vocabulary: A survey
KR20080069990A (en) Speech index pruning
JP2003036093A (en) Speech input retrieval system
US9607618B2 (en) Out of vocabulary pattern learning
Chen et al. Retrieval of broadcast news speech in Mandarin Chinese collected in Taiwan using syllable-level statistical characteristics
Nouza et al. System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search
CN115116428B (en) Prosodic boundary labeling method, device, equipment, medium and program product
Moyal et al. Phonetic search methods for large speech databases
Wang Experiments in syllable-based retrieval of broadcast news speech in Mandarin Chinese
KR100467590B1 (en) Apparatus and method for updating a lexicon
Leavitt Let's hear it for audio mining
HaCohen-Kerner et al. Language and gender classification of speech files using supervised machine learning methods
US20220012420A1 (en) Process, system, and method for collecting, predicting, and instructing the pronunciaiton of words
Imperl et al. Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones
Norouzian et al. An approach for efficient open vocabulary spoken term detection
Chung et al. Unsupervised discovery of structured acoustic tokens with applications to spoken term detection
Wang et al. Voice search
Bendib et al. Semantic ontologies for multimedia indexing (SOMI) Application in the e-library domain

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)