USH2189H1 - SQL enhancements to support text queries on speech recognition results of audio data - Google Patents

SQL enhancements to support text queries on speech recognition results of audio data Download PDF

Info

Publication number
USH2189H1
USH2189H1 US10/361,571 US36157103A USH2189H US H2189 H1 USH2189 H1 US H2189H1 US 36157103 A US36157103 A US 36157103A US H2189 H USH2189 H US H2189H
Authority
US
United States
Prior art keywords
speech recognition
format
speech
results
recognition result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/361,571
Inventor
Vishal Rao
Rajiv Chopra
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corp filed Critical Oracle International Corp
Priority to US10/361,571 priority Critical patent/USH2189H1/en
Assigned to ORACLE INTERNATIONAL CORPORATION reassignment ORACLE INTERNATIONAL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHOPRA, RAJIV, RAO, VISHAL
Application granted granted Critical
Publication of USH2189H1 publication Critical patent/USH2189H1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics

Definitions

  • the present invention relates to a system, method, computer program product, and application program interface for indexing data relating to results of speech recognition in a database management system.
  • Speech recognition technology provides the capability to design computer systems that can recognize spoken words.
  • Speech recognition systems accept audio speech data, which are digitized audio speech signals, and output textual information.
  • a number of speech recognition systems are available on the market. The most powerful can recognize thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent. More recently speech recognition systems have been developed that can recognize speech without being trained using a particular voice and accent. Such systems may recognize the speech of most or any speakers, and are said to be speaker independent.
  • Audio speech data may be treated like any other data and stored and organized in a database.
  • searches may be readily performed on the data by a database management system for the database.
  • Prior systems required developers who wished to search audio speech data had to develop complex software procedures in order to perform the searching. For example, to perform a typical search, a user will want to know which audio or video assets satisfy given text query search criteria, the time offsets within each matched media asset where matches occurred, and the user may want to know the speech recognition confidence of each match.
  • an application program interface for indexing data relating to results of speech recognition in a database management system comprises an indextype operable to support text queries on speech recognition results, an interface operable to provide interaction with an index of the indextype, and a format adapter interface operable to invoke a format adapter for converting speech recognition results having a first format to a second format.
  • the format adapter may be operable to parse the speech recognition results in the first format, extract from the speech recognition results text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result, and generate speech recognition results in the second format using the extracted text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result.
  • the indextype may comprise the text data representing the recognized speech, the information relating to a confidence in each speech recognition result, and the timestamp information indicating a location of each portion of a speech recognition result.
  • the interface may be operable to provide interaction comprising performing a query of the text data representing the recognized speech.
  • the query of the text data representing the recognized speech relates to the confidence information and/or the timestamp information.
  • the results of the query may indicate time offsets within each matched media asset where matches occurred and speech recognition confidence of each match occurrence within a matched media asset.
  • FIG. 1 is an exemplary dataflow diagram of speech indexing processing performed in the present invention.
  • FIG. 2 is a block diagram of an exemplary implementation of a database management system, in which the present invention may be implemented.
  • FIG. 3 is an exemplary flow diagram of a process of operation of the present invention.
  • FIG. 4 is an exemplary format of data table that may be used in the present invention.
  • FIG. 5 is an exemplary code sample of how an application would invoke speech recognition on a particular row and populate the result column.
  • FIG. 6 is an example of an SQL command to build an index on the result column.
  • FIG. 7 is an example of an SQL command to create and pass preferences as arguments to index creation.
  • FIG. 8 is an example of a simple query on the data table, which makes use of the index.
  • FIG. 9 is an example of a query that retrieves confidence and timestamps for each occurrence within a matched audio asset row.
  • FIG. 10 is an example of an interface to a format adapter shown in FIG. 1 , which is a proprietary format understanding procedure that extracts the information required for creating an index of the required indextype from a proprietary audio processing result format of a speech recognition engine.
  • FIG. 11 is an algorithmic description of an exemplary implementation of the proprietary format understanding procedure.
  • FIG. 1 An exemplary dataflow diagram of speech indexing processing performed in the present invention is shown in FIG. 1 . Included in FIG. 1 are database management system (DBMS) 102 , speech recognition engine 104 , and speech query requestor 106 .
  • DBMS database management system
  • Speech query requester 106 may be any database client, tool, or application that wants to issue text queries on audio speech data.
  • Database management system (DBMS) 102 provides the capability to store, organize, modify, and extract information from one or more databases included in DBMS 102 . From a technical standpoint, DBMSs can differ widely. The terms relational, network, flat, and hierarchical all refer to the way a DBMS organizes information internally. The internal organization can affect how quickly and flexibly you can extract information.
  • Each database included in DBMS 102 includes a collection of information organized in such a way that computer software can select and retrieve desired pieces of data.
  • Traditional databases are organized by fields, records, and files.
  • a field is a single piece of information; a record is one complete set of fields; and a file is a collection of records.
  • An alternative concept in database design is known as Hypertext.
  • any object whether it be a piece of text, a picture, or a film, can be linked to any other object. Hypertext databases are particularly useful for organizing large amounts of disparate information, but they are not designed for numerical analysis.
  • a database typically includes not only data, but also low-level database management functions, which perform accesses to the database and store or retrieve data from the database. Such functions are often termed queries and are performed by using a database query language, such as Structured Query Language (SQL).
  • SQL is a standardized query language for requesting information from a database.
  • SQL has been a popular query language for database management systems running on minicomputers and mainframes.
  • SQL is being supported by personal computer database systems because it supports distributed databases (databases that are spread out over several computer systems). This enables several users on a local-area network to access the same database simultaneously.
  • Relational databases are powerful because they require few assumptions about how data is related or how it will be extracted from the database. As a result, the same database can be viewed in many different ways.
  • An important feature of relational systems is that a single database can be spread across several tables. This differs from flat-file databases, in which each database is self-contained in a single table.
  • DBMS 102 may also include one or more database applications, which are software that implements a particular set of functions that utilize one or more databases. Examples of database applications include:
  • a database application typically includes data entry functions and data reporting functions.
  • Data entry functions provide the capability to enter data into a database. Data entry may be performed manually, by data entry personnel, automatically, by data entry processing software that receives data from connected sources of data, or by a combination of manual and automated data entry techniques.
  • Data reporting functions provide the capability to select and retrieve data from a database and to process and format that data for other uses. Typically, retrieved data is used to display information to a user, but retrieved data may also be used for other functions, such as account settlement, automated ordering, numerical machine control, etc.
  • DBMS 102 includes speech enhancements 108 , format adapter 110 , data table 112 and speech indexing processing 114 .
  • Speech enhancements 108 are extensions to the standard query language of DBMS 102 .
  • speech enhancements include extensions to the command set of SQL, an indextype, and its associated operators and types to empower applications with sophisticated text querying capabilities on audio data.
  • Speech recognition engine 104 provides speech recognition processing functionality to DBMS 102 .
  • Speech recognition engine 104 is typically configured as a server communicatively connected to DBMS 102 .
  • speech recognition engine 104 provides large vocabulary continuous speech recognition (LVCSR) services to DBMS 102 .
  • LVCSR large vocabulary continuous speech recognition
  • speech recognition engine 104 receives data that represents digitized speech, processes the data to recognize the speech, and outputs text data that represents the speech, which is the speech recognition result.
  • the speech recognition results are placed in the CLOB (Character Large Object) result column in the data table 112 the procedure that invoked the speech recognition processing. This procedure places the result in a CLOB column in data table 112 next to the audio data.
  • CLOB Chargener Large Object
  • Speech Indexing processing 114 is invoked, which in turn invokes format adapter 110 .
  • the speech recognition result is arranged in a proprietary format.
  • Format adapter 110 adapts the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing.
  • Format adapter 110 parses the speech recognition result and extracts the required information. In particular, format adapter 110 extracts text, confidence, and timestamp tuples from each speech recognition result.
  • Speech indexing processing 114 receives the text, confidence, and timestamp tuples extracted from the proprietary format of each speech recognition result by format adapter 110 , stores the extracted information in its own internal data structures and creates an index of the required indextype based on the extracted data. When an index of the required indextype is created or updated, speech indexing processing 114 is invoked for each new or updated row in data table 112 .
  • the row data, which are extracted from the speech recognition results, along with a table name and key to the original row in the indexed table, are provided as parameters.
  • the routine must process the speech recognition result to extract ⁇ text, timestamp, confidence> and insert this data, along with some additional computed data (character offset and sequence number), and the key supplied as a parameter to the procedure into a that is part of an index internal data structure.
  • Speech indexing processing 114 then inserts the extracted tuples of information into index data structures that are stored independently from the table upon which the index is built.
  • FIG. 10 An example of an interface 1100 to format adapter 110 and speech indexing processing 114 is shown in FIG. 10
  • An example of an implementation of format adapter 110 is shown in FIG. 11 .
  • FIG. 2 A block diagram of an exemplary implementation of a DBMS 102 , in which the present invention may be implemented, is shown in FIG. 2 .
  • DBMS 102 is typically a programmed general-purpose computer system, such as a personal computer, workstation, server system, and minicomputer or mainframe computer.
  • DBMS 102 includes one or more processors (CPUs) 202 A- 202 N, input/output circuitry 204 , network adapter 206 , and memory 208 .
  • CPUs 202 A- 202 N execute program instructions in order to carry out the functions of the present invention.
  • CPUs 202 A- 202 N are one or more microprocessors, such as an INTEL PENTIUM® processor.
  • FIG. 1 A block diagram of an exemplary implementation of a DBMS 102 , in which the present invention may be implemented, is shown in FIG. 2 .
  • DBMS 102 is typically a programmed general-purpose computer system, such as a personal computer, workstation
  • DBMS 102 is implemented as a single multi-processor computer system, in which multiple processors 202 A- 202 N share system resources, such as memory 208 , input/output circuitry 204 , and network adapter 206 .
  • system resources such as memory 208 , input/output circuitry 204 , and network adapter 206 .
  • DBMS 102 is implemented as a plurality of networked computer systems, which may be single-processor computer systems, multi-processor computer systems, or a mix thereof.
  • Input/output circuitry 204 provides the capability to input data to, or output data from, DBMS 102 .
  • input/output circuitry may include input devices, such as keyboards, mice, touchpads, trackballs, scanners, etc., output devices, such as video adapters, monitors, printers, etc., and input/output devices, such as, modems, etc.
  • Network adapter 206 interfaces DBMS 102 with network 210 .
  • Network 210 may include one or more standard local area networks (LAN) or wide area networks (WAN), such as Ethernet, Token Ring, the Internet, or a private or proprietary LAN/WAN.
  • LAN local area networks
  • WAN wide area networks
  • Memory 208 stores program instructions that are executed by, and data that are used and processed by, CPU 202 to perform the functions of DBMS 102 .
  • Memory 208 may include electronic memory devices, such as random-access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), flash memory, etc., and electromechanical memory, such as magnetic disk drives, tape drives, optical disk drives, etc., which may use an integrated drive electronics (IDE) interface, or a variation or enhancement thereof, such as enhanced IDE (EIDE) or ultra direct memory access (UDMA), or a small computer system interface (SCSI) based interface, or a variation or enhancement thereof, such as fast-SCSI, wide-SCSI, fast and wide-SCSI, etc, or a fiber channel-arbitrated loop (FC-AL) interface.
  • RAM random-access memory
  • ROM read-only memory
  • PROM programmable read-only memory
  • EEPROM electrically erasable
  • memory 208 includes database management routines 212 , database 214 , and operating system 216 .
  • Database management routines 212 include software routines that provide the database management functionality of DBMS 102 .
  • Database management routines 212 include SQL interface with speech enhancements 108 , format adapter 110 , and speech indexing processing 114 .
  • SQL interface 108 accepts database queries using the SQL database query language, converts the queries to a series of database access commands, calls database processing routines to perform the series of database access commands, and returns the results of the query to the source of the query.
  • SQL interface 108 may support one or more particular versions of SQL or extensions to SQL, such as the ORACLE® PL/SQL extension to SQL.
  • Speech enhancements are extension to the standard query language of DBMS 102 .
  • speech enhancements include extensions to the command set of SQL, an indextype, and its associated operators and types to empower applications with sophisticated text querying capabilities on audio data.
  • Format adapter 110 processes the speech recognition result from speech recognition engine 104 .
  • the speech recognition result is arranged in a proprietary format.
  • Format adapter 110 adapts the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing.
  • Format adapter 110 parses the speech recognition result and extracts the required information. In particular, format adapter 110 extracts text, confidence, and timestamp tuples from each speech recognition result.
  • Speech indexing processing 114 receives. the text, confidence, and timestamp tuples extracted from the proprietary format of each speech recognition result by format adapter 110 , stores the extracted information in its own internal data structures and creates an index of the required indextype based on the extracted data. When an index of the required indextype is created or updated, speech indexing processing 114 is invoked for each new or updated row in data table 112 .
  • the row data, which are extracted from the speech recognition results, along with a table name and key to the original row in the indexed table, are provided as parameters.
  • Database 214 includes a collection of information organized in such a way that computer software can select, store, and retrieve desired pieces of data.
  • database 214 includes a plurality of data tables, such as data table 112 .
  • Data table 112 is arranged to store audio speech data that has been or is to be processed by speech recognition engine 104 , shown in FIG.
  • indexing information is kept in internal data structures, not in the same data table that stores the media data and speech recognition results.
  • a user of the system would store media assets in data table 112 .
  • the present invention contemplates implementation on a system or systems that provide multi-processor, multi-tasking, multi-process, and/or multi-thread computing, as well as implementation on systems that provide only single processor, single thread computing.
  • Multi-processor computing involves performing computing using more than one processor.
  • Multi-tasking computing involves performing computing using more than one operating system task.
  • a task is an operating system concept that refers to the combination of a program being executed and bookkeeping information used by the operating system. Whenever a program is executed, the operating system creates a new task for it. The task is like an envelope for the program in that it identifies the program with a task number and attaches other bookkeeping information to it.
  • Multi-tasking is the ability of an operating system to execute more than one executable at the same time.
  • Each executable is running in its own address space, meaning that the executables have no way to share any of their memory. This has advantages, because it is impossible for any program to damage the execution of any of the other programs running on the system. However, the programs have no way to exchange any information except through the operating system (or by reading files stored on the file system).
  • Multi-process computing is similar to multi-tasking computing, as the terms task and process are often used interchangeably, although some operating systems make a distinction between the two.
  • Process 300 begins with step 302 , in which media content is uploaded into a database table in DBMS 102 , such as data table.
  • media content includes audio speech data, which are digitized audio speech signals.
  • a speech recognition processing requestor 106 such as an application, that wants to process audio data with speech recognition engine 104 invokes the appropriate speech recognition. This causes the speech recognition engine 104 , which is waiting for speech processing requests to receive a request for speech recognition.
  • the received requests are processed by interface 116 of speech recognition engine 104 .
  • Speech recognition engine 104 processes the speech data in this request in order to recognize the speech and generate text data representing the recognized speech.
  • format adapter 110 adapts the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing.
  • Format adapter 110 parses the speech recognition result and extracts the required information.
  • format adapter 110 extracts text, confidence, and timestamp tuples from each speech recognition result.
  • speech indexing processing 114 receives the text, confidence, and timestamp tuples extracted from the proprietary format of each speech recognition result by format adapter 110 , inserts the extracted data in to data table 112 and creates an index of the required indextype based on the inserted data.
  • the extracted text, confidence, and timestamp tuples from format adapter 110 are passed directly to speech indexing processing 114 for index creation.
  • the extracted text, confidence, and timestamp tuples from format adapter 110 may be stored before being passed o speech indexing processing 114 for index creation.
  • speech indexing processing 114 is invoked for each new or updated row in data table 112 .
  • This routine must process the data to extract ⁇ text, timestamp, confidence> tuples and insert them into data table 112 .
  • Speech indexing processing 114 then inserts the extracted tuples of information into index data structures associated.
  • speech query requestor 206 generates a query on the text data included in data table 112 and transmits the query to DBMS 102 .
  • the generated query utilizes speech enhancements 108 to the query language used by DBMS 102 .
  • DBMS 102 performs the query by accessing data table 112 and, using the index, retrieves the specified information, and returning the results of the query to speech query requester 106 .
  • the customer stores media content including audio in data table 112 in DBMS 102 .
  • An exemplary format of data table 112 is shown in FIG. 4 .
  • Data table 112 includes id column 402 , audio data column 404 , and result column 406 .
  • id column 402 includes a unique identifier if the data in the row
  • audio data column 404 includes the actual audio data
  • result column 406 includes the speech recognition results.
  • FIG. 5 An example 500 of how an application would invoke speech recognition on a particular row and populate the result column 406 is shown in FIG. 5 .
  • the application After the application has processed each audio asset in data table 112 and populated the result column 406 , it is now ready to build an index on the result column, for example, using an SQL command 600 , such as that shown in FIG. 6 .
  • an SQL command 600 such as that shown in FIG. 6 .
  • the application can create preferences using the and pass those preferences as arguments to index creation, for example, using an SQL command 700 , such as that shown in FIG. 7 .
  • FIG. 8 An example 800 of a simple query on the data table 112 is shown in FIG. 8 .
  • FIG. 9 An example 900 of a more sophisticated query that that retrieves confidence and timestamps for each occurrence within a matched audio asset row is shown in FIG. 9 .
  • the SpeechContains operator matches those rows that satisfy the input query while the ancillary operator SpeechConfidenceTimestamp returns the corresponding collection of confidence/timestamp pairs for each returned row.
  • Format adapter 110 must be provided to adapt the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing.
  • the formatting procedure must extract the information required for creating an index of the required indextype from the proprietary audio processing result format of speech recognition engine 104 .
  • format adapter 110 is invoked for each new or updated row in the indexed table.
  • the row data which are processing results of SpcechMining, along with a table name and key to the original row in the indexed table, are provided as parameters.
  • This routine must process the data to extract ⁇ text, timestamp, confidence> tuples.
  • An example of the interface 100 to format adapter 110 is shown in FIG. 10 .
  • An example of the processing 1100 performed by format adapter 110 is shown in FIG. 11 .
  • query_string Specify the query that defines your search in indexed_column.
  • Oracle Text query operators can be used in this query string.
  • reference_label Optionally specify the label that associates the SpeechScore and SpeechConfidenceTimestamp generated by the SpeechContains operator. Returns For each row selected, SpeechContains returns a number between 0 and 100 that indicates how relevant the document row is to the query. The number 0 means that Oracle found no matches in the row.
  • Example The following example searches for all documents in the SpeechMining_result column that contain the word ‘oracle’.
  • the score for each row is selected with the SpeechScore operator using a label of 1: SELECT ordsys.SpeechScore(1), title FROM audionews WHERE ordsys.SpeechContains(SpeechMining_result, ‘oracle’, 1) > 0; OPERATOR: SpeechScore Signature SpeechScore(reference_label IN NUMBER) RETURN NUMBER; Description Use the SpeechScore operator in a SELECT statement to return the score values produced by SpeechContains in an SpeechIndexing query. Parameters reference_label: An integer that refers to the corresponding invocation of SpeechContains. If there are multiple invocations of SpeechContains in the same query, this parameter is used to maintain the reference.
  • SpeechScore operator can be used in a SELECT, ORDER BY, or GROUP BY clause. Returns This operator returns a NUMBER.
  • SpeechConfidenceTimestamp Signature SpeechConfidenceTimestamp(reference_label IN NUMBER) RETURN ordsys.ORDConfidenceTimestampTable; Description Use the SpeechCoinfidenceTimestamp operator in a SELECT statement to return a collection of confidence and timestamp pairs produced by SpeechContains in an SpeechIndexing query. Parameters reference_label: An integer that refers to the corresponding invocation of SpeechContains.
  • SpeechConfidenceTimestamp can be used in a SELECT clause. Returns This operator returns a table of type ordsys.ORDConfidenceTimestampTable (defined below).
  • INDEXTYPE ORDSpeechIndex Description This indextype allows a user to create an audio index on a CLOB column that contains the results of SpeechMining.
  • Parameters parameter_string Can be used to pass in Oracle Text preferences to the underlying Oracle Text index. Note that datastore preferences are disallowed.
  • the types below are used to retrieve speech recognition confidence and timestamp values from the query into PL/SQL variables.
  • OBJECT ORDConfidenceTimestampTuple CREATE TYPE ORDConfidenceTimestampTuple AS OBJECT (confidence NUMBER, timestamp NUMBER);
  • OBJECT ORDConfidenceTimestampTable CREATE TYPE ORDConfidenceTimestampTable AS TABLE OF ORDConfidenceTimestampTuple;

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system, method, computer program product, and application program interface for indexing data relating to results of speech recognition in a database management system provides the capability to perform simple and efficient searches on audio speech data with reduced development effort. An application program interface for indexing data relating to results of speech recognition in a database management system comprises an indextype operable to support text queries on speech recognition results, an interface operable to provide interaction with an index of the indextype, and a format adapter interface a format adapter that the index creation activity will invoke to extract relevant information from a proprietary speech recognition format.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
The benefit under 35 U.S.C. § 119(e) of provisional application Ser. No. 60/419,520, filed October 21, 2002, is hereby claimed.
FIELD OF THE INVENTION
The present invention relates to a system, method, computer program product, and application program interface for indexing data relating to results of speech recognition in a database management system.
BACKGROUND OF THE INVENTION
Speech recognition technology provides the capability to design computer systems that can recognize spoken words. Speech recognition systems accept audio speech data, which are digitized audio speech signals, and output textual information. A number of speech recognition systems are available on the market. The most powerful can recognize thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent. More recently speech recognition systems have been developed that can recognize speech without being trained using a particular voice and accent. Such systems may recognize the speech of most or any speakers, and are said to be speaker independent.
Audio speech data may be treated like any other data and stored and organized in a database. In the case of textual or numeric data, searches may be readily performed on the data by a database management system for the database. However, unlike textual or numeric data, there is no simple and efficient way to search audio speech data. Prior systems required developers who wished to search audio speech data had to develop complex software procedures in order to perform the searching. For example, to perform a typical search, a user will want to know which audio or video assets satisfy given text query search criteria, the time offsets within each matched media asset where matches occurred, and the user may want to know the speech recognition confidence of each match. Conventionally, this required development of software to perform several iterations of extracting the relevant text, time offset, and confidence data from the speech recognition results, build appropriate B-tree indices on this extracted data, and associate time offsets and confidence values. with their corresponding text data. In addition, procedures would have to be developed that would use the index and search through the text data for matched rows, and then search through the matched rows for time offsets into the media asset where matches occurred.
What is needed is a technique by which simple and efficient searches may be performed on audio speech data and which provides reduced development effort.
SUMMARY OF THE INVENTION
The present invention provides the capability to perform simple and efficient searches on audio speech data with reduced development effort. According to one embodiment of the present invention, an application program interface for indexing data relating to results of speech recognition in a database management system comprises an indextype operable to support text queries on speech recognition results, an interface operable to provide interaction with an index of the indextype, and a format adapter interface operable to invoke a format adapter for converting speech recognition results having a first format to a second format.
The format adapter may be operable to parse the speech recognition results in the first format, extract from the speech recognition results text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result, and generate speech recognition results in the second format using the extracted text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result.
The indextype may comprise the text data representing the recognized speech, the information relating to a confidence in each speech recognition result, and the timestamp information indicating a location of each portion of a speech recognition result. The interface may be operable to provide interaction comprising performing a query of the text data representing the recognized speech. The query of the text data representing the recognized speech relates to the confidence information and/or the timestamp information. The results of the query may indicate time offsets within each matched media asset where matches occurred and speech recognition confidence of each match occurrence within a matched media asset.
BRIEF DESCRIPTION OF THE DRAWINGS
The details of the present invention, both as to its structure and operation, can best be understood by referring to the accompanying drawings, in which like reference numbers and designations refer to like elements.
FIG. 1 is an exemplary dataflow diagram of speech indexing processing performed in the present invention.
FIG. 2 is a block diagram of an exemplary implementation of a database management system, in which the present invention may be implemented.
FIG. 3 is an exemplary flow diagram of a process of operation of the present invention.
FIG. 4 is an exemplary format of data table that may be used in the present invention.
FIG. 5 is an exemplary code sample of how an application would invoke speech recognition on a particular row and populate the result column.
FIG. 6 is an example of an SQL command to build an index on the result column.
FIG. 7 is an example of an SQL command to create and pass preferences as arguments to index creation.
FIG. 8 is an example of a simple query on the data table, which makes use of the index.
FIG. 9 is an example of a query that retrieves confidence and timestamps for each occurrence within a matched audio asset row.
FIG. 10 is an example of an interface to a format adapter shown in FIG. 1, which is a proprietary format understanding procedure that extracts the information required for creating an index of the required indextype from a proprietary audio processing result format of a speech recognition engine.
FIG. 11 is an algorithmic description of an exemplary implementation of the proprietary format understanding procedure.
DETAILED DESCRIPTION OF THE INVENTION
An exemplary dataflow diagram of speech indexing processing performed in the present invention is shown in FIG. 1. Included in FIG. 1 are database management system (DBMS) 102, speech recognition engine 104, and speech query requestor 106. Speech query requester 106 may be any database client, tool, or application that wants to issue text queries on audio speech data.
Database management system (DBMS) 102 provides the capability to store, organize, modify, and extract information from one or more databases included in DBMS 102. From a technical standpoint, DBMSs can differ widely. The terms relational, network, flat, and hierarchical all refer to the way a DBMS organizes information internally. The internal organization can affect how quickly and flexibly you can extract information.
Each database included in DBMS 102 includes a collection of information organized in such a way that computer software can select and retrieve desired pieces of data. Traditional databases are organized by fields, records, and files. A field is a single piece of information; a record is one complete set of fields; and a file is a collection of records. An alternative concept in database design is known as Hypertext. In a Hypertext database, any object, whether it be a piece of text, a picture, or a film, can be linked to any other object. Hypertext databases are particularly useful for organizing large amounts of disparate information, but they are not designed for numerical analysis.
Typically, a database includes not only data, but also low-level database management functions, which perform accesses to the database and store or retrieve data from the database. Such functions are often termed queries and are performed by using a database query language, such as Structured Query Language (SQL). SQL is a standardized query language for requesting information from a database. Historically, SQL has been a popular query language for database management systems running on minicomputers and mainframes. Increasingly, however, SQL is being supported by personal computer database systems because it supports distributed databases (databases that are spread out over several computer systems). This enables several users on a local-area network to access the same database simultaneously.
Most full-scale database systems are relational database systems. Small database systems, however, use other designs that provide less flexibility in posing queries. Relational databases are powerful because they require few assumptions about how data is related or how it will be extracted from the database. As a result, the same database can be viewed in many different ways. An important feature of relational systems is that a single database can be spread across several tables. This differs from flat-file databases, in which each database is self-contained in a single table.
DBMS 102 may also include one or more database applications, which are software that implements a particular set of functions that utilize one or more databases. Examples of database applications include:
    • computerized library systems
    • automated teller machines
    • flight reservation systems
    • computerized parts inventory systems
Typically, a database application, includes data entry functions and data reporting functions. Data entry functions provide the capability to enter data into a database. Data entry may be performed manually, by data entry personnel, automatically, by data entry processing software that receives data from connected sources of data, or by a combination of manual and automated data entry techniques. Data reporting functions provide the capability to select and retrieve data from a database and to process and format that data for other uses. Typically, retrieved data is used to display information to a user, but retrieved data may also be used for other functions, such as account settlement, automated ordering, numerical machine control, etc.
DBMS 102 includes speech enhancements 108, format adapter 110, data table 112 and speech indexing processing 114. Speech enhancements 108 are extensions to the standard query language of DBMS 102. For example, where DBMS 102 uses SQL, speech enhancements include extensions to the command set of SQL, an indextype, and its associated operators and types to empower applications with sophisticated text querying capabilities on audio data.
Speech recognition engine 104 provides speech recognition processing functionality to DBMS 102. Speech recognition engine 104 is typically configured as a server communicatively connected to DBMS 102. Preferably, speech recognition engine 104 provides large vocabulary continuous speech recognition (LVCSR) services to DBMS 102. Essentially, speech recognition engine 104 receives data that represents digitized speech, processes the data to recognize the speech, and outputs text data that represents the speech, which is the speech recognition result. The speech recognition results are placed in the CLOB (Character Large Object) result column in the data table 112 the procedure that invoked the speech recognition processing. This procedure places the result in a CLOB column in data table 112 next to the audio data. When a Create Index command is issued on this CLOB column, Speech Indexing processing 114 is invoked, which in turn invokes format adapter 110. Typically, the speech recognition result is arranged in a proprietary format. Format adapter 110 adapts the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing. Format adapter 110 parses the speech recognition result and extracts the required information. In particular, format adapter 110 extracts text, confidence, and timestamp tuples from each speech recognition result.
Speech indexing processing 114 receives the text, confidence, and timestamp tuples extracted from the proprietary format of each speech recognition result by format adapter 110, stores the extracted information in its own internal data structures and creates an index of the required indextype based on the extracted data. When an index of the required indextype is created or updated, speech indexing processing 114 is invoked for each new or updated row in data table 112. The row data, which are extracted from the speech recognition results, along with a table name and key to the original row in the indexed table, are provided as parameters. The routine must process the speech recognition result to extract <text, timestamp, confidence> and insert this data, along with some additional computed data (character offset and sequence number), and the key supplied as a parameter to the procedure into a that is part of an index internal data structure. Speech indexing processing 114 then inserts the extracted tuples of information into index data structures that are stored independently from the table upon which the index is built.
An example of an interface 1100 to format adapter 110 and speech indexing processing 114 is shown in FIG. 10 An example of an implementation of format adapter 110 is shown in FIG. 11.
A block diagram of an exemplary implementation of a DBMS 102, in which the present invention may be implemented, is shown in FIG. 2. DBMS 102 is typically a programmed general-purpose computer system, such as a personal computer, workstation, server system, and minicomputer or mainframe computer. DBMS 102 includes one or more processors (CPUs) 202A-202N, input/output circuitry 204, network adapter 206, and memory 208. CPUs 202A-202N execute program instructions in order to carry out the functions of the present invention. Typically, CPUs 202A-202N are one or more microprocessors, such as an INTEL PENTIUM® processor. FIG. 2 illustrates an embodiment in which DBMS 102 is implemented as a single multi-processor computer system, in which multiple processors 202A-202N share system resources, such as memory 208, input/output circuitry 204, and network adapter 206. However, the present invention also contemplates embodiments in which DBMS 102 is implemented as a plurality of networked computer systems, which may be single-processor computer systems, multi-processor computer systems, or a mix thereof.
Input/output circuitry 204 provides the capability to input data to, or output data from, DBMS 102. For example, input/output circuitry may include input devices, such as keyboards, mice, touchpads, trackballs, scanners, etc., output devices, such as video adapters, monitors, printers, etc., and input/output devices, such as, modems, etc. Network adapter 206 interfaces DBMS 102 with network 210. Network 210 may include one or more standard local area networks (LAN) or wide area networks (WAN), such as Ethernet, Token Ring, the Internet, or a private or proprietary LAN/WAN.
Memory 208 stores program instructions that are executed by, and data that are used and processed by, CPU 202 to perform the functions of DBMS 102. Memory 208 may include electronic memory devices, such as random-access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), flash memory, etc., and electromechanical memory, such as magnetic disk drives, tape drives, optical disk drives, etc., which may use an integrated drive electronics (IDE) interface, or a variation or enhancement thereof, such as enhanced IDE (EIDE) or ultra direct memory access (UDMA), or a small computer system interface (SCSI) based interface, or a variation or enhancement thereof, such as fast-SCSI, wide-SCSI, fast and wide-SCSI, etc, or a fiber channel-arbitrated loop (FC-AL) interface.
In the example shown in FIG. 2, memory 208 includes database management routines 212, database 214, and operating system 216. Database management routines 212 include software routines that provide the database management functionality of DBMS 102. Database management routines 212 include SQL interface with speech enhancements 108, format adapter 110, and speech indexing processing 114. SQL interface 108 accepts database queries using the SQL database query language, converts the queries to a series of database access commands, calls database processing routines to perform the series of database access commands, and returns the results of the query to the source of the query. For example, in an embodiment in which DBMS 102 is a proprietary DBMS, such as the ORACLE® DBMS, SQL interface 108 may support one or more particular versions of SQL or extensions to SQL, such as the ORACLE® PL/SQL extension to SQL. Speech enhancements are extension to the standard query language of DBMS 102. For example, where DBMS 102 uses SQL, speech enhancements include extensions to the command set of SQL, an indextype, and its associated operators and types to empower applications with sophisticated text querying capabilities on audio data.
Format adapter 110 processes the speech recognition result from speech recognition engine 104. Typically, the speech recognition result is arranged in a proprietary format. Format adapter 110 adapts the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing. Format adapter 110 parses the speech recognition result and extracts the required information. In particular, format adapter 110 extracts text, confidence, and timestamp tuples from each speech recognition result.
Speech indexing processing 114 receives. the text, confidence, and timestamp tuples extracted from the proprietary format of each speech recognition result by format adapter 110, stores the extracted information in its own internal data structures and creates an index of the required indextype based on the extracted data. When an index of the required indextype is created or updated, speech indexing processing 114 is invoked for each new or updated row in data table 112. The row data, which are extracted from the speech recognition results, along with a table name and key to the original row in the indexed table, are provided as parameters. The routine must process the speech recognition result to extract <text, timestamp, confidence> and insert this data, along with some additional computed data (character offset and sequence number), and the key supplied as a parameter to the procedure into a that is part of an index internal data structure. Speech indexing processing 114 then inserts the extracted tuples of information into index data structures that are stored independently from the table upon which the index is built. Database 214 includes a collection of information organized in such a way that computer software can select, store, and retrieve desired pieces of data. Typically, database 214 includes a plurality of data tables, such as data table 112. Data table 112 is arranged to store audio speech data that has been or is to be processed by speech recognition engine 104, shown in FIG. 1, speech recognition processing results output by speech recognition engine 104. Preferably, indexing information is kept in internal data structures, not in the same data table that stores the media data and speech recognition results. Typically, a user of the system would store media assets in data table 112.
In addition, as shown in FIG. 2, the present invention contemplates implementation on a system or systems that provide multi-processor, multi-tasking, multi-process, and/or multi-thread computing, as well as implementation on systems that provide only single processor, single thread computing. Multi-processor computing involves performing computing using more than one processor. Multi-tasking computing involves performing computing using more than one operating system task. A task is an operating system concept that refers to the combination of a program being executed and bookkeeping information used by the operating system. Whenever a program is executed, the operating system creates a new task for it. The task is like an envelope for the program in that it identifies the program with a task number and attaches other bookkeeping information to it. Many operating systems, including UNIX®, OS/2®, and WINDOWS®, are capable of running many tasks at the same time and are called multitasking operating systems. Multi-tasking is the ability of an operating system to execute more than one executable at the same time. Each executable is running in its own address space, meaning that the executables have no way to share any of their memory. This has advantages, because it is impossible for any program to damage the execution of any of the other programs running on the system. However, the programs have no way to exchange any information except through the operating system (or by reading files stored on the file system). Multi-process computing is similar to multi-tasking computing, as the terms task and process are often used interchangeably, although some operating systems make a distinction between the two.
An exemplary flow diagram of a typical process 300 of operation of a database management system incorporating the present invention is shown in FIG. 3. It is best viewed in conjunction with FIG. 1. Process 300 begins with step 302, in which media content is uploaded into a database table in DBMS 102, such as data table. In particular, media content includes audio speech data, which are digitized audio speech signals. In step 304, a speech recognition processing requestor 106, such as an application, that wants to process audio data with speech recognition engine 104 invokes the appropriate speech recognition. This causes the speech recognition engine 104, which is waiting for speech processing requests to receive a request for speech recognition. The received requests are processed by interface 116 of speech recognition engine 104. Speech recognition engine 104 processes the speech data in this request in order to recognize the speech and generate text data representing the recognized speech.
In step 308, format adapter 110 adapts the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing. Format adapter 110 parses the speech recognition result and extracts the required information. In particular, format adapter 110 extracts text, confidence, and timestamp tuples from each speech recognition result. Then, speech indexing processing 114 receives the text, confidence, and timestamp tuples extracted from the proprietary format of each speech recognition result by format adapter 110, inserts the extracted data in to data table 112 and creates an index of the required indextype based on the inserted data. In one embodiment shown in FIG. 1, the extracted text, confidence, and timestamp tuples from format adapter 110 are passed directly to speech indexing processing 114 for index creation. In other embodiments, the extracted text, confidence, and timestamp tuples from format adapter 110 may be stored before being passed o speech indexing processing 114 for index creation. When an index of the required indextype is created or updated, speech indexing processing 114 is invoked for each new or updated row in data table 112. The row data, which are extracted from the speech recognition results, along with a table name and are provided as parameters. This routine must process the data to extract <text, timestamp, confidence> tuples and insert them into data table 112. Speech indexing processing 114 then inserts the extracted tuples of information into index data structures associated.
In step 308, speech query requestor 206 generates a query on the text data included in data table 112 and transmits the query to DBMS 102. The generated query utilizes speech enhancements 108 to the query language used by DBMS 102. In step 310, DBMS 102 performs the query by accessing data table 112 and, using the index, retrieves the specified information, and returning the results of the query to speech query requester 106.
Following is an exemplary description of a sample usage scenario that will demonstrate the power and ease of use of the speech indexing functionality provided by the present invention.
Imagine a scenario in which a customer wants to do the following:
    • 1. Upload media content including audio into DBMS 102.
    • 2. Process the audio by sending data to a previously started speech recognition engine 104 and store the results in DBMS 102.
    • 3. Create an index on the speech recognition results that will allow for sophisticated text querying capabilities.
    • 4. Query the data to retrieve matched rows along with time offset and speech recognition confidence pairs for each occurrence within a matched row.
The customer stores media content including audio in data table 112 in DBMS 102. An exemplary format of data table 112 is shown in FIG. 4. Data table 112 includes id column 402, audio data column 404, and result column 406. For each row of data in data table 112, id column 402 includes a unique identifier if the data in the row, audio data column 404 includes the actual audio data, and result column 406 includes the speech recognition results. An example 500 of how an application would invoke speech recognition on a particular row and populate the result column 406 is shown in FIG. 5.
After the application has processed each audio asset in data table 112 and populated the result column 406, it is now ready to build an index on the result column, for example, using an SQL command 600, such as that shown in FIG. 6. For enhanced text queries that need to take into account customized preferences, such as lexer and wordlist preferences, the application can create preferences using the and pass those preferences as arguments to index creation, for example, using an SQL command 700, such as that shown in FIG. 7.
An example 800 of a simple query on the data table 112 is shown in FIG. 8. An example 900 of a more sophisticated query that that retrieves confidence and timestamps for each occurrence within a matched audio asset row is shown in FIG. 9. In this example, the SpeechContains operator matches those rows that satisfy the input query while the ancillary operator SpeechConfidenceTimestamp returns the corresponding collection of confidence/timestamp pairs for each returned row.
Format adapter 110 must be provided to adapt the format of the speech recognition result generated by speech recognition engine 104 to the format used for speech indexing. The formatting procedure must extract the information required for creating an index of the required indextype from the proprietary audio processing result format of speech recognition engine 104. In one embodiment, when an index of the required indextype is created or updated, format adapter 110 is invoked for each new or updated row in the indexed table. The row data, which are processing results of SpcechMining, along with a table name and key to the original row in the indexed table, are provided as parameters. This routine must process the data to extract <text, timestamp, confidence> tuples. An example of the interface 100 to format adapter 110 is shown in FIG. 10. An example of the processing 1100 performed by format adapter 110 is shown in FIG. 11.
APPENDIX A
OPERATOR: SpeechContains
Signature
SpeechContains(indexed_column CLOB,
query_string VARCHAR2,
[reference_label NUMBER])
RETURN NUMBER;
Description
Use the SpeechContains operator in the WHERE clause of a SELECT
statement to specify the query expression for a SpeechIndexing query.
SpeechContains returns a relevance score for every row selected. You
obtain this score with the SpeechScore operator. Additionally,
SpeechConfidenceTimestamp returns tuples of speech recognition
confidences and time offsets for the matches in the selected row.
Parameters
indexed_column: Specify the CLOB column to be searched on. This
column must have an ordsys.ORDSpeechIndex index associated with it.
query_string: Specify the query that defines your search in
indexed_column. Oracle Text query operators can be used in
this query string.
reference_label: Optionally specify the label that associates the
SpeechScore and SpeechConfidenceTimestamp generated by the
SpeechContains operator.
Returns
For each row selected, SpeechContains returns a number between 0 and
100 that indicates how relevant the document row is to the query.
The number 0 means that Oracle found no matches in the row.
Example
The following example searches for all documents in the
SpeechMining_result column that contain the word ‘oracle’. The score for
each row is selected with the SpeechScore operator using a label of 1:
SELECT ordsys.SpeechScore(1), title
FROM audionews
WHERE ordsys.SpeechContains(SpeechMining_result,
‘oracle’, 1) > 0;
OPERATOR: SpeechScore
Signature
SpeechScore(reference_label IN NUMBER) RETURN NUMBER;
Description
Use the SpeechScore operator in a SELECT statement to return the score
values produced by SpeechContains in an SpeechIndexing query.
Parameters
reference_label: An integer that refers to the corresponding invocation
of SpeechContains. If there are multiple invocations of SpeechContains in
the same query, this parameter is used to maintain the reference.
Notes
The SpeechScore operator can be used in a SELECT, ORDER BY, or
GROUP BY clause.
Returns
This operator returns a NUMBER.
Example
See the example for SpeechContains
OPERATOR: SpeechConfidenceTimestamp
Signature
SpeechConfidenceTimestamp(reference_label IN NUMBER)
RETURN ordsys.ORDConfidenceTimestampTable;
Description
Use the SpeechCoinfidenceTimestamp operator in a SELECT
statement to return a collection of confidence and timestamp
pairs produced by SpeechContains in an SpeechIndexing query.
Parameters
reference_label: An integer that refers to the corresponding invocation
of SpeechContains. If there are multiple invocations of SpeechContains in
the same query, this parameter is used to maintain the reference.
Notes
The SpeechConfidenceTimestamp operator can be used in a SELECT
clause.
Returns
This operator returns a table of type
ordsys.ORDConfidenceTimestampTable (defined below).
INDEXTYPE: ORDSpeechIndex
Description
This indextype allows a user to create an audio index
on a CLOB column that contains the results of SpeechMining.
Parameters
parameter_string: Can be used to pass in Oracle Text preferences to the
underlying Oracle Text index. Note that datastore preferences
are disallowed.
The types below are used to retrieve speech recognition confidence and
timestamp values from the query into PL/SQL variables.
OBJECT ORDConfidenceTimestampTuple
CREATE TYPE ORDConfidenceTimestampTuple
AS OBJECT (confidence NUMBER, timestamp NUMBER);
OBJECT ORDConfidenceTimestampTable
CREATE TYPE ORDConfidenceTimestampTable
AS TABLE OF ORDConfidenceTimestampTuple;

Claims (27)

1. A method for indexing data relating to results of speech recognition in a database management system, comprising the steps of:
receiving speech recognition results at the database management system, the speech recognition results having a first format;
converting the first format of the speech recognition results to a second format; and
generating an index of the speech recognition results in the database management system.
2. The method of claim 1, wherein the converting step comprises the steps of:
parsing the speech recognition results in the first format;
extracting from the speech recognition results text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result; and
generate speech recognition results in the second format using the extracted text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result.
3. The method of claim 2, wherein the second format is a standardized format.
4. The method of claim 3, wherein the first format is a proprietary format.
5. The method of claim 2, wherein the generating step comprises the steps of:
generating an index using the extracted speech recognition results, including the text data representing the recognized speech, the information relating to a confidence in each speech recognition result, and the timestamp information indicating a location of each portion of a speech recognition result in the database management system; and
storing the extracted information.
6. The method of claim 5, wherein the second format is a standardized format.
7. The method of claim 6, wherein the first format is a proprietary format.
8. A system for indexing data relating to results of speech recognition in a database management system comprising:
a processor operable to execute computer program instructions;
a memory operable to store computer program instructions executable by the processor; and
computer program instructions stored in the memory and executable to perform the steps of:
receiving speech recognition results at the database management system, the speech recognition results having a first format;
converting the first format of the speech recognition results to a second format; and
generating an index of the speech recognition results in the database management system.
9. The system of claim 8, wherein the converting step comprises the steps of:
parsing the speech recognition results in the first format;
extracting from the speech recognition results text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result; and
generate speech recognition results in the second format using the extracted text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result.
10. The system of claim 9, wherein the second format is a standardized format.
11. The system of claim 10, wherein the first format is a proprietary format.
12. The system of claim 9, wherein the generating step comprises the steps of:
generating an index using the extracted speech recognition results, including the text data representing the recognized speech, the information relating to a confidence in each speech recognition result, and the timestamp information indicating a location of each portion of a speech recognition result in the database management system; and
storing the extracted information.
13. The system of claim 12, wherein the second format is a standardized format.
14. The system of claim 13, wherein the first format is a proprietary format.
15. A computer program product for indexing data relating to results of speech recognition in a database management system comprising:
a computer readable medium;
computer program instructions, recorded on the computer readable medium, executable by a processor, for performing the steps of
receiving speech recognition results at the database management system, the speech recognition results having a first format;
converting the first format of the speech recognition results to a second format;
generating an index of the speech recognition results in the database management system.
16. The computer program product of claim 15, wherein the converting step comprises the steps of:
parsing the speech recognition results in the first format; extracting from the speech recognition results text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result; and
generate speech recognition results in the second format using the extracted text data representing the recognized speech, information relating to a confidence in each speech recognition result, and tirnestamp information indicating a location of each portion of a speech recognition result.
17. The computer program product of claim 16, wherein the second format is a standardized format.
18. The computer program product of claim 17, wherein the first format is a proprietary format.
19. The computer program product of claim 16, wherein the generating step comprises the steps of:
generating an index using the extracted speech recognition results, including the text data representing the recognized speech, the information relating to a confidence in each speech recognition result, and the timestamp information indicating a location of each portion of a speech recognition result in the database management system; and
storing the extracted information.
20. The computer program product of claim 19, wherein the second format is a standardized format.
21. The computer program product of claim 20, wherein the first format is a proprietary format.
22. An application program interface for indexing data relating to results of speech recognition in a database management system comprising:
an indextype operable to support text queries on speech recognition results;
an interface operable to provide interaction with an index of the indextype; and
a format adapter interface operable to invoke a format adapter for converting speech recognition results having a first format to a second format.
23. The application program interface of claim 22, wherein the format adapter is operable to parse the speech recognition results in the first format, extract from the speech recognition results text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result, and generate speech recognition results in the second format using the extracted text data representing the recognized speech, information relating to a confidence in each speech recognition result, and timestamp information indicating a location of each portion of a speech recognition result.
24. The application program interface of claim 23, wherein the indextype comprises the text data representing the recognized speech, the information relating to a confidence in each speech recognition result, and the timestamp information indicating a location of each portion of a speech recognition result in the database management system.
25. The application program interface of claim 24, wherein the interface is operable to provide interaction comprising performing a query of the text data representing the recognized speech.
26. The application program interface of claim 25, wherein the query of the text data representing the recognized speech relates to the confidence information and/or the timestamp information.
27. The application program interface of claim 26, wherein results of the query indicate time offsets within each matched media asset where matches occurred and speech recognition confidence of each match occurrence within a matched media asset.
US10/361,571 2002-10-21 2003-02-11 SQL enhancements to support text queries on speech recognition results of audio data Abandoned USH2189H1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/361,571 USH2189H1 (en) 2002-10-21 2003-02-11 SQL enhancements to support text queries on speech recognition results of audio data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US41952002P 2002-10-21 2002-10-21
US10/361,571 USH2189H1 (en) 2002-10-21 2003-02-11 SQL enhancements to support text queries on speech recognition results of audio data

Publications (1)

Publication Number Publication Date
USH2189H1 true USH2189H1 (en) 2007-05-01

Family

ID=37991614

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/361,571 Abandoned USH2189H1 (en) 2002-10-21 2003-02-11 SQL enhancements to support text queries on speech recognition results of audio data

Country Status (1)

Country Link
US (1) USH2189H1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060111902A1 (en) * 2004-11-22 2006-05-25 Bravobrava L.L.C. System and method for assisting language learning
US7272558B1 (en) * 2006-12-01 2007-09-18 Coveo Solutions Inc. Speech recognition training method for audio and video file indexing on a search engine
US20190164551A1 (en) * 2017-11-28 2019-05-30 Toyota Jidosha Kabushiki Kaisha Response sentence generation apparatus, method and program, and voice interaction system
US10956433B2 (en) * 2013-07-15 2021-03-23 Microsoft Technology Licensing, Llc Performing an operation relative to tabular data based upon voice input

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233544B1 (en) * 1996-06-14 2001-05-15 At&T Corp Method and apparatus for language translation
US6601073B1 (en) * 2000-03-22 2003-07-29 Navigation Technologies Corp. Deductive database architecture for geographic data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233544B1 (en) * 1996-06-14 2001-05-15 At&T Corp Method and apparatus for language translation
US6601073B1 (en) * 2000-03-22 2003-07-29 Navigation Technologies Corp. Deductive database architecture for geographic data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060111902A1 (en) * 2004-11-22 2006-05-25 Bravobrava L.L.C. System and method for assisting language learning
US8272874B2 (en) * 2004-11-22 2012-09-25 Bravobrava L.L.C. System and method for assisting language learning
US7272558B1 (en) * 2006-12-01 2007-09-18 Coveo Solutions Inc. Speech recognition training method for audio and video file indexing on a search engine
US10956433B2 (en) * 2013-07-15 2021-03-23 Microsoft Technology Licensing, Llc Performing an operation relative to tabular data based upon voice input
US20190164551A1 (en) * 2017-11-28 2019-05-30 Toyota Jidosha Kabushiki Kaisha Response sentence generation apparatus, method and program, and voice interaction system
US10861458B2 (en) * 2017-11-28 2020-12-08 Toyota Jidosha Kabushiki Kaisha Response sentence generation apparatus, method and program, and voice interaction system

Similar Documents

Publication Publication Date Title
US7092936B1 (en) System and method for search and recommendation based on usage mining
EP2041672B1 (en) Methods and apparatus for reusing data access and presentation elements
US6928432B2 (en) System and method for indexing electronic text
US7987189B2 (en) Content data indexing and result ranking
US5819251A (en) System and apparatus for storage retrieval and analysis of relational and non-relational data
US6233578B1 (en) Method and system for information retrieval
US7467134B2 (en) Storing and retrieving the visual form of data
US20180101621A1 (en) Identifier vocabulary data access method and system
US7167848B2 (en) Generating a hierarchical plain-text execution plan from a database query
US6708186B1 (en) Aggregating and manipulating dictionary metadata in a database system
US20040225696A1 (en) On-demand multi-version data dictionary to support distributed applications
US9785707B2 (en) Method and system for converting audio text files originating from audio files to searchable text and for processing the searchable text
JPH10232875A (en) Data base managing method and parallel data base managing system
US7236993B2 (en) On-demand multi-version denormalized data dictionary to support log-based applications
USH2189H1 (en) SQL enhancements to support text queries on speech recognition results of audio data
US7181454B1 (en) Asset locator
US20030225722A1 (en) Method and apparatus for providing multiple views of virtual documents
US20050210005A1 (en) Methods and systems for searching data containing both text and numerical/tabular data formats
JP3500097B2 (en) Composite media search method and composite media search program recording medium
US7873659B2 (en) Database management system, database management method and database management program
US8352457B2 (en) Dynamically generating an XQuery
Sattler et al. Adapter Generation for Extraction and Querying Data from Web Sources
JP2003527657A (en) Method and system for browsing, recording, and retrieving symbolically linked information
US20080294675A1 (en) Column file storage estimation tool with text indexes
CN118132587A (en) Data analysis method, system, electronic equipment and storage medium based on natural language

Legal Events

Date Code Title Description
AS Assignment

Owner name: ORACLE INTERNATIONAL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RAO, VISHAL;CHOPRA, RAJIV;REEL/FRAME:013768/0717

Effective date: 20030205

STCF Information on status: patent grant

Free format text: PATENTED CASE