MXPA97002705A

MXPA97002705A - Method and apparatus to create a researchable digital digital library and a system and method to use that bibliot

Info

Publication number: MXPA97002705A
Application number: MXPA/A/1997/002705A
Authority: MX
Inventors: L Mauldin Michael
Original assignee: Carnegie Mellon University
Priority date: 1994-10-14
Filing date: 1997-04-14
Publication date: 1998-07-03

Abstract

The present invention relates to a method for creating a digital library, independent from audio data and existing video images, comprising the steps of: transcribing the audio data and marking the audio data transcribed with a first set of time stamps, index the transcribed audio data, digitize the video data and mark the digitized video data with a second set of timestamps related to the first set of timestamps, segment the digitized video data into video paragraphs according to a set of rules, which are based on the characterization of the scene of the video images and the processing of the audio data, and store the indexed audio data and the digitized video data, segmented with their sets respective timestamps to form the digital library, which can be accessed through the indexed audio data without returning to the data and existing audio and vid images

Description

METHOD AND APPARATUS FOR CREATING A RESEARCHABLE DIGITAL VIDEO LIBRARY AND A SYSTEM AND METHOD FOR USING SUCH A LIBRARY DESCRIPTION OF THE INVENTION The present invention is directed generally to a digital video library system and more particularly, to a system that integrates the recognition of voice, image recognition and language understanding to create, index and investigate digital video libraries. Most digital libraries will soon be available on the nation's Information Superhighway as a result of the emergence of multimedia technologies. These libraries will have a profound impact on the conduct of business, professional and personal activities. However, due to the sheer volume of information available, it is not enough to simply store information and reproduce that information at a later date. That, in essence, the concept of commercial video demand services, and is relatively simple. The new technology is in need of creating, organizing and researching most of the data libraries and then retrieving and reusing them effectively. Currently, even though much of the television broadcast is in closed caption, the vast majority of funds for the nation's video and movies are not. Because of this, any type of digital video library must employ some type of audio transcription. Many sources of error and variability occur naturally in the context of audio transcription. For example, the broadcast of video productions, if they are documentary style interviews or theatrical productions, should record the voice of multiple people who have spoken in different locations. This results in a quality voice signal with different signal-to-noise ratio properties. Also, forming the problem are the effects of different orientations of the speakers and the particular reverberation characteristics of the room. Furthermore, the use of tabletop microphones, lapel microphones and directional boom microphones traditionally used in the broadcast of video productions, are used as sources for audio transcription, the variability of differences between the characteristics of the microphone and Differences in signal-to-noise ratios can significantly degrade performance. Additionally, in a typical video interview, people talk frequently. This implies that many of the words are small or poorly pronounced. The lexicological descriptions of the pronunciations used in conventional systems for dictation, where careful articulation is the norm, will not work very well for the spontaneous voice that comes out. Also, unlike the Wall Street Journal's dictation models, in which the do limits the size and nature of the vocabulary, it has probably been used in sentences, audio transcripts of the video broadcast, generally tend not to have such restrictions. Consequently, there are many problems and challenges presented by the audio portion of raw videotape images, which must be addressed by any digital library system. Also, there are problems and challenges presented by the video portion of the raw videotape image. For example, to store video in the digital format effectively in such a way that it is usable, the video must be segmented. Traditional segmentation methods involve the count of tables before and after a reference time. This type of content-independent segmentation can result in segments, which are either not complete or contain two or more concepts or scenes. Therefore, any digital library system must be able to segment the video into useful understandable segments based on the content. In addition to the problems associated with the creation of a digital video library, there are also problems with effectively accessing the library. The two standard measures of operation in the recovery of information are memory and precision. The memory is the proportion of important documents that are actually recovered and the accuracy of the proportion of recovered documents that are really important. These two measurements can be changed one for the other and the objective of recovering the information is to maximize both. Text research typically involves searching for keywords or, in some circumstances, using limited natural language differences. The current recovery technology works well in the textual material of newspapers, electronic files and other sources of grammatical correctness and written content spelled appropriately. In addition, natural language questions allow direct description by the user of the desired subject matter. However, the video recovery task, based on research transcripts that contain a finite set of errors, challenges the state of the art. Even the understanding of a perfect transcription of the audio would be too complicated for the current natural language technology. When the mode of communication, such as multimedia, has intrinsic temporal speeds associated with it, such as audio or video, research becomes increasingly difficult. For example, it takes 1000 hours to review 1000 hours of video. Detailed indexing of the video can help that process. However, users often wish to examine the video similar to the way in which they browse through the pages of a book. Unfortunately, the mechanisms to do so today are inadequate. Scanning by jumping from a set of frame numbers can skip the target information completely. On the contrary, by accelerating the reproduction of the movement of the video at twenty (20) times the normal speed, it presents information at an incomprehensible speed. Even if users could understand such accelerated playback, it would still take six minutes to explore two hours of videotape. A two-second scene would be presented in only one-tenth of a second. Similar to the problems with video search, there is an analogous problem with the audio search, only more acute. Rapid audio playback during a scan is impractical. Beyond one and a half (1.5) at two (2) times the normal speed, the audio becomes incomprehensible because faster playback speeds shift frequencies to inaudible ranges. Although digital signal processing techniques are helpful in reducing frequency shifts, at high playback rates, these digital signal processing techniques have very similar sound bytes as those of an analog disk video explorer. As you can imagine, the problem is more complicated in a multimedia scenario. The integration of text, audio and video in this way presents many obstacles, which must be overcome. There are approximately one hundred and fifty (150) spoken words per minute from an average interview video. That translates to approximately nine thousand words (9000) for an hour of video, or almost fifteen pages of text. A person who examines the text may be able to find important sections relatively quickly. However, if a specific topic contained in a videotape reading is searched, the search problem is acute. Even if a high playback speed of three (3) to four (4) times the normal speed was understandable, the continuous playback of audio and video is a totally unacceptable search mechanism. Assuming that the target information was halved through a one-hour video file, it would still take approximately seven (7) to ten (10) minutes to find. In complex form, the emerging fields such as digital libraries and multimedia, it is not surprising that most applications today have failed to take full advantage of the bandwidth of information much less the capabilities of a multimedia, video and multimedia environment. digital audio Designs nowadays typically employ a vision of VCR / Video-Phone multimedia. In this simplistic model, video and audio can be reproduced, stopped, their windows placed on the screen and possibly manipulated in other ways such as by displaying a graph synchronized to a temporary point in the multimedia object. This is the paradigm of traditional analog interactive video, developed almost two decades ago. In contrast to interactive video, a much more appropriate term for this is "interrupted video". The interrupted video paradigm of today observes multimedia objects more as text with a temporal dimension. The differences between moving video and other media, such as text and even images, are attributed to the fact that time is a parameter of video and audio. However, in the hands of a user, each medium has a temporary nature. It takes time to read (process) a text document or even an image. In traditional media, each user absorbs the information at his own speed. One can still assimilate visual information holistically, that is, reach an understanding of complex information almost simultaneously.

However, in order to carry almost all meaning by - full, the audio and video must be reproduced at constant speed, the speed at which they were recorded. Although, a user must accept video and audio playback at 1.5 times the normal speed for a short time, it is unlikely that users would accept long periods at such playback speeds. In fact, studies show that there is surprisingly significant sensitivity to alter reproduction fidelity. Even if users accept accelerated reproduction, the speed of information transfer would still be controlled mainly by the system. Although the types of video and audio data are of constant speed, of continuous time, the information contained in them does not. In fact; the granularity of the information content is such that a half hour of video can easily have a hundred pieces semantically separated. The pieces can be linguistic or visual in nature. They can be in the reingo from sentences to paragraphs and from images to scenes. Understanding the information contained in the video is essential to successfully implement a digital video library. Returning to a full half-hour video, when only one minute is important, it is much worse than returning a complete book, when only one chapter is needed. With a book, electronic or paper, tables of contents, indexes, exam and reading speeds allow users to quickly find the portions they need. Because the time to scan a video can not be dramatically shorter than the actual video time, a digital video library should give users only the material they need. Understanding the content of the video information allows not only to find the important material, but to present that information in useful ways. Tools have been created to facilitate examining the audio, which presents graphic representations of the audio waveform to the user to assist in the identification of the locations of interest. However, studies have shown that these techniques are useful only for audio segments under three minutes. When looking for a specific piece of information during audio or video hours, other search mechanisms are required. For example, in the previous research at Carnegie Mellon University, the assignee of the present invention, a multidimensional model of multimedia objects including text, images, digital audio and digital video was developed. With this model, developed during the Advanced Learning Technologies Project (the "ALT project"), the knowledge of the variable granularity about the domain, content, structure of the image and the appropriate use of the multimedia object is interleaved with the object. Based on a history of current interactions (inputs and outputs), the system makes a judgment on what to exhibit and how to exhibit it. The techniques that use such associated abstract representations have been proposed as mechanisms to facilitate searches of large digital video and audio spaces. The ALT Project was described in Stevens, Next Generation Network and Operating System Requirements for Continuous Time Media, Springer Verlag, 1992, which is incorporated herein by reference. Also, simply looking for and watching video clips from digital video libraries, while useful, is not enough. Once users identify the video objects of interest, they should be able to manipulate, organize and reuse the video. Demonstrations abound where students create video documents by associating video clips with text. Although such demonstrations are positive stages, the reuse of the video must be more than simply editing a selection and joining it to the text. Although some excellent tools are commercially available for editing digital video, there are currently no tools available to intelligently assist in the creative design and use of video despite cinematic knowledge. One reason for the scarcity of tools is the intrinsic, the constant speed, the temporal aspect of the video. Another is the complexity involved in understanding the interplay nature of the scene, frame formation, camera angle, and transition. Therefore, there is a need to incorporate into any digital video editor intelligence with respect to cinematic knowledge. This would make possible the context of sensitive help in the reuse of video and its composition in new ways. The present invention is directed to a method and apparatus for creating a searchable digital video library and a system and method for using such a library, which overcomes the many obstacles encountered in the prior art. The method includes the steps of transcribing audio data, marking the transcribed audio data with a first set of timestamps and indexing the transcribed audio data. The steps of digitizing the video data and marking the digitized video data with a second set of time stamps related to the first set of timestamps are made, before the segmentation of the digitized video data into paragraphs of agreement. with a set of rules. The method further includes the step of storing the indexed audio data and the digitized video data with their respective sets of timestamps. The method may also include the step of passing the transcribed audio data through a natural language interpreter before indexing the transcribed audio data. The natural language interpreter updates the set of rules. The method can be practiced in such a way that the digital library is created automatically. The invention is also directed to an apparatus for creating a digital library of audio data and video images. The apparatus includes means for transcribing the audio data and marking the transcribed audio data with a first set of timestamps, means for indexing the transcribed audio data, means for digitizing the video data and marking the digitized video data with a second set of timestamps related to the first set of timestamps, means to store a set of rules and means to segment the digitized video data into paragraphs according to the set of stored rules. Additionally, the means for storing the indexed audio data and the digitized video data with their respective sets of timestamps is provided. The apparatus additionally includes a natural language interpreter to process the transcribed audio data before the audio data is indexed and to update the set of rules. The present invention is also directed to a method and apparatus, which uses natural language techniques to formulate searches used to retrieve information from the digital library. The search method can be implemented permanently alone or in a network environment. It is an object of the present invention to establish a system that includes a large, online, digital, video library which allows the search based on the complete content and knowledge and recovery by means of desktop computers and data communications networks . It is another object of the present invention to develop a method for creating and organizing the digital video library. It is still another object of the invention to develop techniques for effectively searching and retrieving portions of the digital video library in view of the unique demands presented by multimedia systems. It is a feature of the present invention that speech understanding technologies, natural language and image are integrated for the creation and exploration of the digital library. It is another feature of the present invention that a high quality speech recognition function is provided. Yet another feature of the present invention is provided in that a natural language understanding system for a full text search and retrieval system. It is still another feature of the invention that the image understanding functions are provided to segment video sequences. Finally, it is another feature that the system is adaptable to several network architectures. The advantages of the present invention are many. The digital video library system provides the search for the complete content of, and recovery of, an online database. Voice recognition features provide a user-friendly interface with humans. Image understanding functions provide meaningful video segmentation based on context and not just time. Multi-mode search techniques provide a comprehensive and accurate search. The diverse network architectures support multiple users and increase the efficiency of the search. Finally, the ability to access unedited video allows another exploitation of information. These and other advantages and benefits will be apparent from the Detailed Description of the following Preferred Modality. BRIEF DESCRIPTION OF THE DRAWINGS The various objects, advantages and novel features of the present invention will be described by way of example only, in the following detailed description, when given in conjunction with the accompanying drawings, in which: FIGURE 1 is a block diagram illustrating a generality of the method for creating a digital, searchable video library and of a system for use or scanning thereof in accordance with the teachings of the present invention; FIGURE 2 is a flow diagram illustrating the processing flow used for the creation of the digital video database; FIGURE 3A is a flow diagram illustrating an implementation of the audio transcription function illustrated in FIGURE 2; FIGURE 3B is a flow diagram illustrating an implementation of the natural language interpretation function illustrated in FIGURE 2; FIGURE 4 is a schematic diagram illustrating an implementation of the data and network structure of the present invention; and FIGURE 5 is a schematic diagram illustrating an implementation of an online digital video library communication structure. FIGURE 6 is an example of the integration of several techniques involved in video segmentation. In an appendix to this, FIGURE A-1 is an example of a computer screen showing icons presented in response to a search request; and FIGURE A-2 is an example of the formation of video paragraphs as defined in the present invention. With reference to FIGURE 1, a generality of a digital video library system, generally referred to by the number 10, constructed in accordance with the present invention is shown. Similar reference numbers will be used among the various figures to represent similar elements. In FIGURE 1, the digital video library system 10 is shown to have two portions 12, 14. The portion 12 out of line, involves the creation of a digital library 36. The online portion 14 includes the functions used in the scanning of the digital library 36. As used herein, the term "digital video library system" refers to the complete system, while the term "digital library" refers to the database 36 created by the portion outside the digital library. line 12. Off-line portion 12 receives raw video material 16 comprising audio data 18 and video data 20. Raw video material 16 may include audio-video from any one or many of several sources . It is preferable that the raw video material 16 incorporates not only television images 22, but also raw source materials, generally shown as extra images 24, from which the television images 22 are derived. Such extra 24 images enrich the digital library 36 significantly, so that the raw video material 16 can be used as reference resources and for uses other than those originally intended. The extra 24 images also enlarge the amount of raw video material 16 significantly. For example, typical source images run fifty (50) to one hundred (100) times greater than the corresponding television image broadcast 22. As another example, an interview with Arthur C. Clarke for the "Space Age" series, described in detail in the following Operational Summary, resulted in two minutes of airtime even though more than four hours of videotape was created. During the Interview. Finally, new video images 26 not created for television broadcasting may also be included. The raw material can also include pure text, only audio, or only video. The audio data 18 is subjected to speech and language interpretation functions 28 and speech and language indexing functions 30, each of which will be described in greater detail herein. The video data 20 is subjected to the functions of video segmentation 32 and video compression 34, which will also be described in greater detail herein. The resulting digital library 36 includes indexing, text transcriptions of audio data 38 and segmented, compressed audio / video data 40. The digital library may also include indexed text and compressed, segmented video data. The digital library 36 is the output of the offline portion 12 of the digital video library system 10. It is the digital library 36 which is used by the online portion 14 and which, in a commercial environment, it is accessed or in any other way it is made available to users. Returning now to the online portion 14 of the digital video library system 10, the digital library 36 is made available to a user's workstation 42. The workstation 42 preferably recognizes both the voice commands and textual natural language questions, any of which will request a natural language search function 129. By means of an interactive video segmentation function 46, the segments of video 48 are retrieved from digital library 36. Video segments 48 can be observed at work station 42 and selectively stored for future use. The reader will understand that the out-of-line portion 12 of the system 10 can be implemented in the software and run on a 150 MIPS DEC Alpha workstation or another similar machine to automatically generate the digital library 36. Once the digital library 36 It is created in accordance with the teachings of the present invention, it can be stored in any conventional storage means. The out-of-line portion 14 of the system 10 can be implemented in the software and run on several different machines that have access to the digital library 36 by means of various network configurations as described in the following. Alternatively, the "in-line" portion can be implemented in a single placement mode, although the networked environment would allow much greater access to the digital library 36. Creation of the Digital Library Content is transported in both narrative (voice and language) ) and image. Only through the collaborative interaction of the technology of understanding the image, voice and natural language can the present invention, the population, segment, index and search automatically of various video collections with satisfactory memory and precision. The approach compensates only for interpretation and search problems in complete and ambiguous error data environments. The understanding of the image plays a critical role in the organization, search and reuse of digital video. The digital video library system 10 must automatically annotate the digital video by understanding the voice and language, as well as by the use of other textual data that has been associated with the video. Spoken words or sentences should be attached to their associated boxes. The search in the traditional database by keywords, where the images are only a reference, but not directly searched, is not appropriate or useful for the digital library system 10. On the contrary, the digital video by itself must be segmented, researched, manipulated and presented by similarity coupling, parallel presentation and context size, while retaining the image content. The integration of speech recognition technologies, natural language processing and image understanding allows a digital library 36 to be created, which supports the intelligent search of large collections of digital video and audio. Audio Transcription and Time Marking of Function 27 With reference to FIGURE 2, it is noted that the speech and language interpretation function 28 of FIGURE 1 is implemented by an audio transcription and time marking function 27 and the natural language interpretation function 29. The audio transcription portion of the audio transcript and the time stamp function 27 operates in a digitized version of the audio data 18 using known techniques in automated speech recognition to transcribe narrations and dialogs automatically. For example, the Sphinx-II voice recognition system can preferably be used. The Sphinx-II system is a continuous voice recognizer, independent of the speaker, of large vocabulary developed at Carnegie Mellon University. The Sphinx-II system currently uses a vocabulary of approximately 20,000 words to recognize connected spoken pronunciations from many different speakers. The Sphinx-II speech recognition system is described in greater detail in Huang, The SPHINX-II Speech Recognition System, An Overview, Computer and Speech Language, (1993) which is incorporated herein by reference. However, as will be appreciated by those skilled in the art, other methods of transcription can be employed, including human transcription or, in the case of programs with closed titles, using only the titles of the programs as they are. The transcription generated by the audio transcription portion of the function 27 does not need to be observed by the users and can be hidden from them. Improvements in the speed of error can be anticipated since many of the video images useful for educational applications will typically be of high audio quality and will be narrated by trained professionals which facilitates minor error transcripts. However, because the anticipated size of the video libraries, a larger vocabulary is anticipated. By itself, the vocabulary of the video library may tend to degrade recognition speed and increase errors. In response, several innovative techniques have been developed and are exploited to reduce errors in the audio transcription function. The use of program-specific information, such as subject-based vocabularies and preference-word-of-preference word lists, is used by the audio transcription portion of function 27. Word hypotheses are improved using the language models of "long distance" adaptive, known. In addition, multi-step recognition processing is performed in such a way that multiple-sentence contexts can be considered. Additionally, transcription will be marked in time by function 27 using any known technique to apply a timestamp. The audio time stamps will be aligned with the timestamps associated with the processed video for subsequent retrieval as discussed in the following. It is expected that the digital video library system 10 will tolerate error rates greater than those that might be required to produce a transcript that can be read by a human. Also, online texts and closed headers, where available, can preferably be used to provide basic vocabularies for the recognition and search of texts. In a preferred embodiment, the audio transcription portion of the function 27, generally processes a pronunciation in four known steps as illustrated in FIGURE 3A. The first stage, represented by frame 52, is a time-synchronous step forward using the semi-continuous acoustic models between words with code books dependent on the telephone and a big-scale language model. The synchronous forward pitch function 52 produces a set of possible word events, with each word event having a start time and multiple possible end times. A synchronous-reverse time step function 54 using the same system configuration is then performed. The result of the synchronous pitch-inverse time function 54 is of multiple possible start times for each end time predicted in the forward synchronous-time step 52. In step 56, an approximate A * algorithm is used to generate the set of N-best hypotheses for the vocabulary of the results of forward-synchronous-time step 52 and of reverse-synchronous-time step 54. Any of many language models can be applied in step 56. It is preferred that the lack be a trigram language model. This approximate A * algorithm is not guaranteed to produce the best classification of the first hypothesis. Finally, in step 58, the best hypothesis classification is selected from among the N-best list produced. The best hypothesis classification is the output of step 58 as the output of audio transcription function 27. The transcripts marked in time, thus generated, are passed on to function 29 of natural language interpretation described in the following. The audio transcription portion of function 27 can direct many sources of error and variability which occurs naturally. For example, with respect to the problem possessed by multiple signal-to-noise ratios, the audio transcription function uses signal adaptation techniques, including preprocessing and initial detection of the signals, which automatically correct such variability. With regard to the problem caused by multiple unknown microphones, the audio transcription function can use dynamic microphone adaptation techniques to reduce the error without having to restrict it to the new microphones. With respect to the problems associated with eloquent words, currently the only known technique is for the manual adaptation of the vocabulary using intelligent linguists. The audio transcription portion of the function 27 can employ known expert system techniques to formulate a task domain based on the knowledge of such linguists, in such a way that the learning of the automatic pronunciation can be carried out. With respect to the problems associated with expanded vocabularies, research on long-distance language models indicates that twenty (20) to thirty (30) percent of the improvement in accuracy can be accomplished by dynamic adaptation of vocabulary based in words that have recently been observed in previous pronunciations. In addition, most of the broadcast of video programs have a significant descriptive text available. These include previous descriptions of the program design called treatments, work scripts, excerpts that describe the program and titles. In combination with these resources they provide valuable additions to the dictionaries used for the audio transcription function. Because the creation portion 12 of the digital video library system 10 is typically performed offline, the processing time can be processed for greater accuracy, thus allowing the use of larger, continuously expanding dictionaries and more Computational intensive language models. It is estimated that the error rates that can be achieved by the techniques herein, even with the requirements of increased vocabulary, will approach twelve (12) to fifteen (15) percent and with advancement in computer technology , search technology and word processing techniques, from five (5) to six (6) percent. Natural Language Interpretation 29 Natural language processing is used in two parts of the digital video library system 10, in the offline portion 12 to create a final transcript, which is used in the creation of the indexed text transcript audio 38, and in the online portion 14 for the formulation of natural language search questions 129. Although recovery research does typically focus on newspapers, electronic files and other sources of "clean" documents, the questions of Natural language, on the contrary to the language of complex questions, allows the direct description of the described material. The natural language interpretation function 29 performs several known subfunctions. The first is called "creation of a summary" 150 in FIGURE 3B in which, by analyzing the words on the audio track for each visual paragraph (the concept of a "visual paragraph" is described in the section entitled Understanding of Image Based on Content, in the following), the subject area and the theme of the narrative for that video paragraph is determined. The formation of the summary can be used to generate titles or summaries of each paragraph of video or segment to be used in the creation of icons, tables of contents, or indexing. The second function is defined as "identification" 152, in which using the techniques of data extraction known in the art, the names of the persons, places, companies, organizations and other entities mentioned in the sound track can be determined. This will allow the user to find all the references for a particular entity with a single question. The third function is correction of the transcript 154. Using the semantic and syntactic restrictions, combined with a phonetic knowledge base, which, for example, can be the Sphinx-II dictionary or an analogous dictionary of another audio transcription function, Recognition of certain errors and correction of such errors is achieved. In this way, the correction function of transcription 154 is capable of automatically generating final transcriptions of the audio with errors of recognition of corrected words. Natural language interpretation functions 29, 129 are based on known techniques and, for example, can apply statistical techniques or expert systems. For example, the natural language interpretation function 29 is contemplated in the Scout system developed at Carnegie Mellon University. Other interpreters or processors of natural language are known in the art and can be used for this. The Scout system is a full-text information retrieval and storage system, which also serves as the test basis for data extraction and information retrieval technology. The natural language interpretation function 29 can also be applied to the transcriptions generated by the function 27 of marked time and audio transcription to identify keywords. Because processing at this point occurs offline, the natural language interpretation function 29 has the advantage of more processing time, which encourages understanding and allows correction of transcription errors. The function of interpretation of natural language 29, solves several deficiencies in the technique. First, the natural language interpretation function 29 increases the pattern of coupling and parsing to retrieve and correct errors in the spoken string. Using the measurements of phonetic similarity produced by the audio transcription portion of function 27, a measure of the similarity of the classified string is used to recover and classify partial couplings. The baseline measurement system has been designed to address the issue of the inadequacy of current recovery algorithms. This is the first document of the functioning of the recovery algorithm in the transcribed video. A test of collecting questions and important video segments from the digital library 36 are created. Using the manual methods, the important set of video segments 48 was established from the digital library 36. The collection of tests was then used to evaluate the recovery performance of the existing recovery algorithms in terms of recall and accuracy. The results of the baseline performance test can be used to improve the natural language interpretation function 29 by elaborating on the sets of current patterns, rules, grammars and vocabularies to cover the additional complexity of spoken language using grammars activated by data, large. To provide efficient implementation and high development index, regular expression approximations are used for the grammars free context, typically used for natural language. By extending this technique to an automatically recognized audio track, the acceptable levels of recall and accuracy in the recovery of the video scene are realized. The results of the baseline performance test can also be used to improve the audio transcription portion of the function 27, such that the basic pattern of coupling and parsing of the algorithms are stronger and work despite of lower level recognition errors using a minimum divergence criterion for the choice between ambiguous interpretations of the spoken words. For example, the SCOUT text retrieval system of the CMU uses a partial coupling algorithm to recognize misspelled words in texts. The existing algorithm for coupling in the phonetic as well as the textual space was extended. For example, in one of the training videotapes, in an interview with Arthur Clarke, Clarke uses the phrase "self-fulfilling prophecies". In the first prototypes of the digital video library system 10, due to the limited vocabulary of the audio transcription portion of function 27, the audio transcription portion of function 27 created the term "self-fulfilling profit seize".

To maintain high recall performance, video segments must be recovered in spite of such bad transcriptions. A natural language question becomes a phonetic space as follows: Question: PR AA1 F AHO S IYO Z - "prophecy" Data: PR AA1 F AHO TS IY1 Z - "profit seize" which is diverted only by one insertion (T ) and a change in tension (IYO to IY1). Such technique allows the recovery of "self-fulfilling prophecies" and its phonetic equivalent of "self-fulfilling profit seize". Boolean and vector-space models of information retrieval have been applied to the digital video library system 10. A collection test to measure recall and accuracy and establish a baseline performance level is also provided for evaluation of the digital video library system 10. Users are provided with options to order the returned set of "bumps" and to limit the size of the bumps as well. As illustrated in Figure 2, the use of the natural language interpretation function 29 extends to the paragraph formation function 33 for video data 20. A set of rules 37 are created and updated by the function of natural language interpretation 27. These rules 37 apply to the function 33 of paragraph formation. The paragraph formation function 33 will be described in greater detail herein in the following. Also, the automatic summary formation of the recovered material to build a module that assembles the video segment in a video sequence oriented to a single user, is provided by the natural language interpreter 29. Indexed 30 of Voice and Language Continuing with reference to Figures 1 and 2, the voice and language indexing function 30 is applied to the final transcript produced by the natural language interpretation function 29. The indexing function 30 uses techniques generally known in the art. For example, an inverted index is created that contains each term and a list of all the locations where that term is used. The indicators, that is, the time stamps, for each presentation of the term are provided for recovery. The natural language and voice indexing function 30 is also useful in providing a video scanning capability. The ability to scan the video is the subject of a United States Patent Application entitled "System and Method for Skimming Digital Audio / Video Data", which was filed together with the present in the name of Mauldin et al. ("Mauldin et al.") And which is incorporated herein by reference. Both of the present application and - the Mauldin et al. Application, are owned by the same entity. The final result of the audio data processing flow 20 is the indexed transcription of the text 38, which is stored in the digital library 36 for future use. Understanding the Image Based on the Content With reference to Figures 1 and 2, the video data 20 will be processed in parallel and in certain circumstances as mentioned herein, the interaction with the processing of the audio data 18 described in FIG. the above. The first stage is generally referred to herein as video segmentation based on content, shown as dotted line box 32 in FIGURE 2, which is made up of up to three functions. The first function is performed in step 31 and is the digitization of the video data 20. The digitization function 31 is performed by means of techniques known to those skilled in the art. The second function 33 of paragraph formation. The use of function 33 for paragraph formation avoids the consumption of time, the conventional procedure of reviewing a video file frame by frame around an index entry point. To identify the boundaries of the paragraph, function 33 for paragraph formation locates the start and end points for each take, scene, conversation, or similar by application of the machine vision methods that interpret the image sequences. The function 33 for the formation of paragraphs is able to follow objects, even through the movements of the camera, to determine the limits of a video paragraph. The resulting paragraph or segmentation process is faster, more accurate, and more easily controlled than any previous manual method. Each paragraph can be reasonably summarized by a "representative table" as it is known and is thus treated as a unit for the size of context or for a search of the content of the image. At least a portion of this task is done by content-independent methods, which detect large "image changes", for example, detection of the "keyframe" by change in the coefficient (compression) of the Discrete Cosine Transformation ( "DCT"). It is preferred, however, to use video paragraph formation methods based on the content, because the end user is interested in the content or object of the retrieval, not simply the retrieval of the image. The object of the video consists of both the image content, the textual content and the transcriptions of the audio text, the combination of which the object is specified. Textual information is useful to filter video segments quickly locating potential groups of interest. A subsequent visual question, which refers to the image content is preferred. For example, Questions such as "Find the video with a similar scenario", "Find the same scene with different camera movement" and "Find the video with the same person" are important considerations from the user's perspective. Part of these questions can be done by methods independent of the content, such as histogram comparisons. The current efforts in image databases, in fact, are based mostly on indirect statistical image methods. They fail to exploit the language information associated with the images or to deal with three-dimensional events. The use of multiple methods, either separately or in combination, for Function 33 for the formation of paragraphs. The first method is the use of comprehensive image statistics for segmentation and indexing. This initial segmentation can be done by monitoring coding coefficients, such as DCT and detecting rapid changes in them. This analysis also allows the identification of the keyframe or tables of each video paragraph; the keyframe is usually at the beginning of the visual sentence and is relatively static. Once the video paragraph is identified, the characteristics of the image such as color and shape are extracted and those are defined as attributes. A comprehensive set of image statistics such as color histograms and Kalman filtering (edge detection) is created. Although these are "indirect statistics" for the content of the image, they have proven useful in quickly comparing and categorizing images and will be used in recovery time. The concurrent use of image, voice and natural language information is preferred. In addition to image properties, issues such as speaker changes, audio synchronization and / or background music and changes in the content of spoken words can be used for reliable segmentation. Figure 6 illustrates how previously identified information can be used to increase segmentation reliability. As seen in FIGURE 6, the coincidence in the change in the histogram, the scene change information and the audio information, combine to increase the reliability in determining the limits of video paragraph 1. FIGURE A-2 is an example where the keywords are used to locate articles of interest and then the statistics of the image (movement) is used to select representative figures of the video paragraph. In this example, the words "toy" and "kinex" have been used as the keywords. The initial and closing frames have similar color and text properties. The structural and temporal relationships between the video segments can also be extracted and indexed. The next integrated method for determining video paragraph limits is the two-dimensional camera and the movement of the object. With this method, the visual segmentation is based on the interpretation and following the movements of the camera uniform such as approach, panoramic and movement of the camera forward. Examples include the inspection of large panoramic scenes, the focus of the observer's attention on a small area within a large scene, or the moving camera mounted on a vehicle, such as a boat or an airplane. A more important class of video segment is defined not by the movement of the camera, but by the movement or action of the objects that are being observed. For example, in an interview segment, once the interviewee: -: or interviewee has been located by voice recognition, the user may wish to see the entire segment that contains the interview with this same person. This can be done by looking forward or backward in the video sequence to locate the box in which this person appears or disappears from the scene. It is also preferred to incorporate techniques to track objects with a high degree of freedom, such as the human hand (27 degrees of freedom), based on "deformable templates" and the Kalman Extended Filtration method. Such a technique provides a tool for the video database to track and classify the movements of highly articulated objects. The segmentation of video by the appearance of a particular object or an object in combination, known by those with skill in the technique as "presence of the object", is also a powerful tool and it is preferred to include the methods to do so. Although this is difficult for a general three-dimensional object for arbitrary location and orientation, the KL transformation technician has proven to work to detect a particular class of object. Between the presence of the object, the human content is the most important and common case of the detection of the presence of the object. Finally, the techniques discussed so far are applicable to two-dimensional scenes, but the video represents in most cases the form of three dimensions and movement. Adding a three-dimensional understanding capability to function 33 for paragraph formation greatly expands the capabilities of video segmentation function 32. The "factoring" approach, a pioneer at Carnegie Mellon University, is used in the approach in which each picture box a "point of interest" operator finds numerous corner points and others in the picture that lead by itself to the unambiguous frame coupling to picture. All the coordinates of these points of interest in all the frames of the video sequence are put into a large data discussion. Based on a theory of linear algebra, it has been proved that this disposition - whose classification is always equal to or less than 3 - can be decomposed in the information of form and movement, that is to say Observations = Form x Movement. Other rules 37 generated by the natural language interpretation function 29 can be useful for forming paragraphs based on the content. For example, the "football" and "scoreboard" keys can be used to identify scenes in a soccer game segmented by the sample scoreboard. It should be understood by those skilled in the art that any of these methods may be employed in function 33 for the formation of paragraphs, either separately or in combination with other methods, to meet the requirements of particular applications. In addition, the present invention also provides time-based segmentation capability. After the paragraph formation function 33 is terminated, the icons are generated by the function 35. The icons are a combination of text and video, either fixed or moving, which are created for the subsequent presentation to the user. conduct an investigation The visual icons are preferably representative of a video paragraph or contiguous, multiple video paragraphs that relate to the same subject matter. The examples of icons recovered in a search or investigation are shown in FIGURE A-1. Both iconic fixed and miconic representations of video information can easily be misinterpreted by a user. For example, a search for related video sequences for the transportation of items during the early 1800s can return twenty (20) important items. If the first twenty (20) seconds of several sequences are "talking the heading" of introductions, the icons and mykonos do not provide a significant visual issue about the content of the video; The information after the introduction may or may not be interesting for the user. However, the icons, imiconos in movement, intelligent overcome some of those limitations. Image segmentation technology creates short sequences that more closely map to the visual information contained in a video stream. Several pictures of each new scene are used to create the imicon. This technique allows the inclusion of all the important image information in the video and the elimination of redundant data. See Mauldin et al. For a video that contains only a scene with little movement, a micono can be the appropriate representation. If the video data contains a single scene but with considerable movement content or multiple scenes, the imicone is preferred to display the visual content. To determine the imicone, the optimal number of frames needed to represent a scene, the optimal frame rate and the requisite number of scenes necessary for the representation of the video are determined. The heuristics for the creation of the imicon are data-dependent and such factors as the number of unique scenes needed to represent a video fragment are taken into account; the effect of camera movements and the movements of the object in the selection of images to represent each scene; and the best speed of representation of the images. Due to the human visual system is able to quickly find a piece of desired information, the simultaneous presentation of the icons in movement created intelligently, will leave the user acting as a filter to choose a material of greater interest. It is preferred that the process flow continuously with the video compression function 34, although the video compression function 34 can occur at various positions within FIGURE 2. The video compression function 34 can use any of the video compression formats. Commercial compression available, for example the compression format Intel DVIMR, thus requiring only 10 Mbytes per minute of video source to achieve VHS quality playback, ie 256 x 260 pixels. Other compression techniques can also be employed which, for example, can be MPEG or MPEG-II. Using compression techniques, it is anticipated that only one terabyte of storage will maintain more than 1000 hours of compressed video, segmented 40. Interactive Digital Library of User Stations 42 Interactive Stations 42 of the user, see FIGURE 1, preferably they are instrumented to maintain a global history of each session. Which includes all the original digitized dialogue of the session, the associated text as recognized by the audio transcription portion of the function 27, the questions generated by the natural language processing function 129 and the video objects returned, the compositions created by users and registration of all user interactions. In essence, station 42 will be able to reproduce a full session allowing both comprehensive statistical studies and detailed individual protocol analyzes. An initial question can be textual, which is entered either through the board, the mouse, or spoken words that are entered through the microphone into workstation 42 and recognized by the portion on line 14 of the system 10. Subsequent refinements of the question, or new related questions may relate visual attributes such as "find me scenes with similar visual backgrounds". The natural language processing function 129 exemplified by the Scout program is used to process a question in much in the same way that the natural language processing function 29 is used to process the transcribed audio. The interactive user stations 42 include the option of adjusting the duration and information content of the recovered segments to adjust the speed of reproduction of the information, as well as adjusting the playback speed of the media. When a search contains many hits, system 10 will simultaneously display icons and imicons (intelligently chosen full-motion sequences) along with the text summary. Which is defined as a parallel presentation. The functionality will be provided to allow the user to extract subsequences from the supplied segments and reuse them for other purposes in various forms and applications. Each will be described in more detail in the following. Interactive user station 42 allows the user to adjust the "size" (duration) of the recovered video / audio segments for playback. Here, the size may be the length of time, but more likely will be summary fragments where the complexity or type of information will determine the measurement. The appropriate metaphors to be used when the user is adjusting the size is summarized is chosen based on empirical studies. For example, it is well known that higher video production value has more blow changes per minute than for example a videotape reading. Although it is visually richer, it can be less dense linguistically. The unique balance of the density of linguistic and visual information appropriate for different types of video information is selected. The interactive user station 42 allows the user to interactively control the reproduction speed of a given recovered segment, at the expense of the quality of information and perception. The formation of video paragraphs will help this purpose. Knowing where scenes begin and end, high-speed scans of digital video segments 48 can be performed by presenting fast scene representations. This method is an improvement over skipping a set of number of frames, since the changes in the scene often reflect changes in the organization of the video very similar to the sections in a book. The empirical studies can be used to determine the speed of presentation of the scene that better enables the searches of the user and the differences, if there are between selection of the image for the optimal explorations and the selection of image for the creation of the imiconos. Once users identify the video objects of interest, they need to be able to manipulate, organize and reuse the video. Even the simple task of editing is more than trivial. To ensure video reuse effectively, the user needs to combine text, images, video and audio in new and creative ways. The tools can be developed for the user's workstation 42 to provide expert assistance in kinematical knowledge to complete the output of the content based on the video segmentation function 32 with the language interpretation function 28 to create the semantic understanding of the video. For example, the contrast of a high quality, visually rich presentation edited together with a selection of a college reading in the same material may be inappropriate. However, developing a composition where reading material is available to those interested, but not presented automatically, can create a richer learning environment. With deep understanding of the video materials, it is possible to help reuse more intelligently. Data and Architecture of the Core Network to provide continuous means of remote storage systems, is the ability to sustain sufficient data rates of the file system and on the network to provide pleasant audio and video fidelity in terms of frame rate, size and resolution in the reproduction for the user that receives. The ability to continuously transmit thirty (30) frames / second in full color, on the full screen, television quality images even for a single user is limited by network bandwidth and allocation. For current compression ratios that produce 10 Mbytes / minute of video, a minimum of 1.3 Mbit / s dedicated to the link would be required to supply the continuous video. Those speeds can not be achieved commonly through the Internet. The ability to deliver the same video material simultaneously to many users is further limited by the disk transfer speeds. With reference to FIGURE 4, a preferred network architecture is shown, generally mentioned by the number 80. There is a digital video / audio file 82 with a hierarchically cached file system, with all the data digitized in the node 84"media-server" superior and memories caches of the most recently accessed media in nodes 88, 90, 92 of the "site-server". It is preferred that the upper server-media node 84 have a capacity of one (1) terabyte and each of the site-server nodes 88, 90 and 92 have a capacity of forty (40) to fifty (50) gigabytes. The top server-media node 84 is preferably implemented as a multi-threaded user-level process in a UNIX system, with fixed priority policy scheduler, which communicates with the data of the continuous media in network connections standard. The "site-server" nodes 88, 90, 92 are placed in a local area network with the workstation 42 of the local end-user interactive user. The searchable portions of the digital library 36, ie the auxiliary transcripts and index, exist in the upper media-server node 84 and are replicated in each site. This allows for intensive CPU searches that are going to be done locally and the means that will serve either the local cache on the site-servers 88, 90, 92 or from the upper server media node 84. The workstation 42 of the local interactive user may be either a buffer display station, a display plus a search engine, or these latter 98 cache media with a capacity of approximately 2 gigabytes, depending on their size and operating class. Cache strategies will be implemented through implementations of the standard file system, for example the Andrew Transare's File System (AFS) and OSF's Distributed File System industry standard (DFS). The concentration of observation that flows strongly in the architecture of the system and in this way is dependent on the application. Where and how to keep the cache depends on the "observation location". The typical, severe DC network data requirements for video demand systems are relaxed in the implementation of the library system because (1) more sequences are anticipated to be short (less than two minutes), (2) ) many will be supplied from the site-server nodes 88, 90, 92 in the premises and (3) the data presentation is always made from the buffer constituted by the user's local disk, typically 1-2 gigabytes in the developments of the first systems. Currently the compression techniques used reduce the data requirements to approximately 10 Mbytes / minute of video. The digital video library system 10 is independent of the architecture, in such a way that the commercial file systems of structured videos for the supply of continuous media and video demand, which solves the problems of achieving sufficient efficiency of the server, including the use of the disc that is dismounted in the disk arrangements to allow the continuous supply to large numbers of simultaneous observers of the same material, may be incorporated when available. An 82 file from one (1) to ten (10) terabytes is representative of the anticipated business environments. The server network 80 can transmit to other sites by means of the commercially available multi-megabyte switched data service (SMDS) 99 currently at data rates of economic price (1.17 Mbps). The frame relay services (not shown) from 56 Kbps to 1.5 Mbps are also provided for far satellite services. The communication interfaces to the workstation interface 42 of the local Ethernet interactive user to clouds 99 SMDS are in place. A key element of online digital library communication network shown schematically as 100 in FIGURE 5, through the. which means the servers 109 and the nodes 110 of the satellite (user) are interconnected. The voice-grade telephone lines in traditional modem-based access is not suitable for this multi-media application. The network 100 preferably has the following characteristics. First, the preferred communication is transparent to the user. Special-purpose hardware and software support is preferably reduced to a minimum in both the server and slave nodes. Second, the preferred communication services must be cost-effective, which implies that the link capacity (bandwidth) is scalable to accommodate the needs of a given node. Server nodes 107, for example, require the highest bandwidth because they are shared among a number of satellite nodes 110. Finally, the deployment of a conventional communications network should be avoided. The most effective and timely cost solution will be built on the communication services already available or on trial in the field. A tele-switching of a Wide Area Network (WAN) topology network 100 ideally suited for the online digital video library has been developed.

The WAN topology used is shown in FIGURE 5. Two elements of the communication network are (1) the use of Central Office Local Area Networks (CO-LAN) 102 to provide non-switched data services to stations working on the digital subscriber circuit technology 105 and (2) the use of a Multi-Megabit Switched Data Service (SMDS) "cloud" 104 to interconnect CO-LAN 102 and wide server nodes 107 high band. The high bandwidth server nodes 107 are directly connected in the SMDS cloud 104 by means of a line 108 of 1.17 Mbit / s access. The SMDS infrastructure provided for larger bandwidth connections (from 4 Mbit / s to 34 Mbit / s) should be required. OPERATIONAL SUMMARY The following example explains the processing of the present invention together with a hypothetical search. It is assumed that the digital library 36 has been created by the 12th offline portion. The student begins by talking to the monitor, "I've got something to put together on culture and satellites, what are they?" Transparent for the user, the user's work station 42 has already performed continuous voice recognition, independent of the speaker, highly accurate in his question. The line portion 14 of the digital library system 10 then applies the sophisticated natural language processing functions 129 to understand the question and translate the question into recovery commands to locate important portions of the compressed video, segmented 40. The compressed, segmented video 40 is examined using the associated indexed transcripts of text 38. The appropriate selection is further refined by means of the dimensioned scene developed by the image understanding technology 32. Appearing on the screen are several icons, some showing moving fragments of the video content, followed by the text that forms extended titles / extracts of the information contained in the video (see Figure A-2). By making this possible, image processing helped select fixed and representative images for icons and scene sequences for intelligent moving icons. The audio transcription functions 27 created transcripts, which are used by the natural language function 29 to summarize and form the extract of the selections. Through either a mouse or a spoken command, the student requests the second icon. The screen is filled with Arthur Clarke's video that describes how they do not try to patent communications satellites, even when he first described them. Next, the student requests the third icon and sees villages in India that are using satellite discs to observe educational programming. Asking again, Arthur Clarke reappears.

Now, speaking directly to Clarke, she is amazed if she has any thoughts on how her invention has shaped the world. Clarke speaking from his office, begins to talk about his childhood in England and how different the world is since then. Using an examination control, you find one particularly important to be included in the multimedia composition. In addition to the search and retrieval requirement to give the student such functionality, it requires understanding of the image to intelligently create scenes and the ability to examine them. The function examined is described in Mauldin et al. The next day the student to his teacher access to his project. More than a simple presentation of a few fragments of video, the student has created a video laboratory that can be explored and whose structure is in itself an indicator of the student's understanding. The help for this student to be successful are the tools to build multimedia objects that includes help in the language of cinema, appropriate use of video and structuring composition. Behind scenes, the system has created a profile of how the video is used, distribute that information to the accounts of the library. Although the present invention has been and described in conjunction with its preferred embodiments, it will be understood that variations and changes in the details of the present invention as described and illustrated herein may be made by those skilled in the art without departing from the spirit of the invention. spirit, principle and scope of the present invention. Accordingly, it is expressly intended that all such equivalents, variations and changes thereto, which fall within the principle and scope of the present invention as described herein and are defined in the claims encompassed by it.

Claims

CLAIMS 1. A method to create a digital library, independent from the audio data and existing video images, which includes the steps of: transcribing the audio data and marking the audio data transcribed with a first set of marks of weather; index the transcribed audio data; digitize the video data and mark the digitized video data with a second set of timestamps related to the first set of timestamps; and segment the digitized video data into paragraphs according to a set of rules, characterized because the set of rules that are based on the characterization of the scene of the video images and the processing of the audio data, the audio data indexed and digitized, segmented video data that are stored with their respective sets of timestamps to form the digital library, which can be accessed through indexed audio data without returning to existing audio data and images Of video. The method according to claim 1, characterized in that it additionally comprises the step of passing the transcribed audio data through the natural language interpreter before indexing the transcribed audio data. 3. The method according to claim 2, characterized in that the natural language interpreter updates the rule set. 4. An apparatus for creating a separate digital library, from the audio data and existing video images, means for transcribing the audio data and marking the transcribed audio data with a first set of timestamps; means for indexing the transcribed audio data; means for digitizing the video data and marking the digitized video data with a second set of timestamps related to the first set of timestamps; means to store a set of rules; means for segmenting the digitized video data into paragraphs according to the set of stored rules, characterized in that the rules are based on the characterization of the scene of the video images and the processing of the audio data; and means for storing the indexed audio data and the digitized video data, segmented with their respective sets of timestamps to form the digital library, which can be accessed through the indexed audio data without returning the audio data in existing video images. 5. The apparatus according to claim 4, characterized by further comprising a natural language interpreting means for processing the transcribed audio data before the data is indexed. 6. The apparatus according to claim 5, characterized in that the natural language interpreter means updates the set of rules. The method according to claim 1, characterized in that it additionally comprises the step of generating a set of icons after segmenting the digitized video data into video paragraphs according to the set of rules, each of the icons that is representative of the contents of the video of the video paragraph to which they correspond. 8. The method of compliance with the claim 7, characterized in that the set of icons is a set of intelligent moving icons. 9. The method of compliance with the claim 8, characterized in that the set of intelligent moving icons is generated using data-dependent heuristics. 10. The method of compliance with the claim 1, characterized in that it additionally comprises the step of compressing the digitized video data before storing the indexed audio data and the digitized video data with their respective sets of timestamps. 11. The method in accordance with the claim I, wherein the step of transcribing the audio data and marking the audio data transcribed with a first set of timestamps, is characterized in that it includes the steps of: producing a set of possible occurrences of words with each occurrence of word that has an initial time and a plurality of possible final times; producing a plurality of possible start times for each of the end times; generate a set of N-best hypothesis for the audio data; and to select a better classification hypothesis from the set of N-best hypotheses to produce the transcribed audio data. 12. The method in accordance with the claim II, characterized in that the set of possible word occurrences are produced using a front-time synchronous pitch function. 13. The method according to the claim 11, characterized in that: the plurality of possible start times are produced using an inverse time synchronous function. The method according to claim 2, wherein the step of passing the transcribed audio data through the natural language interpreter before indexing the transcribed audio data is characterized by including the steps of: summarizing the data audio transcripts; mark transcribed audio data using data extraction techniques; and correct transcribed audio data, marked using semantic and syntactic restrictions and a phonetic recognition base. 15. The corporation in accordance with the claim 1, characterized in that the digitized video data is segmented into video paragraphs using comprehensive statistical image rules. The method according to claim 1, characterized in that the digitized video data is segmented into video paragraphs using camera movement rules. The method according to claim 1, characterized in that the digitized video data is segmented into video paragraphs using object movement rules. 18. The method according to claim 1, characterized in that the digitized video data is segmented into video paragraphs using deformable templates and filtering rules. 19. The method according to claim 1, characterized in that the digitized video data are segmented into video paragraphs using presence rules of the object. 20. The method of compliance with the claim 1, characterized in that the digitized video data are segmented into video paragraphs using rules of understanding of three dimensions. The apparatus according to claim 4, characterized in that it additionally comprises means for generating a set of icons after the digitized video data is segmented into paragraphs according to the set of rules, each of the icons being representative of the contents of the video of the paragraph of the video to which they correspond. 22. The apparatus according to claim 21, characterized in that the icon set is a set of intelligent moving icons. 23. The apparatus according to claim 22, characterized in that the means for generating the set of icons in intelligent motion utilizes data-dependent heuristics. The apparatus according to claim 4, characterized in that it additionally comprises means for compressing the digitized video data before the indexed audio data and the digitized video data are stored with their respective sets of timestamps. The apparatus according to claim 4, characterized in that the means for transcribing the audio data and marking the transcribed audio data with a first set of timestamps comprises: means for producing a set of possible speech presentations with each word presentation that has a start time and a plurality of possible end times; means for producing a plurality of possible start times for each of the end times; means to generate a set of N-best hypotheses for the audio data; and means to select a better classification of hypotheses from the set of N-best hypotheses to produce the transcribed audio data. 26. The apparatus according to claim 25, characterized in that the means for producing the set of possible word presentations uses a front time synchronous pitch function. 27. The apparatus according to claim 25, characterized in that the means for producing the plurality of possible start times uses an inverse time synchronous function. The apparatus according to claim 5, characterized in that the means for passing the transcribed audio data through a natural language interpreter before indexing the transcribed audio data comprises: means for summarizing the transcribed audio data; means for marking transcribed audio data using data extraction techniques; and means to correct the transcribed audio data, marked using semantic and syntactic restrictions and a phonetic knowledge base. 29. The apparatus according to claim 4, characterized in that the means for segmenting the digitized video data into video paragraphs use comprehensive image statistics rules. 30. The apparatus according to claim 4, characterized in that the means for segmenting the digitized video data into video paragraphs uses camera movement rules. 31. The apparatus according to claim 4, characterized in that the means for segmenting the digitized video data into video paragraphs uses object movement rules. 32. The apparatus according to claim 4, characterized in that the means for segmenting the digitized video data into video paragraphs uses deformable templates and filtering rules. 33. The apparatus in accordance with the claim 4, characterized in that the means for segmenting the digitized video data into video paragraphs uses object presence rules. 34. The apparatus according to claim 4, characterized in that the means for segmenting the digitized video data into video paragraphs use rules of understanding of tree dimensions.