WO1997009683A1 - Authoring system for multimedia information including sound information - Google Patents

Authoring system for multimedia information including sound information Download PDF

Info

Publication number
WO1997009683A1
WO1997009683A1 PCT/JP1995/001746 JP9501746W WO9709683A1 WO 1997009683 A1 WO1997009683 A1 WO 1997009683A1 JP 9501746 W JP9501746 W JP 9501746W WO 9709683 A1 WO9709683 A1 WO 9709683A1
Authority
WO
Grant status
Application
Patent type
Prior art keywords
information
means
voice
image
speech
Prior art date
Application number
PCT/JP1995/001746
Other languages
French (fr)
Japanese (ja)
Inventor
Hideaki Kikuchi
Nobuo Hataoka
Toshiyuki Aritsuka
Original Assignee
Hitachi, Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30017Multimedia data retrieval; Retrieval of more than one type of audiovisual media

Abstract

An authoring system by which retrieval of a moving picture or sound information from video information including sound information is facilitated using a portable information terminal such as a PDA (Personal Digital Assistant) notebook computer, or using a multimedia terminal such as a personal computer or a workstation. The authoring system is provided with at least a retrieval key-inputting means through which a retrieval key such as a key word or an attribute value is inputted, retrieval result outputting means which outputs the retrieved sound information or moving picture, multimedia information retrieving means which retrieves multimedia information including sound information and moving picture information, and index generating means which generates indexes representing the correspondences between sound information and the moving picture information with respect to multimedia information including sound information. A desired moving picture or sound information can be readily retrieved from other corresponding information.

Description

Osaringu type technical field of Maruchimeda information, including a bright fine manual voice information

The present invention, PDA or (Personal Digi al Ass i stant) Laptop of which portable information terminals, personal computers, at any workstation multimedia terminal, the video including sound information, facilitating a different picture speaker that provides Osaringu scheme makes it possible to extract the. BACKGROUND

In conventional video Osaringu method, when extracting a person-specific scene from the video, in order to extract the person image from the image frame, the correspondence between the personal image and the voice is Torezu, extracted scene section necessarily of the person there is a problem that does not match the voice section. In contrast, hold in advance a person separately from the image feature amount and audio feature, the person image identification from each of the characteristic amounts, perform the voice speaker identification, although techniques associating portrait image and sound of the same person are considered , in practice it is impossible to have coercive a different image feature amount and audio feature person, feasibility is low.

In the prior art, and a person image, it is difficult to automatically extract a scene including audio section corresponding thereto.

An object of the present invention, and a person specified from the image with the mouse, the person's name input by keys board provides a system for a scene including an image appearance section and the speech section of the relevant person can be automatically extracted it is. Disclosure of the Invention

In order to solve the above problems, the Maruchimeda information O one Sari ring system of the present invention, at least a search key input means for inputting a search key such as keywords and attribute value, as search results audio information or moving picture a force search result output means out includes a multimedia information retrieval means for retrieving multimedia § information including audio information and video information, and the Maruchimeda information including audio information of the audio information and the moving picture by providing an index creation means for creating a Indekusu indicating the correspondence between, and win moving image or want sound information, and allows easy retrieval to Rukoto from the corresponding other information.

The Indekusu creating means includes a voice activity detection means for detecting a speech section of the sound information included in multimedia information, and the audio index creation means for creating a sound I Ndekusu based on speech interval, the speech moving image corresponding to the voice interval or made it possible to easily obtain the sound of the voice interval corresponding to the moving image.

And enabling visually be performed by Li, the Osarin grayed multimedia information to having an in-Dex display means for displaying Indekusu of the Maruchimeda information on the display

Against Indeku scan that I displayed on the re-display the Indekusu display means, by specifying a speech segment to find the voice or moving images of the speech segment. By specifying an arbitrary image of the moving image, using a Indekusu created by the Indekusu creation means searches the Tekoe or moving images of a speech section corresponding to the specified image.

By using the position input means such as a mouse, to specify the range of the desired Maruchimeda information, by designating by said position input means an arbitrary position in another window, the reference information to the multimedia information in addition to the position to allow Rukoto, it is also possible to configure Osari ring system of hyperlinks multimedia information.

The Indekusu creating means includes a voice activity detection means for detecting a speech section of the sound information contained in Maruchimeda information, to identify the speaker the audio of the voice interval detected by the voice interval detection means, said for all the speech segment and speaker identification means for identifying the speaker, 該話 person and the person and the voice Tendekusu creating means for creating a voice I Ndekusu the voice section to the original, I have a Li, the same speaker of the whole speech section moving image corresponding to audio or made it possible to obtain a sound of all the speech section of the same speaker that corresponds to the moving image easily.

Using the character input means such as a keyboard, Li by the specifying the person's name, to find the voice or moving picture in a speech period of the person thereof.

The Maruchimeda information search means detects a movement of the lip from the person image in the video image, based the lip recognizing means for identifying phonemes corresponding to the motion of the lips, the audio information in the video image to the phoneme standard pattern recognition a speech recognition means for a phoneme discrimination result output from the mouth-lip recognition means, an image speech collating unit for comparing and collating the speech recognition result you output the speech recognition means, in the image speech collating hand stage, the phoneme identification result by having a scene extracting means for extracting a moving image of chic matching with the determined speech segment, the person image or corresponding to the speech of the speech segment, to obtain a sound of the speech segment corresponding to the person image easily it can.

The Maruchimeda information retrieval means, in response to the position input means by connexion input position such as a mouse, by having the person image extracting unit for extracting a person image that exists in the position in the moving image, the position Specifies the person image in the video image have use the input means, automatically voice Oh Rui speech period of the person thereof can retrieve a moving image.

Further, the multimedia information retrieval client Bok server system of the present invention, you calling and voice transmission request means for transmitting a sound transmission request protocol, a voice search means for searching against the voice, a moving picture transmission request protocol multimedia information display client with a moving image transmission request means, (hereinafter client) and receives a sound transmission request protocol, the information acquisition means to acquire the multimedia information specified in 該Pu port Tokoru , a voice extracting means for extracting audio from multimedia information, and the audio transmission means for transmitting voice, and video image transmission means for transmitting the moving image, multimedia information search server (hereinafter, server) having a, a a further server, after receiving the moving image transmission request protocol, specified have you to the protocol A scene extraction unit for extracting a moving image has been section of the multimedia information, made it possible to communicate only the information of a desired interval without communicating any information. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is an overall block diagram of a multimedia information Osaringu method, FIG. 2 is a configuration example of Indekusu creating means, FIG. 3 is another configuration example of the index creation means, Figure 4 is Maruchimeda a configuration example of an information retrieval means, FIG. 5 shows another configuration example of the multimedia information retrieval means, Figure 6 is an example of the configuration of the search result output unit, FIG. 7 is a screen display of the present invention an example, FIG. 8 is a screen another example of a display of the present invention, FIG. 9 is a configuration example of a multimedia information retrieval Kura o Ant server system, the first 0 Figure multimedia information retrieval client server path is another example of the configuration of the system, the first FIG. 1 is a screen display example of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, the embodiment will be described in detail with reference to FIG. Hereinafter, MultiMediaCard § information is information including at least voice and video image. Also, in this case, in particular as a multimedia terminal, a description assumes the mobile information terminal having the browsing and editing features of multimedia information. However, the present onset Akira is not limited to the portable information terminal, and multimedia terminals such as a personal computer or workstation, for families with editing function, the British Council language 習用 video deck, video storage, such as a TV phone message recording video application to Ma Ruchimeda information devices generally having a function is possible.

Figure 1 is a proc configuration diagram of a multimedia information Osaringu method of the present invention.

In Figure 1, the search key input means 1 0 1, in order to find the object to be edited by the user, a means for inputting a Kiwado Ya position as a search key. Multimedia information retrieval unit 1 0 2 is a means for retrieving voice or moving images of an arbitrary section to the multimedia information. Search result output unit 1 0 3, the results for Maruchimeda information retrieval unit 1 0 2, is a means for outputting for presentation to the user. Index creation means 1 0 4 is a means for creating a Indekusu showing the correspondence between the sound information and the moving picture about the multimedia information.

More specifically, the audio that is included in the multi-media information 1 0 5, the mean interval in the story's name that corresponds to the voice section and audio sound voice exist. Also, moving image, for example, performs a segmentation corresponding to each person for each person in the image, using the inter-moving table 示区 by segmentation based on any rule, such as. Note that when performed automatically at any time when the Maruchimeda information after storing the index creation means 1 0 4 is also carried out by the user can be considered. In the following, it is assumed to be performed by the user.

User first, enter a search of the key to search the object that you want to edit by using the search key input means 1 0 1. Here, the search key, or string, any part image in the still image, such as the section is conceivable. Search keys input means 1 0 1 alone or double multiplexer can input all of these search keys. Next, the multimedia information retrieval unit 1 0 2 uses a search key by being re entered into the search key input means 1 0 1, with respect to multimedia § information Indekusu 1 0 6, matches the search key index the retrieving multimedia information lifting one specific section. Furthermore, the search result output unit 1 0 3 to view the index of the multimedia information on the display, multimedia information retrieval unit 1 0 2 by the retrieved voice, Oh Rui outputs a moving image. Specifically, in the case of audio speakers, and audio output from such Tsu Dofon to, in the case of moving image, for displaying to such a display.

For example, the search key input means 1 0 1 Of the containing Murrell sound Maruchimeda information, when the search key indicating a particular interval is entered, the multimedia information retrieval unit 1 0 2, advance index creation means 1 0 4 using Maruchimeda information Indekusu 1 0 6 created on the basis of the speaker-specific speech segment by, for search the moving image corresponding to audio or voice. Voice or moving images retrieved is more in the search result output unit 1 0 3, audio output or moving image display is performed.

In FIG. 2, showing a configuration example of the index creation means 1 0 4 of the present invention. In Figure 2, to create a voice Indeku scan 2 0 4 as a multimedia information index 1 0 6. Therefore, Indekusu creating means 1 0 4, and a speech segment detection device 2 0 1 and audio Indekusu creating means 2 0 2.

Voice section detection unit 2 0 1, to the sound information included in the multimedia information 2 0 3 accumulated is means for detecting a human voice interval. As a method of detecting a speech section in the acoustic information, for example, a method short-time power of a constant Works threshold or more values ​​whether continues for a predetermined time or longer is found using ( "Digital speech processing", Tokai University Press, see "detection of 8.2 audio section" pp 1 5 3). Audio index creation means 2 0 2, the original to create the indenyl box information of the speech section detected by the speech section detecting unit 2 0 1. Here, the index that is created by the audio-in desk preparing means 2 0 2, for example, the beginning of each speech interval is detected, the time and the end, and a speech section length.

Such By creating a Indekusu based on the speech section in Li, as a moving image or vice versa correspond to the voice of the voice interval, so it is possible to easily obtain speech of the speech segment corresponding to the moving picture Become.

Figure 3 is another structural example of the index creation means 1 0 4 of the present invention. In Figure 3, the speaker identification means 8 0 1, for voice, in particular speaker performs collation with voice standard pattern, means for identifying whether the voice is the voice of a specified speaker is there. As a method for speaker identification, for example, after a feature extracted from the speech wave, examines the distance or similarity between the standard flapping down of each registered speaker which is stored in advance, the line determination of the re-recognition by the the degree there is a name cormorant method ( "digital sound processing", Tokai University Press, see PP 1 9 6 "9. of 3 speaker recognition system configuration").

In FIG. 3, first, the speech section detecting unit 2 0 1, we detect human speech segment against stored multimedia information 2 0 3 of acoustic information was. Further, the audio of the detected speech interval, and more speaker identification means 8 0 1 performs speaker identification based on the speech reference pattern 8 0 2. Result of performing speaker identification, the audio of the voice section, obtaining a talker name applicable. Slave go-between, by the voice index creation means 2 0 2, to associate the speaker's name and voice section, to create an index 2 0 4 as a multi-media information index. Here, indenyl box created by the audio index preparing means 2 0 2, for example, the beginning of each speech interval is detected, the time and the end, and a speaker name with the voice Ward between length.

Thus more to create a Indekusu based on speaker name of the speech segment, as a moving image or vice versa correspond to the voice of the voice section, it is possible to easily obtain speech of the speech segment corresponding to the moving picture so as to.

FIG. 4 is a view to view an example of a configuration of a multimedia information retrieval unit 1 0 2 of the present invention. In Figure 4, lip recognition unit 1 5 0 1 recognizes the movement of the lips from person's face image which is extracted have your input image, a means for outputting a sound element corresponding to the motion of the lips. As a method for recognizing a phoneme from the movement of the lips, for example, performs a two-dimensional shape extraction by the image processing First, there is a method of performing a phoneme identified using neural network by pairs to the data ( "Nonba - Bal interface", ohm, Inc., see ppl 4 9 "recognition of the lover's tiff"). Speech recognition means 1 5 0 6 is a means for performing speech recognition on the voice information. As a method for performing speech recognition of the input speech, for example, obtains the distance input sound voices compared to the standard patterns of phonemes for each small section, the distance of the closest phonemes output as phoneme recognition result, a further phoneme sequence a word speech dictionary and compared to means (supra "digital speech processing", see Tokai University publishing, "word recognition and 8.6 phoneme units" PP 1 6 7). Image speech collating unit 1 5 0 2, the phoneme sequence corresponding to the movement of the lips of a person image, a means for matching the input speech. Scene extracting section 1 5 0 3 is a means for cutting out an image of the specified interval.

In Figure 4, first, in lip recognition unit 1 5 0 1, to recognize the movement of Li lips by the matching between the standard pattern 1 5 0 4 feature amounts such as the mouth shape and mouth surface product. As a result of lip recognition is by outputting a phoneme sequence. Then, the speech in the speech recognition unit 1 5 0 6, the I Li phoneme sequence similarity calculations between each phoneme spectrum of speech of the scan Bae-vector and the phoneme standard pattern dictionary 1 5 0 7 in phoneme section and outputs as the recognition result. Here, it performed in the image speech collating unit 1 5 0 2, the phoneme sequence is the output result of the lip recognizing section 1 5 0 1, the comparison and collation of phoneme sequence which is the output result of the speech recognition means 1 5 0 6. This makes it possible to give collated corresponds to the voice section before and after the movement of the lips in the human image. Finally, the scene extracting means 1 5 0 3, the image of the speech segment associated with the human product image, is extracted from the whole image.

With the above processing, the image of the designated person by the position input means such as a pen, it is possible to extract an image section including a speech interval from input image. Further, a person image corresponding to all speech segment of the same speaker, or it is possible to obtain a sound of all the speech section of the speaker corresponding to the human image easily.

FIG. 5 is a diagram showing another configuration example of the multimedia information retrieval unit 1 0 2 of the present invention. In FIG. 5, automatically detects the presence of a person from the input image picture provided person image extracting unit 1 9 0 1 detects the face of a person. As a method of performing an input image automatically from the detection of the presence or absence of a person, further detects a face, if example embodiment, a method of collating Biramiddo image obtained by sampling the image at multiple resolutions ( "Digital Signal Processing Handbook ", electronic, information and communication Engineers published, p P 4 0 1 Γ 4. 3. 3 recognition of the person" reference). Lip recognition unit 1 9 0 2 recognizes the movement of the lips from the person facial image extracted in the input image, a means for outputting a sound element corresponding to the motion of the lips. Speech recognition means 1 9 0 7 is means learning to perform speech recognition on the voice information. Image speech collating unit 1 9 0 3, the phoneme sequence corresponding to the movement of the lips of a person image, a means for matching the input speech. Scene extracting means 1 9 0 4 is a means for cutting out the image of the designated section.

In Figure 5, the position coordinates on the plane image input by using the first position input means on the basis of the person image extracting unit 1 9 0 1, the region near the input coordinates in the input image of the person image detecting the presence, further extracts the person's face images. In the case where one of the portrait image is detected in the input image, and it the designated image, the case where a plurality of persons image is detected in the input image is input by the position input means 1 0 1 and containing coordinate points, or to the nearest personal image and the specified image. For person's face image extracted by the person image extracting unit 1 0 2, then, in lip recognition unit 1 9 0 2, the lips by matching the standard slam 1 9 0 5 feature amounts such as the mouth shape and mouth area movement It recognizes the can. As a result of lip recognition, to the output child a phoneme sequence. Then, the speech recognition means 1 9 0 7, as a result speech recognition phoneme sequence by the similarity calculation between each phoneme scan Bae spectrum of speech spectrum and the phoneme standard slam dictionary 1 9 0 8 in phoneme section Output.

Here, it performed in the image speech collating unit 1 9 0 3, the phoneme sequence is the output result of the lip recognizing section 1 9 0 2, the comparison and collation of phoneme sequence which is the output result of the speech recognition means 1 9 0 7. Thus, it is possible to associate collates the voice section before and after the movement of the lips in the human image. Finally, the scene extracting means 1 9 0 4, the video of the speech segment associated with the personal image extracted from the whole image.

Li by the above process, the image of the designated person by the position input means such as a pen, it is possible to extract from the input video image section including a speech interval.

6 is a diagram showing the proc structure example of performing I Ndekusu display in Maruchimeda information O one authoring system of the present invention. In Figure 6, index generation means 3 0 3 corresponds to the index creation unit 1 0 4 in the first view. Index display means 3 0 1 is a means for displaying on the display to visualize the multimedia information in Dex. In Figure 6, first, the multimedia information index 3 0 4 created by the index creation means 3 0 3 performs visualization by index display means 3 0 1, and displays on the display 3 0 2. For example, the Indekusu created based on the speech section, the time on the horizontal axis in a two-dimensional coordinate system was convex, the beginning of each speech segment, is due that the display method of the time or interval length end to bars considered . Alternatively, for I Ndekusu speech that are speaker separately segmentation is considered a method of expressing by placing an additional speaker separately bars.

Note that, specifically, the search result output unit 1 0 3 in the first figure, the index display means 3 0 1, and a display 3 0 2.

By performing the visualization of such Indekusu, it becomes possible to perform the O one authoring of Maruchimeda information visually.

The FIG. 7 is a diagram showing a screen display example to visualize the index. In Figure 7, the image display region 4 0 1 on the display, the moving image is a table Shimesuru region. Index display region 4 0 2 is an area for displaying on the display, the Maruchimeda information Indekusu. Voice Indesuku display region 4 0 3 is an area for displaying on the display, an audio index. Specifying voice section 4 0 4 shows the voice section specified by the user to request the output of the voice or moving images. Specified image 4 0 5 shows an image to specify to the user requests output of the voice or moving images.

In Figure 7, first, by the specifying the user voice section for voice Indekusu displayed on the voice indenyl box display region 4 0 3 of the index display area, corresponding to Wish voice or moving images Li , audio walk can be output a moving image. Further, the user, if requests the moving image displayed on the image display region 4 0 1, a speech section or a moving image that corresponds to the speech currently being output, specifies the image 4 0 5 especially good Li that can output a voice or moving picture of the requested voice section.

In FIG. 8 as another display example, showing a screen display example when using multimedia information Osarin grayed method of the present invention to a mobile terminal. In Figure 8, on the screen of the portable information terminal, a video display area 7 0 2, the document display area 7 0 3 is provided with the menu area 7 0 1. First, in FIG. 8 to the left of the portable information on the terminal, selects the item "words extracted" from the menu region 7 0 within 1. Then, in the video playback on the video display area 7 0 2, is designated by the position input means 7 0 5 position of have the person image obtained by extracting the words. The operation up to this point, to extract a scene including a speech section corresponding to the person image designated using Osaringu method of multimedia information shown in Figure 1. In Figure 8 on the right side of the portable information terminal further extracted icons 7 0 4 symbolizes the scene, move on the screen by using the position input means such as a mouse, any document display area 7 0 3 by the position placing the icon 7 0 4 shows an operation to associate the document on the document display area 7 0 3, the extracted image.

In FIG. 9, a proc configuration example of Maruchimeda information retrieval client server system using multimedia information Osaringu method of the present invention. In Figure 9, a search key input unit 6 0 1 to search for the target to edit the user is input to means such as keywords or position as a search key. Voice transmission request means 6 0 2, the server-side, a means for requesting the transmission of audio information. Multimedia information acquisition unit 6 0 3 is means for acquiring from the data space (not shown) multimedia information includes voice information transmission is requested. Audio extraction means 6 0 4 is means for extracting a speech information portion included in the multimedia information. The voice transmitting unit 6 0 5 is means for transmitting the voice information to the client side, the speech retrieval unit 6 0 6 performs speech recognition on the voice information, the voice recognition result, which is specified as a search key the string that is a means for performing a search and speaker search. As a method for performing speech recognition of the input speech, for example, determined Me distance input speech as compared with a standard pattern of the phoneme for each small section, phonemes output as phoneme recognition result close in distance, a further phoneme sequence there is a means to be compared with the single-word speech dictionary (supra "digital sound processing", University Press Tokai, "word speech recognition and 8.6 phoneme units" pp 1 6 7 see). Moving image transmission request means 6 0 7, the server-side, a means for requesting transmission of video information in a specific section. Scene extracting means 6 0 8 from within the entire moving image, a means for extracting the video information of the designated section. Moving image transmission means 6 0 9, the client-side, a means that sends video information. Moving image Moto示 means 6 1 0 is a means for displaying a moving image. In Figure 9, the client-side, first, a string of characters entered using a search key input means 6 0 1. Then, in a voice transmission request means 6 0 2 requests transmission of the audio information in a particular Maruchimeda information. Next, in the server side, after obtaining the transmission request of the audio information, in Maruchimeda information acquisition means 6 0 3 acquires multimedia information including audio information requested to be transmitted from the database. Furthermore, the audio information part of the multimedia information acquisition, and Oite extracted audio extraction means 6 0 4, in a voice transmission unit 6 0 5, and transmits only the client speech information portion. On the client side, Te voice search means 6 0 6 smell, the received voice information, to search for a specified string. Incidentally, this Kodewa performs speech recognition once the received audio information assumes a speech retrieval method for searching the string for the recognition result. Then, the transmission of moving images corresponding to the speech segment that contains the string, and requests the moving image transmission request means 6 0 7. Furthermore, in the server side, based on the dynamic image transmission request received, the scene extracting means 6 0 8, extracts a moving image between the requested district from the whole moving image, the client-side by the moving picture transmission means 6 0 9 to send to.

With the configuration described above, in the voice search available multimedia information retrieval client server system, the entire multimedia information without transmitting from the server side to the client, to be transmitted only the information required becomes possible.

First 0 is a diagram showing another example of block configuration of a multimedia information retrieval client server system.

In the first 0 views, the client-side, first, the designated speaker name speaker names entered using a search key input hand stage 6 0 1. Then, in the voice transmit request means 6 0 2 requests transmission of the audio information in a particular multimedia information. Next, in the server side, after obtaining the transmission request of the voice information, the multimedia information acquisition unit 6 0 3 acquires Maruchimeda information including audio information requested to be transmitted from the database. Furthermore, the audio information part of the obtained multimedia information, extracted by the audio extraction unit 6 0 4, in a voice transmission unit 6 0 5, and transmits only the client speech information portion. On the client side, the speech retrieval unit 6 0 6, the received voice information, to search for a specified speaker. Here, it performs speaker identification on the received audio information assumes a speech retrieval method for searching for a specified talker name against the identification result. Then, the transmission of the moving image of the speech section corresponding to the finger Teihanashi name, requesting the moving image transmission request means 6 0 7. Furthermore, in the server side, based on the moving image transmission request received, the scene extracting means 6 0 8, extracts a moving image of the requested segment from the entire moving image, the moving image transmission means 6 0 9 on the client-side Send.

Or more of the processing by Li, in the multi-media information retrieval client one mackerel system capable of speaker search, without having to send the entire Maruchimeda information to the server side Karak Ryan Bok, it is possible to transmit only the required information to become.

First FIG. 1 is a screen Table 示例 multimedia information O one authoring system of the present invention. In the first Figure 1, the image display area 1 2 0 1 on Disubure I, an area for displaying a moving image. Index display area 1 2 0 2 is an area for displaying the Maruchimeda information Indekusu on Deisupuresu. Speaker name display area 1 2 0 3 is a realm that displays the speaker name corresponding to each voice section. As the story's name display method, and how to display the speaker's name for each voice section, is how to display the story's name is divided in each speaker are considered. In the first Figure 1, based on the speaker name displayed in the talker name display area 1 2 0 3, the user, using the character means such as keys board, enter the person's name. Alternatively, a position input means such as a mouse, for inputting the speaker's name by specifying the speaker displayed on the talker name display area. Based on the input talker name, as possible out thereby outputting audio or video image of the speaker of the speech segment.

According to the present invention, it is possible to output the video including sound of a plurality of speakers, a voice or moving images of a speech section corresponding to the speech of each speaker.

If multiple portrait image is present in the same image, By specifying a voice section, the person images corresponding to the speech of the speech segment, corresponding to the sound of the entire speech segment of the same speaker and sound of the specified voice section human image, as possible out to be extracted.

Similarly, by specifying the image, speech voice section corresponding to the speech of each speaker, a moving image, or the sound of the entire speech segment of the same speaker and the voice of the speech segment corresponding to the specified image corresponding person image can be output. Industrial Applicability

The present invention, PDA (Personal Digital Assistant), or a portable information terminal such as a no-Bok computer, a personal computer, the Maruchimeda terminals person such as a workstation, that Suitable for equipment for handling an image including the audio information. Thus, it is possible to provide a system for obtaining Bei the Osaringu method to easily extract a different video speaker.

Claims

The scope of the claims
1. the means for storing the multimedia information including audio information and video information (1 0 5),
Index creating means for creating an index indicating a correspondence between the sound information and the moving picture read the multimedia information (1 04),
It means for storing the index (1 0 6),
A retrieval key inputting means order to enter the search information about wanted moving image or want acoustic information (1 0 1),
And Maruchimeda information retrieval means (1 02) for searching a moving image or sound information corresponding to the retrieval information by referring to the Indekusu,
Search result output means for outputting the search results (1 0 3)
Multimedia information of Osaringu system consisting.
2. The Indekusu creating means,
Voice section detection means for detecting a speech section of the sound information included in the Maruchimeda information (20 1),
Multimedia information Osaringu method according to claim 1 and a voice index creation hand stage (202) to create an audio index based on the speech segment.
3. the retrieval result output means includes an index display means and a display, the search results and Osaringu scheme Maruchimeda information according to claim 1 for displaying the Indekusu.
4. the retrieval result output means includes an index display means and a display, to display the search results and the index,
Osaringu method of multimedia § information according to claim 1, which specifies a voice section which is specified using Indekusu displayed on the display as search information.
5. Osaringu method of multimedia information according to claim 1, wherein the said retrieval information of any of the moving image.
6. The index creation means,
Voice ku while detecting means for detecting a speech section of the sound information included in the multimedia information and (2 0 1),
Identifies the speaker the audio of the voice interval detected by the voice interval detection means, and speaker identification means (8 0, 82 0 2) identifying the 該話's for all the speech segment,
A voice altogether box creating means (2 0 2) to create a voice Indekusu on the basis of the 該話 person and the person voice interval,
Consisting of
Osaringu method of multimedia information Claims first term.
7. The search information as a person's name, and wherein the searching for the person image sound or voice in a speech period of the person thereof corresponding, Osaringu method of multimedia information claims first term.
8. The multimedia information retrieval means (1 0 2),
Detects movement of the lip from the person image in the moving image, the lip recognizing means for identifying phonemes corresponding to the motion of the lips (1 5 0 1),
Audio information in the video image and recognizing speech recognition hands stage based on the phoneme standard patterns (1 5 0 6),
A phoneme identification result the mouth lip recognizing means outputs, as image speech collating unit for comparing and collating the speech recognition result output from the speech recognition means (1 5 0 2), in the image speech collating unit, the phoneme identification result essence consistent Then a scene extracting means for extracting a moving image of the determined speech segment (1 5 0 3), a person image or corresponding to the speech of the speech segment, to obtain a sound of a voice section corresponding to the person image O one authoring system Maruchimeda information claims first term.
9. The search information to the person image in the moving image, O one authoring system of multimedia information claims paragraph 1, characterized in that to search for voice or moving images in a speech period of the person workpiece image.
1 0. Audio transmission request means for transmitting a sound transmission request protocol (6 0 2),
A voice search means for searching for the voice (6 0 6),
Moving image transmission request means for transmitting a moving image transmission request protocol (6 0 7),
A multi-media information display client with (hereinafter referred to as the client),
Receiving the acoustic transmission request protocol, the information acquisition means for acquiring Maruchimeda information specified in the protocol (6 0 3),
From the multimedia information and audio extraction means for extracting a voice (6 0 4), a voice transmitting means for transmitting voice (6 0 5),
Moving picture transmission means for transmitting a moving picture and (6 0 9),
Te multimedia information retrieval client server path system odor with multimedia information search server (hereinafter, server) and a having a
It said server after receiving the moving image transmission request protocol, has a scene extracting means (6 0 8) for extracting a moving image of the specified segment in the protocol,
Of multimedia information, Maruchimeda information retrieval client server system that communicates only the information between a desired district without communicating any information
PCT/JP1995/001746 1995-09-01 1995-09-01 Authoring system for multimedia information including sound information WO1997009683A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP1995/001746 WO1997009683A1 (en) 1995-09-01 1995-09-01 Authoring system for multimedia information including sound information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP1995/001746 WO1997009683A1 (en) 1995-09-01 1995-09-01 Authoring system for multimedia information including sound information

Publications (1)

Publication Number Publication Date
WO1997009683A1 true true WO1997009683A1 (en) 1997-03-13

Family

ID=14126227

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP1995/001746 WO1997009683A1 (en) 1995-09-01 1995-09-01 Authoring system for multimedia information including sound information

Country Status (1)

Country Link
WO (1) WO1997009683A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002169592A (en) * 2000-11-29 2002-06-14 Sony Corp Device and method for classifying and sectioning information, device and method for retrieving and extracting information, recording medium, and information retrieval system
WO2002075590A1 (en) * 2001-03-15 2002-09-26 Flanderit - Mobile Solutions Ltd A system to visualize an electronically recorded presentation
JP2002539528A (en) * 1999-03-05 2002-11-19 キヤノン株式会社 Database annotation and search
JP2002354452A (en) * 2001-05-28 2002-12-06 Ricoh Co Ltd Document preparation system, document preparation server, document preparation program, and medium recording program for making document
US7017115B2 (en) 2000-12-07 2006-03-21 Nec Corporation Portable information terminal equipment and display method therefor
JP2006333065A (en) * 2005-05-26 2006-12-07 Fujifilm Holdings Corp Photo album producing method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07226931A (en) * 1994-02-15 1995-08-22 Nippon Telegr & Teleph Corp <Ntt> Multi-medium conference equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07226931A (en) * 1994-02-15 1995-08-22 Nippon Telegr & Teleph Corp <Ntt> Multi-medium conference equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TRANSACTIONS OF 1989 SYMPOSIUM LECTURE ON INFORMATICS (TOKYO), 17 January 1989, RYUICHI OGAWA et al., "Support System for Preparing Hypermedia Including Voices and Animations", pages 43-52. *
TRANSACTIONS OF LOCAL LECTURE BY THE TOHOKU BRANCH OF THE JAPAN SOCIETY OF MECHANICAL ENGINEERS - PRECISION ENGINEERING SOCIETY, 1993, YONEZAWA, MASAMI NAKANO et al., "Study on Mechanic Lip Reading by Stereovision (Identification of Vocal Mouth Shape)", pages 255-257. *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002539528A (en) * 1999-03-05 2002-11-19 キヤノン株式会社 Database annotation and search
JP2002169592A (en) * 2000-11-29 2002-06-14 Sony Corp Device and method for classifying and sectioning information, device and method for retrieving and extracting information, recording medium, and information retrieval system
US7017115B2 (en) 2000-12-07 2006-03-21 Nec Corporation Portable information terminal equipment and display method therefor
WO2002075590A1 (en) * 2001-03-15 2002-09-26 Flanderit - Mobile Solutions Ltd A system to visualize an electronically recorded presentation
JP2002354452A (en) * 2001-05-28 2002-12-06 Ricoh Co Ltd Document preparation system, document preparation server, document preparation program, and medium recording program for making document
JP2006333065A (en) * 2005-05-26 2006-12-07 Fujifilm Holdings Corp Photo album producing method

Similar Documents

Publication Publication Date Title
Aizawa et al. Efficient retrieval of life log based on context and content
US7263659B2 (en) Paper-based interface for multimedia information
US6751354B2 (en) Methods and apparatuses for video segmentation, classification, and retrieval using image class statistical models
US7263671B2 (en) Techniques for annotating multimedia information
US8073700B2 (en) Retrieval and presentation of network service results for mobile device using a multimodal browser
US6055536A (en) Information processing apparatus and information processing method
US7684651B2 (en) Image-based face search
US20100280983A1 (en) Apparatus and method for predicting user&#39;s intention based on multimodal information
US5588073A (en) Online handwritten character recognizing system and method thereof
US5970455A (en) System for capturing and retrieving audio data and corresponding hand-written notes
US20110219018A1 (en) Digital media voice tags in social networks
US20030046087A1 (en) Systems and methods for classifying and representing gestural inputs
US4769845A (en) Method of recognizing speech using a lip image
Clarkson Life patterns: structure from wearable sensors
US7149957B2 (en) Techniques for retrieving multimedia information using a paper-based interface
US6404925B1 (en) Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition
US20030112267A1 (en) Multi-modal picture
US6377296B1 (en) Virtual map system and method for tracking objects
US6526395B1 (en) Application of personality models and interaction with synthetic characters in a computing system
US20030144843A1 (en) Method and system for collecting user-interest information regarding a picture
US5777614A (en) Editing support system including an interactive interface
US6578040B1 (en) Method and apparatus for indexing of topics using foils
US8320708B2 (en) Tilt adjustment for optical character recognition in portable reading machine
US20070286463A1 (en) Media identification
US20060155765A1 (en) Chat information service system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CN JP KR US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase