WO1997009683A1

WO1997009683A1 - Authoring system for multimedia information including sound information

Info

Publication number: WO1997009683A1
Application number: PCT/JP1995/001746
Authority: WO
Inventors: Hideaki Kikuchi; Nobuo Hataoka; Toshiyuki Aritsuka
Original assignee: Hitachi, Ltd.
Priority date: 1995-09-01
Filing date: 1995-09-01
Publication date: 1997-03-13

Abstract

An authoring system by which retrieval of a moving picture or sound information from video information including sound information is facilitated using a portable information terminal such as a PDA (Personal Digital Assistant) notebook computer, or using a multimedia terminal such as a personal computer or a workstation. The authoring system is provided with at least a retrieval key-inputting means through which a retrieval key such as a key word or an attribute value is inputted, retrieval result outputting means which outputs the retrieved sound information or moving picture, multimedia information retrieving means which retrieves multimedia information including sound information and moving picture information, and index generating means which generates indexes representing the correspondences between sound information and the moving picture information with respect to multimedia information including sound information. A desired moving picture or sound information can be readily retrieved from other corresponding information.

Description

Description Authoring method for multimedia information including audio information

The present invention makes it possible to easily output a video including audio information for each speaker on a portable information terminal such as a PDA (Personal Digital Assistant) notebook personal computer or a multimedia terminal such as a personal computer or a workstation. It provides an authoring method that makes it possible to extract data to other parties. Background art

In the conventional video authoring method, when extracting a scene for each person from a video, since the person image is extracted from the image frame, there is no correspondence between the person image and the voice, and the extracted scene section is not necessarily the person's section. There was a problem that it did not match the voice section. On the other hand, there is a method in which image features and voice features are stored for each person in advance, and a person image and a voice speaker are identified from each feature, and a person image of the same person is associated with voice. However, in reality, it is impossible to retain the image feature amount and the sound feature amount for each person, and the feasibility is low.

With the conventional technology, it is difficult to automatically extract a scene including a person image and a corresponding voice section.

An object of the present invention is to provide a system that can automatically extract a scene including an image appearance section and an utterance voice section of a corresponding person by specifying a person from an image using a mouse or inputting a person name using a keyboard. That is. Disclosure of the invention

In order to solve the above-mentioned problems, in the multimedia information sourcing method of the present invention, at least a search key input means for inputting a search key such as a keyword or an attribute value, and acoustic information or a moving image as a search result. Output means for outputting a search result, and multimedia information searching means for searching for multimedia information including audio information and moving image information, wherein the multimedia information including the audio information includes the audio information and the moving image. By providing an index creation means for creating an index indicating the correspondence, it was possible to easily search for the desired moving image or desired sound information from other corresponding information.

The index creating means includes: voice section detecting means for detecting a voice section of audio information included in multimedia information; and voice index creating means for generating a voice index based on the voice section. It is possible to easily obtain a moving image corresponding to the voice of the section or a voice of the voice section corresponding to the moving image.

By having an index display means for displaying the index of the multimedia information on a display, authoring of the multimedia information can be performed visually.

A voice section is specified for the index displayed on the re-display by the index display means, thereby searching for a voice or a moving image in the voice section. By specifying an arbitrary image of the moving image, a hand voice or a moving image in a voice section corresponding to the specified image is searched using the index created by the index creating means.

By using a position input unit such as a mouse to specify a range of desired multimedia information and specifying an arbitrary position in another window by the position input unit, reference information to the multimedia information is obtained. In addition to location It is also possible to configure an authoring method for hyper-ring type multimedia information that enables the use of multimedia information.

The index creating means includes: a voice section detecting means for detecting a voice section of the audio information included in the multimedia information; a speaker for voice in the voice section detected by the voice section detecting means; Speaker identification means for identifying a speaker; and voice index creation means for creating a voice index based on the speaker and the voice section, thereby providing a voice index for all voice sections of the same speaker. It is possible to easily obtain a moving image corresponding to voice or a voice of the same speaker corresponding to the moving image in all voice sections.

By using a character input means such as a keyboard to specify a person name, a voice or a moving image in the voice section of the person is searched.

The multimedia information search means detects lip movement from a person image in a moving image, and a lip recognition means for identifying a phoneme corresponding to the lip movement, and recognizes voice information in the moving image based on a phoneme standard pattern. A voice recognition unit that performs comparison, a phoneme identification result output by the lip recognition unit, and a voice recognition result that is output by the voice recognition unit; and a voice recognition unit that outputs the phoneme identification result. And a scene extracting means for extracting a moving image of a voice section determined to be in good agreement, whereby a person image corresponding to the voice of the voice section or a voice of the voice section corresponding to the human image can be easily obtained. it can.

The multimedia information retrieving means includes a person image extracting means for extracting a person image existing at the position in a moving image in accordance with a position input by a position input means such as a mouse. By using the input means, a person image in the moving image can be specified, and the voice or the moving image in the voice section of the person can be automatically searched.

Also, a multimedia information search client server system of the present invention Is a multimedia information comprising: a voice transmission request unit for transmitting a voice transmission request protocol; a voice search unit for searching for voice; and a video transmission request unit for transmitting a video transmission request protocol. A display client (hereinafter referred to as a client); an information acquisition means for receiving a sound transmission request protocol and acquiring multimedia information specified by the protocol; and a voice extraction means for extracting voice from the multimedia information. A multimedia information search server (hereinafter, a server) including: a voice transmitting unit that transmits a voice; and a moving image transmitting unit that transmits a moving image. The server further includes a moving image transmission request protocol. After receiving, a scene extracting means for extracting a moving image in a section designated by the protocol, A) It became possible to communicate only the information in the desired section without communicating all the information among the information. BRIEF DESCRIPTION OF THE FIGURES

Fig. 1 is an overall configuration diagram of the multimedia information authoring system, Fig. 2 is an example of the configuration of the index creation means, Fig. 3 is another example of the configuration of the index creation means, and Fig. 4 is the multimedia configuration. FIG. 5 shows another configuration example of the multimedia information search means, FIG. 6 shows a configuration example of the search result output means, and FIG. 7 shows a screen display of the present invention. FIG. 8 is another example of the screen display of the present invention, FIG. 9 is a configuration example of a multimedia information search client server system, and FIG. 10 is a multimedia information search client server. Fig. 11 shows another example of a screen display according to the present invention. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments will be described in detail with reference to the drawings. In the following, Multimedia The key information is information including at least a sound and a moving image. In addition, the description here assumes that a multimedia information terminal is a portable information terminal that has the function of browsing and editing multimedia information. However, the present invention is not limited to such portable information terminals, but also includes multimedia terminals such as personal computers and workstations, video recorders for home use, English language learning VCRs with editing functions, and videophone answering machines. It can be applied to multimedia information devices in general.

FIG. 1 is a block diagram of a multimedia information authoring system block according to the present invention.

In FIG. 1, a search key input unit 101 is a unit for inputting a keypad position or the like serving as a search key in order for a user to search for an object to be edited. The multimedia information retrieving means 102 is a means for retrieving audio or video in an arbitrary section from the multimedia information. This is a means for outputting the search results of the search result output means 103 and the multimedia information search means 102 for presentation to the user. The index creating means 104 is means for creating an index indicating the correspondence between the audio information and the moving image for the multimedia information.

More specifically, the voices included in the multimedia information 105 are divided into voice sections in which the voices are present and sections using the speaker names corresponding to the voices. Also, for the moving image, for example, a moving image display section based on an arbitrary rule based on an arbitrary rule such as performing a section corresponding to each person in the image is used. The index creation means 104 may be executed by the user or automatically at any time after the storage of the multimedia information. In the following, it is assumed that the operation is performed by the user.

The user first uses the search key input means 101 to search for an object to be edited. Enter a search key to search. Here, the search key can be a character string, an arbitrary partial image in a still image, or a section. The search key input means 101 allows single or multiple inputs for all of these search keys. Next, the multimedia information search means 102 uses the search key input by the search key input means 101 to search the multimedia information index 106 for an index matching the search key. Search for multimedia information in a specific section with. Further, the search result output means 103 displays the index of the multimedia information on a display, and outputs the voice or moving image searched by the multimedia information search means 102. Specifically, in the case of audio, audio is output from a speaker, a headphone, or the like, and in the case of a moving image, display is performed on a display or the like.

For example, when the search key input means 101 inputs a search key indicating a specific section of the voices included in the multimedia information, the multimedia information search means 102 sets the index creation means 1 in advance. Using the multimedia information index 106 created based on the voice section of each speaker according to 04, a voice or a moving image corresponding to the voice is searched. The searched sound or moving image is output as a sound or a moving image by the search result output means 103.

FIG. 2 shows a configuration example of the index creating means 104 of the present invention. In FIG. 2, a voice index 204 is created as a multimedia information index 106. Therefore, the index creating means 104 is composed of the speech section detecting means 201 and the speech index creating means 202.

The voice section detecting means 201 is means for detecting a human voice section from the acoustic information included in the stored multimedia information 203. sound As a method of detecting a speech section in the sound information, for example, there is a method that uses whether or not short-time power having a value equal to or more than a certain threshold value has continued for a certain time or more ("digital sound processing", Tokai University Press, pp. 15 3 (See “8.2 Voice Detection”). The voice index creating means 202 creates an index based on the information of the voice section detected by the voice section detecting means 201. Here, the index created by the audio desk creating means 202 includes, for example, the start and end times of each detected voice section, the voice section length, and the like.

By creating an index based on a voice section in this way, it is possible to easily obtain a voice in a voice section corresponding to a video as a moving image corresponding to the voice in the voice section or vice versa. Become.

FIG. 3 shows another example of the configuration of the index creating means 104 of the present invention. In FIG. 3, the speaker identification means 8001 is a means for comparing the voice with a voice standard pattern of a specific speaker to identify whether the voice is the voice of a specified speaker. is there. As a speaker identification method, for example, after extracting features from speech waves, the distance or similarity between each registered speaker and a standard pattern stored in advance is checked, and re-recognition is determined based on the degree. (“Digital Speech Processing”, Tokai University Press, PP 196 “9.3 Configuration of Speaker Recognition System”).

In FIG. 3, first, a human voice section is detected by the voice section detecting means 201 from the acoustic information of the stored multimedia information 203. Further, the speaker in the detected voice section is identified by the speaker identification means 811, based on the audio standard pattern 802. As a result of speaker identification, the corresponding speaker name is obtained for the voice in each voice section. Therefore, the speech index creation means 202 associates the speech section with the speaker name and creates an index 204 as a multimedia information index. I do. Here, the index created by the speech index creating means 202 includes, for example, the start and end times of each detected speech section, the speech section length and the speaker name, and the like.

By creating an index based on the speaker name of the voice section in this way, it is possible to easily obtain a video corresponding to the voice of the voice section, or vice versa. Become like

FIG. 4 is a diagram showing a configuration example of the multimedia information search means 102 of the present invention. In FIG. 4, a lip recognition means 1501 is a means for recognizing lip movement from a human face image extracted from an input image and outputting a phoneme corresponding to the lip movement. As a method of recognizing phonemes from lip movements, for example, there is a method in which a two-dimensional shape is first extracted by image processing, and phoneme identification is performed on the data using a neural network ("Non-bal interface", See Ohm, ppl 49 "Recognition of quarrels." The speech recognition means 1506 is means for performing speech recognition on speech information. As a method of performing speech recognition of the input speech, for example, the input speech is compared with a standard phoneme pattern for each small section to obtain a distance, a phoneme having a short distance is output as a phoneme recognition result, and a phoneme sequence is further obtained. There is a means for comparison with a word speech dictionary (see "Digital Speech Processing", Tokai University Press, PP 166, "8.6 Word Speech Recognition in Phonemes"). The image / sound matching means 1502 is a means for checking a phoneme sequence corresponding to the movement of the lips in a person image with an input sound. The scene extracting means 1503 is a means for cutting out a video of a designated section.

In FIG. 4, first, the lip recognition means 1501 recognizes the movement of the lip by comparing a feature amount such as a mouth shape and a mouth area with a standard pattern 1504. As a result of lip recognition, a phoneme sequence is output. Next, the speech recognition means 1506 selects the speech spectrum in the phoneme section. A phoneme sequence is output as a speech recognition result by calculating the similarity between each of the phoneme patterns of the phoneme standard pattern dictionary 1507 and the phoneme standard pattern. Here, the image / speech matching means 1502 compares and compares the phoneme sequence output from the lip recognition means 1501 with the phoneme sequence output from the speech recognition means 1506. As a result, it is possible to collate the lip movement in the human image with the preceding and following voice sections and associate them. Finally, the scene extracting means 1503 extracts the video of the audio section associated with the human image from all the videos.

With the above processing, it is possible to extract a video section including a voice section from an input image for a video of a person designated by a position input unit such as a pen. In addition, it is possible to easily obtain a person image corresponding to all voice sections of the same speaker, or a voice of all voice sections of the same speaker corresponding to the person image.

FIG. 5 is a diagram showing another configuration example of the multimedia information search means 102 of the present invention. In FIG. 5, a person image extracting means 1901 is provided to automatically detect the presence or absence of a person from an input image and detect the face of the person. As a method of automatically detecting the presence or absence of a person and detecting a face from an input image, for example, there is a method of collating a viramid image obtained by sampling an image at a plurality of resolutions (see “Digital Signal Processing”). Handbook, "published by the Institute of Electronics, Information and Communication Engineers, p. The lip recognition means 1902 is means for recognizing lip movement from a human face image extracted from an input image and outputting a phoneme corresponding to the lip movement. Speech recognition means 1907 is a method for performing speech recognition on speech information. The image sound collating means 1903 is a means for collating an input sound with a phoneme sequence corresponding to the movement of the lips in a person image. The scene extracting means 1904 is means for cutting out a video of a designated section.

In Fig. 5, the position on the drawing input first using the position input means Based on the coordinates, a person image extracting unit 1901 detects the presence or absence of a person image in an area near the input position coordinates in the input image, and further extracts a person face image. If one person image is detected in the input image, it is designated as a designated image.If more than one person image is detected in the input image, the image is input by the position input means 101. The specified person image includes or is the closest to the coordinate point. Next, the lip recognition means 1902 of the human face image extracted by the person image extracting means 102 is used to compare the lip movements by comparing feature quantities such as the mouth shape and mouth area with the standard pattern 1905. To recognize Note that a phoneme sequence is output as the result of lip recognition. Next, the speech recognition means 1907 calculates a phoneme sequence as a speech recognition result by calculating the similarity between the speech spectrum in the phoneme section and each phoneme spectrum in the phoneme standard pattern dictionary 1908. Output.

Here, the image / speech matching means 1903 compares and compares the phoneme sequence output from the lip recognition means 1902 with the phoneme sequence output from the speech recognition means 1907. This makes it possible to collate and associate the movement of the lips in the person image with the preceding and following voice sections. Finally, the scene extracting means 1904 extracts the video of the voice section associated with the human image from all the videos.

According to the above processing, it is possible to extract a video section including a voice section from an input video with respect to a video of a person specified by position input means such as a pen.

FIG. 6 is a diagram showing an example of a block configuration for performing index display in the multimedia information focusing method of the present invention. In FIG. 6, the index creating means 304 corresponds to the index creating means 104 in FIG. The index display means 301 is means for visualizing the multimedia information index and displaying it on a display. In FIG. 6, first, the multimedia information index 304 created by the index creating means 303 is visualized by the index display means 301 and displayed on the display 302. For example, for an index created based on a voice section, a method of displaying the start and end times and section lengths of each voice section using a bar line in a two-dimensional coordinate system with time on the horizontal axis . Alternatively, for the index of speech divided into sections for each speaker, a method of arranging a bar line for each speaker and expressing it is conceivable.

Note that, specifically, the search result output means 103 in FIG. 1 is composed of an index display means 301 and a display 302.

By visualizing such an index, it is possible to visually perform multimedia information authoring.

FIG. 7 is a diagram showing a screen display example in which the index is visualized. In FIG. 7, a video display area 401 is an area on the display for displaying a moving image. The index display area 402 is an area on the display for displaying the multimedia information index. The audio index display area 403 is an area on the display for displaying an audio index. The designated voice section 404 indicates a voice section specified by the user to request the output of a voice or a moving image. The designated image 405 indicates an image designated by the user to request output of a sound or a moving image.

In FIG. 7, first, the user designates a voice section corresponding to a desired voice or a moving image with respect to the voice index displayed in the voice index display area 403 in the index display area. It can output audio or moving images. In addition, when the user requests a sound section or a moving image corresponding to the currently output sound for the moving image displayed in the video display area 401, the user specifies the image 405. Do As a result, it is possible to output the voice or the moving image of the requested voice section.

As another display example, FIG. 8 shows a screen display example when the multimedia information authoring method of the present invention is used for a portable terminal. In FIG. 8, a video display area 702, a document display area 703, and a menu area 701 are provided on the screen of the portable information terminal. First, on the portable information terminal on the left side of FIG. 8, select the item “extract dialogue” from the menu area 701. Next, during image reproduction on the image display area 702, the position of the human image from which the lines are to be extracted is designated by the position input means 705. By the operations up to this point, a scene including a voice section corresponding to the specified person image is extracted using the multimedia information authoring method shown in FIG. On the portable information terminal on the right side of FIG. 8, the icon 704, which symbolizes the extracted scene, is moved on the screen using a position input device such as a mouse, and is moved to an arbitrary position in the document display area 703. The operation of associating the extracted video with the document in the document display area 703 by placing the icon 704 at the position of is shown.

FIG. 9 is an example of a block configuration of a multimedia information search client-server system using the multimedia information authoring method of the present invention. In FIG. 9, search key input means 6001 is a means for inputting a keyword or a position serving as a search key in order to search for an object to be edited by a user. The voice transmission requesting means 602 is a means for requesting the server side to transmit voice information. The multimedia information acquiring means 603 is means for acquiring multimedia information including audio information requested to be transmitted from a data pace (not shown). The voice extracting means 604 is a means for extracting a voice information part included in the multimedia information. The voice transmitting means 6 05 is a means for transmitting voice information to the client side, The voice search means 606 is a means for performing voice recognition on voice information, and performing a search or a speaker search on a character string designated as a search key based on the voice recognition result. As a method of performing speech recognition of input speech, for example, the input speech is compared with a standard phoneme pattern for each small section to obtain a distance, a phoneme having a short distance is output as a phoneme recognition result, and a phoneme sequence is further obtained. There is a means to compare with a word speech dictionary (see "Digital Speech Processing", Tokai University Press, pp. 167, "8.6 Word Speech Recognition in Phonemes"). The moving image transmission request means 607 is a means for requesting the server side to transmit moving image information in a specific section. The scene extracting means 608 is means for extracting moving image information of a designated section from all moving images. The moving image transmitting means 609 is means for transmitting moving image information to the client side. The moving image presentation means 610 is a means for displaying a moving image. In FIG. 9, on the client side, first, a character input using the search key input means 600 is set as a designated character string. Next, the audio transmission requesting means 62 requests transmission of the audio information in the specific multimedia information. Next, on the server side, after obtaining a transmission request for audio information, the multimedia information acquisition means 603 acquires multimedia information including the audio information requested to be transmitted from the database. Further, the audio information part in the obtained multimedia information is extracted by the audio extracting means 604, and the audio transmitting means 605 transmits only the audio information part to the client. On the client side, the voice search means 606 searches the received voice information for a specified character string. Here, it is assumed that a voice search method is performed in which voice recognition is performed once on received voice information and a specified character string is searched for the recognition result. Next, the transmission of the moving image corresponding to the voice section containing the specified character string Request is made in request means 607. Further, on the server side, based on the received moving image transmission request, the scene extracting means 608 extracts a moving image of the requested section from all the moving images, and the moving image transmitting means 609 causes the client side to extract the moving image. Send to

With the above configuration, in a multimedia information search client server system capable of voice search, it is possible to transmit only necessary information without transmitting all multimedia information from the server side to the client.

FIG. 10 is a diagram showing another example of a block configuration of the multimedia information search client server system.

In FIG. 10, on the client side, first, the speaker name input using the search key input means 61 is set as the designated speaker name. Next, the audio transmission requesting means 62 requests transmission of audio information in the specific multimedia information. Next, on the server side, after receiving a transmission request for audio information, the multimedia information acquisition means 603 acquires multimedia information including the audio information requested to be transmitted from the database. Further, the audio information part in the obtained multimedia information is extracted by the audio extracting means 604, and the audio transmitting means 605 transmits only the audio information part to the client. On the client side, the voice search means 606 searches the received voice information for a designated speaker. Here, a speech search method is assumed in which speaker identification is performed on received speech information, and a search for the specified speaker name is performed based on the identification result. Next, the transmission of the moving image in the voice section corresponding to the designated speaker name is requested by the moving image transmission requesting means 607. Further, on the server side, based on the received moving image transmission request, the scene extracting means 608 extracts a moving image of the requested section from all moving images, and the moving image transmitting means 609 causes the client side to extract the moving image. Send to

With the above processing, it is possible to transmit only the necessary information without transmitting all the multimedia information from the server side to the client in the multimedia information search client-server system capable of speaker search. Becomes

FIG. 11 is a screen display example of the multimedia information writing method of the present invention. In FIG. 11, a video display area 1221 is an area for displaying a moving image on the display. The index display area 122 is an area for displaying a multimedia information index on a disk press. The speaker name display area 1 203 is an area for displaying a speaker name corresponding to each voice section. As a speaker name display method, there are a method of displaying the speaker name for each voice section, and a method of displaying the speaker name after dividing for each speaker. In FIG. 11, based on the speaker name displayed in the speaker name display area 1203, the user inputs a person name using character means such as a keyboard. Alternatively, the speaker name is input by designating the speaker displayed in the speaker name display area using a position input means such as a mouse. Based on the input speaker name, it is possible to output the voice or moving image of the speaker's voice section.

ADVANTAGE OF THE INVENTION According to this invention, the audio | voice or the moving image of the audio | voice area corresponding to the audio | voice of each speaker can be output with respect to the video containing the audio | voice by a plurality of speakers.

When multiple person images exist in the same image, by specifying the voice section, the person image corresponding to the voice in the voice section, the voice in the specified voice section and the voice in all voice sections of the same speaker It is possible to extract person images and.

Similarly, by specifying the image, the sound corresponding to the voice of each speaker It is possible to output a voice in a voice section, a moving image, or a person image corresponding to the voice in the entire voice section of the same speaker as the voice in the voice section corresponding to the designated image. Industrial applicability

INDUSTRIAL APPLICABILITY The present invention is suitable for a portable information terminal such as a PDA (Personal Digital Assistant) and a notebook personal computer, and a multimedia terminal such as a personal computer and a workstation, which handles images including audio information. This makes it possible to provide a system with an authoring method for easily extracting video for each speaker.

Claims

The scope of the claims

1. means (105) for storing multimedia information including sound information and moving image information;

An index creation means (104) for reading the multimedia information and creating an index indicating the correspondence between the acoustic information and the moving image;

Means (106) for storing the index,

A search key input means (101) for inputting search information relating to a desired moving image or desired sound information;

Multimedia information search means (102) for searching for a moving image or audio information corresponding to the search information by referring to the index;

Search result output means (10 3) for outputting the above search results;

Authoring method for multimedia information consisting of

2. The index creation means,

Voice section detection means (201) for detecting a voice section of the audio information included in the multimedia information;

2. The multimedia information authoring method according to claim 1, further comprising a voice index creating step (202) for creating a voice index based on said voice section.

3. The authoring method for multimedia information according to claim 1, wherein the search result output means has an index display means and a display, and displays the search result and the index.

4. The search result output means has an index display means and a display, displays the search result and the index,

2. The multimedia according to claim 1, wherein a voice section specified using the index displayed on the display is specified as search information. Authoring method for key information.

5. The authoring method for multimedia information according to claim 1, wherein the search information is an arbitrary moving image.

6. The index creating means is:

Voice interval detection means (201) for detecting a voice interval of audio information included in the multimedia information;

Speaker identification means (810, 802) for identifying a speaker with respect to the voice of the voice section detected by the voice section detection means, and identifying the speaker for all voice sections;

Voice index generating means (202) for generating a voice index based on the speaker and the voice section;

Consists of

The multimedia information authoring method according to claim 1.

7. The multimedia information authoring method according to claim 1, wherein the search information is used as a person name to search for a voice in a voice section of the person or a person image corresponding to the voice.

8. The multimedia information search means (102)

Lip recognition means (1501) for detecting lip movement from a human image in a moving image and identifying phonemes corresponding to lip movement;

A speech recognition means (1506) for recognizing speech information in a moving image based on a phoneme standard pattern;

An image / speech matching unit (1502) for comparing and collating the phoneme identification result output by the lip recognition unit with a speech recognition result output by the speech recognition unit; Then, scene extracting means (1503) for extracting a moving image of the determined voice section, and 2. The multimedia information authoring method according to claim 1, wherein a person image corresponding to the voice of the voice section or a voice of the voice section corresponding to the human image is obtained.

9. The multimedia information authoring method according to claim 1, wherein the search information is a person image in a moving image, and a sound or a moving image in a voice section of the person image is searched.

10. Voice transmission request means (6 02) for transmitting a voice transmission request protocol;

Voice search means (600) for searching for voice;

Moving image transmission request means for transmitting a moving image transmission request protocol (607);

A multimedia information display client (hereinafter referred to as a client) with

Information acquisition means (603) for receiving a sound transmission request protocol and acquiring multimedia information specified in the protocol;

Voice extracting means (604) for extracting voice from multimedia information, voice transmitting means (660) for transmitting voice,

Moving image transmitting means (609) for transmitting a moving image;

A multimedia information search server (hereinafter, referred to as a server) provided with: and a multimedia information search client server system having:

The server has a scene extracting means (608) for extracting a moving image in a section designated by the protocol after receiving the moving image transmission request protocol,

Multimedia information retrieval client / server system that communicates only information in a desired section without communicating all information among multimedia information Stem