WO2005069171A1

WO2005069171A1 - Document correlation device and document correlation method

Info

Publication number: WO2005069171A1
Application number: PCT/JP2005/000333
Authority: WO
Inventors: Kyoji Hirata
Original assignee: Nec Corporation
Priority date: 2004-01-14
Filing date: 2005-01-14
Publication date: 2005-07-28
Also published as: JP4600828B2; JPWO2005069171A1

Abstract

There is provided a document correlation method including a step (a) for preparing a content containing at least one of the audio information and the video information on a plurality of speakers and a document describing the content of the content; and a step (b) for correlating the content with the document on speaker unit.

Description

Specification

Document association device and document association method

Technical field

The present invention relates to a document associating device and a document associating method, and more particularly, to a document associating device that derives a correspondence relationship between content such as video or audio and document information related to the content. And a document association method.

Background art

[0002] A method of automatically documenting document data in a corresponding part of an audio recording or a video recording accompanied by an audio is known. For example, Japanese Unexamined Patent Application Publication No. 7-199379 discloses that a voice in a voice recording or a video accompanying a voice is subjected to text recognition by voice recognition processing, and the text and document information stored in order in a document storage device. In comparison with, a method has been proposed that considers both to be the same if they contain the same set of characters. At this time, the automatic speech recognizer decodes the speech and the decoded text is collated with the document information via identification of similar words or clusters of words.

[0003] Also, in Japanese Patent Application Laid-Open No. 2000-270263, in a broadcast program, when the announcement manuscript and the subtitle content are extremely similar, voice recognition processing is performed on the announcement manuscript, and the speech recognition result and the presentation time are given. A system has been proposed that derives the correspondence between subtitle texts arranged in order, and detects and records timing information of a start point and an end point as a synchronization point.

[0004] Furthermore, in Japanese Patent Application Laid-Open No. 8-212190, when associating a scenario text with a moving image accompanied by voice, a silent section when the scenario text is converted to speech is predicted, and the prediction result and the moving image accompanied by voice are predicted. A system for associating a voice with a text by comparing the silent section of the voice signal in the above has been proposed.

[0005] The first problem of the document correspondence method for associating these conventional video or audio with! /, Itsuto contents and a document is that the accuracy of correspondence between audio information and document data is significantly higher than the accuracy of voice recognition. Therefore, when the accuracy of speech recognition is not sufficiently obtained, the correspondence between the speech information and the document data is not accurately derived. In the conventional method described in the above-mentioned Japanese Patent Application Laid-Open No. 7-199379 and Japanese Patent Application Laid-Open No. 2000-270263, after converting speech into text by speech recognition processing, the converted text and document data are converted. Synchronized between. As a result, if the text output by speech recognition contains a lot of errors, there is a lot of correspondence, such as the ability to correspond to the document data or the correspondence to a completely different document part. Error is included. In general, in speech recognition, the recognition accuracy is remarkable when background music other than the uttered voice is loud, such as when background music is superimposed on the voice, or in conversations recorded under high noise such as outdoors. It is known to decrease. Even in ordinary conversation, there are many cases where high recognition accuracy cannot be expected due to the positional relationship between the microphone and the speaker, the manner of speaking, the conversation style, and the characteristics of the speaker. If the conversation content is limited to a specific topic, it is possible to improve the accuracy of speech recognition by taking measures such as optimally selecting a dictionary for recognition based on the topic to be estimated. However, in general, topics cannot be estimated in advance in many cases. In such a case, if an incorrect dictionary is used, there is a problem that the accuracy of speech recognition is further reduced. Based on the speech recognition results containing many errors, when the voice recording or the video recording accompanied by the voice is correlated with the document information, the number of mapping errors increases, and simultaneous text display and keyword It will be difficult to use it for search cueing.

[0007] As a second problem of the conventional method, when the document information is not a faithful reproduction of the voice but a document whose contents are simply summarized, the document information and the voice information are combined. Sometimes it is not possible to match correctly. For example, when associating audio information in a lecture with explanatory materials or summary documents created by the speaker, there is no part in the document that directly corresponds to the text created from the audio information, so the document information and audio information Cannot be matched correctly.

[0008] As a third problem of the conventional method, in matching based on speech recognition, since the unit of matching is a word unit, if the document content and the speech information do not completely match, the matching is performed. This means that the appearance of one word greatly deviates the correspondence.

[0009] As a related technique, Japanese Patent Application Laid-Open No. 2000-348064 (priority claim number: 09Z288724, priority claim country: United States) discloses a method for searching for speech information using content information and speaker information. A method and apparatus are disclosed. A method for retrieving speech information using the content information and the speaker information is a method for retrieving speech information for one or more speech sources. Receiving a user query specifying at least one content and one speaker constraint; and comparing the user query with the content index and speaker index of the audio source to match the user query. Identifying audio information.

As a related technique, Japanese Patent Application Laid-Open No. 2002-189728 discloses a multimedia information editing apparatus, a method and a recording medium, and a multimedia information distribution system. This multimedia information editing device edits multimedia information. The multimedia information editing apparatus is characterized by comprising a storage means, a voice discriminating means, a document converting means, and a multimedia structure deciding means. The storage means stores multimedia information such as audio and moving images. The voice discriminating means determines whether a voice is added to the multimedia information stored in the storage means. The document converting means converts the voice information into document information when the voice is added by the voice discriminating means. The multimedia structuring unit language-analyzes the document converted by the document converting unit, and structures and associates the document with the multimedia information.

[0011] As a related technique, Japanese Patent Application Laid-Open No. 2002-236494 discloses techniques of a speech section discrimination device, a speech recognition device, a program, and a recording medium. This voice section discriminating apparatus is characterized by comprising acoustic analysis means, standard pattern storage means, matching means, judgment means, and voice section discrimination means. The acoustic analysis means acoustically analyzes a voice input from the outside at a predetermined cycle, and obtains an acoustic feature based on the analysis result. The standard pattern storage means stores a standard pattern corresponding to a single speaker's voice and a mixed voice of a plurality of speakers under the premise that voices of a plurality of speakers may be mixed in the input voice. The matching unit performs matching between the standard pattern stored in the standard pattern storage unit and the acoustic feature amount obtained by the acoustic analysis unit.

The determining means determines a force at which the input voice is similar to any of the standard patterns at each of the predetermined cycles based on a processing result by the matching means. The voice section discriminating means includes a step of discriminating a voice section of each speaker based on a result of the determination by the determining means. [0012] As a related technique, Japanese Patent Application Laid-Open No. 2002-366552 (priority claim number: 09Z962659, priority claiming country: United States) discloses a method and system for searching a recorded voice and searching for a related segment. ing. This is a method for searching recorded voices in a database. a) converting the recorded speech to text using a speech recognition system; andb) creating a full-text index of the recorded speech using an information extender, wherein the full-text The index includes a plurality of timestamps indicating the occurrence of a word in the recorded speech; c) searching for text by the full 'text' server using the full 'text index; d) Storing the search text, the full'text index, and the recorded speech in the database. The specific content of the recorded audio is played back using the full text index without listening to the entire recording.

[0013] As a related technique, Japanese Patent Application Laid-Open No. 11-242669 discloses a technique of a document processing apparatus. This document processing apparatus is characterized by comprising voice input means, extraction means, attribute generation means, document storage means, instruction means, output means, and attachment means. The voice input means inputs voice. The extracting means extracts information for specifying the speaker from the voice input by the voice input means. The attribute generating means generates speaker attribute information by comparing the extracted information with predetermined reference information. The document storage means stores a document. The instruction unit indicates a position in the document to which the input voice is to be attached. The output means outputs the document. The attachment means stores, in the document storage means, group information including information on a position in the document designated by the designation means, the input voice, and speaker attribute information generated by the attribute generation means. I do.

Disclosure of the invention

An object of the present invention is to provide a document associating device and a document associating method for accurately associating significant sections defined in content such as audio and video with sections in a document.

Another object of the present invention is to provide a document associating device and a document associating method for accurately associating significant sections in content with sections in a document without being affected by the state of the content. is there. Another object of the present invention is to provide a document associating device and a document associating method for accurately associating significant sections in content with sections in a document without being affected by the type of document. It is.

[0017] These objects, other objects, and advantages of the present invention can be easily confirmed by the following description and the accompanying drawings.

[0018] In order to solve the above problems, the document association method of the present invention includes: (a) a content including at least one of audio information and video information in which a plurality of speakers appear as speakers; A step of preparing a document describing the content of the content; and (b) a step of deriving the correspondence between the content and the document for each speaker.

In the above document association method, the (b) step includes: (bl) dividing the content into a plurality of content sections by dividing the content into speakers, and (b2) dividing the document into the plurality of content sections. And (b3) associating the plurality of content sections with the plurality of document sections.

[0020] In the above-described document association method, the (b2) step includes: (b21) a single power of the plurality of speakers. And (b22) dividing the content for each speaker based on the point in time when the speaker changes.

In the above-described document association method, the step (b21) includes a step (b211) of extracting the change point of the voice of the speaker from the voice information, wherein the content is the voice information.

[0022] In the above-mentioned document association method, the step (b21) includes a step (b212) of extracting the change point of the video of the speaker from the video information, wherein the content is the video information.

In the above document association method, the content is audio-video information in which the audio information and the video information are synchronized.

In the above-described document association method, the step (b21) includes a step (b213) of analyzing a change point of a sound feature of the audio information, and deriving a time point at which the speaker changes. [0025] In the above document matching method, the (b21) step is a step (b214) of analyzing a change point of a visual feature of the video information, and deriving a time point when the speaker changes. including.

[0026] In the above document matching method, the (b21) step is (b215) performing a change point analysis of a visual feature of the video information and a change point analysis of a sound feature of the audio information. To derive the point in time at which the speaker changed.

[0027] In the above document association method, the step (b) includes the step of (b4) analyzing the structure of the document, and dividing the document for each speaker.

[0028] In order to solve the above-mentioned problems, the computer program product of the present invention, when used on a computer, has the following! /, Having program code means for executing all steps described in any one of the above.

[0029] The computer program product having the above program code means is stored in a computer-readable storage means.

[0030] In order to solve the above problem, the document association apparatus of the present invention includes a content section extraction unit, a document section extraction unit, and a section correspondence relation derivation unit. The content section extraction unit extracts a plurality of content sections by dividing the content for each speaker for content including at least one of voice information and video information in which a plurality of speakers appear as speakers. I do. The document section extraction unit extracts a plurality of document sections by dividing the document describing the content of the content into speaker units. The section correspondence deriving unit derives a correspondence between the plurality of content sections and the plurality of document sections.

[0031] In the above document associating device, the content is the audio information. The content section extraction unit analyzes the sound characteristics of the audio information and extracts the plurality of content sections.

[0032] In the above document associating device, the content is the video information. The content section extraction unit analyzes the visual characteristics of the video information and extracts the plurality of content sections.

[0033] In the above document associating device, the content is audio-video information in which the audio information and the video information are synchronized. The content section extraction unit outputs the audio information. The result of the analysis of the sound feature of the report and the result of the analysis of the visual feature of the video information are integrated to extract the plurality of content sections.

[0034] In the above document associating device, the content extracting unit includes an audio section extracting unit, a video section extracting unit, and an audio / video section integrating unit. The voice section extracting unit analyzes the sound characteristics of the voice information, divides the voice information into speaker units, and extracts a plurality of voice sections. The video section extractor analyzes the visual characteristics of the video information and divides the video information into speaker units to extract a plurality of video sections. The audio-video section integration unit extracts the plurality of content sections based on the plurality of pieces of audio section information regarding the plurality of audio sections and the plurality of pieces of video section information regarding the plurality of video sections.

[0035] In the above document associating device, the content section extraction unit may determine that one of the plurality of speakers in the content is a speaker at the time when the speaker changes to another one of the plurality of speakers. A change point is extracted, and a plurality of content sections are extracted.

[0036] In the above document associating device, the content includes the audio information. The content section extraction unit extracts the speaker change point based on a change in a characteristic of at least one of the prosodic information of the utterance height, the utterance speed, and the utterance size in the audio information.

[0037] In the above document associating device, the content includes the audio information. The content section extraction unit extracts the speaker change point based on a change in the conversation mode in the voice information.

[0038] In the above document associating device, the content includes the video information. The content section extraction unit extracts the speaker change point based on a change in a visual feature of a person in the video information.

In the above-described document association device, the content includes the video information. The content section extraction unit extracts the speaker change point based on a change in a facial feature of a person in the video information.

[0040] In the above document associating device, the content includes the video information. The content section extraction unit extracts the speaker change point based on a change in a visual feature of a person's clothing in the video information. [0041] In the above document associating device, the document section extracting unit extracts the plurality of document sections based on the format information of the document.

[0042] In the above document associating device, the document section extracting unit extracts the plurality of document sections based on the description about the speaker written in the document.

[0043] In the above document associating device, the document section extracting unit extracts the plurality of document sections based on the tag information of the structured document in the document.

[0044] In the above document associating device, the document section extracting unit extracts the plurality of document sections based on a change in conversation characteristics in the document.

[0045] In the above document associating device, the section correspondence deriving unit determines the plurality of content sections based on a comparison between the section length of the plurality of content sections and the document amount of the plurality of document sections. And a plurality of document sections.

In the above document associating apparatus, the section correspondence deriving unit performs the association based on the plurality of content sections and the execution result of the dynamic programming matching for the plurality of document sections.

[0047] In the above document associating device, the section correspondence deriving section includes a content speaker identifying section, a document speaker information extracting section, and a section matching section. The content speaker identification unit specifies a speaker in at least one of the plurality of content sections. The document speaker information extracting unit specifies a speaker in at least one of the plurality of document sections, and obtains speaker information as information of the speaker. The section matching unit matches the plurality of content sections with the plurality of document sections based on the speaker information.

[0048] In the above document associating device, the content speaker identification unit includes a content characteristic amount extraction unit, a speaker information storage unit, and a characteristic amount matching identification unit. The content feature extraction unit extracts a feature in at least one of the plurality of content sections. The speaker information storage unit stores the feature amount and the speaker in association with each other. The feature quantity matching identification unit identifies the speaker based on a comparison between the stored feature quantity and the extracted feature quantity.

[0049] In the above document associating device,

The content speaker identification unit determines the voice pitch, voice length, and voice strength in the audio information. A document associating device for identifying the speaker based on at least one feature of the prosody information.

[0050] In the above document associating device, the content speaker identification unit specifies the speaker based on the feature amount representing the conversation mode in the audio information.

[0051] In the above document associating device, the content speaker identification unit specifies the speaker based on the visual feature amount of the person in the video information.

[0052] In the above document associating device, the content speaker identification unit uses the facial features of the person as the visual features of the person.

[0053] In the above document associating device, the document speaker information extracting unit specifies the speaker based on the description about the speaker written in the document.

[0054] In the above document associating device, the document speaker information extracting unit specifies the speaker based on the metadata of the structured document in the document.

[0055] In the above document associating device, the section matching unit is configured so that a plurality of speakers in each of the plurality of content sections and a plurality of speakers in each of the plurality of document sections match. A content section is associated with a plurality of document sections.

[0056] In the above document associating device, the section matching unit determines the plurality of content sections and the plurality of document sections based on an execution result of the dynamic programming matching for the plurality of content sections and the plurality of document sections. Correlate with the document section.

In the above-described document association device, the content includes audio information. The document associating device further includes a speech recognition unit that extracts speech contents in the plurality of content sections and outputs speech text information. The section correspondence deriving unit associates the plurality of content sections with the plurality of document sections based on the similarity between the uttered text information and the document information of the document.

[0058] In the above document associating device, the section correspondence deriving unit is configured to perform a dynamic program matching between a word that appears in the utterance text information and a word that appears in the document information. Then, the utterance text information is matched with the document information.

[0059] In the above document associating device, the section correspondence deriving unit includes a basic word extraction unit. And a basic word group similarity deriving unit. The basic word extractor may include one or more of the first or plurality of second words used in each of the plurality of content sections in the utterance text information.

One basic word and one or more second basic words used in each of the plurality of document sections are extracted. The basic word group similarity deriving unit measures the similarity between the plurality of first basic words and the plurality of second basic words. The section correspondence deriving unit derives the correspondence based on the similarity.

[0060] In the above document associating device, the section correspondence deriving unit derives a correspondence by associating the similarities by dynamic programming matching.

[0061] In the above document associating device, a content input unit for inputting the content, a content storage unit for storing the content, a document input unit for inputting the document information, and a document storage unit for storing the document And an output unit for outputting information relating to the correspondence.

[0062] According to the present invention, even when the accuracy of speech recognition is not sufficiently obtained due to the influence of BGM, the influence of noise, the utterance style of the speaker, the sound collecting environment, and the like, the content can be accurately obtained. Can be associated with a section in the document. The reason is that the matching between the content such as audio or video and the document section is performed based on the speaker unit (the changed part of the speaker), which is easier than speech recognition. Recognition of a point where a speaker has changed is more robust to noise and sound collection because it only needs to recognize differences compared to recognizing what the speaker is talking about. Also, since the correspondence is focused on the speaker who is not the content of the voice, visual information can also be used, and if the speaker change point is extracted based on the visual information, the sound collection state Correspondence that does not depend on can be performed. Further, according to the present invention, even when a document to be associated does not faithfully represent a conversation in audio or video, the association can be performed. The reason is that since matching at the word level is not possible, it is possible to realize correspondence between speakers and topics in relatively long sections, and it is not necessary to make detailed correspondences between individual conversation contents. BRIEF DESCRIPTION OF THE FIGURES FIG. 1 is a diagram showing a configuration of an embodiment of a document association device of the present invention.

FIG. 2 is a block diagram showing an example of a configuration of a content section extracting means 5 in the embodiment of the document association device of the present invention.

FIG. 3 is a flowchart showing an example of the operation of the content section extracting means 5 in the embodiment of the document association method of the present invention.

FIG. 4 is a block diagram showing another example of the configuration of the content section extracting means 5 in the embodiment of the document association device of the present invention.

FIG. 5 is a flowchart showing another example of the operation of the content section extracting means 5 in the embodiment of the document association method of the present invention.

FIG. 6 is a block diagram showing still another example of the configuration of the content section extracting means 5 in the embodiment of the document association device of the present invention.

FIG. 7 is a flowchart showing yet another example of the operation of the content section extracting means 5 in the embodiment of the document association device of the present invention.

FIG. 8 is a block diagram showing another example of the configuration of the content section extracting means 5 in the embodiment of the document association device of the present invention.

FIG. 9 is a flowchart showing another example of the operation of the content section extracting means 5 in the embodiment of the document association device of the present invention.

FIG. 10 is a flowchart showing an example of the operation of the document section extraction means 6 in the embodiment of the document association device of the present invention.

FIG. 11A to FIG. 11D are diagrams showing an example of a method using document format information in the embodiment of the document association method of the present invention.

FIG. 11B is a diagram showing an example of a method of using document format information in the embodiment of the document association method of the present invention.

FIG. 11C is a diagram showing an example of a method of using document format information in the embodiment of the document association method of the present invention.

FIG. 11D is a diagram showing an example of a method of using document format information in the embodiment of the document association method of the present invention.

[FIG. 12A] FIG. 12A is a diagram showing a format of a document according to an embodiment of the document association method of the present invention. FIG. 9 is a diagram illustrating another example of a method of using information.

FIG. 12B is a diagram showing another example of a method for using document format information in the embodiment of the document association method of the present invention.

[FIG. 12C] FIG. 12C is a diagram showing another example of a method of using document format information in the embodiment of the document association method of the present invention.

FIG. 13 is a diagram showing still another example of a method using document format information in the embodiment of the document association method of the present invention.

[FIG. 14] FIG. 14 is a block diagram showing an example of the configuration of the section correspondence relation deriving means 7 in the embodiment of the document association device of the present invention.

FIG. 15 is a flowchart showing an example of a correspondence deriving method executed by the section correspondence deriving means 7 in the embodiment of the document associating method of the present invention.

[16A] FIG. 16A is a diagram showing the correspondence between content information and document information in the correspondence derivation method.

[16B] FIG. 16B is a diagram showing the correspondence between the content information and the document information in the correspondence deriving method.

[FIG. 17] FIG. 17 is a diagram illustrating normalization in the correspondence deriving method.

[18A] FIG. 18A is a diagram showing the correspondence between content information and document information in the correspondence derivation method.

[18B] FIG. 18B is a diagram showing the correspondence between the content information and the document information in the correspondence deriving method.

FIG. 19 is a block diagram showing another example of the configuration of the section correspondence relationship deriving means 7 in the embodiment of the document association device of the present invention.

FIG. 20 is a flowchart showing another example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document association method of the present invention. [21] FIG. 21 is a diagram showing the correspondence between content information and document information in the correspondence deriving method.

FIG. 22 is a diagram showing a correspondence between content information and document information in a correspondence derivation method. FIG. 23 is a block diagram showing another example of the configuration of the section correspondence relation deriving means 7 in the embodiment of the document association device of the present invention.

FIG. 24 is a block diagram showing an example of a configuration of a candidate text document corresponding unit 62.

FIG. 25 is a flowchart showing another example of the correspondence deriving method executed by the section correspondence deriving means 7 in the embodiment of the document associating method of the present invention.

FIG. 26 is a diagram showing a correspondence between content information and document information in a correspondence deriving method.

FIG. 27 is a diagram showing a correspondence between content information and document information in a correspondence derivation method.

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of a document association apparatus and a document association method according to the present invention will be described in detail with reference to the accompanying drawings.

The configuration of the embodiment of the document association apparatus of the present invention will be described.

FIG. 1 is a diagram showing a configuration of an embodiment of a document association device of the present invention. The document associating device 10 includes a content input unit (content input unit) 1, a document input unit (document input unit) 2, a content storage unit (content storage unit) 3, and a document storage unit (document storage unit) 4. Content section extracting means (content section extracting section) 5, document section extracting means (document section extracting section) 6, section correspondence deriving means (section correspondence deriving section) 7, output means (output section) 8 is provided. The content input means 1 inputs content including information (data) such as audio and video. The document input means 2 inputs a document related to the content. The content storage means 3 stores the content obtained from the content input means 1. The document storage means 4 stores the document obtained from the document input means 2. Content section extraction means 5 extracts a single speaker section from the content. The document section extracting means 6 extracts a single speaker section from a document. The section correspondence deriving means 7 derives a correspondence between the content section extracted by the content section extracting means 5 and the document section extracted by the document section extracting means 6. The output means 8 outputs the correspondence derived by the section correspondence deriving means 7.

[0066] The content input means 1 is for inputting target content. Conte The input means 1 is, for example, a video camera or a microphone. Here, the content is exemplified by video information, audio information, or video information accompanied by audio information. The content input means 1 may be a device such as a video player or a recording player that reads and outputs video information or audio information recorded on a recording medium such as a video tape.

[0067] The document input means 2 is for inputting a document related to the content. The document input unit 2 is, for example, a text input device such as a keyboard, a pen input device, and a scanner. The document input unit 2 may be an input device that reads document data created using document creation software.

[0068] The content storage means 3 is, for example, an internal storage device or an external storage device for recording the content from the content input means 1. The storage medium used in the content storage means 3 is exemplified by RAM, CD-ROM, DVD, flash memory, and hard disk

[0069] The document storage unit 4 is an internal storage device or an external storage device that records the document from the document input unit 2. Recording media used in the document storage means 4 are exemplified by RAM, CD-ROM, DVD, flash memory, and hard disk.

[0070] The content section extraction means 5 divides the content (information) stored in the content storage means 3 into sections for each speaker, and extracts a content section by a single speaker. A single-speaker content section (hereinafter, also referred to as a “single-speaker section”) is a section in which the power at the time the speaker is changed and the next speaker is also changed. A single-speaker section is extracted such that a single speaker is included in the section and speakers in adjacent sections are different. It is desirable that the single-speaker section extracted by the content section extraction means 5 does not include an error, but may include an error due to automation of content section extraction!

The document section extraction means 6 extracts a section (document section) corresponding to each speaker from the document stored in the document storage means 3. The extracted document section describes the document information corresponding to the utterance of a single speaker. The document section extraction means 6 can be used to extract the document section by using, for example, a method using the format information of the document, a method using the description about the speaker written in the document, and a method using the metadata in the structured document. Perform the extraction. The section correspondence deriving means 7 derives the correspondence between the content section extracted by the content section extracting means 5 and the document section extracted by the document section extracting means 6, and outputs it to the output means 8. The output unit 8 displays, outputs, and stores the correspondence on a display device, a printer, an internal storage device, an external storage device, or the like.

When the document associating device 10 is realized by a computer, the content section extracting means 5, the document section extracting means 6, and the section correspondence deriving means 7 include a computer processing device (for example, CPU), It can be realized with a program for realizing the functions of the means 5, 6, and 7.

FIG. 2 is a block diagram showing an example of a configuration of the content section extraction means 5 in the embodiment of the document association device of the present invention. The content section extracting means 5 includes a voice dividing unit 21, a voice feature deriving unit 22, a primary storage unit 23, a voice feature matching unit 24, and an output unit 25. The audio division unit 21 performs the first division of the audio by extracting the silent section also from the content read from the content storage unit 3. The voice feature deriving unit 22 derives a voice feature for the first voice section obtained by the first division. The primary storage unit 23 stores the start time of the first voice section and the voice feature amount. The audio feature amount matching unit 24 compares the audio feature amount derived by the audio feature amount derivation unit 22 with the audio feature amount stored in the primary storage unit 23. The output unit 25 outputs the processing result of the speech feature matching unit 24 to the section correspondence deriving unit 7.

An example of the operation of the content section extracting means 5 in the embodiment of the document association method of the present invention will be described. FIG. 3 is a flowchart showing an example of the operation of the content section extracting means 5 in the embodiment of the document association method of the present invention. Figure 3 shows what is shown in Figure 2. Here, a case where the content is a video including audio and audio analysis is used to extract the content interval will be described as an example.

The voice division unit 21 performs the first division of the voice (Step S101). That is, the audio division unit 21 extracts a silent section of the input video as the first division of the audio, and detects an audio section between the two silent sections. The silent section is extracted by measuring the audio track of the input video or the audio power of the input audio. The speech feature deriving unit 22 derives speech features for the first speech section obtained by the first segmentation of the speech (step S10). 2). Examples of the audio feature amount include an average fundamental frequency, an average speech time length, and an average audio power of the audio in the section. When the voice feature amount deriving unit 22 derives the voice feature amount, the primary storage unit 23 determines whether or not the start time of the first voice section and the voice feature amount are stored (step S103). ). When the start time of the first voice section and the voice feature amount are not stored, the primary storage unit 23 stores the start time of the first voice section and the voice feature amount (step S104). .

If the start time of the first voice section and the voice feature have already been stored, the voice feature matching unit 24 determines whether the new voice feature derived by the voice feature A comparison is made with the audio feature amount stored in the storage unit 23 (step S105). If the voice features of both sections are smaller than (similar to) a preset threshold, the voice feature matching unit 24 determines that the utterance by the same person is continuing (step S106: YES). ). If the audio data has not ended (step S109: NO), the audio dividing unit 21 extracts audio information up to the next silent section (step S101).

If the voice features of the two sections are different (step S106: NO), the voice feature matching unit 24 determines that the speaker of the voice has changed. The output unit 25 outputs a section between the start time stored in the primary storage unit 23 and the start time of the current voice section as a speech section of a single speaker (step S107). That is, the utterance section of the single speaker is detected by analyzing the change point of the sound feature. At the same time, the primary storage unit 23 updates the voice feature amount and the start time to those newly obtained (step S108). If the audio data has not ended (step S109: NO), the audio dividing unit 21 continuously extracts a silent section of the next audio (step S101).

The above processing is continued until the audio data ends. Here, the average fundamental frequency is used to obtain changes in the characteristics of prosody information such as voice pitch, voice length, and voice loudness (an example of changes in voice characteristics). , The average speech time length, and the average speech rate. However, another measure representing prosody information may be used. Also, conversational features such as wording and habits may be used. In that case, at least one change in the characteristics of the prosodic information should be used.

[0079] Further, here, the content section extraction means 5 performs the classification of the speech feature amount in the speech section. A change point of a speaker is detected based on the similarity, and a speaker section is specified. By detecting the point at which the speaker has changed rather than identifying the speaker, the speaker section can be detected with higher accuracy compared to speaker identification and speech recognition. Of course, the content section extraction means 5 may specify the speaker from the voice feature amount at each time and extract the speaker section from the speaker identification result.

FIG. 4 is a block diagram showing another example of the configuration of the content section extracting means 5 in the embodiment of the document association device of the present invention. The content section extracting means 5 includes a scene dividing unit 31, a person extracting and person characteristic amount deriving unit 32, a primary storage unit 33, a person characteristic amount matching unit 34, and an output unit 35. The scene division unit 31 extracts a first video section composed of continuous frames by detecting a content change scene read from the content storage unit 3. The person extraction and person feature deriving unit 32 derives a person feature for the first video section. The primary storage unit 33 stores the start time of the first video section and the person feature. The person feature matching unit 34 compares the person feature derived by the person feature deriving unit 32 with the person feature and the person feature stored in the start time storage unit 33. The output unit 35 outputs the processing result of the person feature matching unit 34 to the section correspondence deriving means 7.

[0081] Another example of the operation of the content section extracting means 5 in the embodiment of the document association method of the present invention will be described. FIG. 5 is a flowchart showing another example of the operation of the content section extracting means 5 in the embodiment of the document association method of the present invention. Here, as an example, a case where video information is assumed as an input, and a speaker section is derived on the assumption that a speaker in a conversation is reflected in the video.

[0082] The scene division unit 31 measures a difference between frames of the input video to detect a portion where the video information has greatly changed, and, based on the detection result, configures a first visually composed frame. A video section is extracted (step S201). The person extraction and person feature value deriving unit 32 extracts a person region appearing in the video and performs image processing on the person region to derive a person feature value (step S202). As a method of extracting a person region, when the moving object in the video is only a person, the difference between the previous frame and the background difference method, which is a method widely used in the field of surveillance, is a specific value or more. Adopt as a person area A method is illustrated. Examples of the feature amount of a person include a face feature amount that is described in detail, such as the shape of a face, and a low-order visual feature amount that describes the color distribution, pattern, and boundary shape of the entire person. By using the color distribution etc. and patterns, it is possible to take into account the characteristics of the clothes worn by the user (the visual characteristics of the clothes of the person), and thus to extract changes in the person in simple meetings, etc. Is fully applicable.

[0083] The person feature amount and start time storage unit 33 stores the start time and the person feature amount of the first video section when the person extraction and person feature derivation unit 32 derives the person feature amount. It is determined whether or not the force is present (step S203). If the start time of the first video section and the person feature are not stored (step S203: NO), the start time and the person feature of the first video section are stored (step S204). That is, the first video section is detected by analyzing the change point of the visual feature in the video. If the start time of the first video section and the person feature have already been stored (step S203: YES), the person feature matching unit 34 is derived by the person extraction and person feature deriving unit 32. The new person feature and the person feature stored in the person feature and start time storage unit 33 are compared (step S205). Then, the person feature matching section 34 determines that the utterance by the same person is continued when the person feature in both sections is more similar than the threshold set in advance (step S206: YES). If the video data has not ended (step S209: NO), the scene division unit 31 extracts a portion where the next video information has greatly changed (step S201).

[0084] The person feature matching section 34 determines that the speaker in the video has changed when the person features in the two sections are different (step S206: NO). The output unit 35 outputs a section between the start time stored in the primary storage unit 33 and the start time of the current video section as an utterance section of a single speaker (step S207). At the same time, the primary storage unit 33 updates the person characteristic amount and the start time to those newly obtained (step S208). If the video data has not ended (step S209: NO), the scene division unit 31 extracts a portion where the next video information has changed significantly (step S201).

[0085] The above processing is continued until the video data ends. Note that the image feature amounts include low-order feature amounts such as color distribution, shape, and edge histogram, eye categories, eyes, nose, and mouth. Higher-order feature amounts such as the arrangement of are exemplified. In addition, an appropriate one may be adopted as the feature quantity, or a plurality of feature quantities may be combined. In addition, if the assumption that a person does not have a large power is introduced, it is possible to extract a person region without using it as a visual feature quantity including background information.

FIG. 6 is a block diagram showing still another example of the configuration of the content section extraction means 5 in the embodiment of the document association device of the present invention. FIG. 6 shows the content section extracting means 5 for performing both the section extraction for audio and the section extraction for video. The speech section extraction unit 81 includes, for example, a speech division unit 21, a speech feature quantity derivation unit 22, a primary storage unit 23, a speech feature quantity matching unit 24, and an output unit 25 shown in FIG. The video section extraction unit 82 includes, for example, a scene division unit 31, a person extraction and person feature amount derivation unit 32, a primary storage unit 33, a person feature amount matching unit 34, and an output 35 shown in FIG. The audio / video section extracting section (audio / video section integrating means) 83 determines a content section from the output of the audio section extracting section 81 and the output of the video section extracting section 82. The audio-video section extracting unit 83 determines a content section by adopting, for example, only the time when both the output of the audio section extracting unit 81 and the output of the video section extracting unit 82 indicate that the speaker has changed.

FIG. 7 is a flowchart showing still another example of the operation of the content section extracting means 5 in the embodiment of the document association device of the present invention. The audio section extraction unit 81 divides the input video into a plurality of audio sections based on the audio of the input video (Step S121). For example, the operation shown in FIG. 3 is performed. On the other hand, the video section extraction unit 82 divides the input video into a plurality of video sections based on the video of the input video (Step S122). For example, the operation shown in FIG. 5 is performed. However, step S121 and step S122 may be performed simultaneously, or step S122 may be performed first. Next, the audio / video section extraction unit (audio / video section integration means) 83 determines a content section based on the output of the audio section extraction unit 81 and the output of the video section extraction unit 82 (step S123). For example, the audio / video section extraction unit 83 determines the content section by adopting only the time when both the output of the audio section extraction unit 81 and the output of the video section extraction unit 82 indicate that the speaker has changed.

FIG. 8 is a block diagram showing another example of the configuration of the content section extraction means 5 in the embodiment of the document association device of the present invention. Figure 8 shows both audio and video analysis. 5 shows a content section extracting means 5 for extracting a single speaker section of the content by using.

[0089] The scene division unit 91 analyzes the feature amount of the content and divides it into scenes. The scene dividing unit 91 may use the audio feature amount as in the audio dividing unit 21 shown in FIG. 2, or may use the visual feature as in the person extraction and the human feature amount deriving unit 32 shown in FIG. A feature value may be used. Further, the sum of the voice feature amount and the person feature amount may be calculated. In other words, in order to derive the point in time when the speaker changes, analysis of the change point of the visual feature in the video and analysis of the change point of the sound characteristic in the voice may be performed, and the results of both may be integrated. ,. The audio feature value deriving unit 92 derives the audio feature value of the extracted scene. The visual feature quantity deriving unit 93 derives the visual feature quantity of the extracted scene. When the extracted voice feature and visual feature are accumulated, the primary storage unit 94 stores the voice feature and the visual feature and the start time thereof. If the speech feature and the visual feature have already been stored, the speech feature matching unit 95 compares the speech feature input from the speech feature derivation unit 92 with the speech stored in the primary storage unit 94. Comparison with the feature value is performed. Similarly, the visual feature matching unit 96 compares the visual feature input from the visual feature deriving unit 93 with the visual feature stored in the primary storage 94.

When the difference between the audio feature amount input from the audio feature amount derivation unit 92 and the audio feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, or the visual feature amount derivation unit 93 If the difference between the visual feature amount input from the CMM and the visual feature amount stored in the primary storage unit 94 is larger than a threshold value previously determined, the voice feature amount stored in the primary storage unit 94 and After clearing the visual features, the current time and start time are sent to the output unit 97. The output unit 97 outputs them to the section correspondence deriving means 7. Note that the difference between the speech feature amount input from the speech feature amount derivation unit 92 and the speech feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, and the visual feature amount derivation unit If the difference between the visual feature amount input from 93 and the visual feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, the current time and the start time are output to the output unit 97. You can send it to

FIG. 9 shows a content section extracting method in the embodiment of the document association apparatus of the present invention. 15 is a flowchart illustrating another example of the operation of Stage 5.

[0092] The scene division unit 91 analyzes the feature amount of the content and divides it into scenes (Step S141). The scene division unit 91 may use the audio feature amount as in the audio division unit 21 shown in FIG. 2, or may use the visual feature as in the person extraction and person characteristic amount derivation unit 32 shown in FIG. Collection may be used. Further, the sum of the voice feature amount and the person feature amount may be calculated. In other words, in order to derive the point in time when the speaker changes, analysis may be performed on the change point of the visual feature in the video and the change point analysis of the sound feature in the voice, and the results of both may be integrated. The voice feature deriving unit 92 derives the voice feature of the extracted scene (step S142). The visual feature deriving unit 93 derives the visual feature of the extracted scene (step S143). However, Step S142 and Step S143 may be performed at the same time, or Step S143 may be performed in full bloom. The primary storage unit 94 determines whether or not the extracted voice feature and visual feature are accumulated (step S144). If the extracted voice feature and visual feature have not been accumulated (step S144: NO), the primary storage unit 94 stores the voice feature and visual feature and the start time thereof (step S144: NO). S 145).

[0093] If the speech feature amount and the visual feature amount are already stored (step S144: YES), the speech feature amount matching unit 95 compares the speech feature amount input from the speech feature amount derivation unit 92 with the primary feature. A comparison is made with the audio feature amount stored in the storage unit 94. Similarly, the visual feature matching unit 96 compares the visual feature input from the visual feature deriving unit 93 with the visual feature stored in the primary storage unit 94 (step S146). .

[0094] If the difference between the speech feature value input from the speech feature value derivation unit 92 and the speech feature value stored in the primary storage unit 94 is smaller than a predetermined threshold value (similarly, る), and The difference between the visual feature quantity input from the visual feature quantity deriving section 93 and the visual feature quantity stored in the primary storage section 94 is smaller than a predetermined threshold! In the case of /, (similarly! /, Ru), the audio feature quantity deriving unit 92 and the visual feature quantity deriving unit 93 determine that the utterance by the same person is continuing (step S147: YES). If the data is not completed (step S150: NO), the scene division section 91 continues the scene division (step S141).

[0095] The voice feature amount input from the voice feature amount derivation unit 92 and stored in the primary storage unit 94 If the difference from the audio feature is larger than a predetermined threshold, or if the difference between the visual feature input from the visual feature deriving unit 93 and the visual feature stored in the primary storage unit 94 is If the threshold is larger than the predetermined threshold, the voice feature deriving unit 92 or the visual feature deriving unit 93 determines that the utterance by the same person has ended (step S147: NO) _{.0 The} primary storage unit 94 Then, the stored voice feature and visual feature are cleared, and the current time and the start time are sent to the output unit 97 (step S148). The output unit 97 outputs them to the section correspondence relation deriving means 7 (step S149).

[0096] Note that if the difference between the audio feature amount input from the audio feature amount derivation unit 92 and the audio feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, and the visual feature amount When the difference between the visual feature amount input from the derivation unit 93 and the visual feature amount stored in the primary storage unit 94 is larger than a predetermined threshold, it is determined that the utterance by the same person is continued. Alternatively, the current time and the start time may be sent to the output unit 97. In this case, if the difference between the audio feature amount input from the audio feature amount derivation unit 92 and the audio feature amount stored in the primary storage unit 94 is smaller than a predetermined threshold, or if the visual feature amount derivation unit 93 If the difference between the visual feature amount input from the first and the visual feature amount stored in the primary storage unit 94 is smaller than a predetermined threshold, it is determined that the utterance by the same person has ended.

[0097] By doing so, it is difficult to detect the image power in a speaker section that is too strong to be distinguished by audio, and it is difficult to detect the image power due to similar visual features such as faces or clothes. The speaker section can be extracted by the voice feature. That is, it is possible to accurately detect the content section.

The document section extraction means 6 shown in FIG. 1 extracts a section (document section) corresponding to each speaker in the document from the document information stored in the document storage means 4. Document information corresponding to the utterance of a single speaker is described in the extracted document section. To extract the document section corresponding to the speaker from the document information, for example, a method using the format information of the document, a method using the description about the speaker written in the document, and using metadata in the structured document There is a way.

FIG. 10 shows a document section extracting unit 6 according to the embodiment of the document association apparatus of the present invention. 6 is a flowchart showing an example of the operation of FIG. The document section extraction means 6 extracts information indicating a document break (hereinafter, “document separation information”) from the document information stored in the document storage means 4 (step S161). Examples of the document delimiter information include a line feed (blank line) in the document, a difference in character font, a difference in character color, a character layout, and a description of a speaker's name. Next, the document section extraction means 6 selects an optimal document section extraction method based on the document section information (step S162). The correspondence (table) between the document division information and the method of extracting the document section is stored in a storage unit (not shown). The method of extracting the document section corresponding to the speaker from the document information includes, for example, a method using document format information, a method using a description about a speaker written in the document, and a method using metadata in a structured document. There is a way to do that. Then, the document section extracting means 6 extracts a section (document section) corresponding to each speaker in the document. In the extracted document section, document information corresponding to the utterance of a single speaker is described. However, if the document information is determined in advance, steps S161 and S162 may be omitted, and the method of extracting the document section corresponding to the document information may be immediately executed.

Hereinafter, a specific example of the document section extraction method executed by the document section extraction means 6 will be described. FIG. 11A to FIG. 11D are diagrams showing an example of a method of using document format information in the embodiment of the document association method of the present invention. In the example shown in FIG. 11A, a blank line is inserted for a comment between speakers. Therefore, the document section extracting means 6 can extract the document section based on the blank line. In the example shown in FIG. 11B, a document in a conversation is illustrated. The remarks of the host are displayed in oblique characters. Therefore, the document section extraction unit 6 can extract the document section by identifying the content of the guest's statement and the content of the host's statement. In the example shown in FIG. 11C, the color is different for each speaker. Often used to distinguish between multiple speakers. Therefore, the document section extracting means 6 can extract the document section using the color information. In the example shown in Fig. 11D, the place to be described is arranged for each speaker. In this way, when the description places are arranged for each speaker, even if the name of the speaker is not directly entered, the document section extracting means 6 extracts the section estimated to be a single speaker. be able to. Note that the section extracted here is only a candidate, and it is desirable that the section is divided by the section of a single speaker. You don't have to. In the method described with reference to FIGS. 11A to 11D, an example of the structure analysis of the document is performed.

FIG. 12A to FIG. 12C are diagrams showing another example of the method of using the document format information in the embodiment of the document association method of the present invention. FIGS. 12A to 12C show a method of extracting a document section using a description about a speaker entered in a document. In the example shown in FIG. 12A, the speaker is written in the form of “name:” before the utterance. The document section extracting means 6 can extract a document section based on “name:”. In the example shown in Figure 12B, expressions such as "Question" and "Answer" are used instead of names. The document section extracting means 6 can extract a document section based on “Question” and “Answer”. In the example shown in FIG. 12C, the names of the speakers are displayed in a separate column, and are widely used in drama scripts and minutes. If such information is used, the document section extracting means 6 can easily extract the information on the speaker and the speaker section in terms of document power. It should be noted that the method described with reference to FIGS. 12A to 12C also implements an example of document structure analysis.

FIG. 13 is a diagram showing still another example of a method of using document format information in the embodiment of the document association method of the present invention. FIG. 13 shows a method of extracting a document section using a tag in a structured document. The document section extracting means 6 can extract a document section using, for example, a “Speaker” tag. In addition to the method of extracting a document section from the document illustrated in FIGS. 11A to 13, the document section extraction using the format information of the document and the description about the speaker is also possible. Further, the document section extracting means 6 can extract the speaker section with higher accuracy by combining these methods. Further, the document section extracting means 6 may derive a document section based on a change in conversation characteristics such as a habit or a phrase of a conversation-equivalent part of the description in the document, similarly to the voice. Note that, even in the method described with reference to FIG. 13, an example of the structure analysis of the document is performed.

Next, the section correspondence deriving means 7 in the embodiment of the document association apparatus of the present invention will be described. FIG. 14 is a block diagram showing an example of the configuration of the interval correspondence deriving means 7 in the embodiment of the document association device of the present invention. In the example shown in Figure 14, The section correspondence deriving means 7 includes a content length normalizing section 41, a document length normalizing section 42, a section consistency deriving section (section matching means) 43, a section correspondence storing section 44, and a section integrating section. 45 and an output unit 46. The content length normalizing unit 41 normalizes the content length in each extracted section. The document length normalizing unit normalizes the length of each document section. The section matching degree deriving unit (section matching means) 43 derives the correspondence between the content section and the document section. The section correspondence storage unit 44 stores the correspondence for each section. The section integration unit 45 integrates adjacent sections and associates the content with the document on a one-to-one basis. The output unit 46 outputs the correspondence.

Next, a description will be given of a correspondence deriving method executed by the section correspondence deriving means 7 in the embodiment of the document matching method of the present invention. FIG. 15 is a flowchart showing an example of a correspondence relation deriving method executed by the section correspondence deriving means 7 in the embodiment of the document correspondence method of the present invention. FIG. 16A and FIG. 16B are diagrams showing the correspondence between content information and document information in the correspondence deriving method. FIG. 17 is a diagram for explaining the normalization in the correspondence deriving method. In the example shown in FIG. 16, for the sake of simplicity, the speaker section extracted by the content section extracting means 5 is 6 sections ([a]-[f]), and the document section extracted by the document section extracting means 6 is It is assumed that there are seven sections ([1] [7]).

[0105] The content length regular shading section 41 performs regular length shaping of the content length in each extracted section (step S301). At the time of regular dagger, if the content includes audio as shown in FIG. 17 (a), first, a silent part in each section is extracted. Next, the extracted silent parts are removed from each section. Then, the length of each section is proportional to the length of the audio part, and the sum is normalized to be 1.0. This state is shown in FIG. It is assumed that the content information shown in FIGS. 16A (a) and 17 (a) includes a silent part. Also, as shown in FIG. 17 (c), the normalization may be performed in proportion to the mere section length without removing the silent part. If the content does not include audio, video information is used to detect people, and if no content is included, the length of each section is proportional to the length of the audio part except for each section, and the sum is 1 0 may be applied. Instead of excluding a section that does not include a person, normalization may be performed in proportion to a simple section length. The document length normalizing section 42 normalizes the length of each document section (step S302). For example, the length of each section is set to a length proportional to the document amount (or character amount) included in each section. FIG. 13A shows an example of a result obtained by normalizing both sides. FIG. 16A (a) shows the content information, and FIG. 16A (b) shows the document information.

The section matching degree deriving unit 43 derives an individual correspondence between the content section and the document section (Step S303). For example, it is assumed that the overlap on the regular axis is checked and that there is a corresponding relationship with the most overlapped area. In the example shown in FIG. 16A, the correspondence relationship is, in terms of document information, [l] → [a], [2] → [a], [3] → [b], [4] → [c], [5] → [d], [6] → [f], [7] → [f]. Considering content information, [ _a ] → [2], [b] → [3], [c] → [4], [d] → [5], [e] → [5], [f] → [7]. The section correspondence storage unit 44 stores the correspondence for each section derived by the section consistency degree derivation unit 43.

[0108] The section integration unit 45 determines whether or not the content and the document completely correspond one-on-one (Step S304). If the content and the document do not completely correspond one-to-one (step S304: NO), the section integration unit 45 determines the content and the document based on the correspondence between the sections stored in the section correspondence storage unit 44. Until the files completely correspond one-to-one, the adjacent sections are integrated so that the content and the document are associated one-to-one (steps S304, S305) _o For example, the adjacent sections corresponding to the same section are integrated By repeating the process (example: [1] → [a], [2] → [a], integrating [1] and [2]), a one-to-one correspondence between content and document Obtainable. When the content and the document completely correspond one-to-one (step S304: YES), the output unit 46 regards the section integrated by the section integration unit 45 as one section and outputs the correspondence ( Step S306).

[0109] In the example shown in FIG. 16A, the above processing allows [[1] [2] [a]],

[[3] [b]], [[4] ^ [c]], [[5] ^ [d] [e]], [[6] [7] [f]] Can be extracted. As described above, the section correspondence relation deriving means 7 performs the correspondence by comparing the section length of the extracted content section with the document amount of the extracted document section.

[0110] The section correspondence deriving means 7 can also derive the correspondence by introducing the certainty factor of the change in the content. That is, the section information derived from the content section extracting means 5 In addition to the above, the confidence of the change point extraction used for the section extraction is input as a score, and the correspondence is derived using the confidence of the change point extraction. For example, in a region where the degree of certainty of change is high, the section integrating unit 45 integrates one of the sections having a score of high degree of certainty of change with another section instead of performing the integration processing. FIGS. 18A and 18B are diagrams showing the correspondence between content information and document information in the correspondence derivation method. That is, in the example shown in FIG. 15A, the confidence of the change from [d] to [e] is 0.90 (high) and the confidence of [e] → [f] is 0.40 (low). In this case, [e] with short length is integrated with [f] to derive the correspondence. As a result, as shown in FIG. 15B, it is possible to derive a correspondence that reflects the certainty factor.

[0111] Similar processing is performed when the certainty factor at the time of document section extraction is used instead of the certainty factor of the content section, or when the certainty factor is used at both the content section and the document section. Is possible.

FIG. 19 is a block diagram showing another example of the configuration of the section correspondence deriving means 7 in the embodiment of the document association device of the present invention. The section correspondence relation deriving means 7 includes a speaker information storage unit 51, a speaker identification unit 52, a document speaker information extraction unit 53, and a section matching degree derivation unit 54. The speaker information storage unit 51 stores a correspondence between a feature amount for specifying a person and the person. The speaker identification unit 52 specifies a speaker. The document speaker information extracting unit 53 extracts information on the speaker from the document. The section matching degree deriving unit 54 performs section matching based on the speaker information.

[0113] The speaker information storage unit 51 records in advance a correspondence between a feature amount (including a voice feature amount or a visual feature amount) for specifying a person and the person. The feature amount is set loosely for person identification. For example, when using the speech feature, a speaker-specific feature such as a pitch and a pitch related to a specific phoneme or a word is used for each speaker. In addition, information such as wording and habit may be used. When visual features are used, the shape, positional relationship, etc. of the eyes, nose, and mouth are used as features of the speaker's face. Known features used as face recognition technology or speaker identification technology can also be used as features.

[0114] The speaker identification unit 52 outputs the information of the content section and the information thereof from the content section extraction means 5. A speaker in one or a plurality of sections is specified by inputting the feature amounts included in the section and comparing them with the feature amounts stored in the speaker information storage unit 51. As described above, the speaker identification unit 52 as the feature amount matching identification unit includes the feature amount stored in the speaker information storage unit 51 and the feature extracted by the content feature amount extraction unit (specifically, the content section extraction unit 5). The speaker is identified by comparing the amount. The speaker identification unit 52 extracts, for example, a person in the speaker information storage unit 51 having the closest input feature amount. If characters are limited in advance in a conference or TV program, etc., identification may be performed in consideration of such restriction information, or all speaker candidates may be listed. The document speaker information extraction unit 53 extracts information on a speaker (speaker information) from a document by specifying a speaker in one or a plurality of document sections. The section matching degree deriving unit 54 performs section matching based on the speaker information. That is, the speaker section is associated with the document section.

[0115] Next, another correspondence deriving method executed by the section correspondence deriving means 7 in the embodiment of the document association method of the present invention will be described. FIG. 20 is a flowchart showing another example of the correspondence derivation method executed by the section correspondence derivation means 7 in the embodiment of the document correspondence method of the present invention. FIG. 21 and FIG. 22 are diagrams showing the correspondence between the content information and the document information in the correspondence deriving method. This example is effective when the speaker information is described in the document and can be extracted as shown in FIGS. 12A to 13.

[0116] The speaker identifying section 52 stores the information of the content section input from the content section extracting means 5 and the feature amount included in the section in the speaker information storage section 51. The speaker (speaker section) in one or a plurality of sections is specified by comparing with the feature amount (step S321). On the other hand, the document speaker information extraction unit 53 extracts information on the speaker (speaker information) from the document by specifying the speaker in one or a plurality of document sections (step S322). However, step S321 and step S322 may be performed simultaneously, or step S322 may be performed first. Next, the section matching degree deriving unit 54 performs section matching based on the speaker information. That is, the speaker section is associated with the document section (step S323). It operates in this way.

The section matching unit 54 shown in FIG. 21 ((a) content information, (b) document information) In an example of the rain processing, a person as a result of the speaker identification unit 52 specifying the speaker using the feature amount stored in the speaker information storage unit 51 based on the content information (: content section). Sections are assigned according to the identification information. Regarding section support, a method of dynamic programming matching (DP matching) may be introduced! When the accuracy of speaker identification based on content information is low and “Tanaka” is not extracted as illustrated in FIG. 21, “Tanaka” can be skipped and a response can be taken.

FIG. 22 ((a) content information, (b) document information) describes an example of section matching processing by section matching degree deriving section 54 when speaker identifying section 52 extracts a plurality of persons as candidates. FIG. In this case, the area [f] can be associated with the section [7] of the document information by the person information based on the document information. Note that "Takagi" and "Yamashita" do not appear in the document. Also, the section [a] is associated with [1] and [2] because the names of both the powers, which are sections of “Yamamoto” or “Tanaka”, appear in the document information.

FIG. 23 is a block diagram showing another example of the configuration of the section correspondence deriving means 7 in the embodiment of the document association device of the present invention. The section correspondence deriving means 7 includes a speech recognition unit 61 that performs speech recognition to generate a candidate text for the input speech, and a candidate text document correspondence unit 62 that associates the candidate text with the document in the document storage unit 4. including.

FIG. 24 is a block diagram showing an example of the configuration of the candidate text document corresponding unit 62. The candidate text document correspondence section 62 includes a candidate text word extraction section 71, a document section word extraction section 72, a candidate text Z document section correspondence section 74, and a candidate text Z document section word similarity calculation section 73. Including. The candidate text word extraction unit 71 extracts one or a plurality of words from the candidate text of the section. The intra-document section word extraction unit 72 extracts one or more words in each section. The candidate text Z document section correspondence unit 74 associates each section. The candidate text Z document section word similarity calculation unit 73 calculates the distance within the section.

Next, another correspondence deriving method executed by the section correspondence deriving means 7 in the embodiment of the document associating method of the present invention will be described. FIG. 25 is a diagram showing the correspondence relation derivation performed by the section correspondence derivation means 7 in the embodiment of the document correspondence method of the present invention. It is a flowchart which shows another example of the output method. FIG. 26 and FIG. 27 are diagrams showing the correspondence between the content information and the document information in the correspondence deriving method. Assume that the content includes audio information.

The speech recognition unit 61 receives information on the content section from the content section extraction means 5. Also, content information is input from the content storage means 3. Then, voice information is extracted from the content information, voice recognition is performed, and a candidate text for the input voice is generated (step S341). There are various methods for the speech recognition method. However, in this embodiment, any of the following methods is used. May be used.

[0122] The candidate text document correspondence unit 62 receives the candidate text of each section of the content from the speech recognition unit 61, and associates the candidate text with the document in the document storage unit 4.

[0123] The candidate text document corresponding unit 62 compares a word in the candidate text with a word in the document section. Then, the content section including the matched word or the similar word is associated with the document section. Specifically, the candidate text word extraction unit 71 extracts one or more words used in each content section from the section candidate texts (step S342). The intra-document section word extraction unit 72 extracts one or more words in each document section (step S343). Step S342 and step S343 may be performed simultaneously, or step S343 may be performed first. Next, the candidate text Z document section word similarity calculation unit 73 calculates an intra-section distance for determining the similarity between the word in the content section and the word in the document section (step S344). The candidate text Z document section correspondence unit 74 associates the content section with the document section by comparing the extracted word sets based on the intra-section distance, and outputs the result (step S345). ).

FIG. 26 shows an example of the correspondence between the candidate text and the document in the document storage unit 4 by the candidate text document section correspondence unit 74. (A) shows the content section, (b) shows the start time of the content section, (c) shows the candidate text word, (d) shows the word in the document section, (e) shows the document section, and (f) shows the document . In the example shown in FIG. 26, in each document section, (Information communication, speech recognition, semantic information, ...), (security, video camera, moving object, ...), (experimental, · · ·, (Research, · · · ·) are extracted.Each audiovisual section, that is, content section (13:41, 15:41), (15:41, 16:50), (16:50, 20 : 15), (20:15, 21:13), ... power, (voice recognition, semantic information, ...), (information communication, semantic information, ...), (security, ...) , (Research, ...) are extracted. Such words may be obtained by simply extracting nouns from the document, or register important words in the dictionary at first glance. In addition, it may be extracted by matching words in the dictionary, and by analyzing the frequency of use of words, the importance can be determined. Good.

FIG. 27 shows an example of the correspondence between the candidate text and the document in the document storage unit 4 by the candidate text document section correspondence unit 74. (A) shows a content section, (b) shows a time between content sections, (c) shows a document section, (d) shows a document, and (e) shows a correspondence table. The candidate text document section correspondence unit 74 can derive the correspondence relation of each section by measuring the similarity (duplication degree) of the word strings as exemplified in the correspondence table in FIG. 27 (e). it can. As shown in FIG. 26, if no response can be taken, "No response" may be made. Also, the method of dynamic programming matching (DP matching) may be used to derive the correspondence between the content section and the document section!

As described above, the association between the content section and the document section is realized. The association may be realized by a combination of the components (FIGS. 14, 19, and 23) of the above-described section correspondence deriving means 7.

[0127] The output unit 8 shown in Fig. 1 outputs the correspondence between the audio or video derived by the section correspondence deriving unit 7 and the document section. As an example of the output form, as shown in FIG. 27 (e), there is a correspondence table in which the time in the content is added to the head of the section of the document. In addition, any output form may be used as long as it represents the correspondence between the time information of the content and the document section.

Industrial applicability

The present invention relates to an information presenting apparatus for automatically displaying content and document information by automatically associating the content with the document and the blue report, and for displaying a corresponding portion of the content with text information The present invention can be applied to a multimedia display device for searching and searching, and a multimedia searching device. It is also applicable to applications such as a congressional video browsing device that checks actual contents while referring to the minutes of a meeting, a lecture support system that refers to lecture materials and lecture contents, and an education support system.

Claims

The scope of the claims

[1] (a) a step of preparing a content including at least one of audio information and video information in which a plurality of speakers appear as speakers, and a document describing the content of the content;

(b) deriving a correspondence relationship between the content and the document for each speaker.

Document matching method.

[2] The document matching method described in claim 1!

The step (b) comprises:

(b 1) dividing the content for each speaker into a plurality of content sections;

(b2) a step of dividing the document into a plurality of document sections by dividing the document into speakers, and (b3) a step of associating the plurality of content sections with the plurality of document sections.

Have

Document matching method.

[3] According to the document matching method described in claim 2,

The step (b2) includes:

(b21) extracting also the content power when the speaker changes from one of the plurality of speakers to another of the plurality of speakers;

(b22) dividing the content for each speaker based on a point in time when the speaker changes;

including

Document matching method.

[4] In the document matching method according to claim 3,

The (b21) step is:

(b211) the content is the voice information, and the method includes a step of extracting a change point of the voice of the speaker as the voice information power; Document matching method.

[5] In the document matching method according to claim 3,

The (b21) step is:

(b212) the content is the video information, and a step of extracting a change point of the video of the speaker from the video information is included.

Document matching method.

[6] The document matching method according to any one of claims 1 to 3,

The content is audio-video information in which the audio information and the video information are synchronized.

Document matching method.

[7] The document matching method according to claim 3 or 5,

The (b21) step is:

(b213) analyzing a change point of a sound feature of the voice information to derive a time point at which the speaker changes.

Document matching method.

[8] The document association method according to claim 3 or 5,

The (b21) step is:

(b214) Analyze the changing points of the visual features of the video information!含む, including a step of deriving a point in time when the speaker changes

Document matching method.

[9] In the document matching method according to claim 3 or 6,

The (b21) step is:

(b215) performing a change point analysis of a visual feature of the video information and a change point analysis of a sound feature of the audio information, and integrating both results to derive a time point at which the speaker changes.

Document matching method.

[10] The document association method according to any one of claims 4 to 9, wherein

The step (b) comprises: (b4) analyzing the structure of the document and providing a step of dividing the document into speakers

Document matching method.

[11] A computer program product having program code means for performing all the steps of any one of claims 1 to 10 when used on a computer.

[12] A computer program product having the program code means according to claim 11, stored in a computer-readable storage means.

[13] For content including at least one of audio information and video information in which a plurality of speakers appear as speakers, the content is divided for each speaker to extract a plurality of content sections. A content section extraction unit;

A document section extracting unit for extracting a plurality of document sections by dividing a document describing the content of the content for each speaker;

A section correspondence deriving unit for deriving a correspondence between the plurality of content sections and the plurality of document sections;

Have

Document association device.

[14] The document association apparatus according to claim 13,

The content is the audio information,

The content section extraction unit extracts a plurality of content sections by analyzing a sound feature of the audio information.

Document association device.

[15] The document association apparatus according to claim 13,

The content is the video information,

The content section extraction unit extracts the plurality of content sections by analyzing visual characteristics of the video information.

Document association device.

[16] The document association device according to claim 13,

The content is audio-video information in which the audio information and the video information are synchronized. And

The document associating device, wherein the content section extracting unit extracts the plurality of content sections by integrating a result of analysis of a sound feature of the audio information and a result of analysis of a visual feature of the video information.

[17] The document association device according to claim 16,

The content extraction unit

A voice section extracting unit that analyzes a sound feature of the voice information and divides the voice information into speaker units to extract a plurality of voice sections;

A video section extracting unit that analyzes a visual feature of the video information and divides the video information into speakers to extract a plurality of video sections;

An audio / video section integrating unit that extracts the plurality of content sections based on a plurality of pieces of audio section information regarding the plurality of audio sections and a plurality of pieces of video section information regarding the plurality of video sections;

including

Document association device.

[18] The document association device according to claim 13,

The content section extraction unit extracts a speaker change point as a point in time when a speaker changes from one of the plurality of speakers to another of the plurality of speakers in the content, and extracts the plurality of contents. Extract intervals

Document association device.

[19] The document association device according to claim 18,

The content includes the audio information,

The content section extraction unit extracts the speaker change point based on a change in a characteristic of at least one of prosodic information of the utterance height, utterance speed, and utterance size in the audio information.

Document association device.

[20] The document association device according to claim 18,

The content includes the audio information, The content section extraction unit extracts the speaker change point based on a change in a conversation mode in the audio information.

Document association device.

[21] The document association apparatus according to claim 18, wherein

The content includes the video information,

The content section extraction unit extracts the speaker change point based on a change in a visual characteristic of a person in the video information.

Document association device.

[22] The document association device according to claim 18,

The content includes the video information,

The content section extraction unit extracts the speaker change point based on a change in a facial feature of a person in the video information.

Document association device.

[23] The document association device according to claim 18,

The content includes the video information,

The content section extraction unit extracts the speaker change point based on a change in a visual characteristic of a person's clothing in the video information.

Document association device.

[24] The document association device according to any one of claims 13 to 23,

The document section extracting unit extracts the plurality of document sections based on format information of the document.

Document association device.

[25] The document association apparatus according to any one of claims 13 to 23,

The document section extracting unit extracts the plurality of document sections based on a description about a speaker written in the document.

Document association device.

[26] The document association apparatus according to any one of claims 13 to 23,

The document section extraction unit, based on tag information of a structured document in the document, Extracting the plurality of document sections

Document association device.

[27] The document association device according to any one of claims 13 to 23,

The document section extraction unit extracts the plurality of document sections based on a change in a conversation feature in the document.

Document association device.

[28] The document association apparatus according to any one of claims 13 to 27,

The section correspondence deriving unit associates the plurality of content sections with the plurality of document sections based on a comparison between the section lengths of the plurality of content sections and the document amounts of the plurality of document sections.

Document association device.

[29] The document association device according to claim 28,

The section correspondence deriving unit performs the association based on a result of performing dynamic programming matching for the plurality of content sections and the plurality of document sections.

Document association device.

[30] The document association apparatus according to any one of claims 13 to 29,

The section correspondence deriving unit,

A content speaker identification unit for identifying a speaker in at least one of the plurality of content sections;

A document speaker information extracting unit that specifies a speaker in at least one of the plurality of document sections and obtains speaker information as information of the speaker;

A section matching unit that matches the plurality of content sections and the plurality of document sections based on the speaker information;

including

Document association device.

[31] The document association apparatus according to claim 30,

The content speaker identification unit, A content feature amount extraction unit for extracting a feature amount in at least one of the plurality of content sections;

A speaker information storage unit for storing the feature amount and the speaker in association with each other; and identifying the speaker based on a comparison between the stored feature amount and the extracted feature amount. Feature matching identification unit

including

Document association device.

[32] The document association apparatus according to claim 30 or 31, wherein

The content speaker identification unit identifies the speaker based on at least one feature of prosody information of voice pitch, voice length, and voice strength in the voice information. .

[33] The document association device according to claim 30 or 31, wherein

The content speaker identification unit specifies the speaker based on a feature amount representing a conversation mode in the voice information.

Document association device.

[34] The document association device according to claim 30 or 31, wherein

The content speaker identification unit specifies the speaker based on a visual feature of a person in the video information.

Document association device.

[35] The document association device according to claim 34,

The content speaker identification unit uses a facial feature of a person as a visual feature of the person.

Document association device.

[36] The document association apparatus according to any one of claims 30 to 35,

The document speaker information extracting unit specifies the speaker based on a description about the speaker written in the document.

Document association device.

[37] The document association apparatus according to any one of claims 30 to 35, The document speaker information extracting unit specifies a speaker based on metadata of a structured document in the document

Document association device.

[38] The document association device according to any one of claims 30 to 37,

The section matching unit associates the plurality of content sections with the plurality of document sections such that a speaker in each of the plurality of content sections matches a speaker in each of the plurality of document sections.

Document association device.

[39] The document association device according to claim 38,

The section matching unit associates the plurality of content sections with the plurality of document sections based on a result of performing dynamic programming matching on the plurality of content sections and the plurality of document sections.

Document association device.

[40] The document association apparatus according to any one of claims 13 to 39,

The content includes audio information,

A speech recognition unit that extracts speech content in the plurality of content sections and outputs speech text information;

The document association device, wherein the section correspondence deriving unit associates the plurality of content sections with the plurality of document sections based on a similarity between the utterance text information and the document information of the document.

[41] The document association device according to claim 40,

The section correspondence deriving unit matches the utterance text information with the document information based on a dynamic program matching result between words appearing in the utterance text information and words appearing in the document information. Let

39. The document association device according to claim 38.

[42] The document association device according to claim 40 or 41,

The section correspondence deriving unit,

Used in each of the plurality of content sections in the utterance text information. A basic word extraction unit that extracts one or more first basic words and one or more second basic words used in each of the plurality of document sections,

A basic word group similarity deriving unit that measures similarity between the plurality of first basic words and the plurality of second basic words,

Derive the correspondence based on the similarity

Document association device.

[43] The document association device according to claim 40 or 41,

The section correspondence deriving unit derives a correspondence by associating the similarities by dynamic programming matching.

Document association device.

[44] The document associating device according to any one of claims 13 to 43,

A content input する for inputting the content;

A content storage unit for storing the content,

A document input unit for inputting the document information;

A document storage unit for storing the document,

An output unit that outputs information about the correspondence relationship;

Further comprising

Document association device.