WO2007043679A1 - Information processing device, and program - Google Patents

Information processing device, and program Download PDF

Info

Publication number
WO2007043679A1
WO2007043679A1 PCT/JP2006/320557 JP2006320557W WO2007043679A1 WO 2007043679 A1 WO2007043679 A1 WO 2007043679A1 JP 2006320557 W JP2006320557 W JP 2006320557W WO 2007043679 A1 WO2007043679 A1 WO 2007043679A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
content
identifier
search
phoneme
Prior art date
Application number
PCT/JP2006/320557
Other languages
French (fr)
Japanese (ja)
Inventor
Masayoshi Ihara
Ryutaro Egawa
Hiroshi Otsuka
Kei Maruno
Shunji Mitsuyoshi
Original Assignee
Sharp Kabushiki Kaisha
Sgi Japan, Ltd.
Advanced Generation Interface Japan, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sharp Kabushiki Kaisha, Sgi Japan, Ltd., Advanced Generation Interface Japan, Inc. filed Critical Sharp Kabushiki Kaisha
Priority to JP2007540220A priority Critical patent/JPWO2007043679A1/en
Publication of WO2007043679A1 publication Critical patent/WO2007043679A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/26603Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel for automatically generating descriptors from content, e.g. when it is not made available by its provider, using content analysis techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording

Definitions

  • Content information acquisition means for acquiring content information, search condition input means for inputting search conditions, and search conditions input by the search condition input means from the content information acquired by the content information acquisition means Content information conforming to the above or a specifying means for specifying a position in the content information.
  • Patent Document 1 a method for detecting a change in content information in a content information search using a general information processing apparatus has been proposed as in Patent Document 1, and a change in volume as a feature amount is proposed.
  • a method has been proposed that uses a scene that exceeds a certain threshold as a highlight scene.
  • the feature amount is a quantification of time-series changes, changes to neighboring pixels, changes in color, acoustic frequency, etc. within a specified range for information such as input audio and video. Value.
  • Various methods can be considered for converting the rate of change into numerical values.For example, for audio, a method of converting to a numerical value based on changes in the frequency axis using a cepstrum or FFT can be considered. Then, it is possible to consider a method of making a numerical value as a time-series change, a difference value, a relative value, or an absolute value of luminance and hue in adjacent pixels, and will be described later in detail in a modification example.
  • Non-Patent Document 3 as an application of such technology, a phoneme symbol string based on phoneme recognition and a search method based on image recognition have been proposed.
  • "still image ⁇ word set ⁇ text ⁇ A method has been proposed in which a character string associated with an image is converted to a phoneme sequence or a phoneme segment sequence, or a phoneme sequence or a phoneme segment sequence is converted into a character string and linked to each other as a ⁇ voice ⁇ video ''. ing.
  • Patent Document 3 a phoneme and a symbol string based on Z or phoneme pieces are registered in a database in association with geographical position information, and search and provision of information with proper nouns that are common in city information is performed.
  • An information distribution device and a receiving device have been proposed, and according to Patent Document 4, retrieval of speech information indexed by phoneme recognition is proposed, and related techniques are also proposed in these cited documents. ing.
  • Patent Document 5 a technique for recognizing emotions from voice feature information is disclosed in Patent Document 5, and a technique for detecting scales and musical instruments has been proposed in Non-Patent Document 4.
  • a method for performing a search based on a detected character string by recognizing a moving image or a still image and detecting a character string or the like has been proposed by Patent Document 6, and Patent Document 7 or the like.
  • a method for recognizing image power and motion, called gesture recognition and motion recognition has been proposed.
  • Patent Document 8 a method for recognizing facial images has been proposed. Proposal 'Invented.
  • Patent Document 9 and methods based on those cited documents have been proposed, but specific scene features can be extracted by combining information that is based on multiple recognitions in a time-series manner. There is no method proposed to specify the position on the time axis in the content, the position on the display screen, or the position on the reading aloud!
  • a state in which a plurality of pieces of different information are generated in the positional vicinity of each other is generally called “co-occurrence", and “co-occurrence relation”, “co-occurrence state” and “co-occurrence””Information” can be used to evaluate the conditions under which arbitrary information is generated by combining information generated in the vicinity of certain information. It is possible and is used to estimate the meaning of sentences using a covariance matrix based on co-occurrence probabilities and co-occurrence information.
  • the positional neighborhood is a temporal and spatial neighborhood based on the time series position, the reading position and the display position.
  • Patent Document 10 proposes a method of indexing content information in the sensitivity word space
  • Non-Patent Document 5 provides an index based on character strings based on utterance contents for video and audio.
  • a search method is proposed, it is proposed to construct an evaluation function for search using a co-occurrence relationship based on the recognition result in the content information, the feature value for recognition, and the identifier.
  • human resources can respond flexibly, such as product reputation surveys at call centers, search according to hobbies of content information such as moving images, nursing of patients in medical settings, and reactions in virtual personalities of robots or agents.
  • the evaluation is performed using information based on the co-occurrence relationship constructed based on multiple features and identifiers (symbols for distinguishing features) obtained from the environment. Based on the results, no method has been proposed for detecting information and providing information and processing that is highly convenient for the user!
  • the syllables, phonemes, and phonemes in the present invention are “Akasana” in Japanese,
  • a / ka / sa / tayoyo or “a / ka / sa / ta / na” is used for syllables, and “a / k / a” for phonemes.
  • phoneme recognition and phoneme recognition are different from general speech recognition. Yes, in more detail, phoneme recognition and phoneme recognition do not use a language model related to grammar, so the meaning is recognized as a recognition result, and it is not converted to a symbol that includes meaning like kanji Or, do not discriminate between homonyms and homonyms, and do not discriminate between nouns and verbs according to the context. It is characterized by analyzing the utterance of the utterance using Dell and evaluating only the match between the utterance and the recognition symbol.
  • a "phoneme” refers to a vowel or consonant that is a component of speech
  • a "phoneme segment” is an element obtained by subdividing one phoneme into , The middle of “A”, the end of “A”, and the notation based on the change of phonemes for the utterances that are intermediate sounds such as the sound between “A” and “I” It may be written as “phoneme identifier” or “phoneme segment identifier”.
  • Patent Document 1 JP 2004-233541 A
  • Patent Document 2 JP-A 62-220998
  • Patent Document 3 JP-A-2004-54915
  • Patent Document 4 Japanese Patent Laid-Open No. 2002-221984
  • Patent Document 5 JP 2002-91482 A
  • Patent Document 6 JP 2002-14973 A
  • Patent Document 7 Japanese Patent Laid-Open No. 09-330400 Patent Document 8: JP-A-5-153581
  • Patent Document 9 Japanese Patent Laid-Open No. 7-36883
  • Patent Document 10 Japanese Unexamined Patent Application Publication No. 2005-107718
  • Patent Document 11 Japanese Unexamined Patent Application Publication No. 2004-280158
  • Patent Document 12 Japanese Patent Laid-Open No. 10-320400
  • Patent Document 13 Japanese Patent Application No. 2005-147048
  • Non-Patent Document 1 Masayuki Nakazawa, Takashi Endo, Kiyoshi Furukawa, Jun Toyoura, Takashi Oka (New Information Processing Development Corporation), "Study of speech summaries and topic summaries using phoneme symbol sequences of speech waveform power", Shin Academic Report, SP96-28, pp.61—68, June 1996.
  • Non-Patent Document 2 “Research and Development on Life Support Interface for Aged Society”, Key Project Research Report by Aomori Prefectural Industrial Research Center Vol.5, Apr.1998-Mar.2001 031
  • Non-Patent Document 3 Takashi Oka, Hironobu Takahashi, Takuichi Nishimura, Nobuhiro Sekimoto, Hidehide Mori, Masanori Ihara, Hiroaki Yabe, Hiroaki Hashiguchi, Hiroshi Matsumura. Pattern Search Algorithm 'Map-Supporting "CrossMediator" -. Someone Unknown, editor, Artificial Intelligence 'Gikai Study Group, volume 1, pages 1-6. Japanese Society for Artificial Intelligence, 2001.
  • Non-Patent Document 4 Masahiro Tani: “Integration of instrument sound features by Bayesian Network and application to instrument identification", 2003 IEICE General Conference “D-14 Speech, Auditory” D-14-21, pi 88 , March 2003
  • Non-Patent Document 5 Satoshi Nagao, “Semantic Transcoding-Towards More Practical Semantic We”, Research on Human-Centered Intellectual Information Technology VI-3.6, Japan Information Processing Development Corporation Technical Research Institute, March 2003
  • search methods generally used a search method that uses character strings and audio information associated with images and video, and a search method that evaluates identifiers and feature values obtained by a single recognition method or feature extraction method. Search based on abstract concepts that are difficult to express in language and search based on sensory concepts such as scene excitement and searches based on hobbies and subjectivity are difficult There was a problem of being.
  • Non-Patent Document 3 a search is performed using a phoneme symbol acquired by phoneme recognition as an identifier! Based on co-occurrence information that combines identifiers and feature quantities based on multiple recognition methods, such as emotion identifiers based on emotion recognition obtained from speech identifiers, motion identifiers based on motion identifiers, image identifiers based on motion recognition, and motion recognition.
  • a method for constructing a covariance matrix and constructing a new evaluation function for indexing or searching has not been proposed.
  • the inventor creates an evaluation function based on the co-occurrence relationship of identifiers and feature quantities obtained as a result of such various recognitions, and performs search and indexing, which has been impossible in the past. It is considered that abstract searches such as climax can be performed, and the name of the evaluation function is appropriately named by the user or producer for any evaluation function configured as an analysis result, and based on the named character string By generating phoneme strings and phoneme string strings, user-defined evaluation functions and indexes can be used to specify search conditions, and the configured evaluation functions can be distributed and distributed. We thought that a high search environment could be realized.
  • the technology related to the co-occurrence relation of information measures the co-occurrence relation based on the co-occurrence frequency of words and characters in the same sentence in the same sentence using the co-occurrence probability and covariance matrix.
  • a method for extracting sentence features for estimating the number of words a method based on Patent Document 9 and those cited documents has been proposed.
  • identifiers extracted by various recognition methods and recognition of them are recognized. It is characterized by using co-occurrence information and co-occurrence probability, covariance matrix and co-occurrence matrix.
  • Non-Patent Document 3 an image is uniformly segmented, and word strings statistically associated with segmented image features are expanded into phonemes and phonemes. Although it is possible to search based on utterances or to search for locations that are uttered in the video, it is possible to combine specific image feature trends, emotion feature trends, and voice feature trends that accompany recognition. Statistically classify based on co-occurrence state and configure an evaluation function to give an identifier, associate a phoneme sequence and phoneme sequence with a utterance of a name indicating the identifier target, and search for those identifiers It was impossible to construct an indexed evaluation function for
  • an object of the present invention is to provide an information search device that can easily search for arbitrary content information by using co-occurrence information based on various types of input information. Etc. is to provide.
  • an information processing apparatus includes a content information acquisition unit that acquires content information, a search condition input unit that inputs search conditions, and the content information acquisition unit described above.
  • Content information that conforms to the search condition input by the search condition input means or a specifying means for specifying a position in the content information from the content information acquired by A feature quantity extraction means, an identifier generation means for generating an identifier using the feature quantity evaluation function extracted by the feature quantity extraction means, and the feature quantity and Z or the identifier as the content or the content.
  • Index information storage means for storing the index information in association with the position in the search condition, and the search condition input by the search condition input means
  • search condition conversion means for converting to a feature quantity and Z or an identifier, wherein the specifying means uses the feature quantity and Z or identifier converted by the search condition conversion means to use the index information and the search condition.
  • Content or position within content by detecting It has a search specifying means for specifying a position.
  • An information processing apparatus is based on a content information acquisition means for acquiring content information, search condition input means for inputting search conditions, and content information acquired by the content information acquisition means. And content information that conforms to the search condition input by the search condition input means or a specifying means for specifying a position in the content information, and a content information extraction that extracts a plurality of different feature quantities A plurality of different feature quantity forces extracted by the feature quantity extraction means, and an identifier generation means for generating a plurality of different identifiers using an evaluation function, and a plurality of different feature quantities and Z or the identifiers as the content.
  • index information storage means for storing as index information in association with the position in the content
  • search condition input means Search condition conversion means for converting the inputted search condition into a plurality of different feature quantities and Zs or identifiers
  • the specifying means includes a plurality of different feature quantities converted by the search condition conversion means and It has a search specifying means for specifying the content or the position in the content by detecting a match between the index information and the search condition using Z or an identifier.
  • the third invention is the information processing apparatus according to the first or second invention, wherein the index information storage means is configured based on a feature amount and Z or an identifier obtained from a content card. Is further stored in association with the content or the position in the content, and the co-occurrence information based on the feature quantity and the Z or identifier converted from the search condition by the search condition conversion means is used as the search condition.
  • Search condition co-occurrence information constituting means configured as co-occurrence information is further provided, wherein the search specifying means includes the search condition co-occurrence information constituted by the search condition co-occurrence information constituting means, and the index co-occurrence information. It is characterized by having co-occurrence search specifying means for specifying the content or the position in the content by detecting the conformity with the above.
  • the content includes character information
  • the identifier generation means includes the sentence An identifier is generated based on character information.
  • the character information and the identifier Is further stored as dictionary information, and the identifier generating unit generates the identifier using the dictionary information from the character information included in the content.
  • the sixth invention is the information processing apparatus according to any one of the first to fifth inventions, wherein the identifier and the standard pattern are associated with the dictionary information storage means in the standard pattern.
  • a standard pattern dictionary information storage means for storing as an image dictionary information, and further comprising an identifier feature quantity conversion means for converting the identifier into a feature quantity by a standard pattern by using the standard pattern dictionary information. To do.
  • the index information storage means is based on real time of the content information.
  • the feature quantity and the Z or the identifier are further stored in association with the content or the position in the content, and the specifying unit is configured to determine whether the content information index information and the search condition are distributed in real time.
  • an eighth invention is the information processing device according to any one of the first to seventh inventions, wherein the content information is being searched and Z or the search result or the detection result is shared. It is characterized by presenting advertisement information associated with origin information and Z or the index information.
  • At least one of a plurality of different feature amounts extracted by the feature amount extraction means is used for the content force / phoneme recognition. It is a feature quantity extracted from phoneme information to be used or a phoneme identifier generated from phoneme information.
  • At least one of a plurality of different feature amounts extracted by the feature amount extraction means is used when the content force phoneme segment is recognized. It is a feature value extracted from phoneme information used or a phoneme identifier generated from phoneme information.
  • At least one of a plurality of different feature amounts extracted by the feature amount extraction means is emotion recognition from the content. It is a feature quantity extracted from emotional information used for recognition or emotion identifier generated by emotional information.
  • At least one of a plurality of different feature values extracted by the feature value extraction means is recognized based on the content power auditory information. It is characterized by the feature quantity extracted from the auditory information power used at the time of identification or the identifier generated from the auditory information.
  • At least one of a plurality of different feature quantities extracted by the feature quantity extraction means is recognized based on the content power visual information. It is characterized by the feature quantity extracted from the visual information force used at the time of identification, or an identifier generated from the visual information.
  • the content includes character information, and a plurality of different feature quantities or identifiers extracted by the feature quantity extraction means.
  • the identification amounts generated by the generation means at least one is a feature amount extracted from a character information force or an identifier generated from a character information cover.
  • the fifteenth invention is the information processing device of the second invention, wherein at least one of a plurality of different feature quantities extracted by the feature quantity extraction means or a plurality of different identifiers generated by the identifier generation means. Is characterized in that the feature quantity or program information extracted from the program information column is an identifier.
  • the sixteenth invention is the information processing apparatus of the second invention, wherein at least one of a plurality of different feature quantities extracted by the feature quantity extraction means or a plurality of different identifiers generated by the identifier generation means. Is characterized in that the feature quantity or sensor information from which sensor information power is also extracted is an identifier.
  • the evaluation function is calculated from co-occurrence information configured based on the feature quantity and Z or identifier obtained from the content.
  • An evaluation function reconstructing means for reconstructing is provided.
  • the eighteenth invention is the information processing apparatus according to the third invention, wherein the co-occurrence information is configured based on the feature quantity and the Z or identifier converted by the search condition power by the search condition conversion means. And an evaluation function restructuring means for reconfiguring the evaluation function. It is a sign.
  • a search that configures co-occurrence information based on a result of specifying the content or a position in the content by the co-occurrence search specifying unit
  • a result co-occurrence information composing means and an evaluation function reconstructing means for reconstructing the evaluation function from the co-occurrence information constructed based on the search result co-occurrence information composing means! To do.
  • content acquisition means for acquiring content
  • search condition input means for inputting a search condition for searching for a predetermined scene from the content
  • contents that match the search condition are stored in the content.
  • an information processing apparatus provided with a specifying means that identifies from content stored in a storage means, a phoneme feature quantity used for phoneme recognition extracted from the content and a phoneme identifier obtained by Z or phoneme recognition
  • index recording means for associating and recording an emotion feature quantity used for emotion recognition extracted from the content and an emotion identifier obtained by Z or emotion recognition as an index.
  • Index specifying means for specifying, from the content, contents that match the search condition based on the index information recorded by the recording means It is characterized by having.
  • content acquisition means for acquiring content
  • search condition input means for inputting a search condition for searching for a predetermined scene from the content
  • contents that match the search condition are included in the content.
  • an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content.
  • Index recording means for associating and recording as an index, the phoneme segment identifier and the emotion feature quantity used for emotion recognition extracted from the content force and the emotion identifier obtained by Z or emotion recognition.
  • the means is an index that specifies the content capability that matches the search condition based on the index information recorded by the index recording means. It has a fixed means.
  • content acquisition means for acquiring content
  • search condition input means for inputting search conditions for searching for a predetermined scene from the content
  • the search In an information processing apparatus provided with a specifying unit that specifies content that satisfies a condition from the content stored in the content storage unit, the phoneme feature quantity used for phoneme recognition extracted from the content and Z or The phoneme identifier obtained by phoneme recognition, the emotion feature quantity used for emotion recognition extracted from the content power and the emotion identifier obtained by Z or emotion recognition, and the first recognition extracted from the content camera Index recording means for associating and recording the first feature quantity and / or the first identifier obtained by the first recognition as an index, and the specifying means is recorded by the index recording means And an index specifying means for specifying the content power that matches the search condition based on the index information.
  • content acquisition means for acquiring content
  • search condition input means for inputting a search condition for searching for a predetermined scene from the content
  • contents that match the search condition are included in the content.
  • an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content.
  • Index specifying means for associating and recording as an index the first feature obtained by the first feature quantity and / or the first recognition, and the specifying means comprises:
  • the system further comprises index specifying means for specifying the content power of content that matches the search condition based on the index information recorded by the index recording means.
  • content acquisition means for acquiring content
  • search condition input means for inputting a search condition for searching for a predetermined scene from the content
  • contents that match the search condition are included in the content.
  • an information processing apparatus provided with a specifying means for specifying from among the contents stored in the storage means, phoneme features to be used for phoneme recognition extracted from the content and phoneme identifiers obtained by Z or phoneme recognition Index recording means for associating and recording the first feature quantity used for the first recognition extracted from the content force and the first identifier obtained by the Z or first recognition as an index
  • the specifying means includes index information recorded by the index recording means.
  • an index specifying means for specifying the content power based on the search condition.
  • content acquisition means for acquiring content
  • search condition input means for inputting a search condition for searching for a predetermined scene from the content
  • contents that match the search condition are stored in the content.
  • an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content.
  • An index that associates and records the phoneme segment identifier and the first feature amount used for the first recognition extracted from the content force and the first identifier obtained by Z or the first recognition as an index.
  • Recording means, and the specifying means also specifies the content power that matches the search condition based on the index information recorded by the index recording means It has an index specifying means.
  • content acquisition means for acquiring content
  • search condition input means for inputting a search condition for searching for a predetermined scene from the content
  • contents that match the search condition are included in the content.
  • Emotion feature quantity and / or emotion identifier obtained by emotion recognition for use in emotion recognition extracted from the content in an information processing apparatus having a specific means specified from the content stored in the storage means
  • Index recording means for associating and recording the first feature quantity used for the first recognition extracted from the content force and the first identifier obtained by Z or the first recognition as an index
  • the specifying means specifies the content power that matches the search condition based on the index information recorded by the index recording means. It characterized by having a stage.
  • the first identifier and Z or the first feature amount are auditory information and Z or It is an identifier and Z or feature quantity based on visual information and Z or character information and Z or sensor information.
  • the inventor can use the evaluation using the co-occurrence information for a probabilistic evaluation of a condition in which arbitrary information is generated by combining information generated in the vicinity of certain information.
  • co-occurrence relationship information could be used for searching and learning because it was used for searching based on the semantic estimation of content information.
  • “scream”, “explosive pronunciation” and “explosion video as a screen change accompanied by radial movement in red and yellow image information” in action movies are information having characteristics of co-occurrence relations. Evaluation 'It is intended to solve problems by conducting search and learning through interpretation.
  • a combination of various recognition methods in the prior art recognizes phonemes' phonemes, emotions, and image features for each frame of an audio stream and a video stream in a moving image, and the recognition results are as follows. Index the moving images using the obtained identifiers, configure the co-occurrence probabilities of the identifiers for each frame, add up the co-occurrence probability transitions over multiple frames based on the co-occurrence matrix, The covariance matrix is obtained and the eigenvalues of the covariance matrix and the evaluator function are constructed.
  • indexing can be performed based on the co-occurrence information of various recognition results in the content information.
  • the evaluation function is reconstructed by multivariate analysis and the number of evaluation functions is arbitrarily increased, or the evaluation function name is manually defined from the image tendency and voice tendency detected by the increased evaluation function.
  • the evaluation function may be reconfigured based on the user's operation on the search results, or the eigenvalue eigenvectors are trained by HMM instead of force. May be.
  • the index based on the evaluation function configured in this way enables the user to obtain a search result that is impossible in the past based on a combination of image features, acoustic features, and detected emotions.
  • the function that reconstructs the evaluation function according to the usage situation makes it possible to search for content information that is more suited to its own subjectivity.
  • a phoneme sequence or phoneme segment sequence included in the voice in the content information which is not a combination of conventional word information, and an emotion identifier based on emotion recognition
  • search and detection based on image features and audio features that are characteristically co-occurring in a scene of a video. It is the gist of.
  • the present invention records, classifies and stores co-occurrence states of images, phonemes and emotions that are unintentionally co-occurring for humans, and records, classifies and records based on the accumulated information!
  • the identifier again and making it available for search 'detection', it is intended to solve powerful problems that could not be solved by conventional simple recognition search.
  • visual information and text information can be used independently, so various applications are possible by using the co-occurrence relationship between utterance recognition and emotion recognition that occurs in voice information in response to telephone calls. It can be used as a tool for robots and video production / editing.
  • a co-occurrence matrix with a total of 250 elemental powers is constructed for each frame to determine the co-occurrence probabilities.
  • a covariance matrix of co-occurrence probabilities is constructed by counting over the frame (3 seconds), and an eigenvalue eigenvector of the covariance matrix is obtained to construct an evaluation function.
  • the function name and identifier can be given by, or a character string with a high probability of co-occurring with the function obtained by multivariate analysis can be given as the function name and identifier so that it can be used according to instructions from the user become.
  • a dictionary can be constructed based on the co-occurrence information of video and emotion-related identifiers associated with strings and phoneme strings.
  • the present invention does not convert phonemes and index character strings as in the prior art, and does not perform a search, an identifier string consisting of phonemes, phoneme power, and an identifier obtained by an evaluation function used for recognition.
  • Search function by performing mutual conversion with, and an evaluation function using a co-occurrence matrix of identifiers or identifier strings that have phoneme power and other identifiers such as emotions and videos, identifier strings, and feature quantities
  • indexing 'search' detection 'learning, or indexing' search 'detection' learning automatically or recursively based on user instructions.
  • identifiers used are not only the phoneme and color information described above, but also “emotion identifiers” and “sound Various identifiers that are identifiers such as “floor identifier”, “environmental sound identifier”, characters by image recognition, “person identifier”, “object identifier” accompanying image recognition, and feature quantities according to their purposes from audio and video It is used to refer to a symbol discriminated based on probability, likelihood, and distance by evaluation function, HMM, ⁇ emotional identifier '' by emotion recognition, ⁇ environmental sound identifier '' by environmental sound recognition, ⁇ character '' by image recognition, This refers to ⁇ person identifier '', ⁇ expression identifier '', ⁇ object identifier '', and ⁇ motion identifier '' of moving images by face detection and image recognition. You can use technology, or you can combine program information, text information, sensor information, and so on.
  • content information is automatically indexed according to identifiers and feature quantities of emotions and images together with phonemes and phoneme pieces, and a search is performed by a combination of these identifiers.
  • a feature value that can be identified as “laugh,” appears in the surrounding feature value, and a place where a phoneme or phoneme sequence of a specific line appears can be detected.
  • Realize an information processing device that can provide a search device that cannot be realized by a video search system automatically record a program with the characteristic tendency, and deliver an email upon detection. It is also possible to construct a “laughing state” identifier or discriminating function by learning the co-occurrence information of the identifier and feature amount by performing face detection and facial feature amount extraction simultaneously with the emotion identifier of laughter.
  • a feature is always extracted from the voices of consumers and operators at the consumer consultation desk, phone recognition is performed, and a product is identified according to the recognized phonemes, and the detected emotion is specified. This means that the user's emotional evaluation of a specific product can be recorded and used for product quality analysis, or from the specified product name utterance by the operator of the consultation desk. Try to solve the problem by displaying the manual of the target product on the terminal screen.
  • the scale feature, the phoneme feature, and the emotion feature By combining the scale feature, the phoneme feature, and the emotion feature, the sung voice of the music and the singing voice of the user are recognized, and the phoneme sequence and the emotion identifier of the lyrics are recognized.
  • Search for music expand input character strings into phoneme symbol strings, compare musical scale transition states and emotion feature appearance frequencies, and search for music with high similarity to find music that suits your hobbies. By searching, it will be possible to search for music that has never existed before, and solve problems.
  • the user's utterance is converted into a phoneme string
  • the actor names in EPG, BML, RSS, and teletext are converted into phoneme strings
  • an actor name phoneme string that matches the user's utterance phoneme string is searched.
  • the cast name associated with the actor name of the matched phoneme string is detected.
  • the phoneme string may be expanded into a phoneme string from words or keywords that are input as characters.
  • a phoneme sequence index is constructed while performing phoneme recognition on the sound synchronized with the distributed moving image, and a phoneme sequence with a cast name based on an actor name detected from EPG, BML, RSS, or text broadcasting. Search for locations that match.
  • emotional characteristics and program genres included in the audio signal accompanying the cast name may be evaluated.
  • recording is started by detecting that the phoneme string based on the cast name matches the emotion feature specified by the user, or playback is performed while skipping only the target range. Based on the ranking, a list is created and output as a search result for encouraging the user's operation to achieve a convenient search and solve the problem.
  • recognition is performed based on a character string obtained based on a feature value obtained from speech, a symbol string using phonemes or phoneme pieces, an identifier such as an emotion, a scale, an instrument sound, an environmental sound, and a feature value obtained from Z or video. It is possible to solve the problem by classifying identifiers such as shapes, colors, characters, and actions using a multivariate analysis method and using them as identifiers in the present invention.
  • the user learns the feature amount of information that is frequently recorded or skip-played, automatically starts recording upon detection of the learned feature amount, starts skip play, When an e-mail or RSS is delivered upon detection, any processing may be performed whenever the problem is solved.
  • the present invention is based on identifiers extracted from images and voices that are not based on conventional identifiers associated with voices, emotion identifiers recognized as voices as feature quantities, and environmental sound identifiers. Indexing and searching by combining musical instrument identifiers and video identifiers, motion identifiers, and shape identifiers to obtain search results, learning the co-occurrence state of identifiers and feature quantities in these processes, It is characterized in that information on emotion identifiers and other identifiers described in this embodiment is distributed, and search and detection are performed based on the distributed information.
  • the character information is not parsed, and only the co-occurrence state using simple word appearance frequency is evaluated. Even if using co-occurrence information between words, even if using co-occurrence information between words, search using the co-occurrence state at the phonetic symbol level expanded into phonemes and phonemes that are not in the dimension of meaning such as kanji The system solves the search results obtained. Analysis may be performed.
  • the present invention provides a conventional technology.
  • indexing is possible by combining symbols, identifiers, characters, etc. based on recognition of multiple audio features, video features, image features, and text features.
  • Reconstruction enables search processing for complex content information expressions that take into account the subjectivity and emotions of people, which was impossible before, and includes adjectives and adverbs included in utterances and character strings. word It is possible to allow the associated abstract search, attempt to resolve the problem by reducing the complexity in the use of the underlying processing equipment of digital divide.
  • meta-indexing is performed by indexing content information that expresses linguistic adjectives and adverbs and adverbs based on feature quantities associated with various recognitions and co-occurrence information of Z or identifiers.
  • Information retrieval method by implementing grounding based on multi-dimensional identifiers centering on phoneme sequences and emotions by constructing annotation information by extraction of phonemes, and reusing them Knowledge sharing can be realized.
  • ⁇ 1 A diagram showing a basic configuration example of an apparatus according to the present embodiment.
  • FIG. 3 A diagram showing an operation of generating an identifier by converting a feature amount identifier.
  • ⁇ 4 A diagram showing a configuration example of video index data.
  • FIG. 5 A diagram showing a configuration example of video index data in the unit time designation method.
  • FIG. 6 A diagram showing the operation of index co-occurrence state learning.
  • [8] A diagram showing an example of a co-occurrence matrix of emotions, phonemes, and images.
  • [9] A diagram showing an example of a covariance matrix of emotions, phonemes, and images.
  • ⁇ 12 A diagram showing an example of learning with basic search conditions.
  • FIG. 14 A diagram showing a configuration example of an index information generating device.
  • FIG. 17 is a diagram showing the operation of the search method.
  • ⁇ 18] A diagram showing an operation procedure of a basic character string search request and execution method.
  • FIG. 19 is a diagram showing an example of search processing.
  • FIG. 20 is a diagram showing an example of a usage environment in the present embodiment.
  • FIG. 21 is a diagram showing an example of a processing procedure on the transmission side.
  • FIG. 22 is a diagram showing an example of a processing procedure on the receiving side.
  • FIG. 23 is a diagram showing state transition of search processing.
  • FIG. 25 is a diagram showing an example of a basic procedure for acquiring external information.
  • FIG. 26 is a diagram showing an example of a search and arbitrary processing method using EPG information.
  • FIG. 27 is a diagram showing a state transition in a product reliability survey application by consumer sentiment.
  • FIG. 28 is a diagram showing an example of a search procedure for language phoneme symbols.
  • FIG. 29 is a diagram showing an example of a phoneme symbol search procedure for language-specific character strings.
  • FIG. 30 is a diagram showing a configuration example of a symbol conversion function.
  • FIG. 31 is a diagram showing an example of an international phoneme symbol conversion procedure.
  • FIG. 32 is a diagram showing an example of a conversion dictionary for Japanese phoneme international phoneme symbols.
  • FIG. 33 shows an example of conversion from international phonemes to Japanese phonemes.
  • FIG. 34 is a diagram showing an example of conversion to phoneme force temperature.
  • FIG. 35 is a diagram showing an example of conversion of phoneme force into phonemes.
  • FIG. 36 is a diagram showing an example of a search procedure for international phoneme symbols.
  • FIG. 37 is a diagram showing an example of an international phoneme symbol search procedure.
  • FIG. 38 is a diagram showing an example of an international phoneme symbol search procedure.
  • the apparatus according to the present invention includes an information processing section 10, a storage section 20, an information input section 30, an information output section 40, and a communication line section 50, as in the apparatus basic configuration example of FIG.
  • This device has a built-in display device such as a TV or display, but it can be held externally.
  • the communication line unit 50 is configured to perform communication regardless of wired wireless communication with other information processing apparatuses, and to perform mutual communication and control with other information processing apparatuses. For example, information may be searched for, browsed, and provided with each other via devices or communication lines using the present invention.
  • the communication line unit 50 has a function of executing acquisition and distribution of arbitrary information, and more specifically, Ethernet (registered trademark), ATM (Asynchronous Transfer Mode), fiber one channel, wireless LAN, It is configured by combining devices such as infrared communication as required, and can use any communication protocol such as IP, TCP, UDP, and IEEE802.
  • the information input unit 30 includes a keyboard, a pointing device, a moving image capture device, a television broadcast related information receiving circuit, and a device capable of inputting information, such as a microphone input. It has the function of saving to the storage unit according to the instructions and outputting the information to the information output unit based on the processing and instructions of the information processing unit.
  • the information input unit 30 is connected to other input devices and input devices such as motion capture devices, cameras, RFID readers, barcode readers, image scanners, switch panels, OCRs, card readers, and sensors described later.
  • the terminals to be connected may be combined as necessary.
  • the information output unit 40 is configured by an apparatus capable of outputting information, such as an image display device and speaker output, and the quantized information is stored and reproduced in the storage unit according to instructions from the information processing unit. Or output information by processing or instructions of the information processing unit.
  • the information output unit 40 may include other output devices such as a printer, an arbitrary driving machine, a modeling device, and a milling machine, and a combination of terminals connected to the output device as necessary. You may print a poster by outputting information based on good search results, or you may print out a resin product.
  • the information processing unit 10 is configured by an arithmetic circuit based on an electronic circuit such as a CPU, and processes information acquired from the information input unit 30 and the storage unit 20. Then, the processed result is stored in the storage unit 20, reproduced, processed, and output to the information output unit 40 or the storage unit 20, or with other information processing apparatuses via the communication line unit 50. Send and receive for information exchange and receive and distribute information. Further, as shown in FIG. 1, the information processing unit 10 may be configured by program module codes for realizing various processes necessary for searching by a program, and a dedicated electronic circuit for executing them.
  • the information processing unit 10 is generally composed of a combination of DSP, reconfigurable processor, FPGA, ASIC, etc., so the storage unit 20 is composed of RAM, ROM, flash memory, hard disk, It is known to be composed of optical discs, removable discs and the like. [0101] Then, the information processing unit 10 evaluates the degree of coincidence between the search condition including the feature quantity and the identifier of the index and the index information, and performs the search, the feature quantity, the search condition, and the search result.
  • the co-occurrence information learning unit 104 for learning the co-occurrence information obtained by the above, the dictionary extraction unit 106 for extracting information for target conversion from the dictionary information storage unit, and the identifier by the recognition process from the extracted feature quantity Index information generation unit 108 that performs determination and indexing, index symbol string synthesis unit 110 that synthesizes index information for content information, control unit 112 that controls each functional unit, and content information power that is similar to MPEG7 Arbitrary symbol information after obtaining the necessary index information, obtaining information in the markup language such as RSS information and XML from the communication line part, or obtaining EPG information based on the broadcast wave received from the information input part
  • Instructions and variables in Meta-symbol extraction unit 114 that extracts attributes, natural information obtained from the outside via the information input unit, and video, images, and voices acquired from the communication line unit and storage unit can be processed by an information processing device
  • Content information extractor 116 that extracts feature values, identifiers obtained by user recognition, identifiers obtained from the outside through storage media or communication
  • Identifier feature value conversion unit 118 for converting to standard feature values
  • feature amount conversion unit 120 for converting feature values obtained from content information and user input into identifiers, and output as an evaluation list as a search result
  • an evaluation list output unit 122 that performs search, detection, and indexing according to combinations according to these needs.
  • content information music based on audio information, meta information attached to the content, EPG and BML as document and program information based on text information, musical scale as musical score information, general still images and moving images, Visual information that may include polygon data and vector data as 3D information, texture data, motion data (motion data), still images and moving images based on visualization numerical data, and content information for advertising and advertising purposes.
  • auditory information text information, and sensor information.
  • the position is chronological, coordinate information in the display, reading position of text, and the recording order and identification number order of the chart.
  • co-occurrence information from the vicinity which may be spatio-temporal coordinates based on positions and coordinates calculated from visual and auditory information Information may be composed.
  • the storage unit 20 includes an information recording / accumulating unit 22 for accumulating / recording each piece of information under the control of the information processing unit 10.
  • the information recording / accumulating unit 22 may be configured using, for example, a semiconductor storage device such as a RAM or a flash memory, or using an external hard disk, an optical disk, or a magnetic disk using an arbitrary interface.
  • the storage unit may be configured with a replaceable storage medium.
  • the storage unit 20 includes a content information storage unit 202 that stores a moving image, a still image, audio, and a document to be searched, and an HMM or Bayes as an evaluation function related to the identifier.
  • An evaluation function storage unit 204 that stores a recognition template of an identification function or an arbitrary distance function, an index information storage unit 206 that stores an identifier or an arbitrary symbol string as an index for searching content information, and a content information capability
  • the feature quantity storage unit 208 that stores the extracted feature quantity information
  • the program storage unit 210 that stores program module codes and parameters for realizing various processes necessary for searching by the program, and the co-occurrence information learning unit
  • a co-occurrence learning storage unit 212 that stores an HMM and an evaluation function such as a recognized identifier recognition template and an identifier recognition template relearned using the present invention;
  • a dictionary information storage unit 214 that stores dictionary information that also has a conversion table information capability for mutually converting an arbitrary identifier or feature amount and another arbitrary identifier or feature amount
  • “content information example” for target content information "example of feature quantity and identifier” for feature quantity and identifier to be used, and mutual conversion of identifier and feature quantity.
  • the dictionary to be used is described in more detail in “Example of dictionary configuration”, and in order to use the information processing device 1 as a search device, content information is input to the device and indexing is performed, or based on user input.
  • a step of constructing a query identifier column (query) used for the search a step of referring to the index based on the query identifier column (query) and performing a search result narrowing down, and a list of search results based on the search result are generally required, and the functions required for them are described in detail in Basic index processing examples and Basic search processing examples. Sharing index information The procedure for learning the wake-up state is described in detail in “Example of co-occurrence state learning process”. [0106]
  • an arbitrary processing unit or storage unit is divided into a server and a client, connected by communication, and information is exchanged between the server and the client. Indexing, detection, and arbitrary processing associated with detection may be performed, as described in detail in “Procedure examples of information processing devices used in terminals and base stations”.
  • the basic operation (processing procedure) of the indexing means will be outlined according to the operation flow of FIG.
  • natural information such as video or audio based on content information, text information input by the user, index information related to content information, extracted meta information, text information extracted, program information received from outside, sensor information, etc.
  • natural information such as video or audio based on content information, text information input by the user, index information related to content information, extracted meta information, text information extracted, program information received from outside, sensor information, etc.
  • natural information such as video or audio based on content information, text information input by the user, index information related to content information, extracted meta information, text information extracted, program information received from outside, sensor information, etc.
  • the natural information is auditory information, visual information, or sensor information, and is an external information distribution device via the external device connected to the information input unit 30 or the communication line unit 50 as content information or advertisement information. It is also obtained as content information acquired by an exchangeable external storage medium and stored in the content information storage unit 202 or advertisement information stored in the advertisement information storage unit 216.
  • the feature amount extraction process is a process of extracting feature amounts from the input natural information. For example, when speech is input, processing such as FFT is performed. A feature value is extracted by quantizing the inside color space. Note that the feature extraction method can take various forms as described below, so it depends on the implementation as described later. good.
  • the feature quantity identifier conversion unit 120 gives the extracted feature quantity to a plurality of evaluation functions in order to evaluate a specific identifier from the identifiers in the same field.
  • An identifier generation process is performed by a feature quantity identifier conversion process for selecting an identifier with high similarity (step S0203). The feature quantity identifier conversion process used for the identifier generation process will be described later with reference to FIG.
  • an identifier generation process (step S 0203) or a dictionary information storage unit that directly uses, as an identifier, a character string of meta information attached to content without using an evaluation function or character information that is program information such as BML or EPG. It is also possible to execute an identifier generation process (step S0203) that converts a character string into an ID by using a dictionary function consisting of 216 and the dictionary extraction unit 106 and uses it as an identifier.
  • identifiers in the same field are vowels, consonants, and silences in the same field in phoneme identifiers, for example, in the case of phoneme recognition. It can be classified into identifiers such as “e / o”, and about 30 types of phoneme identifiers are generally known in Japanese.
  • these identifiers are different depending on the purpose in order to recognize a plurality of different information such as phonemes, phoneme characters, images, faces, musical instruments, environmental sounds, figures, and actions. Classification is performed according to the field of recognition by extracting feature quantities.
  • the index information generation unit 108 performs indexing on the content information in time series to generate an index. Processing is executed (step S0204).
  • the indexing process includes not only the identifiers and feature quantities that can be acquired with audio and video power, but also the index information and meta information power related to the character information and content information input by the user as described above, and the extracted character information and external information.
  • the program information, sensor information, other content information, advertisement information, etc. received from may be recorded in association with each other.
  • it is recorded in the database (step S0205a), the MPEG file is changed (step S0205b), and the index information is recorded (step S0205c).
  • the evaluation function process is executed (step S0302).
  • the evaluation function process is a process for evaluating the likelihood with respect to the input feature quantity using an evaluation function such as a distance function. Then, it is determined whether or not all target evaluation functions have been evaluated for the feature amount (step S0303). If there are still evaluation functions to be evaluated, the evaluation function processing is executed based on the remaining evaluation functions (Step S0303; No ⁇ Step S0302).
  • step S0303 When all evaluations are completed using the target evaluation function (step S0303; Yes), the identifier with the highest likelihood is selected from the evaluation results (step S0304). Then, by executing the symbol identifier output step (step S0305) force S for outputting the selected identifier, an optimum identifier can be obtained as an evaluation result from a plurality of evaluation functions.
  • index information can be recorded by combining the methods and identifiers to some extent and storing the occurrence time and disappearance time of a certain identifier, as shown in Fig. 4, in relation to the time axis of the content information and the scene name.
  • whistle sounds, explosion sounds, and utterance phonemes are indexed according to the environmental sound recognition generated in the scene according to the changes in the image, as shown in Fig. 5.
  • Index information related to character information and content information input by the user as described above, meta information power, extracted character information, external power, and using received program information and sensor information Search for an index for specifying a position in the integrators N'information can be configured.
  • the feature quantity extracted in the feature quantity extraction process (step S0202) and the identifier for which the feature quantity force was also created in the identifier generation process (step S0203) are acquired for both video and audio, and indexed.
  • the identifier obtained by indexing (step S0204) is shown in FIG.
  • the phoneme symbol and phoneme recognition feature value are recorded in the recording step (S0205a, S0205b, S0205c) in association with the time axis information of the content information in the row of the phonetic identification type item.
  • the phoneme identifier is related to the phoneme symbol
  • the feature value for phoneme recognition is related to the time feature information of the content information
  • the recording step (S0205a, S0205b, S0205c) “Index co-occurrence information” is recorded as index information of the location neighborhood in the content information based on a plurality of identifiers and feature amounts associated with recognition feature extraction.
  • index co-occurrence information can be generated and recorded for use in learning, which will be described later.
  • index information can be realized by describing them in a text string as a text string, and when changing to MPEG (step S0205b), the index symbol synthesis unit 110 also extracts the MPEG file power by the meta symbol extraction unit 114.
  • the index information may be combined with the meta information description area.
  • the index information may be a numeric ID having a one-to-one relationship such as a character ID that is not composed of character string information, or an ASCII code converted from a character string.
  • FIG. 6 is a diagram showing a basic processing procedure of index co-occurrence state learning processing.
  • index co-occurrence information index information of positional neighbors configured based on a plurality of identifiers and feature quantities.
  • an index of auditory information by a phoneme identifier consisting of phoneme symbols based on phonemes recorded for each frame by the indexing means is extracted (step S0601).
  • an index of visual information is extracted by extracting the feature value color identifier of the image data of the same frame as the detected phoneme (step S0602).
  • an emotion information index is extracted based on the emotion identifier based on emotion recognition in the same frame (step S06 03).
  • a co-occurrence matrix (FIG. 8) for each frame constituting the co-occurrence information is constructed based on each extracted index information (step S0604).
  • “index co-occurrence information” is obtained as index information of a positional neighborhood composed of a plurality of identifiers and feature quantities.
  • Index co-occurrence information may be configured for the boundary values around 14 Hz, 27 Hz, 55 Hz, and 110 Hz that are continuously felt by humans.
  • the character co-occurrence information may be included in the index co-occurrence information.
  • index co-occurrence information is formed based on the identifiers and feature quantities of positional neighbors based on the co-occurrence matrix formed by the index information in step S0604, and learning processing based on "index co-occurrence information" (step S0605). ) Is executed.
  • the feature value identifiers used for learning are aggregated (steps S0605a and S0605b) as an example of the neighborhood. For a moving image of 30 frames per second, every 90 frames (3 seconds), a predetermined interval is used. Aggregation may be performed every time, aggregation may be performed until a certain distance deviates from the past average value by statistical test, or the range of information detected by a known detection technique may be constant.
  • Step S0605c, Step S0605d are executed.
  • the evaluation function is generated and reconstructed, and the generated and reconstructed evaluation function is shared as learning information. It is stored in the origin learning storage unit 212 (step S0606).
  • step S0605 the co-occurrence information of the identifier is totaled for each frame (step S0605a).
  • the time width for counting the co-occurrence information is calculated every predetermined number of frames ⁇ time.
  • the co-occurrence information of the identifier is calculated by adding the co-occurrence information of the identifier every 90 frames (3 seconds).
  • step S0605b Generate
  • step S0605b a covariance matrix is generated from the generated inter-frame co-occurrence information, and eigenvalues' eigenvectors of the co-occurrence matrix are calculated from the generated covariance matrix to generate learning information (step S0605 c).
  • a standard template of the evaluation function is generated and a learning result is generated (step S0605d).
  • An evaluation function is constructed by executing these processes.
  • the frame width to be aggregated and the time length of one frame can be specified arbitrarily depending on the device configuration, and co-occurrence is performed around 14 Hz, 27 Hz, 55 Hz, and 110 Hz, which are the boundary values that are continuously felt by humans.
  • Information may be configured, and the aggregated interframe information may be used as “index co-occurrence information”.
  • the standard template (function parameter) of the configured evaluation function is stored in the storage medium so that it can be reused (step S0606).
  • the evaluation function and the like generated in step S0605d are stored in the co-occurrence learning storage unit 212.
  • the evaluation function based on the co-occurrence information is performed by performing the indexing procedure by using the evaluation function configured in this way and performing the indexing procedure by using it for the feature identifier conversion in step S0203 of FIG. Can be used for indexing content information.
  • the co-occurrence information based on the index information used for this learning will be specifically described with reference to Figs.
  • identifier there are 30 phonemes (5 vowels, 24 consonants, 1 silence), 4 emotions (joy, anger, romance, comfort), and Web Color216 colors (Web Color It is composed of 250 elements by 250 elements co-occurrence matrix and covariance matrix obtained by combining identifiers indicating the number of display pixels of each color (also called “color” or “browser common color”).
  • this configuration uses the sensor information based on the sensor inputs associated with the content in time series as necessary, so that the terms are included in the co-occurrence matrix according to the type of sensor information.
  • adding an eye adding an item to the co-occurrence matrix according to the index information related to the content or character information in the meta information, or setting the standard pattern of the evaluation function consisting of the co-occurrence information as the search condition
  • the character information may be used for the designation of the name.
  • Fig. 8 is a diagram showing an example of co-occurrence information. The same element is entered on the horizontal axis and the vertical axis, and the number of appearances related to the image and sound in the frame in the moving image is entered at the intersection of the vertical axis and the horizontal axis.
  • the number of occurrences is a value that indicates how many times an identifier appears in a frame, and is a number that is evaluated by how many arbitrary phonemes, pixels, and emotion identifiers are generated within a short time frame. It is.
  • the content of the matrix is “0” for the co-occurrence of emotion “joy” and vowel “A”, and “6” for the appearance frequency of red as emotion “joy” and image identifier. Since this information is a value from which the content information power is also extracted, the number of occurrences of the identifier recognized in the frame that is not necessarily constant may be normalized for each type of identifier to obtain a probability value. A probability transition matrix between frames may be constructed based on the probability of occurrence of.
  • Fig. 9 is a diagram showing an example of a covariance matrix of video features of emotion features and phoneme features.
  • the horizontal axis and the vertical axis are the names of the respective feature amounts, and the number of feature amounts acquired for several frames in a moving image for several seconds is determined from the average over the entire frame. It shows whether there is any scattering.
  • the emotional characteristics indicate the four powers of emotional emotions S, how much the variance is, and for phonemes and images, the distance evaluation results of each distance indicate how much the average power is different.
  • the covariance of the fourth emotion parameter and the first emotion parameter is “0.42”
  • the correlation between the first parameter of the video parameter and the first change of the emotion parameter is [0136].
  • the power of “0.32” is not always constant because the information is the content information power.
  • the present invention is characterized by co-occurrence conditions specified by a person for search, co-occurrence information detected during indexing, and information frequently used by users as search results.
  • occurrence information a co-occurrence matrix based on identifiers of different properties, a co-occurrence probability based on the co-occurrence matrix and a covariance matrix based on the features of the different properties and an evaluation function for searching It is in the place of making, searching and detecting.
  • the standard pattern is extracted by learning the feature amount based on the input of the natural information whose identifier is specified in advance, and the evaluation function is configured by the extracted standard pattern.
  • the standard pattern may be extracted using an identifier configured by self-organization by multivariate analysis.
  • the obtained standard pattern is stored in the evaluation function storage unit 204 as necessary, and the standard pattern dictionary information is stored in the dictionary information storage unit 214 as association information for mutually converting the identifier and the standard pattern. .
  • the standard pattern is used in combination with the evaluation function to identify the identifier, and is configured by a sample feature amount for which the input identifier is not specified and a feature amount attributed to the specific identifier. It consists of the mean and variance of the population, and the evaluation function is used to evaluate the Euclidean distance and Mahalanobis distance, and is sometimes called a standard template, standard parameter, or evaluation function parameter.
  • the standard pattern is a parameter that can be generated by a method such as multivariate analysis using the input feature value and the evaluation function. Based on this, any identifier evaluation function such as HMM, Bayes discriminant function, Mahalanobis distance, Euclidean distance may be used. In addition, it is generally known that the parameters that make up these evaluation functions are configured by mathematical methods such as multivariate analysis, so the extraction method and learning method depend on the implementation.
  • multivariate analysis is performed using evaluation functions, and classification is performed by self-organization.
  • each classified evaluation function is manually given a name and an identifier,
  • the evaluation function can be searched by specifying the evaluation function name from the user by giving the character string included in the content information with high probability of co-occurring with the evaluation function obtained by multivariate analysis as the function name or identifier. You may make it usable for a detection.
  • the identifier information recorded in association with each other is a symbol of a phoneme or a phoneme piece, or a name or a name, an identifier or an identifier string given to a population used to construct an identifier evaluation function. Or the representative feature average itself, and not only phonemes and phonemes, but also identifiers, features, and combinations of images, sounds, and emotions that are described separately.
  • the index information and meta-information related to the character information entered by the user and the content, meta-information, extracted character information, external information, received program information and sensor information may be used.
  • the identifiers and feature quantities of the content information and the identifiers and feature quantities of the advertisement information are similar to each other by the above-described evaluation function at an arbitrary position of the advertisement information indexed by the same method and the indexed content information.
  • Steps to associate advertisements when they are evaluated may be executed or promoted during indexing, or any advertisement or rating only while paused during content information playback
  • the advertisements associated with the function may be played back, and these evaluation functions can be reconstructed using “Example of identifier reconstruction” and “Identifier learning with search 'detection' indexing” described later. Also good.
  • meta information or EPG information recorded in the content can be used as an identifier to construct and search an evaluation function that evaluates the co-occurrence state by acquiring the identifier used for the index as shown in Fig. 7.
  • processing for acquiring program information such as EPG and BML may be added to the program information using EPG and BML acquired during broadcasting, and the co-occurrence state may be configured and indexed. .
  • EPG as program information acquired as character information in step S0701 is used as an identifier as it is, and other identifiers and features are used from step S0601 to step S0603.
  • the co-occurrence matrix in FIG. 8 and FIG. 9 is constructed using the characteristic quantities of other identifiers having a co-occurrence relationship in the same program information using the program information as an identifier.
  • the name of the evaluation function may be indexed by associating the character information and the program information by using the character string of the program field based on the character information and the program information.
  • step SO703 the learning process from step SO703 to step S0705 corresponding to step S0605 is performed based on the acquired co-occurrence information, and an evaluation function for constructing an identifier can be constructed and necessary. If so, the content may be indexed again using the acquired function.
  • Step S1001 When a search condition such as imaging, speech, character string input or the like is input from (Step S1001), query generation processing is executed based on the input search condition (Step S1002), and a territory is generated.
  • a search condition such as imaging, speech, character string input or the like
  • Step S1002 query generation processing is executed based on the input search condition
  • a territory is generated.
  • speech a phoneme sequence based on phoneme recognition or phoneme recognition for a user's utterance
  • a query is generated based on the phoneme sequence
  • a query is generated based on the conversion to the phoneme string sequence, and if it is an image capture, a query based on image recognition is generated, so that a search condition is generated by each recognition method.
  • search conditions are configured using identifiers acquired by multiple recognition methods for input character strings, visual rectangles, and auditory information, and search is performed based on co-occurrence relationships between the specified identifiers and character strings.
  • search condition co-occurrence information is constructed in the same way as “index co-occurrence information”, and “search condition co-occurrence information” is used for similarity evaluation with “index co-occurrence information” of the present invention. It becomes possible to use it for queries as “starting information”.
  • the input character strings and the identifiers obtained from the respective recognition results are converted into or associated with the character strings and identifiers by the dictionary function based on the dictionary information storage unit and the dictionary extraction unit.
  • the information related to the information recognized by the search conditions and the information entered as search conditions Co-occurrence relations of "information recognized from search conditions" and "information entered as search conditions” or "information recognized from search conditions” and “information entered as search conditions” “Search condition co-occurrence information” based on the co-occurrence relationship of “related information” can be configured and used as a query, and can also be used as “index co-occurrence information” used for learning of the present invention.
  • character information extracted by the user index information related to the content, character information extracted from the meta information, program information received from the outside, and sensor information may be used.
  • character strings indicating emotion identifiers and image identifiers, symbol strings such as phoneme strings and phoneme fragment strings may be implemented by text input, menu selection, or voice human power. You can search by converting to identifiers, feature quantities, or identifier strings, and specify the location in the content information.
  • the search target The search is repeatedly performed on the content information, and a search process for evaluating the match between the bow I and the query is executed for all the content information (step S1003).
  • the search process since the search process is executed, the “index co-occurrence information” and the “search condition co-occurrence information” based on the identifier or feature amount of the content information to be searched are compared, and the search result is obtained.
  • the match between "index co-occurrence information” and “search condition co-occurrence information” may be evaluated by DP or distance function, and each co-occurrence information is evaluated by evaluation function, It may be possible to compare the similarity, identity, and degree of matching by evaluating the matching of the acquired identifiers and the distance distance. Instead of evaluating all the identifiers and feature quantities, By comparing feature quantities, similarity, identity, and degree of coincidence may be compared.
  • step S1004 Based on the acquired search results, the degree of matching of the search evaluation results is evaluated, and the search results are ranked (step S1004). Further, an evaluation result list display process (step S 1005) is executed to create and display an evaluation result list based on the ranked search evaluation results.
  • the advertisement information in the storage unit is displayed to the user, the advertisement obtained through the communication line is presented, and the advertisement content associated with the previous indexing is acquired from the storage unit or the communication line unit. May be presented to the user.
  • step S for acquiring content in a time-sharing manner as shown in Fig. 13
  • Step S1302 for confirming completion of content acquisition with 1301 and step S1303 for indexing while extracting features and generating identifiers by acquiring content
  • ⁇ search condition co-occurrence information '' and ⁇ Step S1304 for comparing the index co-occurrence information and detecting matching points is executed, and step S1305 branches according to the detection. Recording start, channel switching, notification, e-mail delivery, etc. It is also possible to execute an arbitrary process (step S 1306) as will be described later, such as changing the robot operation.
  • a dictionary is created based on an input character string as an input search condition for the “index co-occurrence information” applied to the content by the above-described method. Refer to them and convert them to feature quantities and identifiers related to character strings for search. The phoneme sequence generated by the input speech force, phoneme sequence, and other features and identifiers are used for direct search, and the dictionary is based on the phoneme sequence and phoneme sequence generated by the input speech force.
  • the index by the “index co-occurrence information” for the content information is compared with the search condition by the “search condition co-occurrence information” input by the user, so that the “search condition co-occurrence information” "And” index co-occurrence information " It is possible to search for and find objects that match “”, and to specify the position on the time axis, the position on the display screen, and the position on the reading aloud in the content.
  • the matching evaluation method of the search evaluation results is an HMM, a method using a Bayes discriminant function and! /, A probability and a distance, or attribution to a clustered population by multivariate analysis. It is well known that this method is a matching method between DP and CDP and!, A symbol string, and more details can be found in “Examples of methods for evaluating matching between feature quantities and identifier strings”. State.
  • an identifier feature amount conversion unit 118 converts an identifier used in a query generated from an input character string, input speech, or input image into a feature amount. Is converted by the identifier feature amount conversion processing executed by This identifier feature quantity conversion processing will be described with reference to FIG.
  • target symbol extraction processing is executed (step S1102).
  • the target symbol extraction process is a process of selectively extracting a feature quantity associated with an identifier using dictionary information in order to convert the identifier into a feature quantity with respect to an input identifier (identifier string).
  • step S1103 it is determined whether or not an identifier subdivision that divides phonemes into phoneme pieces as necessary is necessary. If it is determined that further subdivision is necessary (step S1103; Yes), symbol subdivision processing is executed (step S1104), and the target symbol extraction processing is executed again after further subdivision.
  • the identifier is a phoneme
  • step S1103 If it is determined that subdivision is not necessary (step S1103; No), based on the selected feature amount!
  • the feature value is output to evaluate the distance between the feature values according to the identifier (step S1105).
  • the identifier feature quantity conversion process described above since the identifier feature quantity conversion process described above is executed, the input identifier or identifier string is converted into the feature quantity, and the search based on the feature quantity can be performed.
  • the user selects the process for learning the co-occurrence state of the search condition (step S 1202), the process for learning the co-occurrence state of the search result (step S1206), and the search result.
  • step S1209 By adding the process of learning the co-occurrence status of search results (step S1209) to the normal search procedure, it becomes possible to learn co-occurrence information associated with searches based on the user's will and use.
  • An evaluation function for indexing can be configured according to the user.
  • search conditions are input by voice input, character string input, or image input by the users (step S1201). Then, based on the input character string as the search condition information, the phoneme string obtained from the utterance 'the feature amount obtained from the phoneme string string or the image' identifier and the search condition information, the dictionary information storage unit 214 to the dictionary information extraction unit 106 Acquires co-occurrence information of related features and identifiers extracted by (step S12 02). An evaluation function is constructed based on the learned co-occurrence information, and the evaluation function is stored (step S 1203).
  • the search processing is selected based on the command dictionary in the keyword "search”, and the ⁇ B / o / k / a / a / a / a / n / n / n / n / n '' is set as the search condition phoneme string, and ⁇ explosion '' is the feature value of explosion sound
  • the co-occurrence state of multiple identifiers and feature quantities can be configured by setting the identifiers of the explosive pronunciation evaluation function configured by collecting the image features and the image features of “images in which the area of the warm color system increases in time series” as search conditions .
  • the similar "Dokarn" of explosion sound related to the explosion sound is set to "d / o / k / a / a /".
  • the search conditions may be configured so that a search can be performed using related identifiers, feature quantities, or identifier strings that are not related to the associated pseudo-sound power by setting ⁇ a / a / n ''. It may be converted into feature quantities, identifiers, or identifier strings with different recognition methods and added as search conditions.
  • search condition co-occurrence information similar to the "index co-occurrence information” based on the phoneme sequence, the identifier of the explosive pronunciation evaluation function, and the image feature is constructed, and the above-described "co-occurrence state learning process” is performed.
  • Search conditions can be learned by configuring an evaluation function according to the procedure.
  • feature value that the warm color system spreads can be measured by evaluating the time-series increase in the area occupied by the warm red and yellow screens.
  • the character string or phoneme sequence / phoneme segment sequence input in step S1201 is stored in the dictionary information storage unit 214, it is extracted from the dictionary information storage unit 214 by the dictionary extraction unit 106. It may be used as co-occurrence information for learning after being converted into another identifier or feature amount based on information, or an identifier feature amount conversion unit 118 is used to convert an identifier into a feature amount for use in a search. Also good.
  • step S1204 is executed as a search based on the co-occurrence information specified as the search condition described above, and a search result having a high degree of matching with the search condition is acquired. Then, for example, the co-occurrence information based on the feature amount and the identifier used in the index information attached to the target scene obtained as the content information force as a search result with a matching rate exceeding 80% is acquired from the acquired search results. 1206 is executed. And it is preserve
  • step S1208 when the search result is selected by the user (hereinafter, the search result selected by the user is referred to as "selection search result") (step S1208; Yes), the search result is shared based on the selection search result.
  • the starting information is learned (step S1209).
  • the evaluation function is reconstructed based on the co-occurrence information learned in step S1209 and stored (step S1210).
  • the search result that the user wants to use is selected again from the search results, SI 211; Yes)
  • the process is executed again from step SI 209.
  • the content information searched in this way is classified as one content genre or category, one image as content, a photo collection of images, It is a piece of music, a chorus part of a piece of music, a movie or video work, a scene in a work, or a common image or sound feature of a work in a specific field. It is possible to acquire search results based on the co-occurrence tendency of specific identifiers and feature quantities in content information that may be within the range, so that content scene search by user instructions and Title search is possible.
  • step S1212 it is selected whether or not the user is required to input the search again, that is, whether or not to end the process.
  • step S1212 the operation for inputting the search condition is performed again
  • step S1201 the process transitions to step S1201 and executed.
  • step S121; Yes an operation for which a search condition is not input is performed
  • the identifiers and feature quantities determined by the evaluation functions based on the co-occurrence information of the plurality of feature quantities and the plurality of identifiers that have been configured are recorded in association with the content information, and 'search' is detected. It is possible to search for more complex hobbies, preferences, and interests, using phonemes, phoneme edges, Z or emotions, Z or other identifiers, and co-occurrence states of Z or their features. The convenience of information retrieval can be improved.
  • co-occurrence information is constructed from the phoneme symbol strings, phoneme symbol strings acquired as search results and search conditions, and various identifier strings as in the case of the above-mentioned indexing. You can perform a close evaluation and perform a search for content information with high similarity according to the search conditions.
  • Index information combined in the same manner as the above-described indexing may be configured using the co-occurrence information of the color identifier selected by the above.
  • identifier such as emotion, scale, musical instrument sound, environmental sound, or a feature obtained from Z or video.
  • the identifiers may be constructed by analyzing, classifying, and learning using identifiers such as shapes, colors, characters, actions, etc. and features associated with the identifiers described above and later, using multivariate analysis techniques. New identifiers may be constructed and used in accordance with the implementation of, and details are given in “Examples of identifier reconstruction”.
  • the identification function information configured in this way can also be exchanged 'distributed' based on 'Example of information sharing procedure between users'. It can be used to improve convenience, and as detailed in “Examples of procedures for information processing devices used in terminals and base stations”, the server client model divides processing into servers and clients, and communicates with devices. By exchanging information between the server and the client, equivalent services and infrastructure, search, indexing, and arbitrary processing associated with detection and detection may be performed.
  • a temperature sensor is attached to a surveillance camera or the like to detect ambient temperature changes and image feature changes, and when an explosion occurs as a phoneme identifier in the co-occurrence information described above.
  • the feature number is the increase in the number of warm-colored pixels in the screen, and the increase in temperature is added to the co-occurrence matrix as temperature sensor information and recorded, and the co-occurrence information associated with the explosion can be learned. You can also index and search.
  • multiple channel forces may be input.
  • Features using stereo images and stereo audio using the input deviation from each channel The identifiers and features of different channels can be used to estimate the position by constructing and identifiers, to estimate movement, and to ensure that certain events are related even if there is a difference in time series. This may be detected by evaluating the co-occurrence relationship with a time series width of several seconds to several minutes or more.
  • the first feature of the present invention is the indexing of content information, which will be described later!
  • various indexing is performed and combined with the index attached by combining with phoneme information, phoneme information, Z or emotion information, Z, auditory information, visual information, character information, program information, sensor information, etc.
  • Learning based on the co-occurrence information and performing a search based on the co-occurrence information, and the second feature is for the identifier and feature quantity used in the present invention as in the example of the search process of the present invention.
  • a search for speech input, image input, and character string input is performed using a dictionary that assigns phoneme strings and phoneme strings based on each name.
  • information indicating the continuous state of phonemes and phonemes as the phonemes and phonemes information indicating how these elements change is "continuous phonemes” and "
  • a “phoneme string” or “phoneme string string” that can be considered even if “continuous phoneme strings” are considered refers to an information string in which these phonemes or phoneme strings are arranged as symbols or identifiers.
  • Various identifiers that can be expressed as “columns” can also be considered for matching evaluation as identifier columns.
  • each identifier recognition method, feature quantity extraction method, identifier string match evaluation method, information classification method, information learning method, communication transmission procedure, type of storage medium, type of communication medium, information processing device Regarding the configuration, the configuration of the terminal and the distribution base station, the shape of the device, the size of the device, the installation location of the device, and the sensors used in the device, devices can be arbitrarily combined as necessary, and programs can be implemented.
  • advertisements and promotions implemented based on the present invention may be combined arbitrarily with the conventional invention, depending on the frequency of access to the advertisement, the frequency of use of the content, the quality, size, and time of the advertisement. Fees may be changed, or prizes may be provided through quizzes or questionnaires, and interactive results can be executed by statistically processing advertisement results relating to objects detected using the present invention.
  • search specification function to search and specify content according to the search conditions
  • information that matches the conditions that are distributed in real time can be used to distribute emails, Change to a channel that matches the conditions, start recording or playback, robot or agent starts utterance, playback the recorded content of another channel retroactively to the detection time, or change device settings It is possible to construct a shortcut including a link to a detection result, or to present to a user by aggregating contents using detected information.
  • the present invention performs indexing using other feature quantities and identifiers described later, performs learning of new identifiers and reconstruction of identifiers according to the co-occurrence state of the indexes, Set search conditions using co-occurrence status, learn new identifiers based on search conditions specified by the user, reconstruct existing identifiers, and search acquired according to the search conditions Based on the co-occurrence status in the result, learning of new identifiers and restructuring of existing identifiers, search and detection based on co-occurrence information combining new identifiers and reconstructed existing identifiers, and a lot of co-occurrence information
  • the search and detection are based on identifiers and feature values constructed by variable analysis and learning.
  • phoneme strings, phoneme string strings, emotion identifiers or their symbol strings based on their names are used for the identifiers and feature quantities used in the present invention.
  • the input character string as the search condition and detection condition, the identifier associated with the recognition of the input voice and input image, the internal ID associated with the feature quantity, the nominal character string, the phoneme string used for recognition, Conversion dictionary with symbol strings by phoneme strings and Z Or, configure a co-occurrence dictionary and extract identifiers and feature quantities based on input speech, input character strings, and input images given as search conditions and detection conditions, and then convert the identifier conversion dictionary, co-occurrence dictionary, and feature quantity Select necessary targets using an evaluation function based on the covariance matrix of the above, search conditions used for the search and detection described above, input character strings as detection conditions, identifiers associated with recognition of input speech and input images, An image associated with an utterance phoneme sequence and Z or utterance phone
  • identifiers and feature quantities can be stored and recorded in association with content by recording them together with time information in a dedicated database, or by creating an index file as a separate file that can be used simultaneously with video and audio information. Save, insert into MPEG file and other video streams, update MPEG file vacant area, comment area, meta information description area, broadcast using program information or text broadcasting using markup language by EPG or BML, etc. Then, the index information according to the present invention may be used by being received by the user and stored in the storage medium by the method described above.
  • Examples of content information as target content information
  • examples of features and identifiers as feature quantities and identifiers that can be used for co-occurrence information
  • conversion of identifiers and feature quantities into phoneme and phoneme symbol strings "Example of dictionary configuration" for converting between identifiers and identifiers
  • Example of methods for converting natural information to feature quantities for constructing dictionaries and converting content information into identifiers
  • Example of method to perform
  • Example of method to evaluate matching between feature quantities and identifier strings for evaluating similarity to detect target range in search
  • “Information” based on the present invention “Example of search method
  • the contents are exclusively movies, dramas, photographs, news reports, Ryome, illustrations, paintings, music, It is generally well known to show promotional videos, novels, magazines, games, papers, textbooks, dictionaries, books, comics, catalogs, posters, broadcast program information, etc., but in the present invention public information, maps Information, product information, sales information, advertisement information, reservation status, viewing status, road status, information such as questionnaires, surveillance camera images, satellite photos, blogs, models, dolls, robot cameras' microphone input, etc. may also be included.
  • time-series changes in video time-series changes in speech
  • text that expects time-series changes in the reading position of the reader electronic information in markup language notation in HTML, and search indexes generated from them
  • search indexes generated from them Interpretation of good reading position even for information, etc., may be interpreted as a time axis, and punctuation, sentences and sentences may be captured as frames.
  • meta information attached to content EPG and BML as document information and program information as text information, musical scale as musical score information, general still images and moving images, polygon data as 3D information, Visual information, auditory information, and text that may contain vector data, texture data, motion data (motion data), still images and moving images based on visualized numerical data, content information for the purpose of advertising and advertising, etc. It consists of natural information including information and sensor information.
  • the feature values and identifiers used in the present invention are defined mainly as auditory information, visual information, and sensor information as natural information.
  • the phoneme, phoneme piece, and emotion are associated with the auditory information, visual information, and sensor information.
  • search is performed by evaluating the co-occurrence state of such information.
  • HM M Constitute the evaluation function, such as HM M function by. Then, based on the nominal phoneme sequence, the nominal phoneme sequence, the character string ID, and the numeric ID associated with the evaluation function, the environmental sound identifier, noise identifier, mechanical sound identifier, and! / An identifier based on can be configured.
  • Identifiers that can be used include urban areas, green spaces, coasts, mountains, deserts, weather, facial expressions, landscape identifiers that indicate how the sun is shaded according to time and season, and object identifiers that indicate objects such as cars, people, faces, flowers, animals, and plants.
  • Image identifiers that indicate image features such as brightness, color, and contour, motion identifiers that indicate movement speed of the object, changes in motion, and changes in the state associated with the behavior, and appearance positions of image system identifiers that correspond to the image range.
  • the display position identifiers shown can be used, and feature values extracted from moving images and still images are classified into populations that are collected for each designation, and based on the classified populations, distance functions by multivariate analysis and HMMs by learning Seki Constitute the evaluation function such as.
  • the scene identifier, object identifier, and motion identifier based on the feature quantity of the moving image or still image are displayed. / You can configure identifiers based on a single moving image or still image.
  • emotion information by simple recognition, it is possible to use general emotional powers such as emotions based on facial expressions and voice tone. It may be used as an identification by detecting and recognizing words that indicate emotions and mental states described in the book related to psychology.
  • identifiers and feature quantities are the frequency of appearance of colors in one frame as in the previous embodiment, identifiers and feature quantities that span multiple frames, not just phonemes, and identifiers and features that span multiple frames. Identifiers and feature quantities based on quantity transition information, feature quantities and identifiers with coordinate information in the display screen, arithmetic estimation of positions using visual information and auditory information feature quantities and identifiers with coordinate information in the spatial coordinate system, The feature quantity extracted in association with the time axis may be an identifier, the detected feature quantity force, the depth restored by arithmetic space calculation, or the depth as the coordinate information of the 3D image information.
  • identifiers may be represented using identifiers that associate character string IDs or numerical IDs with evaluation functions based on features of speech, moving images, or still images, or may be used for speech, video, or still images.
  • the identifiers that can be used, such as arbitrary character strings that are recognized by the user, are combined and used as identifier strings, or the identifiers are shared by using the evaluation values obtained by the recognition of the identifiers using the evaluation function.
  • the HMM and evaluation function can be re-created using a combination of arbitrary evaluation values such as the covariance matrix of the occurrence probability, feature value, HMM output probability, HMM transition probability, distance function distance value, and DP match evaluation value. It can be used for construction, and the identifier string can be configured based on the chronological change of the identifier.
  • an identifier may be given to a population learned by self-organization associated with multivariate analysis to perform search, recognition, detection, and indexing, or these identifiers may be used as search conditions. Alternatively, it may be used as a feature quantity for evaluating an identifier of a population learned by combining feature quantities associated with a plurality of images, videos, and sounds used for multivariate analysis.
  • hash values obtained arithmetically using a feature value average, variance, nominal character string, phoneme sequence or phoneme fragment sequence associated with an arbitrary identifier may be used for indexing.
  • phoneme continuous times, continuous time, and long classified into several types such as “long, medium, short” "Discernment -longj, ⁇ discernment-short” using the length information, and the phoneme-front J, r p honeme -rear using the position information within the range of one phoneme. May be used as symbols or identifiers with location information, or may be used to construct new identifiers by combining those identifiers and symbols as symbol strings or identifier strings.
  • the evaluation function force in the interval where identifiers continue as evaluation results may be used for the length information for classifying the identifiers described above and the weight information associated with the identifier lengths.
  • text information and character information using character strings can be combined with any document processing method, and can be realized by combining feature extraction methods for character strings as described in patents and documents related to them. It is also possible to use information evaluation methods that use co-occurrence states, such as “Examples of search and optional processing with multiple identifiers and multiple search conditions” described later.
  • a symbol co-occurrence frequency or the like may be used as a feature amount, or an identifier associated with sentence analysis or recognition based on the feature amount may be used.
  • environmental sounds are processed as onomatopoeia and evaluated based on recognition of phoneme sequences and phoneme segment sequences, so that environmental sound features, environmental sound identifiers, sound effect identifiers, phoneme identifiers and phoneme segments are evaluated.
  • features may be learned to construct new features or new identifiers as onomatopoeia features or onomatopoeia identifiers, or person identifiers or voice quality or changes based on voice quality or changes Model used for recognition using emotion identifiers May be used to improve the recognition rate.
  • an identifier specified in an arbitrary protocol and the name of an article related to the identification may be used in association with each other.
  • an ID and a musical instrument are used in a method called general MIDI.
  • co-occurrence information based on feature quantities and identifiers configured based on the co-occurrence state is any identifier, any feature quantity, or distance information or probability output as a recognition result corresponding to any identifier.
  • the information is based on the fact that the information occurred simultaneously within the specified time range.
  • the specified time range is not the power of general time expression, but the number of frames (fields) in the time-series moving image and the degree of deviation from the average of neighboring frames, Find the co-occurrence range based on units that take into account time-series transitions when reading aloud, such as the number of characters, character position, number of words, number of sentences, number of sentences, number of chapters and pages.
  • the character information may include text information and program information.
  • solid information is generated using 2.5D features and 3D features based on the image features obtained from multiple imaging information forces, and the 3D information including the generated 3D information, polygon information, and texture information.
  • Evaluate the degree of coincidence by evaluating the distance to the information and evaluating the matching of the 3D shape in the 3D image and 3D image search, and the coordinate position from the centroid of the pseudo 3D and 3D information, the eigenvalue of the coordinate group, and the eigenvalue It is okay to carry out.
  • the scale type is scale information such as "Doremifasolaside".
  • the tempo, rhythm, chord information, etc. associated with the appearance transition state of the time axis of the scale identifier may be included.
  • the instrument type can be realized by learning together the acoustic information of the instrument, and according to the known literature, it is known that over 90% is exempted in single-tone recognition.
  • the recognition of environmental sound types includes frequency characteristics such as FFT cepstrum, mel cepstrum, directional pattern, and formant extraction, volume characteristics, changes of those characteristics due to time transitions, and the recording sound at different positions. Sound characteristics based on differences in volume, phase, and frequency components, sound source position based on left and right phase differences and volume differences, timbre based on frequency distribution characteristics and pitch transitions, wave sounds, cold sounds, etc., similar to instrument identification It can be recognized by using evaluation functions collectively for each piece of biased information, and can also be used as a machine sound type as an application.
  • sound of engine and exhaust sound of steam locomotive, sound of running on track, sound of wind, sound of animals and insects, sound of birds, sound of waves, sound of trees, horn, scream, cry, cry, laughter
  • identifiers based on information such as natural sounds, mechanical sounds, sounds generated by living things, sounds generated by living things, explosion sounds, etc., and acoustic identifiers include scale identifiers, volume identifiers, tone identifiers, chord identifiers, etc.
  • a sound position identifier or sound source direction identifier it distinguishes the top, bottom, left, and right of the direction in which the sound is generated, and if it is an echo state identifier, it distinguishes the size of the room based on the speed of the indoor reflected sound.
  • trumpet or piano sound is discriminated, and if it is a machine sound identifier, machine sound, engine sound, tappet sound, screw sound, exhaust sound, tool sound, furniture sound, flight sound, noise Or nature If it is an identifier, it is a wind sound, wave sound, roaring sound, explosion sound, environmental sound identifier or sound effect identifier, and if it is a speech identifier, it is a language identifier, speech speed identifier, exclamation speech identifier, cheer identifier, hoarseness It is conceivable to combine features specific to these identifiers, which may be identifiers.
  • the identifier of the image type starts from a feature amount such as a contour based on luminance differentiation, a hue difference, a color density, or a difference between them, a face type for recognizing a human face, and a human face shape.
  • Sign language that is extracted such as jestier, dance, animal behavior
  • the upper right numerical value on the screen has a high co-occurrence rate with the character image information of the time Therefore, it is possible to configure the feature amount and identifier based on the information indicating the position when the time information is detected, the display position type related to the direction of the object in the display range, and the image identifier.
  • a luminance identifier For example, if it is a luminance identifier, saturation identifier, hue identifier, contour identifier, motion identifier, image position identifier, speed identifier, moving direction identifier, or a body identifier, an animal identifier, a plant identifier, a machine identifier, a tool identifier
  • an animal identifier For furniture identifiers, person identifiers, material identifiers, sign identifiers, landscape identifiers, and shape identifiers, face identifiers, facial expression identifiers, mouth shape identifiers, clothing identifiers, hairstyle identifiers , Skin identifiers, body identifiers, posture identifiers, waveform shape identifiers, and character identifiers, such as language identifiers, font identifiers, character size identifiers, symbol types, etc. It is conceivable to combine feature quantities.
  • an index of co-occurrence states associated with image, voice, and emotion recognition results Date, learning, search, search results Usage and various identifiers individual recognition technology to the present invention for the purpose of generating a search condition by the phoneme and phonemic piece corresponding to the designations are not the subject of the invention.
  • the character and symbol types that can be distinguished by giving meanings and sounds to the symbols the sign types that the meanings are discriminated by graphical symbols, and the elements of the above-mentioned image features, such as corners, curves, and contours Shape types that discriminate between them, graphic symbol types that distinguish shapes and elements whose combination meanings are fixed to some extent, and EPG, text broadcasting, BML, and RSS for discriminating the contents of broadcast programs as program information
  • EPG text broadcasting, BML, and RSS for discriminating the contents of broadcast programs as program information
  • the program content is BML. Can be obtained in a row.
  • a temperature sensor, a gas sensor, or a motion sensor to the present invention as an identifier accompanying the sensor input, and the input information of those sensor powers may pose a danger to human life. It is possible to construct identifiers by classifying genders, collect co-occurrence information associated with images and sounds related to identifiers, and use them for protection evaluation for human safety by robots and for safety evaluation of the device itself,
  • the heart rate sensor and the brain may be combined with a sensor, a muscular current sensor, and a skin resistance sensor to constitute a medical psychoanalysis apparatus.
  • location identifiers can be acquired based on location information such as GPS and linked to perform searches or learn co-occurrence status.
  • Services and devices based on the co-occurrence state of feature quantities and identifiers may be configured using multi-layer Bayes, multi-layer HMMs, multi-layers, etc. for recognition, classification, discrimination, and evaluation.
  • a sound that collects only specific noise a sound that collects only specific instruments, piano drums, dogs and cats, mechanical sounds of cars and factories, cheers, scales, etc.
  • the feature extraction and recognition processing is similarly performed on the video input from the external device of the apparatus based on the present invention to identify a person or expression based on the face. It is possible to identify items, characters, figures, symbols, signs, etc. based on shapes and colors, or to identify movements based on differences between frames and sound source positions. It may be recorded and used for indexing, and it will be indexed in the future on odor, taste, temperature, humidity, weight, hardness, viscosity, density, size, environment, chemical composition, and physical properties. May be.
  • the information processing section there is natural information obtained from the outside via the information input section, and music, documents, and musical score information based on video, images, sounds, and voice information acquired from the communication line section and storage section.
  • a feature amount extraction unit that extracts feature amounts from content information that can be processed by music, still images and moving images, polygon data and vector data, numerical data, still images and moving images, t, and other information processing devices. Obtained by feature quantity, search conditions, and search results
  • co-occurrence information learning unit that learns the co-occurrence information
  • an index information generation unit that performs indexing by determining identifiers from the extracted feature amounts through recognition processing, and is made up of features and index identifiers.
  • index search evaluation unit that evaluates the degree of matching between the condition and index information
  • evaluation list output unit that outputs it as an evaluation list as a search result.
  • the feature quantity acquired from the content information and user input There is a feature value identifier conversion unit that converts the identifier into an identifier that is recognized by the user, an identifier obtained from the outside through a storage medium or communication, an identifier from which the content is extracted internally, and the like.
  • identifier feature quantity converter that converts to feature quantities, a dictionary extractor that extracts information for target conversion from the dictionary information storage section, and index information such as content information MPEG7.
  • meta-symbol extraction unit After acquiring information from markup languages such as RSS information and XML from the communication line section, or acquiring information on EPG, BML, RSS, and text broadcasting based on the received broadcast wave There is a meta-symbol extraction unit that extracts instructions, variables, and attributes in arbitrary symbol information, and search, detection, and indexing may be performed by combining them as necessary.
  • the feature extraction and recognition processing is performed on the video input from the external device of the apparatus based on the present invention, and the person and the facial expression can be identified based on the face. It is possible to identify articles, characters, figures, symbols and signs based on shapes and colors, or identify movements based on differences between frames or changes in sound source position. It may be used for attachment, and in the future, there will be an index on the environment, chemical composition, physical properties such as smell, taste, temperature, humidity, weight, hardness, viscosity, density, size, etc. Also good.
  • identifiers and feature quantities associated with content can be stored and recorded in a dedicated database along with time information, or an index file as a separate file used simultaneously with video and audio information.
  • MPEG file and other video streams update empty area, comment area and meta information description area of MPEG file, markup languages like EPG, BML, RSS, teletext Using You can use the method of receiving and saving the data as described above. ,.
  • Character strings extracted from websites related to BML, RSS, and content, and co-occurrence information of character strings can be arbitrarily combined to form an identification function or HMM, or an identifier corresponding to the configured HMM or identification function can be configured.
  • the co-occurrence information as described in “Examples of identifier learning based on search 'detection' indexing” and “Example of identifier reconstruction” using the distance, matching degree, and HMM output probability as identification results as features.
  • Learning or identifier construction may be performed, or an arbitrary classification evaluation function may be configured by combining the above-mentioned various feature quantities and classifying them by multivariate analysis and giving identifiers.
  • dictionary function for mutually converting the identifiers and feature quantities used in the present invention will be described using the dictionary information storage unit 214 of the storage unit 20 and the dictionary extraction unit 106 of the information processing unit 10.
  • These dictionaries can be implemented by general-purpose programs such as information processing methods and storage methods using general algorithms such as hash buffers and map buffers, and databases, and dictionary information used by the dictionary function is stored in a storage medium. It is generally well known that the information group can be related by an index, and can be arbitrarily implemented by a publicly known method, and therefore depends on the implementation.
  • the step of inputting an identifier as described above is input.
  • There is a method based on the step of selecting and outputting another identifier associated with the identifier and a method based on the step of inputting the identifier and the step of selecting and outputting the identification function associated with the input identifier.
  • the identifier is information for classifying information that is recognized by the evaluation function, and the identifier string is information in which identifiers of the same system are arranged in time series. It is preferable information that a plurality of arbitrary identifiers are gathered and have a co-occurrence relationship.
  • these dictionaries are indexed by arbitrary keywords and IDs. More specific examples are composed of symbol identifiers, variables, and feature quantities, as in the control dictionary and the Japanese phoneme international phoneme symbol conversion dictionary. In addition, any combination of the above identifiers and features such as Japanese word phoneme sequence conversion dictionary, motion identifier caller phoneme sequence conversion dictionary, face image identifier name phoneme sequence conversion dictionary, etc.
  • the character string "Japanese” is converted to a phoneme string "n / i / h / 0 / n / g / o”
  • the action identifier caller string conversion dictionary If it is a face image identifier name phoneme sequence conversion dictionary, it is converted to an identifier indicating “nodding motion”, r u / n / a / z / u / k / uj, and V phoneme sequence symbol. Executed according to the identifier indicating “Taro's face” and conversion to the rt / a / r / o / uj t ⁇ ⁇ phoneme sequence symbol.
  • the correlation that is one-to-one, one-to-many, or many-to-one is quantitatively recorded and stored. Processing such as identifier conversion is possible, and these dictionaries are composed of reference information groups for conversion. If the dictionary is many-to-one, the dictionary information is based on the co-occurrence information! It may be configured, and eigenvalues and eigenvectors based on co-occurrence information are used.
  • a dictionary of evaluation functions and identifiers may be configured, and features and identifiers can be converted using phoneme strings, phoneme strings, numeric IDs, and string IDs
  • a dictionary may be configured, or a phoneme sequence, phoneme sequence, numeric ID, and character string ID may be converted to an evaluation function or an identifier.
  • a dictionary in which identifiers and language-dependent words are associated with each other may be constructed for each of the above-described feature quantities and identifiers, so that a dictionary combining arbitrary identifiers and feature quantities may be constructed.
  • An identifier that can be used for retrieval by learning the co-occurrence state of features for those words and phoneme strings by associating them with abstract words, adverbs, adjectives, and unknown nouns that can be rebuilt using It may be used as a feature value, or may be used as a hash value by an arithmetic process such as MD5 or CRC based on a phoneme sequence or phoneme segment sequence associated with these identifiers.
  • Phoneme string sequences and hash values are stored in association with each other to efficiently search for phonemes in the dictionary 'search by identifiers and feature values related to phoneme segments, or between different identifiers, identifiers and feature values, phoneme sequences, phoneme sequence And identifiers and phoneme strings
  • a dictionary can be constructed based on the correlation between arbitrary features and identifiers extracted from video and audio power by using the index associated with the dictionary index, and the identifiers and phonemes of images, sounds and emotions can be combined by combining dictionaries. Convert columns, phoneme strings, character strings, images and sound Dictionary information can be newly constructed with a conversion table that evaluates the co-occurrence state of identifiers and features related to voices and emotions.
  • a phoneme string is used.
  • a dictionary structure using a phoneme string may be used. Force Because it depends on the combination of extracted features and identifiers, it depends on the implementation.
  • an arbitrary word character string is selected and converted to an arbitrary identifier or feature amount associated with the word character string by the conversion dictionary.
  • Search using an identifier evaluation function associated with an arbitrary phoneme sequence, phoneme sequence, or identifier by a conversion dictionary or directly search for speech using a phoneme sequence, phoneme sequence, or emotion identifier.
  • a conversion dictionary can be constructed based on the co-occurrence information.
  • these dictionaries are not limited to phonemes or phoneme strings, but may be co-occurrence dictionaries configured based on co-occurrence states of arbitrary identifiers and feature quantities described separately.
  • conversion from an arbitrary name to co-occurrence information and conversion to a phoneme string or phoneme string sequence based on words in an arbitrary language associated with the co-occurrence information may be used at any time.
  • Users can synthesize speech based on recognized phoneme sequences and phoneme sequence, search based on phoneme sequence and phoneme sequence, and use phonetic characters and words associated with phoneme sequence and phoneme sequence. Or ask the user to make a decision.
  • the feature extraction program stored in the program storage section 210 of the storage section 20 or the characteristics of the information processing section 10 is used. This will be described based on the quantity extraction unit 116.
  • These feature quantity extraction functions can be implemented by a general-purpose program, which is an information processing method using various known general algorithms, and basically depends on the implementation.
  • motion feature quantities are extracted based on changes in image shape, moving image, and inter-frame differences, and can be combined with autocorrelation coefficient extraction, higher-order autocorrelation extraction, etc., as feature quantities for video and images.
  • FFT features FFT features, mel cepstrum, mel cepstrum, directional pattern, formant extraction, rhythm extraction, Harmonix extraction, autocorrelation coefficient extraction, higher-order autocorrelation extraction, frequency features and volume features, etc.
  • the change feature can be extracted.
  • multi-order difference features such as frequency components, frequency distribution, volume, sound source direction, differences between them, differences in differences, values of these information average and variance, standard deviation, and their values Or the color distribution, luminance distribution, redistribution, color calculus value, luminance calculus value, saturation calculus value, similarly analyzed RGB value, HSV value, YR—YB—
  • Each frequency component distribution such as Y value and YCM value
  • multi-order difference features such as color, brightness, frequency difference, difference difference, and the average, variance, and standard deviation values of these information and the exponent part of those values
  • a feature amount based on the display position of an object within the image range of the recognized image-related identifier or image-related feature amount, or if it is a moving image, the time axis transition of the feature mentioned in the image Or a three-dimensional image 2.
  • 5D features 2.
  • Various 3D features restored from 5D features 3D image coordinate information used in CG, 3D texture information, 3D motion information, 3D color change information, 3D Source light source change information, 3D hardness texture information, arbitrary image recognition and 2.5D image feature extraction, and feature extraction methods and combinations thereof, time information recognized from these feature quantities, weather information, season It is possible to use identifiers such as information, regional information, and cultural information.
  • the feature quantity identifier conversion program stored in the section 210 and the feature quantity identifier conversion section 120 of the information processing section 10 will be described.
  • These feature quantity extraction functions can be implemented by a general-purpose program, which is an information processing method using various known general algorithms, and basically depends on the implementation.
  • a number of methods have been proposed in the past. For example, a feature amount classified into the same identifier is given to the HMM, and the transition probability and output probability of the HMM are learned and used as an evaluation function. The average and variance of the features classified into the same identifier and the covariance matrix are obtained, then the eigenvalues and eigenvectors are found to construct the distance function, and the Bayes discriminant function and Mahalanobis are used to obtain the distance between the center of the identifier information group and the input sample. A method using a distance function using a distance function or simply using a Euclidean distance function between an input sample and an average vector of identifier groups has been proposed, and since these procedures depend on the implementation, any method can be used. Can be used.
  • the correct identifier is selected from among a plurality of identifiers for the input feature amount. This means that the evaluation function of the identifier to be selected and the input feature amount to be compared are compared. Similarity due to minimum distance or maximum output probability For the input feature quantity V that is known in advance as an identifier X, the similarity evaluation using the identifier evaluation functions X, ⁇ , and ⁇ is performed. If the output identifier evaluation function is X, it can be determined that the recognition of the identifier was successful.
  • a method for evaluating phonemes, phoneme pieces, emotion identifiers, and other arbitrary identifiers is generally performed by using an evaluation function for obtaining a likelihood using various distance functions and probability functions as described above. Then, these evaluations are segmented according to the time axis and display position of the content, and are evaluated sequentially for each segment to give an identifier, or time-series by dividing the time axis into arbitrary unit times and evaluating each frame sequentially. By providing a unique feature identifier, the feature strength for indexing can also be converted to an identifier.
  • one frame of FFT data, mel cepstrum, mel cepstrum, and direction pattern data may be vectors of arbitrary dimensions, or one frame for image features related to moving images and still images. May be configured with an arbitrary pixel size, and these inter-frame error vectors and inter-pixel error vectors may be given in arbitrary dimensions. May be used. Since the method of obtaining the feature value at this point depends on the implementation, any method may be used.
  • the output of the Bayes discriminant function that can be used as the distance including the Mahalanobis distance, the probability value based on the inverse of the probability, the natural logarithm, etc., the natural logarithm, etc.
  • Exponent part of value city block distance, chess board distance, octa
  • Hetas distance Minkowski distance
  • any distance calculation method such as similarity and distance weighted to those distances, distance calculation method using combination of eigenvalues and eigenvectors, eigenvalues and eigenvectors, The distance may be calculated by a combination of eigenvalues, eigenvector norms, maximum eigencomponents, and the like.
  • a force output such as a sensor device with AD conversion such as voice or image is input.
  • the amount of speech depends on the identifier such as FFT, cepstrum, mel cepstrum, direction pattern, etc., and, for images, luminance and saturation delta information and contour information, and delta information by time axis difference
  • the feature quantity is extracted by an optimum method.
  • a feature evaluation step is performed based on recognition using Bayes, HMM, and distance functions. The most probable! Near to identifiers and distances!
  • the step of selecting an identifier is performed. Then, by outputting the selected symbols and identifiers as recognition results, phoneme and phoneme symbols, emotion identifiers, image identifiers, face IDs, recognition characters, environmental sound IDs, mechanical sound identifiers, landscape identifiers, scale identifiers Etc. are obtained and used for indexing.
  • These procedures are executed as a step in which an identifier is evaluated, selected and output using a plurality of evaluation functions by an evaluation function processing step and a step of confirming the end of the evaluation function.
  • the processor can handle the analog value, the analog value may be directly input. If the processor can evaluate the analog value, the evaluation calculation and matching are performed with the analog value as it is. Processing may be performed, or digital values may be converted into analog values for evaluation calculation.
  • the identifiers and feature quantities of instrument types that measure the distance from the population that collected the sounds of each instrument are used, and engine sounds, exhaust sounds, door sounds, etc.
  • a population that collects sounds for each environmental sound, such as wind sounds, waves, and birds and animals, using mechanical sound type identifiers and features to measure the distance from the population that collected the sounds of each mechanical sound It is possible to use an identifier or feature amount of the environmental sound type that measures the distance to
  • an index based on an image-related identifier is based on the image type, and in order to discriminate the person in the video, the person type based on the face type, clothes, and physique, gesture gesture and facial expression.
  • the identifier may be used for indexing by using the following types of signs and Z, or the shape type of cars, ships, desks, phones, etc. and the figure symbol type of Z or toilets or emergency exits.
  • indexing by the apparatus As an indexing method, it is possible to perform indexing by recognition for each search, but once index information is configured, it can be reused any number of times as long as the content does not change. Perform indexing at any time, such as when it is first registered in the storage, or when it becomes the search target for the first time, or when it is registered for the first time and the frequency of use from the outside of the force device itself decreases. It may appear that content information has been registered so that it can be handled by an external device after indexing the content.
  • this indexing is not a force for indexing the recorded content by performing indexing by recognition at an appropriate unit time (for example, 16 milliseconds) at the time of recording information, but during live broadcasting programs
  • the index information may be distributed in real time while indexing at the same time as broadcasting.
  • the indexing device executes the audio / video input step (S0201), so that external force also acquires content information.
  • the content acquired here is not limited to video and audio as described above, but may be arbitrary content information such as still images, document information, BML, EPG recognized subtitles, and character strings included in video. .
  • Information input unit 30, communication line unit 50, or storage unit using exchangeable storage medium Content information acquired is converted to numerical data as feature values by feature value extraction unit 1 16 Executes feature value extraction step S0202 The
  • the feature quantity used in this conversion step S0202 is the same as that described in "Examples of converting natural information into feature quantities", “Examples of feature quantities and identifiers", and “Prior art”.
  • a feature extraction method has been proposed for still images, audio, and sentence strength.
  • Feature classification methods such as still image feature extraction unit and moving image feature extraction unit based on visual information, emotion feature extraction unit based on auditory information, phoneme feature extraction unit, phoneme piece feature extraction unit, and program information extraction unit based on character information
  • the feature quantity is extracted by the feature extraction method.
  • cepstrum for speech waveforms a phoneme or a symbol string of a phoneme piece! /, And any known feature quantity extraction method of V deviation may be used.
  • Step S0203 an identifier is assigned and assigned by the feature value identifier conversion unit 120 or step S0203.
  • the conventional force as in “Example of feature value and identifier” and “Conventional technology” can be used.
  • Step S0204 for indexing by associating an arbitrary feature amount or an identifier recognized using the arbitrary feature amount with the time series of the content information is executed to construct the index information.
  • the configured index information is recorded in the MPEG information by the index symbol string synthesizing unit 110 as an additional change to the additional carostream or existing MPEG7 information, or the index information is stored in the information recording / accumulating unit 22 as a separate file.
  • the user can search and use it in some cases.
  • a symbol string based on several types of feature quantities and identifiers is generated in association with the content and can be configured as "index co-occurrence information", and metadata using "index co-occurrence information” can be configured.
  • the accompanying content information can be constructed.
  • Co-occurrence information refers to emotional changes associated with audio, acoustic changes associated with video changes, emotional changes associated with video changes, subtitles, EPG, BML, RSS, and text broadcast changes associated with video and audio changes.
  • the content is indexed by phonemes and Z or phoneme pieces and Z or emotion identifiers, and similarly indexed by identifiers such as other scales, environmental sounds, recognized character strings, and image identifiers.
  • it is constructed based on correlated changes in content, and is characterized by searching, or by learning feature quantities extracted from search conditions and search results, and constructing new identifiers.
  • the content is indexed in the step of learning the co-occurrence state, and the co-occurrence state of various identifiers and feature quantities is learned, and autonomously classified by the quantity IV analysis IV class and the like. Indexing may be performed for each cluster, and the user may give an arbitrary character string, phoneme string / phoneme fragment string for each classified cluster, and use it for the search.
  • a step of inputting a symbol string or an identifier string that needs to be converted in the user or the device is performed. If it is a normal input character string, a target extraction step for extracting a phoneme string or a phoneme fragment string or any identifier by a conversion dictionary based on the input word is executed.
  • an identifier segmentation process for converting the obtained identifier from a phoneme to a phoneme piece and from an image to an image element, if necessary, is executed.
  • An image element here is a partial element for an image. If a face image is taken as an example, the face image shows the entire face, and if it is a face image element, it is a component of the face, such as the eyes, nose, and mouth. This is an element to which an identifier is assigned based on the classification when an arbitrary image tendency is separated as a part.
  • an identifier average setting step is performed using a sample average value of the corresponding identifier, and a feature value constituted by the average value is output. Since the feature value based on the average value converted according to the identifier is always a value that means the weight of the population, the distance between the identifier centroid and the feature amount is always 0 when given to the identifier evaluation function. It is recognized correctly. [0262] By this conversion, it is possible to evaluate the distance between the feature quantity W converted from the feature quantity Y converted from the arbitrary identifier X and the feature quantity W converted from the arbitrary identifier V force. Since distance evaluation can be realized, it is possible to evaluate the distance between identifiers using the same feature quantity, or to construct a conversion dictionary for identifier quantities.
  • scale identifiers are used for discriminators related to arbitrary speech such as scales that are not related to speech information associated with language, environmental sounds, noise, laughter, and emotional characteristics that can also provide voice. If there is an environmental sound, if it is an environmental sound, it will be the sound of each wave or wind ! /, if it is an emotion identifier, if it is an emotion identifier, the average value of the feature value of the feature type associated with the emotion will be used. It can be used for conversion to feature values.
  • an image identifier is selected through a conversion dictionary such as "maru” or "bac”, converted into a phoneme or a phoneme fragment sequence associated with the image identifier, and then converted into a speech feature value.
  • “Mull” or “Bat” is uttered 1, a search is performed to find the location, and “Mull” or “Bat” is displayed from the image feature value associated with the image identifier of “Mull” or “Bat”. For example, it is possible to search for identifiers with different purposes such as searching for a place to speak.
  • an identifier sequence can be constructed by determining the identifier transition probability according to the spatial change of the front, back, left, and right according to the image features, or the time series of these features can be determined.
  • the feature identifier may be configured after changing the motion identifier to an identifier sequence having an optimal spatial time-series arrangement.
  • a method using a distance function is well known as a method for evaluating feature quantities.
  • feature quantities are composed of vectors
  • the Euclidean distance between feature quantities is measured. More specifically, the cumulative value is obtained from the square of the difference between each element in the feature vector for the first input vector and the second input vector obtained by the same feature extraction method.
  • the distance between vectors can be measured by giving two vectors of the same number of dimensions by the same feature extraction method to the distance function.
  • the distance between the feature quantities can be easily obtained by any known method, but the distance between the feature quantities cannot be easily used for the matching / mismatch evaluation of the identifiers related to the feature quantities. Therefore, the user needs to set an arbitrary threshold value.For example, if the input feature value to be evaluated deviates more than 3 ⁇ from the average feature value and standard deviation of the samples classified into the same population, Disagreement, and if it is small, it is possible to determine whether or not the feature value matches the discriminator match caused by the feature value, and the match and similarity between the index co-occurrence information and the search condition co-occurrence information are also evaluated. become able to.
  • DP matching is well known as a method for evaluating matching / non-matching between identifier strings. It is possible to select an identifier sequence from the long identifier sequence. More specifically, “a, a, a, a, b, b, b, b” and “a, a, a, a, a, b, b” have 100% symbols and their order. “A, a, a, a, b, b, b, b” and “a, a, a, c, c, b, b, b” are estimated to match 75%. For matching evaluation of identifier columns, CDP, Shift—CDP, mp—CDP, RIF—CDP, Self-applicative — Any matching function such as CDP can be used for implementation if necessary!
  • the evaluation result is configured as “0”, and if they do not match, the evaluation result is configured as “1”. If all the frames match, the cumulative value is “0” and the degree of mismatch is 0%. If all the frames do not match, the number of frames is equal to the cumulative value and the degree of mismatch can be evaluated as 100%.
  • the frame length of the sample varies, so the difference in length can be corrected by using the value obtained by dividing the cumulative distance of the result of DP matching by the sum of the number of both frames. I can do it.
  • the distance as the result value of the matching function is the smallest !, (the cumulative distance is the smallest!) High match rate!
  • the identifier can be output as a recognition result.
  • indexing is performed by collecting the consecutive identifiers by detecting that the identifiers have changed between frames in time series.
  • the number of consecutive frames is used for weighting for matching evaluation, and if the difference between the weights of the same identifier is small, it is evaluated that the identifiers match, or the distance from the population centroid of the time-series identifier is the feature amount.
  • a matching score evaluation function is constructed using transitions of multiple identifier distances in time series, and the identifier information of one frame is reduced to one frame every 20 seconds or conversely to one frame every 240 seconds.
  • Distance and distance mean output from the distance estimation function consecutive intervals may be utilized in the boundary evaluation of identifiers.
  • the time-series changes of the identifiers are output by the degree of matching evaluation procedure such as DP or CDP, and the degree of matching is displayed on the screen using the obtained evaluation values, ranked, and displayed as a list. It may be good or announced by speech synthesis.
  • the search device indexes various contents as described above.
  • This indexing may be real-time distribution information such as a TV broadcast program that is recorded by indexing every appropriate unit time (for example, 16 milliseconds) at the time of recording information. It is possible to record only where there are changes, distribute these index information via EPG, BML, RSS, teletext, etc., or record them in association with DVD files. If it is a text file, every word or sentence, Index information may be configured for each section or each chapter. The indexed information is searched by converting the user input into an identifier so that it matches the identifier used for indexing.
  • the search device executes a speech 'character string input step to specify a search condition for the indexed content.
  • the search conditions can be broadly classified into audio, character strings, and moving images and still images.
  • voice search phonemes, phonemes, and emotion identifiers are recognized from speech used for user utterances and searches, and direct search is performed using phonemes and phoneme strings, and recognized phonemes and phonemes.
  • identifier conversion dictionary using, and includes a method for including other feature quantities and identifiers associated with phoneme strings and phoneme string sequences in the search condition, and an instruction dictionary based on recognized phonemes and phoneme strings
  • searching using other feature values or identifiers associated with phoneme sequences or phoneme segment sequences excluding detected commands, and considering user emotions based on recognized emotion identifiers. You can do the processing.
  • the search by the search character string is performed by referring to the identifier conversion dictionary using the search character string and the method for directly executing the search character string, and other feature quantities associated with the search character string.
  • the search character string may be converted into a phoneme sequence or phoneme sequence using an identifier conversion dictionary, and the search may be performed. Based on the recognized emotion identifier, processing that considers the emotion of the user is performed. It ’s okay to go.
  • the search by moving image or still image is recognized as a method of recognizing the image identifier or motion identifier used for the search from the video, moving image or still image captured by the user and executing the search directly by the image identifier or motion identifier.
  • the image identifier or motion identifier recognized as a method of including other feature quantities or identifiers associated with the image identifier or motion identifier in the search condition is used.
  • the detected command is excluded and there is a method of searching using other feature quantities or identifiers associated with the image identifier or motion identifier. Convert related identifiers and action identifiers into phoneme strings and phoneme strings using an identifier conversion dictionary.
  • the search may be performed, or processing may be performed in consideration of the emotion of the user based on the recognized emotion identifier.
  • the common feature of these search condition construction methods is that information that has not been symbolized 'identified' is converted to other identifiers that are associated with each other via the identifier conversion dictionary after being symbolized 'identifiers. If it is necessary, it can be used for a search using the feature value by converting it to the average feature value of the identifier.
  • Taro-san's face image is presented and a voice search is performed based on the recognized name to find a scene where Taro-san is called by someone, or a voice calling Taro-san is Hanako's voice quality By adding this condition, you can search for scenes where Hanako calls Taro.
  • search conditions acquired here are information entered according to user instructions and use not only video and audio, but also information such as still images, document information, EPG, BML, RSS, and character broadcasting.
  • feature quantities and identifiers may be configured.
  • Step S 1001 is entered in which the search condition suitable for the search is input by conversion or conversion to a feature quantity based on the above-mentioned “Example of method for converting to identifier power feature quantity”.
  • step S1001 identifiers and feature quantities for user-specified search conditions are selected based on the same index as the content information index, and query generation step S1002 that configures the search conditions is executed To do.
  • various identifiers and various feature quantities that can be used for the search may be combined to be converted into a search condition using only a general character string to be conditioned.
  • search condition co-occurrence information as a search condition that combines different information such as visual and auditory senses is constructed and given to the search device.
  • the device is instructed by a character string “search of sea image” and there is a character string “sea” and “image search” t, and a command character string.
  • “Sea”, excluding the command string! ⁇ Use the image feature value associated with the character string to construct search conditions based on the co-occurrence information of color features and motion features, and to co-occur information of color identifiers and motion identifiers to detect “sea”
  • the search condition is configured by the evaluation function configured by, or if the index is performed by the “sea” evaluation function, the search condition is configured by converting to the “sea” identifier. You can configure "startup information”.
  • the search condition is by voice
  • it is registered in the command dictionary called "voice search” that corresponds to the command utterance phoneme sequence when the user gives voice instructions such as "voice search, ideal return, explosion sound”.
  • voice search the phoneme sequence of the ‘Issual return, explosive pronunciation’ part excluding the utterance phoneme sequence 'phoneme segment sequence is used to detect and search utterances with explosive pronunciation in the content by conventional methods, and the feeling of sadness co-occur You may be able to detect and search for words such as “I don't like me!” Or “Tobi, I ’ll do it,” or if it ’s a continuous drama broadcast every week.
  • it may be compared with the theme song, and if the degree of match is high, the highlight scene may be evaluated.
  • search condition co-occurrence information is constructed by using combinations of search conditions that are used at the same time. It can be used as a search condition for evaluating matches and similarities, or such "search condition co-occurrence information" can be collected from multiple users and used as "search condition co-occurrence information”.
  • An evaluation function can be constructed using “starting information”.
  • the information recording / accumulating unit power index information of the storage unit is read, and the read index information and the previous search condition information are evaluated by DP, distance function, etc. Based on the saved index information, perform a search according to ⁇ Example of how to evaluate matching between feature quantities and identifier columns '' to select content and position within content.
  • the search step (S1003) is executed.
  • a frame part having a high similarity to the search condition and an index part having a high similarity to the search condition are detected for each identifier and feature quantity, and the similarity between a plurality of identifiers and feature quantities is high.
  • the similarity may be implemented by combining the above-mentioned similarity evaluation methods such as a distance evaluation method and a probability evaluation method if the degree of coincidence by DP.
  • This evaluation is based on an evaluation list without ranking in particular, or an evaluation list in which the maximum value and minimum value are simply determined according to the sum of the evaluation distance and evaluation probability of each identifier, Based on a logical expression such as an OR expression or AND expression, there can be an evaluation list ranked by the values selected by narrowing down or an evaluation list ranked by values calculated according to the logical expression. Note that the evaluation list based on the values calculated according to the logical expression is, for example, “(blue or green) and large amount of motion! ⁇ video”, and the condition is expressed by the following function.
  • a distance evaluation function can be constructed, a probability function can be constructed based on the co-occurrence probability, or multiple pieces of co-occurrence information can be combined. It is possible to perform a search that evaluates the similarity based on the co-occurrence information. And if it is distance, the value is small! If it is probability, the value is large! Since the similarity is considered to be high in the case of ⁇ ⁇ , ranking according to a plurality of identifiers and feature quantities can be realized as evaluation of the search result.
  • the blue feature in this example is the frequency of appearance in the entire screen of pixels within a ⁇ 15 degree hue centered on blue in the screen, and the blue feature average is the content 'all archive The same can be said for green and red, which can be considered as the average of the blue features of the above, and any method can be used because it depends on the implementation.
  • natural colors are used for each season.
  • the motion feature is based on the time axis delta of the video, or may be the size of the motion feature vector used in MPEG4 etc.
  • ⁇ 15 from the current frame Features based on image change information occurring at arbitrary time intervals, such as frames, and the content of those features, which may be averages in the archive, are optionally normalized and corrected.
  • the composition of evaluation formulas based on these feature quantities depends on the implementation, so any combination can be used.
  • an arbitrary evaluation function is configured by combining the obtained face ID, motion ID, image ID, and phoneme or phoneme identifier by combining image recognition technology and speech recognition technology instead of color distortion. It is also possible.
  • the distance between identifiers can be evaluated by using the above-mentioned DP, etc., and the distance between features can be evaluated with an arbitrary distance function. Similarity evaluations can be applied to HMMs, distance functions, etc., and are described in detail in the description of identifiers and features described above and their mutual conversion methods. Of course, it is also possible to improve performance by performing efficient classification by combining various evaluation methods such as multi-layer Bayes and -Ural network.
  • the search results obtained here are listed in descending order of similarity and presented to the user, and the evaluation result that allows the user to view the similarity value as a ranking index is displayed.
  • Execute the browsing step (S1005) output the search result list to the output unit and display it on the screen, or send it to the user terminal via the communication line unit and present it to the user
  • a user processing continuation confirmation step (S 1006) is performed to evaluate whether the user has requested the search again.
  • identifiers may be learned by using co-occurrence information of search results and auxiliary information by associating with EPG, RSS, HTML, XML, BML, teletext, etc. It is possible to realize a service that takes an arbitrary configuration and executes a search by selectively using an arbitrary identifier or feature amount.
  • the character string for search can be transmitted from the broadcast receiving unit, the information line unit connected to the Internet, or the recorded information in the storage unit by any means such as XML, HTML, MPEG7, RSS, teletext, BML, EPG. It is also possible to perform a search by acquiring and converting the feature quantity that serves as a search index into an identifier string based on these character strings. It may be realized as a service that executes a search by selectively using feature quantities. Search string power Search conditions can be generated.
  • the search by character string is a character string identifier change related to each feature extraction method. This is implemented by selecting and using identifiers and identifier features associated with an arbitrary character string using a conversion dictionary and identifier feature conversion dictionary.
  • An identifier may be used. For example, perform a content search by converting a performer's name into a phoneme sequence or a phoneme segment sequence, or find the frequency of occurrence of explosion sounds in content classified as an action movie from the word “action movie”. Multiple action movie powers Performing content searches based on the action movie function by determining the average value of the explosion sound appearance frequency and configuring the action movie evaluation function and indexing with the action movie evaluation function Can do.
  • a co-occurrence state of an arbitrary feature amount or identifier based on the search result.
  • This co-occurrence state can be configured by using co-occurrence probabilities, co-occurrence matrices, and covariance matrices.For example, by co-occurrence information within the top 10 with a matching rate of 70% or higher under certain conditions The co-occurrence state can be selected and used for learning. If the user views the co-occurrence information configured in this way many times or is used from the outside many times by the information sharing method described later, this co-occurrence information is judged to be highly useful.
  • an evaluation function based on the co-occurrence state can be constructed, and the co-occurrence learning storage unit and the evaluation function storage unit can provide new identifier and feature co-occurrence information and evaluation. Record as a function.
  • the feature quantity is also identified using the bias toward the "sea” feature quantity.
  • the function can be reconfigured as in “Example of identifier reconstruction” and reflected in the co-occurrence information learned.
  • the image has the horizon in the center, it will be blue in the lower half of the image with wave motion! Since the colors increase, it is possible to construct an evaluation function based on the image features with “sea” and “coast”. it can.
  • a new function can be configured to remove the search results to be excluded from the previous target search results, or the co-occurrence probability is high for another identifier even though the co-occurrence probability is low for a certain target identifier or condition!
  • the user interface for evaluating the search result may be used so that the user can improve the performance.
  • the content title can be combined with the search by the character string, and the content such as the genre and the director can be used.
  • the search efficiency may be improved by combining with attributes, or an arbitrary name may be given to the co-occurrence state based on the search conditions, identifiers, and features so that it can be used for repeated searches, detection, and instructions.
  • the search conditions and search expressions can be exchanged and distributed via a communication line.
  • Any markup language information such as HTML, XML, RSS, and BML can be converted into the above identifiers such as phonemes, environmental sound identifiers, and image identifiers using the phoneme-symbol conversion dictionary in the dictionary information storage unit of the storage unit. You can convert and perform arbitrary processing associated with search and detection, or record their usage status and re-learn identifiers using co-occurrence information of frequently used search conditions according to the recorded results .
  • the identification function configured as described above, the search result, and the co-occurrence state information in the search result The information can be reused using technology such as P2P software so that other devices can be browsed and acquired via the communication line as shown in “Examples of information sharing procedures between users” or on any site.
  • technology such as P2P software
  • any user may use it with billing, or sell it in a storage medium.
  • the usage amount is changed depending on the accuracy and detail of information to be used, the speed of processing, the number of uses, the usage time, etc., or the search results obtained by using the present invention are used. You may charge for it, change the amount of money, or encrypt it to protect the value of that information.
  • Co-occurrence state information, evaluation functions, and evaluation parameters are stored in the storage unit of the device, or acquired externally via the communication line unit as necessary. Meta information generated using the evaluation function and identifier may be presented to other users or sold.
  • the user inputs a detection condition that triggers an arbitrary process in the same manner as the search condition.
  • the input may be audio, video information, a character string, an identifier obtained by the present invention, or a combination thereof.
  • the present invention executes the steps of configuring the co-occurrence state by the combination of the feature quantity and the identifier and setting the detection condition in the same procedure as when searching for the detection condition.
  • the probability that the distance from the center of gravity in the specific identifier, identifier string, and feature quantity within 1 ⁇ is based on the detection condition, or the probability that the identifier, identifier string, and feature quantity are specific.
  • the registered arbitrary process is executed on the condition that it is 60% or the matching degree between identifier strings exceeds 60%. This value of 60% is due to the fact that in phoneme recognition, emotion recognition image recognition, if the recognition result is generally 60% or more, practical application can be considered, even if it is changed to an arbitrary rate depending on the user environment. If the recognition rate is continuously less than 20%, you can stop the current process or set a flag to indicate that it is the subject of fast-forward or delete.
  • the information acquired by the broadcasting station, the network, and the imaging device is recognized, whether or not the content information is the power intended by the user is detected.
  • the input search condition obtains the cast name associated with the information input by the user by referring to EPG, BML, RSS, and teletext, and executes phoneme and phoneme search by the method described above.
  • recording is performed retroactively for one hour from the location where the cast name is spoken while constantly recording, or by EPG, BML, RSS, and text broadcasting for each program, every CM time, and screen features.
  • the range to be deleted may be determined for each change, or a boundary may be set in the content for each such change and used as an indicator for the user to instruct.
  • the specified range is configured from multiple detection locations for the content, and it is classified and stored as a storage target and a deletion target.
  • Co-occurrence of "co-occurrence information based on search results”, “co-occurrence information extracted by indexing” and “co-occurrence information based on user-specified detection conditions and Z or search conditions” that are co-occurrence states Probability evaluation function based on co-occurrence probabilities and eigenvalues ⁇ Distance evaluation function based on eigenvalues, classification using HMM, classification based on multivariate analysis and construction of evaluation function New identifiers can be learned by defining strings.
  • the collecting step and the step of constructing the co-occurrence probability, co-occurrence matrix and covariance matrix are executed.
  • the nearest frame is the power that can be arbitrarily specified according to the implementation according to the definition of the user. If fine granularity is required, it can be set as a unit of one frame of video video such as 16 ms. Conversely, 3 seconds (180 frames) It is good even if the time unit is divided.
  • the frame force in which the feature that is statistically distant is detected is also acquired by this step.
  • the co-occurrence information is configured based on the information obtained, and learning is performed by using the HMM or covariance matrix or the distance function is configured. Save to.
  • the identifier or feature of the content selected by the user regarding the information presented as the search result or the information specified as the search condition or detection condition perform steps to collect quantity co-occurrence information as a sample. Then, the co-occurrence information of the identifier and the feature amount is acquired based on the sample acquired by the collection.
  • co-occurrence information There are various combinations of co-occurrence information as described in the contents described separately and “Examples of indexing and searching with multiple identifiers and multiple search conditions, optional processing” described later.
  • a learning sample can be obtained by collecting the specified conditions of the search conditions and detection conditions as a sample, and an evaluation function can be configured with the learning samples.
  • a range specification method of co-occurrence matrix one work or one program, a range where arbitrary identifiers are co-occurring, an image of a specified range by segment based on the appearance of a specific identifier Categorize features and speech features by 'analysis' multivariate analysis and evaluate the appearance time of those features, and create an evaluation function by constructing classified information power co-occurrence matrix and co-occurrence probability covariance matrix Any method can be used to determine the frequency of occurrence of identifiers obtained as an evaluation result, or to construct and evaluate scene features and evaluation functions, such as the appearance histogram for the unit time of these identifiers.
  • a search result extracted based on a search condition using them has a high co-occurrence probability in identifiers and feature quantities other than the search condition (for example, 70% or more) or near-distance (for example, within 3 ⁇ of distance average) as a new target, the method used for learning in the co-occurrence information composition step, or conversely with a low probability of belonging (for example, farther than 3 ⁇ )
  • the method used for learning in the co-occurrence information composition step can be considered.
  • These feature quantities used for identifier reconstruction are arbitrarily configured based on values such as the output value of the evaluation function, the output probability of ⁇ , and the similarity between identifier strings.
  • a combination of vector co-occurrence states may be used as a covariance matrix, or a co-occurrence matrix of identifiers may be constructed.
  • a character string is given to the evaluation function, stored in the storage unit, and a learning result is obtained.
  • the character strings given to identifiers and features are used as tag names in markup languages such as new XML, or the given character strings themselves are converted into identifier symbol strings such as phonemes and phonemes. It is possible to support human voice input, or configure an evaluation function that is associated with facial expression identifiers, shape identifiers, motion identifiers, etc., so that it can respond to user video input.
  • the distance evaluation result with the search condition indicates the barycentric force of the co-occurrence information. If the result is within 3 ⁇ or the probability evaluation result is 80% or more, the co-occurrence information of the index in the target range of the selected content is regarded as a co-occurrence matrix or co-occurrence probability, and the index
  • a new evaluation function is constructed based on the identifiers and features used in the above.
  • the evaluation function may be a Bayes discriminant function, Mahalanobis distance function, or a power function. I can do it.
  • the features of the present invention are various identifier recognition and feature extraction methods as conventional techniques. Indexing based on co-occurrence information of other sound identifiers and image identifiers based on emotional identifiers and phonemes and phoneme pieces that are not in the frame width and time width specification, range selection method, identifier string matching method, Search using indexing 'detection, processing of recording and playback started by detection, learning of co-occurrence information in indexing, use of search results, learning of co-occurrence information and learning of co-occurrence information.
  • This is an identifier conversion dictionary that can specify new identifiers and new features obtained by the above and their identifiers and features as search conditions using phoneme strings and phoneme string sequences.
  • the co-occurrence information obtained by associating the identifier and the feature quantity may be used for constructing the discriminant function, or the co-occurrence probability of the identifier may be combined in the following configuration.
  • Learning with a quantity, learning with a feature quantity covariance matrix, learning with an identifier co-occurrence probability and a feature quantity covariance matrix, learning with the output of a distance function as a feature quantity, and identifier A combination of methods such as learning that uses the output probability of the evaluated HMM as a feature value and learning that uses the transition probability of the HMM that evaluates an identifier as a feature value is given as an HMM learning parameter, or a covariance matrix is formed to configure eigenvalues and eigenvectors.
  • Coincidence information can be constructed by acquiring the emotion identifiers that occur around the utterance part, the anger emotion and the phoneme string “k” Learning the co-occurrence state of “/ o / r / a” and the amount of features in the co-occurrence state, the identifier is newly angry, [k / o / r / a], It is possible to construct the identifier "! /, Ru [k / o / r / a]".
  • the information used for learning by reconstruction uses a DP matching rate and a ratio of emotional characteristics and emotion identifiers, and the likelihood and probability based on the evaluation function of the phoneme sequence or phoneme segment sequence and the evaluation function of the emotion identifier, You may use distance. In this case, for example, it is possible to combine associations using feature quantity extraction methods such as video features, image features, moving image features, still image features, scale features, and environmental sound features. Construct facial expression identifiers with emotions, including facial features.
  • the information range that is the target for re-learning the identifier may be configured based on the specified boundary condition such as when using an arbitrary user-specified time width.
  • Examples of identifier associations include association between program information and display position, association between program information and emotion, association between program information and phonemes, association between phonemes, association between program information and landscape images, program information and text Association between program information and environmental sound, association between program information and scale and tempo, chord and chord progression, association between program information and facial expression image, association between program information and object image Association between program information and operation information Display position and emotion, display position and phoneme, phoneme fragment, display position and landscape image, display position and text, display position and environmental sound, display position and scale and tense.
  • association of chords and chord progressions association between display position and facial expression image, association between display position and physical image, association between display position and motion information, association between emotion and phoneme, phoneme fragment, emotion and landscape image Association between emotion and sentence, Association between emotion and environmental sound, Association between emotion and scale and tempo, Association of chord and chord progression, Association of emotion and facial expression image, Association of emotion and object image, Association of emotion and motion information
  • Phonemes, phonemes and landscape images phonemes, phonemes and texts, phonemes, phonemes and environmental sounds, phonemes, phonemes and scales and tempos, chords and chord progressions, phonemes, Association between phonemes and facial expressions, phonemes, association between phonemes and object images, association between phonemes, phonemes and motion information, association between landscape images and text, landscape images and environmental sounds Association, association between landscape image and scale and tempo, chord and chord progression, association between landscape image and facial expression image, association between landscape image and object image, association between landscape image and motion information, sentence and environmental sound Association, association between text and scale and tempo, chord and chord progression, association
  • the input feature quantities and identifiers as search conditions are applied to the content obtained as search results. If it shows a high similarity (for example, 80%) and is specified in the search condition associated with the same content! Other identifiers and features not specified in the conditions are recorded in the co-occurrence information storage unit together with the specified search conditions.
  • the accumulation of information related based on such a co-occurrence state exceeds a certain value (for example, it may be 1000 or n times the number of evaluation dimensions).
  • a co-occurrence matrix based on co-occurrence information can be constructed, the covariance matrix and co-occurrence probabilities can be obtained, and learning can be performed using a distance evaluation function or HMM to reconstruct the evaluation function.
  • information with a lot of variance and information with low probability can be excluded, and the calculation efficiency can be improved by reducing the number of evaluation dimensions, and if it is a fixed phrase such as command control or a specific word, a command can be used.
  • the phoneme or phoneme segment is used by using the recognized phoneme sequence or phoneme sequence in accordance with the user's affirmative or negative instruction for recognition by the device, not the phoneme sequence or phoneme sequence developed from the character string.
  • the evaluation function template for identification may be updated.
  • the explosion sequence is recognized as an environmental sound within a few seconds before and after the phoneme or phoneme segment that is recognized as “waichi”. If more than% is detected, the phoneme or phoneme fragment of “Waichi” is subject to learning as co-occurrence information, and when searching for “explosive pronunciation” by reconstructing the evaluation function, the phoneme string “Waichi” is also included.
  • motion feature amounts are also used as learning targets for co-occurrence information.
  • ”Phonemes and phoneme pieces and“ radial ”motion characteristics and“ explosive sound ”environmental sound identifiers and co-occurrence states of sound effect identifiers constitute an identification function to search for explosion scenes.
  • the co-occurrence matrix of identifiers is used as co-occurrence information. It is possible to use a method for determining the co-occurrence probability by configuration and a method for determining eigenvalues and eigenvectors using the covariance matrix of feature vectors and constructing a Bayes discriminant function or Mahalanobis distance. Then, by constructing a likelihood evaluation function for the content information to be searched, it is possible to evaluate the presence or absence of a phoneme string called “Waichi” when searching for “sad, scene”.
  • the input search condition character string is used as an emotion identification character string such as “joy” or “sadness” by using “(1)” or “(;;)” called emoticons.
  • it can be converted into emotion feature quantities and emotion identifiers via a character string identifier conversion dictionary and used for searching.
  • the likelihood evaluation function configured as described above stores the parameters and templates of the evaluation function in the co-occurrence learning storage unit and the evaluation function storage unit, and also specifies the specified character string, the specified word, and the utterance of the character string and the word.
  • the relationship between the phoneme string and the phoneme string string based on is registered in the dictionary unit. Also, search conditions that can be used via communication lines to evaluate the use value of search conditions.
  • Evaluation of utility value may be performed using the third-party usage frequency as a learning sample, and “output probability of various identifiers” and Z or “co-occurrence probability of various identifiers” and Z or “transition probability of various identifiers” And Z or “co-occurrence probability of various identifiers” and Z or “various feature quantities” are combined into one set of “feature quantities” based on the covariance matrix! / Evaluate to create eigenvalue eigenvectors, construct evaluation functions, give them to HMM as features and train them.
  • the divergence state from the average of distances obtained from the average of distances obtained from identifiers and feature quantities with high co-occurrence probabilities that occur during indexing and search results with high degree of use is 3 ⁇ .
  • the identifiers and feature quantities that exceed and the probability of co-occurrence and ⁇ or the appearance probability are particularly high in view of the average probability, and identifiers and feature quantities may be used as gene flags.
  • the phoneme and phoneme recognition dictionaries are switched according to emotion recognition, the phoneme and phoneme recognition dictionaries are switched according to changes in the recognized environmental sound, and the recognized landscape image is used.
  • the dictionary may be switched according to the co-occurrence state, such as switching the image recognition dictionary of the display object, or switching the recognition dictionary of phonemes and phonemes according to the recognized image.
  • Information based on the starting relationship can be taken as sensitivity information and used to search for content information.
  • This device and the terminal are configured as shown in FIG. 20, and are composed of a user terminal, a distribution base station, a device such as a robot controlled by the terminal and the base station, and a remote controller to be controlled.
  • a user who can be used as a form of a terminal or a form of a base station speaks voice to the terminal, and the terminal or base station uses any of the following for recognition processing. Execute the processing procedure.
  • feature values are extracted from the speech obtained from speech or captured video images, and the feature values are transmitted to the target relay location or base station apparatus, and the feature values are received.
  • the base station apparatus generates a phoneme symbol string, Z or phoneme symbol string, emotion symbol string, and other image identifiers according to the feature quantity. Then, based on the generated symbol string, a matching control means is selected and executed.
  • the second method performs feature amount extraction from the speech obtained by speech and the captured video image, and the phoneme symbol string and Z or phoneme symbol string, emotion symbol string, and other images in the terminal
  • An identifier that accompanies recognition, such as an identifier, is generated, and the generated symbol string is transmitted to the target relay location or base station apparatus.
  • the controlled base station apparatus selects and executes a matching control means based on the received symbol string.
  • the third method performs feature amount extraction from voice obtained by utterance and captured video image, and based on the feature amount generated in the terminal, phoneme sequence and Z or phoneme symbol sequence, It recognizes emotion symbol strings and other image identifiers, selects control contents based on the recognized symbol strings, and transmits them to the base station apparatus that controls the control method and information relay apparatus.
  • the fourth method transmits the voice obtained by utterance using the terminal or the voice waveform or image of the captured video as it is to the base station apparatus that controls it, and the phoneme symbol in the control apparatus. Recognize the string and Z or phoneme symbol string, emotion symbol string, and other image identifiers, select a control means based on the recognized symbol string, and select the relay point or base station device that controls the selected control. It is to execute. Similarly, emotion identifiers can be extracted from voice features and symbols, and so can sound and video features and identifiers such as environmental sounds.
  • the terminal simply transmits only the waveform, transmits the feature amount, transmits the recognized identifier string, and transmits the processing procedure such as the command and message associated with the identifier string.
  • the configuration shown in Fig. 21 is the transmission side
  • the configuration shown in Fig. 22 is the reception side. It is also possible to transmit and receive between each other.
  • An instruction dictionary for converting into an associated processing procedure based on the input phoneme sequence or phoneme segment sequence is a new control command or media that can be used on either the terminal side or the distribution base station side.
  • a user gives a speech waveform to a terminal and a device with an utterance.
  • the terminal-side device analyzes the given voice and converts it into features.
  • the converted features are recognized and converted into identifiers using a recognition technology that is combined with HMM and Bayes.
  • the converted identifier means a phoneme, a phoneme piece, an emotion identifier, and various image identifiers. However, as described elsewhere, if it is a voice, it is a phoneme, an environmental sound, or a scale. Or an identifier based on an image or action. Based on the obtained identifier, the phoneme / phoneme symbol string dictionary is referred to by DP matching to select an arbitrary processing procedure, and the selected processing procedure is transmitted to the target device to execute control. Therefore, it is possible to use the mobile terminal as a remote control by using the present invention, or to control home appliances by robot, and smoothly detect the face, voice humility and facial expression of the other party at the communication destination. It may also be configured to display an emotional index, a display of utterances, an interactive device with a disabled person provided with a braille output unit, etc.
  • the information processed in such a procedure can be transmitted as it is without converting natural information such as video and audio into feature values, or converted to feature values. It is possible to select and send an arbitrary conversion level, such as transmission after conversion, transmission after conversion to an identifier, transmission after selection of control information, and the receiving side can select any state. Based on the information received, it is configured as a receiving device that can be processed and sent to the distribution station or control device based on the acquired information, or searched, recorded, or distributed by mail based on the acquired information. Arbitrary processing such as communication, machine control, and device control may be performed.
  • the identifier string, character string, and feature amount that are appropriately queryed are transmitted to the distribution-side base station, and information according to the query is obtained.
  • the control dictionary configuration shown in Fig. 24 is used to enable control items to be selected by communication when controlling by voice even if advertisements and advertisements are displayed during the communication wait time and search wait time.
  • Exchange control dictionaries like in the example.
  • This control command dictionary is composed of phonemes, phonemes, emotion identifiers,! /, And any identifiers, features, and device control information as described above. It is possible to make it reusable, and by updating or reconfiguring the dictionary information for a search that associates an arbitrary identifier with a feature quantity, it is recommended to update trendy search keywords. .
  • infrared control information to be transmitted to a product that can be controlled by a conventional infrared remote controller is selected as device control information, or a series of operations are batch processed by combining these control information.
  • the feature information may be transmitted to the information processing apparatus for voice versus control without recognizing the identifier according to the CPU performance of the apparatus.
  • server client model is introduced in this way, and the server and the client are divided into arbitrary processing steps, connected by communication, and the server 'client' exchanges arbitrary information. Infrastructure, search and indexing may be implemented.
  • information acquired by client terminals such as DVD recorders, network TVs, STBs, HDD recorders, music recording / playback devices, and video recording / playback devices from core servers at the communication destination can be transmitted via infrared communication, FM, or VHF frequency band communication. 802.
  • Teletext can be used on a mobile terminal or mobile phone, voice control, character string input, instructions for controlling the client terminal by shaking the mobile terminal or mobile phone, or mobile terminal or mobile phone May be used for client terminal operation as a general remote control.
  • the user selects the search condition formula constructed on his / her device in the environment shown in FIG. 20 and the identifier, feature quantity, and Z or function parameter used in the search condition formula, and the communication line and Z or storage medium.
  • the search condition formula and Z or identifier and Z or feature quantity and Z or function parameter may be disclosed or released to any third party, or P2P software may be used. May be shared. You can also sell search conditions, identifiers, feature quantities, and function parameter combinations based on preferences and values of celebrities, specialized magazines, and professionals via communication lines or by attaching magazines.
  • search condition formulas and Z or function parameters of the other party were copied to the storage medium and downloaded via the communication line by the procedure shown in Fig. 25, and used for indexing. If the identifier selected by the feature quantity extraction method or the discriminant function has the same configuration, those search condition formulas can be used on the own device. Take measures to prevent viruses from being included in these distribution information.
  • a user who can acquire or convert information such as an evaluation function or search conditions related to a search can be compared with others on other devices.
  • a search condition formula can be acquired by the same method.
  • an identifier co-occurrence matrix is used to convert between identifiers based on co-occurrence information, such as conversion of international phoneme symbols and language-dependent phoneme symbols, which will be described later, or to convert other identifiers to phoneme symbols. It is also possible to perform transformation in information space using evaluation functions such as HMM, Bayes, and membership probability.
  • the dictionary that converts the phoneme sequence and the phoneme segment sequence and the processing procedure is distributed even on the terminal side.
  • New control commands, media types, format types, device names, phonetic symbol strings, image features, emotion identifiers, and! / Tick symbol strings related to the base station can be described later in markup such as XML and HTML. It may be expressed in language, RSS, or CGI. Information configured in this way may be transmitted / received or distributed.
  • terminal A which is the first user's device, attempts to connect to another terminal C or an information processing device that can communicate with the base station B via the Internet.
  • another terminal C or an information processing device that can communicate with the base station B via the Internet.
  • information that can be used for search by other devices using RSS or CGI will be confirmed.
  • terminal A executes an evaluation function acquisition step for acquiring detailed information related to a target search execution method using a communication line or infrared rays.
  • terminal A can acquire the numerical information, identifier symbol string, evaluation expression, and! Necessary for function construction, and information necessary for the search.
  • the identification function, DP, and HMM that are less frequently used are deleted, and a new evaluation function is stored based on the information acquired earlier.
  • the evaluation function switching step is executed so that it can be reused without being registered and registered every time.
  • the evaluation function is acquired by communication each time, stored in the storage unit, and the stored evaluation function is deleted when the service is terminated or the power is turned off. You can use it, or you can get it from a distributed storage medium.
  • the information exchange target is not the power of the base station or other terminals, but the information processing unit such as the robot and remote controller using the present invention, the information input / output unit, and the storage unit. Any embodiment may be considered as long as the device is included in the configuration related to the invention.
  • a control method is obtained by a method similar to the procedure example of the information processing apparatus used in the terminal and base station described above, and a phoneme string symbol for the command to be input and a dictionary for converting the control command are provided. Therefore, voice operation can be realized by recognizing a person's utterance and making the target command executed. At this time, emotions are analyzed from the voice information, and the detected result is the emotion of “sadness”, and if a phoneme or phoneme associated with the utterance is detected, a comforting context is selected. Alternatively, a processing means such as selecting a context that can be soothed if a feeling of “anger” is detected and a phoneme or phoneme associated with the utterance “Kora” is detected may be implemented.
  • a message that apologizes to the user may be presented by voice or a character string, or a camera etc.
  • the above-mentioned arbitrary identifiers and features such as phonemes, phoneme pieces, emotion identifiers, and image identifiers can be used.
  • Recognition based on volume may be performed, and processing may be selected and changed according to the combination of identifiers.
  • recognition results such as emotion identifiers, instrument identifiers, scale identifiers, and environmental sound identifiers may be used. ,.
  • the recognition result of emotions, phoneme strings, and phoneme string sequences associated with the user's utterance at the time of evaluation is associated with a positive semantic match.
  • Reinforcement learning is performed when an identifier such as “joy” or “relief”, which is an emotion associated with a column or positive meaning, is detected, or the recognition result is linked to a negative meaning.
  • a phoneme or phoneme symbolic string such as “no use” or a negative meaning If the emotions associated with the taste are “sadness”, “anger”, or “disappointment”, remove the target of the next reinforcement learning, or create a new feature group of negative meanings to set the negative object. Reinforcement learning for learning may be performed.
  • keywords related to operable processes may be displayed on the screen, a phoneme string or phoneme string list may be selected, or spoken, and presented to the user. It is possible to realize a voice user interface based on phoneme / phoneme recognition with emotion that does not use general voice recognition.
  • the dictionary that converts the phoneme sequence, the phoneme sequence, the emotion identifier, and the processing procedure may be a new control command, media type, or format type that may be on the terminal side or the distribution base station side.
  • symbol strings such as phoneme symbol strings related to device names, image features, and emotion identifiers are transmitted / received and distributed using markup languages such as XML, HTML, and RDF, RSS, and CGI, which will be described later. Convenience can be aimed at by combining well.
  • a procedure for using a combination of co-occurrence information based on a plurality of identifiers and feature quantities will be described more specifically.
  • a search example processing procedure using a plurality of types of identifiers as an outline and an arbitrary processing procedure based on a search using a plurality of types of identifiers are shown, followed by a specific example of a combination associated with each identifier.
  • the combination of these identifiers and feature quantities may be two or three as required, or the co-occurrence probability of these identifiers may be implemented by combining four or more or more than ten.
  • the co-occurrence state or co-occurrence information in the present invention is based on natural information including auditory information and visual information, sensor information, and identifiers and features acquired from video and Z or audio. It is based on information configured using quantity, and is a plurality of related information that uses sensor information that is detected as text information to be distributed, and their identifiers and information within an appropriate unit time according to usage It is characterized by the fact that features are generated at the same time, and can be constructed with time transitions that are multiple co-occurrence information powers. Use those stochastic transition matrices "Co-occurrence search conditions" configured using "index co-occurrence information” used for content index information and search conditions entered by users. Used as “information”.
  • the boundary of the range in which identifiers and feature quantities are evaluated may be the number of frames divided on the time axis, or arbitrary This may be the case where the divergence state of the feature quantity obtained by this identification method exceeds or is less than the threshold value, or may be an identifier boundary obtained by any detection or identification method.
  • the distribution information is indexed using EPG, BML, RSS, text broadcasting, text included in subtitles and video, etc., as well as checking the bias of what identifiers co-occur in any range.
  • the character string or identifier ID associated with the identifier obtained as a search result is converted into another identifier or identifier string by the conversion dictionary, and the part where the identifier or identifier string matches the content information is searched.
  • the conversion dictionary converts the character string or identifier ID into another identifier or identifier string by the conversion dictionary, and the part where the identifier or identifier string matches the content information is searched.
  • the name of the performer is obtained from EPG, BML, RSS, teletext, text information such as recognized subtitles and text included in the video, or the phoneme entered or entered by the user.
  • the name of the performer that matches the column is detected, and that name is spoken in the video information! / Detects where to speak and subtitles are displayed!
  • the detected location is a scene related to the user's purpose, and the content information is played back, recorded, skipped, or the recording is started by a specific title image feature.
  • EPG EPG, MPEG7, BML, RSS, XML, Web site, recognized subtitles and character strings included in the video, etc., the composition of performers, titles, directors, producers, sports team names and actor casts
  • program information such as family relationships and human relationships above as identifiers
  • the main character and enemy role co-occur and the main character and lover co-occur! / Sounding scenes! / Sounding searches are given multi-variate analysis based on image features, emotions expressed in scenes, phoneme sequences and phoneme sequences associated with voices generated in scenes, and changes in video features in scenes.
  • a method of indexing, searching, detecting, and learning using the phoneme sequence, phoneme sequence, program information, image feature, or image identifier is also possible.
  • an input character string is converted into a symbol string using phonemes or phonemes, or symbol information based on phonemes or phonemes based on a user's utterance speech and identifiers recognized by emotions, environmental sounds, or image features are used.
  • a query is constructed and recording of broadcast contents to the information storage device based on the present invention is started.
  • the symbol string is evaluated at the same time as recording, and the match with a pre-registered symbol string is evaluated, and if the match exceeds a certain percentage, the hour before and after that is registered for long-term storage.
  • a method of narrowing down detection targets using a co-occurrence matrix or a co-occurrence probability according to statistical processing may be used, or a co-occurrence dictionary may be configured by classifying identifiers and feature quantities.
  • the input content information is indexed by emotions and environmental sounds recognized by voice, image features recognized from video, motion identifiers, and object identifiers, and recorded as a database according to the present invention.
  • the speech or character string input by the user is converted into a symbol string using phonemes or phoneme fragments and given to the recorded database as a query, and the search result is detected as the target information. To present.
  • sounds such as "one-one” and “docan”, which are generally called onomatopoeia are also relatively approximate. It can be used as a search index to assist environmental sound identifiers for searching because it is recognized as a phoneme or phoneme segment, or “ni” ”or“ (;
  • the emotion identifiers used for the search from emoticons can be selected by selecting the emotion identifier from the character string by setting the emotion identifier to “joy” or “sorrow”, and the search condition can be configured to perform the search.
  • the search technology of the present invention may be used as an artificial intelligence for chats, agents, and robots to classify identifiers and feature quantities that can be used for dialogue between the device and humans to form a co-occurrence dictionary.
  • a proper noun which is the conversion of a proper noun into a phoneme or phoneme symbol
  • an emotion feature or emotion identifier near a proper noun and its occurrence is evaluated, or a proper noun is issued
  • the user's It is possible to search according to user's preference by evaluating the bias of emotion.
  • the facial expression features in specific emotions are detected, and the facial features are statistically learned to discriminate facial expressions. Can be performed, or the face is converted to a certain direction and size using features based on 3D or 2.5D, and then a part with change or movement is learned as a separate item. It is also possible to give an identifier to separate a part of the face as eyes and mouth and learn to change facial expressions, or to classify bodies, machines, and devices to be used for other searches in the same way. Also good.
  • EPG EPG, BML, RSS, linked to any tag or name in teletext
  • EPG detects a sports program
  • BML detects a change in score
  • a change in score is displayed
  • the highlight position of the sport is detected by moving the playback position to the place where excitement is detected from the emotion feature, and only the image is learned by learning the image features around that time.
  • the partial motion features are large and their motion directions are not parallel, and red and yellow warm-colored features are displayed on the screen.
  • index information is recorded in synchronization with the moving image as an explosion scene.
  • the scene is recorded as a seaside scene.
  • a slowly moving white block is detected in blue and a wind sound is detected, the index information is recorded as an empty scene. .
  • This index information is implemented, and the frequency of the index appearance is calculated for the entire video length, and the similarity of the frequency is evaluated to detect the bias of the expression on the screen and the user browsing
  • a search based on the browsing status of the user and the frequency of identifiers appearing in the content is realized.
  • arbitrary processing such as recording and playback of content information can be performed and search can be performed by setting evaluation functions and evaluation result thresholds that match the user's hobbies and preferences. It becomes possible to do.
  • the composition of the performer, the title, the director, the name of the producer, the family relations and the human relations of the actors as casts are used as identifiers, or they are used as phonemes. It is okay to evaluate the match together.
  • identifier and feature value power obtained are index information and the index information consisting of music identifiers and feature values registered in the database, and by evaluating the distance and coincidence rate, it is possible to meet the user's hobbies and interests. Searching music information based on!
  • instrument type it is possible to search for scenes and pages where any instrument is played or displayed from the co-occurrence information of instrument name and acoustic feature, instrument name and image feature, and the piano is out.
  • search for phoneme strings by pronouncing “piano [p / i / a / n / o]”, or based on the phoneme strings.
  • the audio stream or video stream may be searched according to those features, EPG and BML, which can be recorded or skip-played audio or video stream detected by the search instruction.
  • a co-occurrence dictionary may be configured by classifying identifiers and feature quantities that are acceptable.
  • the above scenes can also be searched using car tappet sounds, engine sounds, and locomotive exhaust sounds, and the names of these sounds are called phoneme strings or phoneme string strings. It can be used for search by converting to, and if the search condition is “engine sound”, the engine sound will be searched, and if it is an engine scene, the engine image features If you search for scenes with volume and engine sound, you can use the following method.
  • the scene search based on the emotional behavior of a person can be performed by associating it with the person type. It is possible to use a phoneme or phoneme symbol string by presenting one word, and convert the expression or emotion name into a phoneme string or phoneme string string for use in a search.
  • the action type is related to the face type and the emotion type as described above, it can be associated with the person type to change the emotional behavior, gesture, action, gesture, and walking of a person. It is possible to search for scenes based on this, and to detect the input video information by signifying the motion identifier and the phoneme or phoneme string sequence. When sign language information is detected and uttered by speech synthesis, the processing and the utterance are converted to a phoneme sequence. It is possible to use CG to reproduce the actions associated with the phoneme sequence and display sign language, and the names of these operations can be converted into phoneme sequences or phoneme segment sequences for use in searches. ,.
  • landscape types natural images and city images are classified based on co-occurrence information of image features such as color features, existence probability per unit area of straight lines and curves, and based on scene names! /, Phonemes It is possible to convert from a sequence to a feature, or to search by indexing the phoneme sequence or phoneme sequence of the content uttered by looking at the scene.
  • location information by associating landscape types with phoneme strings, it is possible to search for information in any area based on any image characteristics from a large amount of accumulated movies and broadcast images. It is possible to build a travel guide based on the image features of location locations used in famous scenes of the city, and to detect similar landscapes. It may be converted into phoneme strings and used for searching.
  • the display position type it is possible to evaluate what kind of image is in which position in the screen, specify the range, display it, and let the user call the name.
  • the device is used as an index for learning the display content, it is conceivable to use a method.
  • numbers are displayed at the detected positions. “Who is No. 1”, “Who is No. 2”, asks the user to call his name, learns, and speaks and confirms the phoneme string and phoneme string train V, using the Tatsu method, or “w / a / k / a / r / a / n / a” t, phoneme sequences and phonemes associated with keywords for specific control.
  • the character string identified by the recognition process is converted into a phoneme string or phoneme string string to be searched, and if it is a still image, it is related to the word that was clicked or range specified. It is possible to display and search the voice and video to be used, and convert the characters and font names into phoneme strings and phoneme string strings so that they can be used for the search.
  • shape types it is possible to detect round objects, square objects, and pointed objects, thereby detecting obstacles that hinder the robot's movement and those that are dangerous to humans, or based on associated image features. If a search is performed using a phoneme sequence or phoneme sequence of abstract keywords, and a search is detected, it can be used anytime. It can be used as a fixed video such as an opening telop in an arbitrary program. It is possible to search by associating phoneme strings or phoneme strings of fixed utterances such as opening utterances, and converting the names of these shapes into phoneme strings or phoneme strings to be used for the search. It is also possible to use a waveform shape type to statistically analyze changes in brain waves and pulse waves extracted from a plurality of locations and provide an identifier for use in searching.
  • program information such as performers, authors, moderators, and program titles. It can be used as an index, and the names of those program genres and categories may be converted into phoneme strings or phoneme string strings and used for searching.
  • Examples of search with images and environmental sounds, examples of searches with environmental sounds and EPG, BML, RSS, teletext, audio / video search examples with multiple identifiers, and optional processing with identifier detection An application example will be described with reference to FIG.
  • phonetic symbols and emotional symbols are indexed at the same time as recording, and the recording range and markup languages such as EPG, BML, RSS, teletext, etc., and the recording range associated with services using CGI are described. You may decide the search range, delete unnecessary parts, or skip scenes automatically during playback. For this reason, a specific keyword is converted into a phoneme, recording is performed as a temporary file while confirming the phoneme match, and an emotional characteristic is formed while constructing an index when a target keyword is detected.
  • the recording range and markup languages such as EPG, BML, RSS, teletext, etc.
  • EPG, BML, RSS, and text broadcasting to classify files and file names, target video and still images, audio, text, and related information power related to their time-series presentation order.
  • devices that perform playback and recording configure phoneme strings and phoneme string sequences for information to be directed, distribute phoneme strings and phoneme string sequences by EPG, BM L, RSS, teletext, Even if it is convenient for the user, it can search, record, and play back the recorded content and recording target using the phoneme sequence and phoneme sequence based on the received EPG, BML, and RSS. good.
  • a device that executes these services may be a desktop information processing device or a portable information terminal, and the contents of the present invention are implemented via a communication base station using them. It can be realized by calling a device using the present invention at home in the process of mobile terminal power, or by mailing information recognized by the mobile terminal to a device using the present invention at home. You can send it.
  • the home device using the present invention is “famous husband (along with [/ a / r / i / n / a / o /]), recording (Rokuga [/ r / o / k / u / g / a /]) ”and the keyword, the device using the present invention at home starts recording all the receivable channels and records from among the keywords. With the exception of the command section, the phoneme is expanded and recorded, and the recorded content is detected by searching the phoneme symbol string.
  • the matching degree is set to 60%, content is recorded while setting a save flag boundary every minute, and 60% is recorded.
  • the recorded content information will be deleted after one hour if the location does not exceed one minute.
  • the keyword matches 60% or more is detected, for example, until one hour before and the boundary of the program by Z, EPG, BML, RSS, text broadcasting, etc.
  • broadcasts with the word “famous husband (along with [/ a / r / i / n / a / o /])” automatically save one hour near the derivation of that word.
  • the video recorded by the present invention may be ranked according to the number of appearances and the degree of coincidence of the words and displayed as a list.
  • face detection may be performed at the same time, and learning may be repeated in association with the name of the actor and the facial features to learn whether or not a specific person is in the screen.
  • learning may be repeated in association with the name of the actor and the facial features to learn whether or not a specific person is in the screen.
  • the learning efficiency is improved and the performance of automatic detection recording is improved.
  • the device itself may perform it independently.
  • a celebrity when a celebrity (Arina) is an actor, it may be called with a different name in video and audio works.
  • the program search can be executed by the following procedure. EPG, BML, RSS, text broadcasting, actor names in the list of performers in various programs are kanji.Look up actor names from user utterances using information converted from English words to symbol strings using phonemes and phonemes. , Text entered as usual The actor name is searched and the target actor name is extracted. Next, the cast name associated with the actor name is extracted.
  • a symbol string using phonemes and phonemes based on the casting names is constructed while referring to a dictionary using phonemes and phonemes based on the casting names. Then, a search using a symbol string based on phonemes or phonemes is performed on video or audio work information indexed by symbol strings based on phonemes or phonemes. As a result, it is possible to search for the scene associated with the cast name of the target actor, and it is related to EPG, BML, RSS, and teletext, which was not possible with conventional phoneme and phoneme searches. This makes it possible to improve the convenience of searching in video and audio works.
  • the index by explosion pronunciation identifier and the index by phoneme symbol string associated with the fallen word, the prosodic, the time when music etc. was recorded as the identifier is the identifier that laughed at other laughter
  • Video information is an action program, and these are aggregated to create an evaluation function to evaluate and search for the degree of action program, as well as phonemes associated with dark video time and scream in the video information.
  • the index appearance frequency of the symbol string or emotion identifier string is detected more than the average appearance frequency of the index associated with screams in many other video and audio information for the entire video time length. If you create a function that evaluates horror programs and evaluate the degree of horror programs and search for them, you can classify the undulations of emotions and changes in content by using when and how to record conference information.
  • a cable device can be realized.
  • the environmental sound segment as the environmental sound is constructed. It is done.
  • visemes are decomposed into time series and viewed as viseme segments
  • video images are viewed as motion elements and motion segments for moving images
  • images are also converted to image and image segments for image information.
  • a new index for search may be reconstructed by viewing it as a piece.
  • the present invention may be implemented by a device whose function is reduced by one of the three methods, and it is used for a surveillance camera or the like to identify a window or door image. It is detected that the image feature evaluation distance of the discriminating function deviates from the average to detect a broken window or door, or that a person does not move for a long time in front of a locked door. Detect crimes by detecting them, detect scene boundaries of moving images and use them in video editing machines, use markup languages, or use phonemes and phonemes from character strings Or by using voice or other identifiers to detect weather by image features and controlling indoor equipment to control ventilation and lighting, or using names, passwords, and face recognition. Billing by personal authentication and utterance It's okay to make a payment.
  • the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is corrected information, whether it is on the terminal side or on the distribution base station side, new program, actor name, program genre, distribution Even if the phoneme symbol string related to the station name, the image feature, the voice feature, the emotion identifier, etc. are sent / received / distributed using markup languages such as XML and HTML, RSS, CGI, etc. Convenience can be aimed at by combining well.
  • a device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal.
  • the contents of the present invention may be implemented via a communication base station.
  • FIG. 27 shows a CRM (Customer Relationship Management) system using the present invention as an application of the above-mentioned search example using proper nouns and emotion identifiers and an example of executing arbitrary processing by audio-video search with multiple identifiers.
  • CRM Customer Relationship Management
  • utterances associated with consumer emotions are analyzed and indexed using a plurality of analysis devices and identification devices according to the present invention.
  • the reputation of the product as viewed by consumer power is derived from the phoneme string indicating the product of a specific model number and the emotions associated with anger and sadness that accompany it.
  • Phonemes that can identify emotional features and products The number of occurrences of symbol strings can be analyzed quantitatively, and the results can be displayed using a markup language such as HTML or XML, described later, or CGI, and manuals for identified products can be displayed. You can display it!
  • the consumer requests a consultation from the consultation service operator over the phone or in the store.
  • voice feature values of both operator and consumer are extracted, and emotions, phonemes, and phonemes are recognized from the extracted feature values.
  • phonemes, phonemes, and emotions recognized by the above-described method are stored in the information storage device.
  • the relevance evaluation method is based on the fact that anger emotions and sadness emotions occur in the audio information in which a specific product model number is detected, and the consumer evaluation is low. Also good. In this way, by evaluating the phoneme symbol strings recognized in the speech information and the distribution of emotion identifiers, it is possible to quantitatively evaluate the consumer's feelings about the product, and quantitative analysis of the product's reliability is possible. Can be performed.
  • the searched “1X5 (Ichietsutsugo [/ i / ch / i / e / cl / k / u / s / u / g / o /])” manual is displayed on the operator's screen. And can answer consumer questions. At this time, it is possible to quantitatively record the emotional evaluation of a product by recognizing the consumer's emotion and storing it in association with the information storage device.
  • the device using the present invention when the device using the present invention performs a search for a target product name, the criteria for the matching degree of phonemes and phoneme symbol strings is set to 60%, and a list of products exceeding 60% is constructed. By displaying it as a list, the operator may select the manual for the target product.
  • the dictionary for converting the phoneme sequence and the phoneme sequence to the processing procedure is not limited to the terminal side or the distribution base station side, and the phoneme symbol for the new product name or product genre. Convenience by combining symbol strings such as strings, image features, voice features, emotion identifiers, etc., even when sending and receiving and distributing information using markup languages such as XML and HTML, RSS, CGI, etc. Can be planned.
  • the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. It is also possible to analyze the psychological state of the mobile phone and confirm that it does not cause excessive stress.
  • the content of the present invention may be implemented via the communication base station using these.
  • the user speaks voice to the user's browser.
  • the features of the spoken speech are extracted.
  • this feature value is transmitted to the target device, and the device that has received the feature value generates a phoneme symbol string and Z or phoneme symbol string and emotion symbol string according to the feature value. Then, based on the generated symbol string, the matching control means is selected and executed.
  • a phoneme symbol string, Z or phoneme symbol string, and emotion symbol string are generated in the user's browser, and the generated symbol string is transmitted to the target device.
  • the controlled device selects and executes the matching control means based on the received symbol string.
  • the third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on the feature values generated in the user's browser, selects the control content based on the recognized symbol strings, and Sent to the device that controls the control method.
  • the speech waveform is transmitted as it is using the user's browser, and the phoneme symbol string, Z, phoneme symbol string, and emotion symbol string are transmitted in the controlling device. It recognizes, selects a control means based on the recognized symbol string, and the controlled device executes the selected control.
  • the dictionary that converts the phoneme sequence and phoneme segment sequence to the processing procedure is a phoneme symbol sequence related to correction information, new tags, variables, and attributes, which may be on the terminal side or on the distribution base station side. It is convenient to combine symbol strings such as image features, audio features, and emotion identifiers by sending and receiving and distributing information using markup languages such as XML and HTML, RSS, and CGI. I can plan.
  • the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal.
  • the contents of the present invention may be implemented through a communication base station or may be implemented in combination.
  • the in-car audio detects the phoneme sequence as "accident situation (j / i / k / o / j / o / u / k / y / o / u)" and
  • the information is transmitted to the base station, received via VICS, mobile phone! /, Or any other communication means.
  • the information transmitted from each vehicle may be captured by orvis and transmitted to the base station.
  • a dictionary for converting a phoneme sequence or a phoneme segment sequence and a processing procedure is not limited to a new place name, title or address, road phoneme symbol sequence, Convenience is achieved by combining symbolic strings such as image features and emotion identifiers well by sending and receiving and distributing information using markup languages such as VICS, XML, and HTML, and RSS and CGI described later. I can do it.
  • the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal.
  • the contents of the present invention may be implemented by performing a search via a communication base station or performing a search independently.
  • karaoke is a music sales system. Examples will be described.
  • song titles and chorus lyrics are recorded as phoneme strings, phoneme fragment strings, and musical scale strings, and can be used for title search in karaoke by searching for matching points. Furthermore, in addition to the feature structure like karaoke, it is possible to compare the frequency of appearance of scale symbols and to search for high coincidence, things, appearance distribution structure, and appearance position distribution.
  • the user's preference may be learned by learning the co-occurrence information obtained by such a search, or when the user selects and plays back and then repeatedly selects or listens to the music to the end.
  • the user may be judged to have affirmed the search result, and may be interpreted as having made a negative judgment if the search is performed once or if the next song is immediately moved to the next song.
  • the “00 band” of the query may be spoken by voice and used for natural language processing, may be searched by expanding phonemes by inputting a character string, or may be searched as a character string. You may evaluate the similarity of a music feature and an emotion feature.
  • the tendency of emotion identifiers generated according to music is extracted by statistical processing for each music genre and subjected to multivariate analysis. Search according to the sensitivity parameter of the user based on the present invention by searching for music genre identifiers or by evaluating the similarity of the appearance tendency of emotion identifiers in music, searching for music that is close to the sensitivity trend By presenting it to the user, a service that recommends music according to the user's preference is also possible.
  • the dictionary for converting the phoneme sequence or phoneme sequence and the processing procedure is not limited to the terminal side or the distribution base station side. Convenience can be achieved by combining symbol strings such as scale symbol strings and emotion identifiers, even if they are sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. I can do it.
  • the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal.
  • the contents of the present invention may be implemented via a communication base station.
  • nasal song and lyrics search in the prior art is separated into the act of nasal song and the action of lyric utterance, and thus is different from the search based on the co-occurrence information in the present invention.
  • the present invention is an application of voice operation, and a user speaks voice to an information terminal and Z or a terminal-side browser.
  • the feature amount is extracted from the spoken voice.
  • this feature amount is transmitted to the target device, and the distribution device that has received the feature amount generates a phoneme symbol string and Z or a phoneme symbol string and an emotion symbol string according to the feature amount. Generate. Then, based on the generated symbol string, a matching control unit on the distribution apparatus side is selected and executed.
  • a phoneme symbol string, Z or phoneme fragment symbol string, and emotion symbol string are generated in the information terminal and Z or the terminal-side browser, and the generated symbol string is transmitted to the target distribution device side. Send. Then, the distribution apparatus side selects and executes the matching control and distribution means based on the received symbol string.
  • the third method is to recognize phonemes, Z or phoneme symbols, and emotion symbol strings based on feature values generated in the information terminal and Z or the terminal-side browser, and control based on the recognized symbol strings.
  • the contents are selected and transmitted to the distribution apparatus side that controls the control method.
  • the distribution apparatus that has received the control method performs information processing based on the control method and provides information.
  • the fourth method uses the information terminal and the Z or terminal side browser to transmit the speech waveform as it is to the device that controls the phoneme symbol string and the Z or phoneme symbol string on the controlling distribution device side.
  • the distribution apparatus executes the selected control, it is a problem.
  • emotion identifiers can be extracted from voice, and features can be extracted from symbols, and so can sound and video features and identifiers such as environmental sounds. To do.
  • phoneme symbol strings are embedded in CGI and HTML for the displayed products, and search and evaluation based on those symbols will move to matching pages, product orders and details will be displayed. If it is, it may be the way.
  • search targets may be performed on books, AV content, digital materials, cosmetics, pharmaceuticals, food, automobiles, and other industrial products and items with any proper nouns.
  • a method may be considered in which each proper noun is uttered by multiple speakers and the same phoneme is provided with a recognition template for multiple phonemes and phoneme fragments, thereby improving the search rate of the phoneme string of the page to be used. It is done.
  • an application system such as an expert system may be constructed by using a part of the processing procedure of such an ordering system.
  • the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is not limited to the phoneme symbol sequence or image feature related to the new product or product genre, which may be on the terminal side or on the distribution base station side. It is also possible to send and receive and distribute information such as symbolic strings such as voice features and emotion identifiers using markup languages such as XML and HTML, RSS and CGI, which will be described later.
  • these services themselves may be content distribution services such as movies, photos, and novels, and even digital material distribution services and product sales services. It may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. You may implement.
  • a book-reading service when a book is sold, it is possible to use any speech or sentence by using phonemes or phonemes, or by evaluating the emotions that are included based on recognition. It is possible to retrieve the position.
  • speech synthesis is used for reading, the speech dictionary and template are changed to the speech of the favorite celebrity by changing the speech synthesis template for the speaker's phoneme.
  • the utterance dictionary or template related to the parameters for speech synthesis in the reading can be changed with the change of emotions, and can be combined for convenience.
  • the user speaks voice to the remote control.
  • the feature amount is extracted from the spoken voice.
  • this feature value is transmitted to the target device, and the device that has received the feature value generates a phoneme symbol string and Z or phoneme symbol string and emotion symbol string according to the feature value. .
  • a matching control means is selected and executed.
  • a phoneme symbol string, Z or phoneme symbol string, and emotion symbol string are generated in the remote controller, and the generated symbol string is transmitted to the target device.
  • the controlled device selects and executes a matching control means based on the received symbol string.
  • the third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on feature values generated in the remote control, selects the control content based on the recognized symbol strings, and controls the control method. Is sent to the device.
  • the fourth method transmits the speech waveform as it is using a remote controller to recognize the phoneme symbol string, Z, phoneme symbol string, and emotion symbol string in the controlling apparatus, and recognizes them.
  • the control means is selected based on the selected symbol string, and the selected device executes the selected control.
  • Such remote control technology may be introduced into a robot to perform home appliance control, or may be incorporated into a car navigation system to perform control.
  • any new U ⁇ control symbol string information is distributed to the operated device using the markup language or CGI described later such as RSS, HTML, XML, etc., and phonemes, phonemes, and speech waveforms are transmitted.
  • the remote phone to be used may receive or transmit the updated phoneme symbol string information of the mobile terminal via infrared or wireless.
  • the dictionary that converts the phoneme sequence and the phoneme segment sequence and the processing procedure is not limited to the terminal side or the distribution base station side. Convenience can be achieved by combining symbolic strings such as voice features and emotion identifiers, even if information is sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. .
  • the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal.
  • the contents of the present invention may be implemented via a communication base station.
  • the user speaks voice to the mobile terminal.
  • the features of the spoken speech are extracted.
  • this feature amount is transmitted to the target device, and the device that has received the feature amount generates a phoneme symbol string and Z or a phoneme symbol string and an emotion symbol sequence according to the feature amount. .
  • the matching control means is selected and executed.
  • a phoneme symbol string, a Z or phoneme fragment symbol string, and an emotion symbol string are generated in a mobile terminal, and the generated symbol string is transmitted to a target device.
  • the controlled device selects and executes a matching control means based on the received symbol string.
  • the third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on the feature values generated in the mobile terminal, selects the control content based on the recognized symbol strings, and controls the method.
  • the fourth method is to transmit the speech waveform as it is using a mobile terminal, and recognize the phoneme symbol string, Z or phoneme symbol string, and emotion symbol string in the controlling device.
  • the control means is selected based on the recognized symbol string, and the controlled device executes the selected control.
  • emotion identifiers can be extracted from voice and features can be extracted from symbols, and so can sound and video features and identifiers such as environmental sounds.
  • the infrared of the mobile device is used to control the DVD deck, TV, air conditioner, and other devices, and the IP address of the device is acquired using infrared or wireless LAN to control the device. If the control information of the target device is acquired and controlled via the Internet for mobile terminals or the indoor LAN, the sound from the mobile terminal or mobile phone can be obtained by acquiring the control list using the present invention. Control can be realized.
  • the mobile device sends its own IP address or e-mail address to the target device, and the target device connects to any port based on the IP address and sends control information
  • Any method may be used when the device sends the control information to a portable terminal by attaching it to an e-mail, or simply acquires the control information by exchanging infrared rays.
  • a search service may be implemented by performing phoneme recognition, phoneme recognition, emotion recognition, environmental sound recognition, and scale recognition for input from a microphone of a mobile terminal.
  • the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is not related to the correction information, the new content, the program genre, or the actor name, which may be on the terminal side or on the distribution base station side.
  • a combination of symbolic strings such as phoneme symbol strings, image features, voice features, and emotion identifiers that can be sent and received and delivered using markup languages, such as XML and HTML, described later, RSS, and CGI. Convenience can be aimed at.
  • multiple low-performance microphones and high-performance microphones are available for high-performance audio recording for recognition, and recognition is performed by increasing the sampling rate for recording.
  • the voice rate for voice call transmission may be converted to a lower sampling rate, and voice information for call may be formed to generate compressed voice information for call.
  • the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a mobile phone information terminal, or a wearable information terminal.
  • the content of the present invention may be implemented via a communication base station.
  • the detection and recording functions equivalent to those described above are performed using the image recognition function and voice recognition function associated with the associated imaging device, microphone, and recording device.
  • the detection and recording functions equivalent to those described above are performed using the image recognition function and voice recognition function associated with the associated imaging device, microphone, and recording device.
  • a robot or an agent using the present invention is accompanied by an identifier or feature amount extracted while the content is being browsed by the user, and the facial expression or utterance of the user.
  • an identifier or feature amount extracted By observing the feature quantity and identifier related to the phoneme 'phoneme piece' emotion, it becomes possible to observe the co-occurrence state of the feature quantity and identifier of the user with the feature quantity and identifier of the content.
  • identifiers and feature quantities related to emotions and phonemes may be acquired from the content playback device using the present invention, or content related to user emotions and phonemes using the indexing function in the device itself. An identifier or feature amount may be extracted.
  • a “comic, program user status evaluation function” is composed of feature values and identifiers collected in “comed program”, and the content is “comic, program feature value”. If the feature and identifier of the user and content are close to the center of gravity of the feature in the “Comedy Program User Situation Evaluation Function” By expressing the emotion as “fun,” it is possible to produce a performance as a pseudo emotion. Of course, other emotions such as other emotions may also be learned in the same manner based on the co-occurrence state of the feature quantity and identifier obtained by the user and the content ability.
  • the image of the object from which the identifier is obtained, the sound when it hits, and the sound force when it is operated learn the feature amount, Forces such as the mass and weight of the object Force that can be transported
  • the device itself automatically learns by recording and learning whether it should be avoided in the event of a collision, and how to express the emotion when it is presented to the user.
  • the reaction of the CG, robot, or agent is performed according to the sensitivity or response of the user using the knowledge database for virtual personalities such as CG, robot, or agent. It can be used to change the facial expression of CG, robots, and agents, and broadcast information such as TV can be acquired using external information such as EPG, BML, RSS, and text broadcasting, to the user's preference. You may provide information on such entertainers and time circumstances, and use a method that analyzes them based on the number of times the information recorded by the aforementioned video search and recording means and the playback viewing time!
  • the user's preference may be analyzed, and the robot using the present invention may be connected to infrared communication or wireless LA
  • the control method of the surrounding device is acquired using N, etc., and the device control is improved according to the user's voice, and the convenience of device control is improved, and the identifier and feature amount of the currently displayed information can be obtained. You may win.
  • the phoneme string or the phoneme string string and the dictionary that converts the processing procedure are phoneme symbol strings related to the correction information, new information, and functions of the robot that can be on the terminal side or on the distribution base station side.
  • symbol strings such as image features, audio features, and emotion identifiers can be combined and sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. I can plan.
  • the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal.
  • the contents of the present invention may be implemented via a communication base station.
  • an analyzer as a medical application will be described.
  • characters, facial expressions, and gestures the subject of observation using a pulse sensor, electroencephalogram sensor, muscle current sensor, skin resistance sensor, weight scale, sphygmomanometer, and thermometer
  • it records the emotions associated with the user's utterance and the feature quantities obtained from the sensor power such as brain waves and pulses.
  • each identifier or feature By learning the co-occurrence state, specifying each identifier or feature as a search condition with phonemes, phoneme pieces or character strings, and performing indexing on content information It is possible to cope with differences in international pronunciation that can be achieved with the power to search, record, distribute, and receive information based on conditions.
  • foreign language phonemes and Japanese phonemes can be converted as countermeasures against changes in the sound environment in consideration of the inclusion of overseas content. Resolve the problem by adjusting search conditions and performing search by converting information that is co-occurring for humans, such as conversion of phoneme sequences and image features, conversion of sound effects and phoneme sequences, and conversion of emotions and character strings. It is intended to be illustrated.
  • co-occurrence state is used, it is output when facial expressions are in trouble by combining video features and motion features, identifiers related to video and audio, identifiers such as chords and environmental sounds, and emotion identifiers. Learning the co-occurrence state of the voice features, etc., and constructing a new ⁇ problem attitude '' identifier, the co-occurrence with the layer that acquires the voice identifier and image identifier and the layer that processes the co-occurrence state It is also possible to configure and use multi-layer Bayes and multi-layer HMMs by methods such as layers that process time series transitions of states.
  • the feature of identifier conversion is that phoneme conversion is based on language-dependent phoneme notation based on co-occurrence information of international phoneme symbols or phoneme symbols having different language characteristics by recognition of phonemes and phonemes based on different language environments.
  • the output probability of the international phoneme symbol HMM is used as a feature amount and the language-specific sound. Perform HMM learning based on prime symbols. Conversely, HM based on language-specific phonetic symbols If it is M, the output probability of language-specific phoneme symbols is learned based on international phoneme symbols. Similarly, it is possible to use a method for learning phoneme-to-phoneme conversion and phoneme-to-phoneme conversion, respectively, and use distances such as Bayes discriminant function and Mahalanobis distance instead of HMM power. It also shows an application example using Japanese phonemes for the probability of belonging to an international phoneme, which may be a method or a method using likelihood or probability.
  • the phoneme attribution probability in each language for the international phoneme symbol is obtained, a corresponding table is created, the phoneme is identified from the feature quantity using the international phoneme dictionary, and converted into a phoneme symbol string depending on each language, By obtaining the phoneme attribution probabilities between different languages and evaluating the attribution probabilities in descending order, it is converted to the language features of the device that uses the speech features of people who speak other languages or other languages as their native language. Or you may. It should be noted that phoneme, phoneme conversion can be configured based on the co-occurrence of these identifiers as well as conversion between image identifiers as well as between different language phonemes and phoneme conversion, and conversion between image identifiers and phonemes and phoneme string sequences. Use it by using the function.
  • the correspondence with international phoneme symbols can be related to UPA numbers, IPA symbols, UCS code numbers by referring to the International Phonetic Symbol Guidebook by the International Phonetic Society, and these symbols and numbers can be used as identifiers. It may be used for conversion management. Also, when converting phoneme symbols between different languages to international phoneme symbols, use the transition probability between the phoneme probability table and the preceding and following phonemes, or re-learn the output probability and convert between symbols using HMM etc. Alternatively, the Euclidean distance function, the Bayes discriminant function and! /, An evaluation function may be constructed using the co-occurrence information of output probabilities and feature quantities and used as a symbol conversion function.
  • Japanese speech ⁇ Japanese phoneme sequence ⁇ Japanese keywords ⁇ English translation ⁇ English phoneme sequence ⁇ English DB phoneme sequence search
  • identifiers such as phonemes and phoneme pieces differ depending on the language
  • identifiers whose notation does not necessarily match are similarly different.
  • the identifier evaluation function configured in each language environment is recognized for the same utterance, the co-occurrence state of each identifier is observed, and the identifier output as the recognition result or It is possible to change the symbol between identifiers by learning the output probability, likelihood, distance, feature quantity, etc., as shown in the procedure of FIGS.
  • the utterance information in English and the utterance information in Japanese are converted into feature quantities in the same manner as in the indexing or searching.
  • Japanese and English phonemes and phoneme recognition are executed.
  • voice information that depends on each other's language is indexed by identifiers through a recognition process that depends on each other's language. Is done.
  • the index by the implemented identifier sequence is observed, and the co-occurrence state of each identifier and the transition of the output probability are observed.
  • Japanese phoneme co-occurrence information recognized in English can be constructed.
  • an evaluation function such as an HMM or a Bayes discriminant function is constructed as an English phoneme recognition function when uttered in a Japanese phoneme sequence, and internal constants for the discriminant function are created as files It can be saved and reused on any storage medium.
  • any phonetic waveform can be indexed simultaneously with phonemes and phonemes, or it can be indexed with Japanese phonemes, English phonemes, and international phoneme symbols at the same time.
  • indexing with dependent phonemes and phonemes the co-occurrence state may be observed and a recognition function based on HMM or Bayes may be constructed.
  • This method identifies the current phoneme based on the output probability from the phoneme HMM or phoneme HMM, or inputs it to the transformed HMM layer, or outputs the probability index part of multiple Bayes functions. It is also possible to use a multi-layer Bayes method in which the distances are evaluated in parallel and the array of distance information is configured as a feature quantity.
  • the source phoneme H The MM output probability is input to the HMM classified by international phoneme symbols and learned. Based on this learning, output probabilities are evaluated and international phoneme symbols are assigned. In this case, the co-occurrence matrix and co-occurrence probabilities are used for learning, and output probability values and features are given as sample vectors for the Bayesian function. It is also possible to obtain a tutor and use it as an evaluation function.
  • transition is made from "silence" to "A" utterance based on the output probability of the phoneme HMM in the current frame, the output probability of the phoneme HMM in the next frame, and the output probability of the previous frame.
  • the probability of silence is high
  • the frame is ⁇ Pau ''
  • the frame with the increased output probability of ⁇ A '' is ⁇ A ''
  • these symbols are arranged in time series ⁇ Pau-A- A symbol based on the phoneme transition “A” is assigned.
  • the first frame and the last frame are filled with the same identifier as the self frame because the previous and subsequent frames are missing.
  • the output probability of the speech unit HMM in the current frame and the output probability of the speech unit HMM in the next frame in the process of transitioning from "silence" to "A" utterance In high phoneme symbols, the ratio of silence to the phoneme symbols is high V, where ⁇ Pau '' is the part, and ⁇ A '' is the high percentage of phoneme symbols, the part is ⁇ A '', and the symbol is assigned And For example, in the second frame, “Pau-A-A” is 60%, “A-A-A” is 20%, and other 20%, and others are omitted from the notation.
  • strings in any language such as Japanese, English, French, Spanish, German, Korean, Chinese, Indian, Islam, Hebrew, Aramaic, Vietnamese, Greek, etc.
  • a phoneme sequence or phoneme segment sequence is constructed based on the pronunciation of the character string, or it is converted into phonetic notation in any language such as Hiragana or Katakana or converted to international phonetic notation. It may be converted into a quantity and the co-occurrence state is confirmed, and the phoneme conversion between languages may be realized, or for the phonemes and phonemes depending on each language, the international phoneme symbol is used as an intermediate form by the above method. You can convert it.
  • the present invention mainly refers to identifiers and emotion identifiers based on phoneme symbols and phoneme symbol symbols, but they are described in “Prior Art”, “Prior Art Issues”, and “Solutions for Issues”.
  • the convenience of the embodiment of the present invention is improved by executing an indexing method and a search request by a combination of identification symbols applied using a recognition technique or identification technique for some other feature amount or identifier. You may plan.
  • video and other parts to be selected / designated by the user's instructions are used in MPEG4 etc., and the boundary of the selection range is specified using the image outline of the image object used or the coordinate information in the 3D image.
  • You can use the method you can use the boundaries detected from silent parts such as voice and frequency deviation, indexing by selecting the display object in the image, and selecting it. It is also possible to advertise tourist information using location information such as latitude and longitude regarding the shooting location in the program, and carry out advertisement and promotion according to the recognized identifier and extracted feature quantity Or indexing to run advertisements and promotions.
  • Identifiers and feature quantities for search results and content information indexed using the present invention Is added as a markup language tag or attribute and distributed to provide related content according to the user's operation, provide advertisements, or sell products, Content operations, content editing, and content use.
  • Annotation processing that supplements or annotates content-related information using search results is also acceptable. If content search is performed using the co-occurrence information used in the present invention, You can configure a bot system that collects and searches information on the network independently.
  • a phoneme segment is a phoneme symbol that is decomposed into a central part, a front part, a rear part, and a plurality of phonemes on the time axis, or between phonemes such as the first phoneme and the second phoneme.
  • the first phoneme force in the transition state between the pieces may be phoneme information with intermediate features based on the position where it changes to the second phoneme, or it may be recognized based on the detected emotion, environmental sound, or person. You can also switch phoneme recognition dictionaries and phoneme templates like this.
  • the identifier used in the present invention is an identifier extracted from emotion features including the above-mentioned phonemes and phoneme pieces, an image identifier extracted from image features, or an acoustic feature.
  • identifier extracted from emotion features including the above-mentioned phonemes and phoneme pieces
  • image identifier extracted from image features or an acoustic feature.
  • the bias of the feature amount for recognition of phonemes, phonemes, and various identifiers is generated, and learning the bias for each emotion in the same phoneme, re-learning the features so that it can be performed simultaneously with the recognition of emotions associated with any phoneme and the recognition of phonemes with environmental sounds.
  • the recognition rate may be improved, based on the intra-frame co-occurrence information of the content information, and the inter-frame probability transition matrix is used to search the content information V, or used as an evaluation function for the content information.
  • Detection information including EPG, BML, RSS, text broadcasting, image characteristics and identifiers, voice characteristics and identifiers, and various identifiers and various feature quantities, depending on the user's positive responses and actions such as recording
  • co-occurrence information in identifiers and feature quantities other than those specified by the user for search, detection, and learning information that is frequently recorded and played back by the user is produced. It may be collected autonomously or the evaluation of the collected information may be presented by voice or text image to reflect the user's subjectivity.
  • recognition is based on feature strings obtained from symbol strings, emotions, scales, musical instrument sounds, environmental sounds, etc., and Z or video, which are recognized based on feature quantities obtained from speech.
  • Classifiers such as shape, color, character, action, etc., and program information identifiers are categorized by multivariate analysis based on quantification analysis from class I to class IV, and used as a new identifier that is additionally used in the present invention. However, it can be used as an index for the search result by evaluating it in three stages, whether it belongs to 3 ⁇ from the mean and variance, or belongs to force 1 ⁇ , which belongs to 2 ⁇ .
  • the features in these processes may be composed of scalars, vectors, matrices, arbitrary-order tensors, multi-dimensional arrays, complex numbers, quaternions, octal numbers and! /, And multi-numbers. good.
  • XML eXtens3 ⁇ 4le Markup Language
  • S OA Service Oriented Architecture
  • RDF Resource Description Framework
  • BML Broadcast Markup Language
  • SMIL Synchronization Multimedia Integration Language
  • MathML MathML (Mathematical Markup Language), Xpath (XML Path Language), SML (Simple (or (Stupid or Software) Markup Language), MCF (Meta Contents Framework), DDML (Document Definition Markup Language) ⁇ DSSSL (Document Style Semantics and Specification Language), DSML (Directory Services Markup Language), DTD (Document Type Definition), GML (Geography Markup Language), SMIL (Synchronized Multimedia Integration Language), SGML (Standard Generalized Mark-up Language), RDF (Resource Description Framework), etc.
  • SOAP Simple Object Access Protocol
  • UDDI Universal Description, Discovery, and Integration
  • WDL Web Services Description Language
  • SVG Scalable Vector Graphics
  • HTML HyperText Markup Language
  • URI Uniform Res ource Identifier
  • WAP The Wireless Application Protocol
  • XQL XML Query Language
  • VML Vector Markup Language
  • URL Uniform Resource Locator
  • EPG Electronic Program Guide
  • DLNA Digital Linking Network Alliance
  • BML Various protocols such as (Broadcast Markup Language), information processing language variables such as markup language, schema, attributes, arbitrary tags, attributes, functions, etc. may be used in any combination to implement the service.
  • correction information and new information may be expressed, written, and implemented using tags, variables, attributes, and instructions that indicate correction or new information. Convenience can be achieved by combining “Example of device”.
  • information input from the outside is health management measuring instruments such as pulsometers and blood pressure monitors that are not powered by voice or video, taste sensors, olfactory sensors, human body sensors, heat sensors, humidity sensors, temperature sensors, illuminance Low environmental instruments such as sensors, Raman spectroscopy, ultraviolet, infrared, visible spectrophotometer, laser 'ablation inductively coupled plasma mass spectrometer, qualitative quantitative analysis, fluorescent X-ray elemental analyzer, light scattering laser tomography instrument , Fourier transform infrared spectrophotometer, soft X-ray transmission device, colorimeter, spectrolino, cap detector, thermal analysis operation system, differential thermal 'thermogravimetric simultaneous measurement device, differential Scanning calorimeter, thermal machine, analyzer, thermal dilatometer, decomposition gas analyzer, automatic thermal analysis sample changer, humidity generator, plasma graft polymerizer, ultraviolet graft polymerizer, total organic carbon analyzer, gas chromatography , Liquid chromatography,
  • detections can be used for criteria, variables, and attributes for executing the process, and for criteria, variables, and attributes for robots and other behavior indicators. Even if the detection and prediction of the risk that occurs in the human body good,.
  • artificial intelligence and artificial incompetence including information retrieval devices! /, Information processing devices such as robots, personal computers, car navigation systems, backbone servers, and communication base stations, mobile phones, watches, and accessories
  • the present invention may be a mobile terminal such as a terminal, remote control, PDA, IC card, intelligent! ⁇ FID, or body-embedded terminal. If present, the present invention can be implemented on an apparatus including an arbitrary information processing apparatus or an information distribution apparatus on a line.
  • information support based on location may be implemented in association with location information by a combination of GPS and geomagnetic location detection system. It is also possible to use a co-occurrence matrix or feature value with an arbitrary identifier V, or a distance function.
  • user preference information is configured and analyzed based on search conditions frequently used by the user, or the user preference information is aggregated and multivariate analysis is performed to create a new preference. It is also possible to use a co-occurrence matrix with an arbitrary identifier or a distance function that uses a feature quantity, even if a category is provided.
  • an advertisement by an arbitrary means using a co-occurrence matrix, a co-occurrence probability, a distance function based on a search condition based on a combination of the above-described arbitrary identifiers and feature amounts Based on preference by evaluating the similarity of preference information with others who may advertise It may be used for compatibility fortune-telling, and may be used for advertisements while waiting for user instructions, such as during learning or presenting search results, or while waiting for users, not only during search. Yes.
  • phoneme information, emotion information, environmental sound information, scale information, and musical instrument information of information distributed using the identifier as described above are associated, and further, image recognition information, face information, color space information, image Information that associates internal object information and recognition character string information and registers information in the database, searches the database, corrects and changes each content file, and manages the generation of attached files associated with the content file
  • image recognition information face information, color space information, image Information that associates internal object information and recognition character string information and registers information in the database
  • searches the database corrects and changes each content file, and manages the generation of attached files associated with the content file
  • information registration and information retrieval can be realized easily and with high accuracy.
  • the registered audio information and video information as the search target are statistically converged to provide an efficient registration of recorded information and a service associated with the browsing of the registered contents. You can also.
  • an evaluation function and an HMM for generating identifiers as described above and analyzing categories of identifiers to form categories are configured, and the evaluation functions and configuration information are distributed between users.
  • associating phoneme and phoneme information, emotion information, environmental sound information, scale information, musical instrument information, etc. based on the associated voice information, and also image recognition information, face information, color Spatial information, object information in the image, motion information, recognition character string information, recognition symbol information, etc. are linked to the information database and search conditions are set from the database, and provided to other information processing devices By doing so, arbitrary information registration and information retrieval can be realized easily and with high accuracy.
  • the above-mentioned operation characteristics are information on the sound source movement of the sound that is not in the image force, reflected wave change information such as echo search, feedback from the motor or pressure sensor, torque It may be information, or robot operation information or contact information may be used.
  • a symbol string or identifier using a phoneme or phoneme as described above is transmitted to another device to change the processing content of the device, or a symbol string based on a phoneme or phoneme is received from another device.
  • the general recognition rate is about 60%, so there are existing identifiers that show a match rate exceeding 60%. It is possible to construct a co-occurrence matrix, co-occurrence probability, Bayes, HMM, and other evaluation functions such as probability function, likelihood function, and distance function based on the evaluation.
  • new evaluation functions and identifiers may be configured, and DP, CDP, riff CD Pt, It is also possible to combine arbitrary symbol string matching methods, and to improve learning efficiency by combining with neural networks, fuzzy, chaos, fractal, genetic algorithms, etc.
  • the information processing apparatus includes, for example, a main storage unit and an auxiliary storage unit! /, An information storage unit that performs information evaluation calculation processing, a communication unit that exchanges information with an external device, It is composed of a device that can register and retrieve information based on an information processing device that has an input unit that receives user instructions and an output unit that presents processing results to the user. Computers, backbone servers and communication base stations can be considered. In addition, it is more preferable to use an apparatus that can analyze information using a program that statistically analyzes the information recorded in the database.
  • the service using the present invention and the billing system are linked to provide added value to the user to realize the information distribution service and the agent service in consideration of the user's psychology and the user's hobbies.
  • the user positively captures the results presented by the robot or agent, construct an algorithm to increase the number of times that are affirmed by the enhanced learning algorithm and the evaluation function for the search. Therefore, if a robot or an agent is affirmed by the user! /, A desire to exist and the robot or agent learns autonomously may constitute a learning model.
  • co-occurrence information based on learning results with low usage frequency is automatically deleted on the condition of user evaluation and free space, or saved in an external storage device or communication destination storage device. You can delete items in your own device, or leave an index or identification function that simplifies the conditions and obtain it from the outside using a communication line when necessary!
  • the portable information terminal is, for example, a mobile phone, a PDA (Personal Digital Assistant), a notebook computer, a wearable computer, a wristwatch computer, or an in-vehicle computer such as a car navigation system.
  • These information processing devices and portable information terminals are included in any combination necessary for execution by the feature extraction unit, user information input unit, information search unit, information storage unit, and query information transmission / reception unit.
  • the information between these processes must be exchanged and mutually searched via the communication network such as the Internet and Intranet via wireless LAN, infrared communication, mobile phone, normal LAN, wired line, wireless line, etc.
  • a markup language is used, a markup language transmission / reception unit and a markup language interpretation unit may be added to the information input unit and the information output unit as necessary.
  • advertisement information for advertisements may be acquired via a communication line, advertisements attached to content may be presented, advertisement status is recorded to verify advertisement effectiveness. It is also possible to analyze search co-occurrence information with a high frequency of establishment of advertisements, or present advertisements with co-occurrence information highly similar to the co-occurrence information obtained at the time of indexing. Make them May be provided as a service!
  • arbitrary information in the storage unit may be in the same device, may be acquired from another device via a communication line, or may be a content search service.
  • the search system is not included in the information processing apparatus if the database and the index search evaluation unit are external to the information processing apparatus. It can be realized by enabling communication by any means regardless of wired or wired.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

Provided is an information retrieval device, which is enabled to retrieve arbitrary contents information easily by making use of cooccurrence information on the basis of various pieces of information inputted. A feature quantity is extracted from the visual information, the audio information and the character information of contents information and sensor information, and an identifier is created from the feature quantity extracted, by means of a feature function. The feature quantity and/or the identifier are stored as index information in relation to the contents or the position in the contents. A retrieval condition inputted is converted into a feature quantity and/or an identifier, and the contents or the position in the contents is specified by detecting the fitness, which is based on the cooccurrence information between the index information and the retrieval condition and in the vicinity of the inside of the contents information, by using the feature quantity and/or the identifier, as converted.

Description

明 細 書  Specification
情報処理装置およびプログラム  Information processing apparatus and program
技術分野  Technical field
[0001] コンテンツ情報を獲得するコンテンツ情報獲得手段と、検索条件を入力する検索条 件入力手段と、前記コンテンツ情報獲得手段により獲得されたコンテンツ情報から、 前記検索条件入力手段により入力された検索条件に適合するコンテンツ情報又は 当該コンテンツ情報内の位置を特定する特定手段と、を備えた情報処理装置等に関 する。  [0001] Content information acquisition means for acquiring content information, search condition input means for inputting search conditions, and search conditions input by the search condition input means from the content information acquired by the content information acquisition means Content information conforming to the above or a specifying means for specifying a position in the content information.
背景技術  Background art
[0002] 従来、一般的な情報処理装置を用いたコンテンツ情報検索にお 、て、コンテンツ情 報の変化を検出する方法は特許文献 1のように提案されており、特徴量として音量の 変化を用い、一定の閾値を越える個所をハイライトシーンとして捉える方法が提案さ れている。  Conventionally, a method for detecting a change in content information in a content information search using a general information processing apparatus has been proposed as in Patent Document 1, and a change in volume as a feature amount is proposed. A method has been proposed that uses a scene that exceeds a certain threshold as a highlight scene.
[0003] ここで、特徴量とは入力された音声や動画などの情報に関し時系列的変化や隣接 画素との変化や指定した範囲内での色や音響周波数等の変化や割合を数量化した 値である。変化の割合を数値に変換する方法としては、種々の方法が考えられるが、 例えば、音声であればケプストラムや FFTを用いて周波数軸の変化に基づく数値に 変換したりする方法が考えられ、映像であれば時系列的変化や隣接画素における輝 度や色相の差分値や相対値や絶対値として数値にしたりする方法が考えられ、より 詳しくは変形例に別途後述する。  [0003] Here, the feature amount is a quantification of time-series changes, changes to neighboring pixels, changes in color, acoustic frequency, etc. within a specified range for information such as input audio and video. Value. Various methods can be considered for converting the rate of change into numerical values.For example, for audio, a method of converting to a numerical value based on changes in the frequency axis using a cepstrum or FFT can be considered. Then, it is possible to consider a method of making a numerical value as a time-series change, a difference value, a relative value, or an absolute value of luminance and hue in adjacent pixels, and will be described later in detail in a modification example.
[0004] また、コンテンツ情報に対し音声による検索を実行する場合、主人公の名前のよう な固有名詞は辞書に登録されていないことも多く検出を行うこが困難であったため非 特許文献 1のように語彙に依存しない音素認識を応用した検索技術として、任意のキ 一ワードを検索する方法が提案されており、この検索技術の基本となる音素認識や 応用技術である音素片認識は特許文献 2にあるように古くからの公知技術として用い られ、音素辞書を用いて装置を制御するユーザインタフェースとして非特許文献 2の ように音素辞書と音素認識により装置制御方法を辞書に登録する方法が説明されて いる。 [0004] Also, when performing a search by voice on content information, it is difficult to detect proper names such as the main character's name because they are not registered in the dictionary. As a search technology that applies vocabulary-independent phoneme recognition, a method of searching for arbitrary key words has been proposed. As described in Non-Patent Document 2, a method for registering a device control method in a dictionary by phoneme recognition and a phoneme dictionary as a user interface for controlling the device using a phoneme dictionary is described. The Yes.
[0005] また、このような技術の応用として非特許文献 3によれば、音素認識による音素記号 列や画像認識による検索する方法が提案されており、例えば、「静止画→単語集合 →テキスト→音声→動画」として画像に関連付けられた文字列を音素列や音素片列 に変換したり、音素列や音素片列を文字列に相互に変換して連鎖的に検索したりす る方法が提案されている。  [0005] Further, according to Non-Patent Document 3 as an application of such technology, a phoneme symbol string based on phoneme recognition and a search method based on image recognition have been proposed. For example, "still image → word set → text → A method has been proposed in which a character string associated with an image is converted to a phoneme sequence or a phoneme segment sequence, or a phoneme sequence or a phoneme segment sequence is converted into a character string and linked to each other as a `` voice → video ''. ing.
[0006] また、特許文献 3によれば、音素及び Z又は音素片による記号列を地理的な位置 情報と関連付けてデータベースに登録し、市街情報に多い固有名詞を伴う情報の検 索と提供を実現する情報配信装置と受信装置が提案されており、特許文献 4によれ ば音素片認識により索引付けされた音声情報の検索が提案されており、それらの引 用文献にも関連技術が提案されている。  [0006] Further, according to Patent Document 3, a phoneme and a symbol string based on Z or phoneme pieces are registered in a database in association with geographical position information, and search and provision of information with proper nouns that are common in city information is performed. An information distribution device and a receiving device have been proposed, and according to Patent Document 4, retrieval of speech information indexed by phoneme recognition is proposed, and related techniques are also proposed in these cited documents. ing.
[0007] また、他の認識技術に関しても、声の特徴情報から感情を認識する技術が特許文 献 5に開示されており、音階や楽器の検出技術に関しては非特許文献 4による提案 がなされており、動画像や静止画像を認識し、文字列などを検出することで、検出さ れた文字列に基づいて検索を実行する方法が特許文献 6により提案されており、特 許文献 7等によりジヱスチヤ認識や動作認識と呼ばれる画像力 動作を認識する方 法が提案されており、特許文献 8によれば顔画像の認識を行う方法が提案されてい るように、近年多様な入力に対する認識技術が提案 '発明されて ヽる。  [0007] Also, with respect to other recognition techniques, a technique for recognizing emotions from voice feature information is disclosed in Patent Document 5, and a technique for detecting scales and musical instruments has been proposed in Non-Patent Document 4. A method for performing a search based on a detected character string by recognizing a moving image or a still image and detecting a character string or the like has been proposed by Patent Document 6, and Patent Document 7 or the like. A method for recognizing image power and motion, called gesture recognition and motion recognition, has been proposed. According to Patent Document 8, a method for recognizing facial images has been proposed. Proposal 'Invented.
[0008] また、文章内の単語や文字の同一文中における同時出現頻度に基づいた共起関 係を共起確率や共分散行列を用いて計測し意味を推定するための文章特徴を抽出 する方法として特許文献 9やそれらの引用文献に基づく方法が提案されているが、複 数の認識に基づく情報に関し時系列的に近い情報を組合せることで特定のシーン特 徴を抽出'学習し検索によってコンテンッゃコンテンッ内の時間軸上の位置や表示 画面上の位置や音読上の位置を特定するために用いると!、う方法は提案されて 、な い。  [0008] Further, a method for extracting a sentence feature for estimating a meaning by measuring a co-occurrence relation based on the co-occurrence probability based on the co-occurrence frequency in the same sentence of words and characters in the sentence. Patent Document 9 and methods based on those cited documents have been proposed, but specific scene features can be extracted by combining information that is based on multiple recognitions in a time-series manner. There is no method proposed to specify the position on the time axis in the content, the position on the display screen, or the position on the reading aloud!
[0009] なお、複数の異なる情報が相互の位置的近傍に生じる状態を一般的に「共起」と呼 ぶことが知られており「共起関係」や「共起状態」や「共起情報」とも! \ある情報の 近傍に発生する情報を組合せて任意の情報が発生する条件の評価に用いることが 可能であり、共起確率や共起情報に基づく共分散行列を用いて文章の意味推定な どに利用されている。また、位置的近傍とは時系列的位置や音読位置や表示位置に 基づ 、た時空間的な近傍と本発明では考えてょ 、。例えば、「人が泣 、て 、る」と 、う 文にお 、て、「人」と「泣」とは同じ文中に存在すると ヽぅ点で位置的近傍にあることか ら共起関係にあるといえる。 [0009] In addition, it is known that a state in which a plurality of pieces of different information are generated in the positional vicinity of each other is generally called "co-occurrence", and "co-occurrence relation", "co-occurrence state" and "co-occurrence""Information" can be used to evaluate the conditions under which arbitrary information is generated by combining information generated in the vicinity of certain information. It is possible and is used to estimate the meaning of sentences using a covariance matrix based on co-occurrence probabilities and co-occurrence information. Also, in the present invention, the positional neighborhood is a temporal and spatial neighborhood based on the time series position, the reading position and the display position. For example, in a sentence where “a person crys, te, ru” and “a person” and “cry” exist in the same sentence, they are in a co-occurrence relationship because they are close to each other in position. It can be said that there is.
[0010] また、特許文献 10にはコンテンツ情報を感性語空間にて索引付けする方法が提案 されており、非特許文献 5においては映像と音声に対して発話内容に基づいた文字 列による索引を与えて検索する方法が提案されているが、それぞれコンテンツ情報 内の認識結果や認識のための特徴量や識別子に基づいた共起関係を用いて検索 のための評価関数を構成することは提案されて 、な!/、。  [0010] Further, Patent Document 10 proposes a method of indexing content information in the sensitivity word space, and Non-Patent Document 5 provides an index based on character strings based on utterance contents for video and audio. Although a search method is proposed, it is proposed to construct an evaluation function for search using a co-occurrence relationship based on the recognition result in the content information, the feature value for recognition, and the identifier. Wow! /
[0011] また、例えばコールセンターにおける商品の評判調査や動画像などのコンテンツ情 報に対する趣味に応じた検索や医療現場における患者の看護やロボット若しくはェ ージェントの仮想人格における反応といった人間が臨機応変な対応をしている場面 若しくはその模倣にぉ ヽて、環境から得られる複数の特徴量や識別子 (特徴を弁別 する記号)に基づいて構成された共起関係に基づく情報を用いて評価し、その評価 結果に基づ 、て検出を行 、利用者にとって利便性の高!、情報や処理を提供する方 法は提案されていない。  [0011] In addition, for example, human resources can respond flexibly, such as product reputation surveys at call centers, search according to hobbies of content information such as moving images, nursing of patients in medical settings, and reactions in virtual personalities of robots or agents. The evaluation is performed using information based on the co-occurrence relationship constructed based on multiple features and identifiers (symbols for distinguishing features) obtained from the environment. Based on the results, no method has been proposed for detecting information and providing information and processing that is highly convenient for the user!
[0012] このため、本発明を用いることで特許文献 11のような電話による応対を実行するコ ールセンターのような環境において、オペレータと顧客の相性を評価し円滑なコミュ 二ケーシヨンを図れるオペレータを割当てるようなシステムを機能拡張したり、特許文 献 12のようにフレーム単位に映像の特徴量を抽出し、映像の特徴量同士が一致す る力否かを評価し検索をする方法を改善したりすることも可能であり、これらの情報を 分析するために特許文献 13を用いて多変量解析し共起関係を分析しても良 ヽ。  [0012] Therefore, by using the present invention, in an environment such as a call center that performs telephone reception as in Patent Document 11, an operator who can evaluate the compatibility between the operator and the customer and can achieve smooth communication is assigned. Such as expanding the functions of such systems, or extracting video feature values in frame units as in Patent Document 12 and improving the search method by evaluating whether or not the video feature values match. In order to analyze this information, it is also possible to analyze the co-occurrence relationship by performing multivariate analysis using Patent Document 13.
[0013] なお、従来の出願や文献によると、音素と音節を混同しているものが多く見受けら れるが、本発明における音節や音素や音素片とは日本語で「あかさたな」 、う発音 を例にする場合、音節表記した場合であれば「あ/か/さ/た〃よ」もしくは「a/ ka/ sa/ t a/ na」と表記され、音素表記した場合は「a/ k/ a/ s/ a/ t/ a/ n/ a」もしくは「a/ cl/ k I a/ s/ a/ cl/ t/ a/ n/ ajと表記され、音素片表記であれば「a/ a— k/ k/ k— a/ a/ a— s I s/ s-a/ a/ a- 1/ t/ t— a/ a/ a— n/ n/ n— a/ a」もしくは「a/ a- cl/ cl/ cl- k/ k/ k— a/ a I a-s/ s/ s-a/ a/ a— cl/ cl/ cl-t/ t/ t— a/ a/ a— n/ n/ n— a/ a」といった f列カ ィグラム であれば られ、「a— a— a/ a— cl— cl/ cl— cl— cl/ cl— cl— k/ cl— k— k/ k— k— a/ a— a— a/ a— a— s/ s— s— s/ s— a— a/ · · · t— a— a/ a— a— n/ n-n~n/ n— a— a/ a— a— a」と つた [列力卜フィグフム の例となり、音素の前半部、中盤部、後半部といった任意の位置に基づく分離による 音素片であっても良ぐ /cl/は無声破裂音における発音前の無音部もしくは無声部 を指しており、音素、音素片ともに任意の改善により異なる表記記号に変更しても良 い。 [0013] According to conventional applications and documents, many phonemes and syllables are confused, but the syllables, phonemes, and phonemes in the present invention are “Akasana” in Japanese, In the example, “A / ka / sa / tayoyo” or “a / ka / sa / ta / na” is used for syllables, and “a / k / a” for phonemes. / s / a / t / a / n / a '' or `` a / cl / k I a / s / a / cl / t / a / n / aj '' — K / k / k— a / a / a— s I s / sa / a / a- 1 / t / t— a / a / a— n / n / n— a / a ”or“ a / a- cl / cl / cl- k / k / k— a / a I as / s / sa / a / a— cl / cl / cl-t / t / t— a / a / a— n / n / n— a / a ” , “A— a— a / a— cl— cl / cl— cl— cl / cl— cl— k / cl— k— k / k— k— a / a— a— a / a— a— s / s— s— s / s— a— a / ··· t— a— a / a— a— n / nn ~ n / n— a— a / a— a— a ” / Cl / points to the silent or unvoiced part of the unvoiced plosive before sounding, even if it is a segment of the phoneme by separation based on any position such as the first, middle, and second half of the phoneme. , Phonemes, and phonemes may be changed to different notation symbols by any improvement.
[0014] なお、音素及び音素片認識と通常の音声認識の違いを説明すると、音素認識や音 素片認識は一般的な音声認識と違 ヽ意味や内容を解釈しな ヽと ヽぅ特徴があり、より 詳しくは音素認識や音素片認識は文法に関わる言語モデルを用いな ヽため認識結 果として意味を捉えて 、な 、こと、若しくは漢字のような意味を含む記号に変換して いないこと、若しくは同音異義語や同音異表記語を弁別しないこと、文脈に応じて名 詞ゃ動詞と!/、つた品詞の弁別を行わな 、こと!、う特徴があり、表音記号別の音響モ デルを用いて発話音の発話音を分析し発話音記号と認識記号の一致のみを評価す るという特徴がある。  [0014] It should be noted that the difference between phoneme and phoneme recognition and normal speech recognition is explained. Phoneme recognition and phoneme recognition are different from general speech recognition. Yes, in more detail, phoneme recognition and phoneme recognition do not use a language model related to grammar, so the meaning is recognized as a recognition result, and it is not converted to a symbol that includes meaning like kanji Or, do not discriminate between homonyms and homonyms, and do not discriminate between nouns and verbs according to the context. It is characterized by analyzing the utterance of the utterance using Dell and evaluating only the match between the utterance and the recognition symbol.
[0015] また、「音素」とは、音声を構成する要素である母音や子音を指し、「音素片」とは 1 つの音素をより細力べ分割した要素であり、たとえば「あ」の始端、「あ」の中盤、「あ」の 終端と分割したり「あ」と「い」の間の音といった中間音であったりする発話音声に対し ての音素の変化を踏まえた表記を示し、「音素識別子」や「音素片識別子」と表記し ても良い。  [0015] A "phoneme" refers to a vowel or consonant that is a component of speech, and a "phoneme segment" is an element obtained by subdividing one phoneme into , The middle of “A”, the end of “A”, and the notation based on the change of phonemes for the utterances that are intermediate sounds such as the sound between “A” and “I” It may be written as “phoneme identifier” or “phoneme segment identifier”.
特許文献 1 : :特開 2004- - 233541号公報  Patent Document 1:: JP 2004-233541 A
特許文献 2 : :特開昭 62- - 220998号公報  Patent Document 2: JP-A 62-220998
特許文献 3 : :特開 2004- - 54915号公報  Patent Document 3: JP-A-2004-54915
特許文献 4: :特開 2002- - 221984号公報  Patent Document 4: Japanese Patent Laid-Open No. 2002-221984
特許文献 5 : :特開 2002- - 91482号公報  Patent Document 5:: JP 2002-91482 A
特許文献 6 : :特開 2002- - 14973号公報  Patent Document 6:: JP 2002-14973 A
特許文献 7 : :特開平 09 - - 330400号公報 特許文献 8:特開平 5— 153581号公報 Patent Document 7:: Japanese Patent Laid-Open No. 09-330400 Patent Document 8: JP-A-5-153581
特許文献 9:特開平 7— 36883号公報 Patent Document 9: Japanese Patent Laid-Open No. 7-36883
特許文献 10 :特開 2005— 107718号公報 Patent Document 10: Japanese Unexamined Patent Application Publication No. 2005-107718
特許文献 11 :特開 2004— 280158号公報 Patent Document 11: Japanese Unexamined Patent Application Publication No. 2004-280158
特許文献 12:特開平 10— 320400号公報 Patent Document 12: Japanese Patent Laid-Open No. 10-320400
特許文献 13:特願 2005— 147048号公報 Patent Document 13: Japanese Patent Application No. 2005-147048
非特許文献 1 :中沢正幸,遠藤隆,古川清,豊浦潤,岡隆ー (新情報処理開発機構), 「音声波形力 の音素片記号系列を用いた音声要約と話題要約の検討」,信学技報, SP96-28, pp.61— 68, June 1996. Non-Patent Document 1: Masayuki Nakazawa, Takashi Endo, Kiyoshi Furukawa, Jun Toyoura, Takashi Oka (New Information Processing Development Corporation), "Study of speech summaries and topic summaries using phoneme symbol sequences of speech waveform power", Shin Academic Report, SP96-28, pp.61—68, June 1996.
非特許文献 2:「高齢ィ匕社会対応型生活支援インターフェースに関する研究開発」、 青森県工業総合研究センターによるキープロジェタト研究報告書 Vol.5、 Apr.1998 〜Mar.2001 031 Non-Patent Document 2: “Research and Development on Life Support Interface for Aged Society”, Key Project Project Research Report by Aomori Prefectural Industrial Research Center Vol.5, Apr.1998-Mar.2001 031
非特許文献 3 :岡隆ー,高橋裕信,西村拓一,関本信博,森靖英,伊原正典,矢部博 明,橋口博榭,松村博.パターン検索のアルゴリズム 'マップ - "CrossMediator"を支 Oもの -. someone Unknown, editor,人工知會 'ギ会研究会, volume 1, pages 1-6. 人工知能学会, 2001. Non-Patent Document 3: Takashi Oka, Hironobu Takahashi, Takuichi Nishimura, Nobuhiro Sekimoto, Hidehide Mori, Masanori Ihara, Hiroaki Yabe, Hiroaki Hashiguchi, Hiroshi Matsumura. Pattern Search Algorithm 'Map-Supporting "CrossMediator" -. Someone Unknown, editor, Artificial Intelligence 'Gikai Study Group, volume 1, pages 1-6. Japanese Society for Artificial Intelligence, 2001.
非特許文献 4 :谷真宏: "Bayesian Networkによる楽器音特徴量の統合と楽器同定へ の応用", 2003年電子情報通信学会総合大会『D-14音声,聴覚』 D-14-21, pi 88, March 2003 Non-Patent Document 4: Masahiro Tani: "Integration of instrument sound features by Bayesian Network and application to instrument identification", 2003 IEICE General Conference "D-14 Speech, Auditory" D-14-21, pi 88 , March 2003
非特許文献 5 :長尾確、「セマンティック 'トランスコーディング-より実用的な〃 Semantic We に向けて-」、人間主体の知的情報技術に関する調査研究 VI- 3.6、財団法人 日本情報処理開発協会 先端情報技術研究所、平成 15年 3月 Non-Patent Document 5: Satoshi Nagao, “Semantic Transcoding-Towards More Practical Semantic We”, Research on Human-Centered Intellectual Information Technology VI-3.6, Japan Information Processing Development Corporation Technical Research Institute, March 2003
発明の開示 Disclosure of the invention
発明が解決しょうとする課題 Problems to be solved by the invention
従来の検索は画像や映像に関連付けられた文字列や音声情報を用いて検索する 方法や単独の認識方法や特徴抽出方法により得られた識別子や特徴量を評価する 検索方法が一般的であったため、言語表現しづらい抽象概念に基づく検索やシーン の盛り上がりといった感覚的な概念に基づく検索や趣味や主観に応じた検索は困難 であるという課題があった。 Conventional search methods generally used a search method that uses character strings and audio information associated with images and video, and a search method that evaluates identifiers and feature values obtained by a single recognition method or feature extraction method. Search based on abstract concepts that are difficult to express in language and search based on sensory concepts such as scene excitement and searches based on hobbies and subjectivity are difficult There was a problem of being.
[0017] このため、前記非特許文献 3によれば音素認識により獲得された音素記号を識別 子として用いた検索を行って!/ヽるが、画像情報や映像情報から獲得される画像特徴 量や画像認識による画像識別子や動作認識による動作識別子と音声情報から獲得 される感情認識による感情識別子や音素認識による音素識別子といった複数の認識 方法に基づく識別子や特徴量を組合せた共起情報に基づいて共分散行列を構成し 、新しく索引付や検索に用いる評価関数を構成する方法は提案されていない。  [0017] Therefore, according to Non-Patent Document 3, a search is performed using a phoneme symbol acquired by phoneme recognition as an identifier! Based on co-occurrence information that combines identifiers and feature quantities based on multiple recognition methods, such as emotion identifiers based on emotion recognition obtained from speech identifiers, motion identifiers based on motion identifiers, image identifiers based on motion recognition, and motion recognition. A method for constructing a covariance matrix and constructing a new evaluation function for indexing or searching has not been proposed.
[0018] そこで、発明者はこのような多様な認識の結果得られる識別子や特徴量の共起関 係に基づいて評価関数を作り検索や索引付を行うことで、従来不可能であったシー ンの盛り上がりといった抽象的な検索が可能であると考えるとともに、解析結果として 構成される任意の評価関数に対し評価関数名を利用者や製作者が適宜命名し、命 名された文字列に基づいて音素列や音素片列を生成することにより、検索条件の指 定に利用者が構成命名した評価関数や索引を用いたり、構成された評価関数を交 換配布したりすることで利便性の高い検索環境を実現できると考えた。  [0018] Therefore, the inventor creates an evaluation function based on the co-occurrence relationship of identifiers and feature quantities obtained as a result of such various recognitions, and performs search and indexing, which has been impossible in the past. It is considered that abstract searches such as climax can be performed, and the name of the evaluation function is appropriately named by the user or producer for any evaluation function configured as an analysis result, and based on the named character string By generating phoneme strings and phoneme string strings, user-defined evaluation functions and indexes can be used to specify search conditions, and the configured evaluation functions can be distributed and distributed. We thought that a high search environment could be realized.
[0019] このような情報の共起関係に関する技術は前述の通り文章内の単語や文字の同一 文中における同時出現頻度に基づいた共起関係を共起確率や共分散行列を用い て計測し意味を推定するための文章特徴を抽出する方法として前記特許文献 9やそ れらの引用文献に基づく方法が提案されているが、本発明では各種認識方法により 抽出された識別子やそれらを認識するための特徴量を用いた共起確率や共分散行 列や共起行列と 、つた共起情報を用いることを特徴として 、る。  [0019] As described above, the technology related to the co-occurrence relation of information measures the co-occurrence relation based on the co-occurrence frequency of words and characters in the same sentence in the same sentence using the co-occurrence probability and covariance matrix. As a method for extracting sentence features for estimating the number of words, a method based on Patent Document 9 and those cited documents has been proposed. However, in the present invention, identifiers extracted by various recognition methods and recognition of them are recognized. It is characterized by using co-occurrence information and co-occurrence probability, covariance matrix and co-occurrence matrix.
[0020] このような課題に基づ ヽて色々な装置を検討した場合、例えば、シリーズ物の映画 の中で一般的にいわれる「キメ台詞」を検索することは困難であり、同様の台詞がお 笑 、番組の中でネタとして用いられて 、るかどうかの判断はより困難であったり、加え てそれらを判別し自動的に収録すると 、うことは困難であったり、「キメ台詞」のところ ば力りをスキップしながら閲覧するといつたことが困難であったり、主人公の名前を泣 きながら呼んでいるのか、怒って呼んでいるのか、嬉しそうに呼んでいるのかを判断 しストーリの盛り上がりに応じて検索することが困難であったり、動画音声ストリームか ら音声認識置いて単語を同定することは現状の音声認識システムでは困難であった り、音声ストリームカゝら音素を認識した場合であっても映画などのコンテンツにおいて 配役名称は記号ィ匕できても、配役名に関連する役者名を記号ィ匕することは困難であ つたり、映像ストリームにおける配役名や役者名は文字列記号であるため検索には文 字列記号のみでしか実施できな力つたり、映像や音声によるシーンの感情的な盛り 上がりを検索できないという課題があった。 [0020] When various devices are examined based on such a problem, for example, it is difficult to search for "Kime dialogue" which is generally referred to in a series of movies. It is more difficult to judge whether or not it is used as a story in a program, or it is difficult to determine if they are automatically recorded and recorded. For example, it is difficult to see when skipping force, or whether you are calling the hero's name crying, angry, or joyfully. It is difficult to search according to the excitement of the voice, or it is difficult to identify words by putting voice recognition from the video / audio stream with the current voice recognition system Even if a phoneme is recognized from an audio stream card, it is difficult to symbolize the actor name related to the cast name even though the cast name can be symbolized in the content such as a movie. However, since the cast name and actor name in the video stream are character strings, there is a problem that the search can be performed only with the character string symbols, and the emotional excitement of the scene due to video and audio cannot be searched. there were.
[0021] この課題は、主にコンテンツ中の発話単語や画像情報を認識することで容易に利 用者の意図する検索や検出が可能であると考えられていたことに起因するが、実際 のコンテンツ情報は単一の認識結果により得られるものば力りではない点、従来は単 語レベルで認識しょうとして ヽたが、コンテンツ中では叫び声や泣声と ヽつた単語に ならない音がシーンの盛り上がりに影響する点、単純な認識と索引付と検索では検 索結果が絞り込めな 、点、シーンで生じる音声力も認識される感情を考慮して 、な ヽ 点、機械音や爆発音といった環境音と発話音に基づく音素列を認識し索引付けると 共にそれらがほぼ同時に生じている区間を共起情報に基づいて検出するといつた方 法が実現されて 、な 、点と 、つた複数の要因により生じて 、た。  [0021] This problem is mainly due to the fact that it is thought that the user's intended search and detection can be easily performed by recognizing spoken words and image information in the content. In the past, content information was not as powerful as what could be obtained from a single recognition result, and in the past we tried to recognize it at the single-word level, but in the content, screaming and crying sounds that do not become confused words make the scene swell. Influencing points, simple recognition, indexing, and search cannot narrow down the search results. Considering the emotion that the voice power generated in the scene is also recognized. Recognizing and indexing phoneme sequences based on utterances, and detecting the intervals where they occur almost simultaneously based on co-occurrence information, a method is realized, which is caused by a number of factors. And .
[0022] また、言語によって音素や音素片に対する認識の違いや解釈の偏りがあり、母国語 が異なる人や音素記号列の表記方法が必ずしも統一できないため、国際的に利用し ようとすると充分に実用には耐えないという課題があり、任意の端末に情報を提供す る際の汎用性が低く国際音素記号と地域言語における音素記号の違いを充分に吸 収できるものではな ヽと 、う課題があった。  [0022] In addition, there are differences in recognition and interpretation of phonemes and phonemes depending on the language, and it is not always possible to unify people and phoneme symbol strings in different native languages. There is a problem that it can not be put into practical use, and it is not versatile when providing information to any terminal, and it can not absorb the difference between international phonetic symbols and phonetic symbols in local languages. was there.
[0023] また、消費者との対話を収録し分析する CRMシステムにお 、て、消費者相談窓口 の顧客との対話状況力 音声特徴を収録しながら、顧客の商品に対する評価を客観 的且つ定量的に把握すし収録し分析することが困難であったり、相談窓口のォペレ ータが対話状況力 即座に該当商品のマニュアルを入手するといつたことが困難で あったりすると ヽぅ課題があった。  [0023] In addition, in the CRM system that records and analyzes consumer interaction, the ability to interact with the customer at the consumer consultation desk is recorded objectively and quantitatively, while recording voice characteristics. It was difficult to grasp, record and analyze automatically, and the operator of the consulting office had a good dialogue situation.
[0024] また、カラオケなどではタイトルのわ力もない曲を歌いたい場合や音楽データを検 索する際にその音楽や映像の感情的盛り上がりや膨大な音楽や映像タイトルを検索 したり、特定のキーワードの出現位置を検索したりすることや歌詞のサビの部分から の検索など困難であるという課題があった。 [0025] また、従来力 ある EPGや BML、 RSS、文字放送などのテキスト検索ではその入 力が煩雑であり、映像音声ストリーム力 抽出した情報に基づいて音素記号や音素 片記号、感情を識別するための感情識別子、楽器を識別するための楽器識別子、音 階を識別するための音階識別子、言語や音素や音素片や感情や楽器や音階を識 別するための音声特徴、屋内の音の響き方や音の位置を識別するための音響特徴 、風景や人物や物や動物や文字などの形状や運動を判別するための画像特徴及び 画像認識結果を生成しそれらを組合せ目的とする抽象概念の対象となるシーンの盛 り上がりなどの検索は実施されていな力つたため、自由度の高い検索が行えないとい う課題があり、これらの組合せはより詳しく後述する。 [0024] Also, if you want to sing a song that does not have the title of karaoke, or search music data, you can search for emotional excitement of the music or video, search for enormous music or video titles, It was difficult to search for the appearance position of the song and to search from the chorus part of the lyrics. [0025] In addition, conventional text search such as EPG, BML, RSS, and text broadcasting is complicated, and the input of video / audio stream power identifies phoneme symbols, phoneme symbols, and emotions based on the extracted information Emotion identifiers for identifying musical instruments, musical instrument identifiers for identifying musical scales, musical scale identifiers for identifying musical scales, voice features for identifying language, phonemes and phonemes, emotions, musical instruments and musical scales, and the sound of indoor sounds Acoustic features for identifying the direction of a person and sound, image features for discriminating the shape and movement of landscapes, people, objects, animals, characters, etc. Searches such as the rise of the target scene have not been conducted, and there is a problem that a search with a high degree of freedom cannot be performed. These combinations will be described in more detail later.
[0026] そして、このような課題に対する一般的な対策として提案された従来技術によれば 、音素列と文字列変換することで相互の検索を実現する方法は提案されているが、 それらを画像認識結果や感情認識結果、音素認識結果に基づ!ヽた共起状態を評価 して学習したり、学習結果を用いてより複雑な検索を実施したりするといつた異なる評 価基準に基づく認識による索引を用いて複合的な検索を行うことは出来な力つた。  [0026] According to the conventional technique proposed as a general countermeasure against such a problem, a method for realizing mutual search by converting a phoneme string and a character string has been proposed. Based on recognition results, emotion recognition results, and phoneme recognition results! Evaluation based on different co-occurrence states and learning based on learning results, or when performing more complex searches, recognition based on different evaluation criteria It was impossible to perform a complex search using the index by.
[0027] また、従来技術では表情に基づ 、た感情認識はなされて 、るが、音声入力から得 られた音素列や音素片列と感情識別子に対して画像入力による表情画像を関連付 けて分類し検索評価、学習する方法は提案されておらず、音素や音素片による認識 は提案されて 、な 、ため映画やドラマのようなコンテンツから感情や音素列や画像特 徴を伴う適当なシーンを検索 '検出したり、検出に基づいて録画を開始したり、再生し たり、嫌なところをスキップ再生したり、アナウンスを流したり、メールを配信したり、 RS Sを生成したりするといつた利用はなされていないため、本発明のような利用者感情 やコンテンツ内で表現される感情を踏まえた音声入力を伴う検索や検出、索引付け に関する課題を解決して 、な 、とともに、本発明は感情や感性の発生や制御を行う 装置ではな ヽため装置としての発明分野も異なる。  [0027] In addition, in the prior art, although emotion recognition is performed based on facial expressions, a phoneme sequence or phoneme segment sequence obtained from speech input and an emotion identifier are associated with facial expression images by image input. Classification, search evaluation, and learning methods are not proposed, and recognition by phonemes and phonemes is not proposed. Therefore, content such as movies and dramas should be appropriately used with emotions, phoneme sequences, and image features. Search for scenes' Whenever you detect, start recording based on detection, play, skip over unwanted places, broadcast announcements, deliver emails, or generate RSS Therefore, the present invention solves the problems related to search, detection, and indexing that involve voice input based on user emotions and emotions expressed in content as in the present invention. Is emotion Invention Field of the apparatus for ヽ such a device that performs generation and control of sensibility different.
[0028] さらに、前記非特許文献 3のようなシステムでは画像を一様に区分ィ匕し、区分化さ れた画像特徴に統計的に関連付けられた単語文字列を音素や音素片に展開し発話 に基づいて検索したり、映像内で発話されている個所を検索したりすることは可能で あるが、認識に伴う特定の画像特徴傾向や感情特徴傾向や音声特徴傾向を組合せ 共起状態に基づいて統計的に分類し評価関数を構成して識別子を与えたり、識別 子の対象を示す呼称の発話にともなう音素列,音素片列を関連付けたり、それらの識 別子を検索するための索引付評価関数を構成したりすることは不可能であった。 [0028] Further, in a system such as Non-Patent Document 3, an image is uniformly segmented, and word strings statistically associated with segmented image features are expanded into phonemes and phonemes. Although it is possible to search based on utterances or to search for locations that are uttered in the video, it is possible to combine specific image feature trends, emotion feature trends, and voice feature trends that accompany recognition. Statistically classify based on co-occurrence state and configure an evaluation function to give an identifier, associate a phoneme sequence and phoneme sequence with a utterance of a name indicating the identifier target, and search for those identifiers It was impossible to construct an indexed evaluation function for
[0029] このため、ある発話音素列'音素片列と画像特徴や画像特徴と感情識別子が関連 付けられた傾向分析に基づいて検索をすることは不可能であるば力りではなぐ感情 識別子を含めた共起情報を用いて!/、な 、ため「悲鳴を伴う爆発シーン」や「泣きなが ら主人公の名前を叫んでいるシーン」といったコンテンツ情報のシーンの盛り上がり に関わるような検索は出来な力つた。  [0029] For this reason, if it is impossible to perform a search based on a trend analysis in which a certain phoneme sequence 'phoneme segment sequence and an image feature or an image feature and an emotion identifier are associated with each other, Using the co-occurrence information included, it is possible to search related to the excitement of the content information scene such as “explosion scene with scream” and “screaming hero's name”. I helped.
[0030] このように、従来の検索技術では人の感覚や趣味、主観、感情に配慮した自由度 の高い検索を実現することが困難であったため、検索装置に対する複雑な入力が不 得意な人と得意な人との間にデジタルデバイドと呼ばれる情報格差が生まれ情報化 社会における一般的な課題となっている。  [0030] As described above, it is difficult for the conventional search technology to realize a search with a high degree of freedom in consideration of human senses, hobbies, subjectivity, and emotions. An information gap called a digital divide is born between people who are good at it and those who are good at it, and it has become a general issue in the information society.
[0031] そこで、上述した課題に鑑み、本発明が目的とするところは、入力された各種情報 に基づいて共起情報を利用することにより、任意のコンテンツ情報を容易に検索でき る情報検索装置等を提供することである。  [0031] In view of the above-described problems, an object of the present invention is to provide an information search device that can easily search for arbitrary content information by using co-occurrence information based on various types of input information. Etc. is to provide.
課題を解決するための手段  Means for solving the problem
[0032] 上記の課題を解決するために、第 1の発明の情報処理装置は、コンテンツ情報を 獲得するコンテンツ情報獲得手段と、検索条件を入力する検索条件入力手段と、前 記コンテンツ情報獲得手段により獲得されたコンテンツ情報から、前記検索条件入力 手段により入力された検索条件に適合するコンテンツ情報又は当該コンテンツ情報 内の位置を特定する特定手段と、を備えており、コンテンツ情報力 特徴量を抽出す る特徴量抽出手段と、前記特徴量抽出手段により抽出された特徴量カゝら評価関数を 用いて識別子を生成する識別子生成手段と、前記特徴量及び Z又は前記識別子を 前記コンテンッ又は前記コンテンッ内の位置に関連づけて索引情報として記憶する 索引情報記憶手段と、前記検索条件入力手段により入力された検索条件を特徴量 及び Z又は識別子に変換する検索条件変換手段と、を更に備え、前記特定手段は 、前記検索条件変換手段により変換された特徴量及び Z又は識別子を用いて前記 索引情報と前記検索条件との適合を検出することでコンテンツ又はコンテンツ内の位 置を特定する検索特定手段を有することを特徴とする。 In order to solve the above-described problem, an information processing apparatus according to a first aspect of the present invention includes a content information acquisition unit that acquires content information, a search condition input unit that inputs search conditions, and the content information acquisition unit described above. Content information that conforms to the search condition input by the search condition input means or a specifying means for specifying a position in the content information from the content information acquired by A feature quantity extraction means, an identifier generation means for generating an identifier using the feature quantity evaluation function extracted by the feature quantity extraction means, and the feature quantity and Z or the identifier as the content or the content. Index information storage means for storing the index information in association with the position in the search condition, and the search condition input by the search condition input means And a search condition conversion means for converting to a feature quantity and Z or an identifier, wherein the specifying means uses the feature quantity and Z or identifier converted by the search condition conversion means to use the index information and the search condition. Content or position within content by detecting It has a search specifying means for specifying a position.
[0033] 第 2の発明の情報処理装置は、コンテンツ情報を獲得するコンテンツ情報獲得手 段と、検索条件を入力する検索条件入力手段と、前記コンテンツ情報獲得手段によ り獲得されたコンテンツ情報から、前記検索条件入力手段により入力された検索条件 に適合するコンテンツ情報又は当該コンテンッ情報内の位置を特定する特定手段と 、を備えており、コンテンツ情報力 複数の異なる特徴量を抽出する特徴量抽出手段 と、前記特徴量抽出手段により抽出された複数の異なる特徴量力も評価関数を用い て複数の異なる識別子を生成する識別子生成手段と、複数の異なる前記特徴量及 び Z又は前記識別子を前記コンテンツ又は前記コンテンツ内の位置に関連づけて 索引情報として記憶する索引情報記憶手段と、前記検索条件入力手段により入力さ れた検索条件を複数の異なる特徴量及び Z又は識別子に変換する検索条件変換 手段と、を更に備え、前記特定手段は、前記検索条件変換手段により変換された複 数の異なる特徴量及び Z又は識別子を用いて前記索引情報と前記検索条件との適 合を検出することでコンテンツ又はコンテンツ内の位置を特定する検索特定手段を 有することを特徴とする。  [0033] An information processing apparatus according to a second aspect of the present invention is based on a content information acquisition means for acquiring content information, search condition input means for inputting search conditions, and content information acquired by the content information acquisition means. And content information that conforms to the search condition input by the search condition input means or a specifying means for specifying a position in the content information, and a content information extraction that extracts a plurality of different feature quantities A plurality of different feature quantity forces extracted by the feature quantity extraction means, and an identifier generation means for generating a plurality of different identifiers using an evaluation function, and a plurality of different feature quantities and Z or the identifiers as the content. Or index information storage means for storing as index information in association with the position in the content, and the search condition input means Search condition conversion means for converting the inputted search condition into a plurality of different feature quantities and Zs or identifiers, and the specifying means includes a plurality of different feature quantities converted by the search condition conversion means and It has a search specifying means for specifying the content or the position in the content by detecting a match between the index information and the search condition using Z or an identifier.
[0034] また、第 3の発明は、第 1又は第 2の発明の情報処理装置において、前記索引情報 記憶手段は、コンテンツカゝら獲得された特徴量及び Z又は識別子に基づ ヽて構成さ れる共起情報を前記コンテンツ又は前記コンテンツ内の位置に関連づけて更に記憶 しており、前記検索条件変換手段によって検索条件から変換された特徴量及び Z又 は識別子に基づく共起情報を検索条件共起情報として構成する検索条件共起情報 構成手段を更に備え、前記検索特定手段は、前記検索条件共起情報構成手段によ り構成された検索条件共起情報と、前記索引共起情報とのとの適合を検出すること でコンテンツ又はコンテンツ内の位置を特定する共起検索特定手段を有することを 特徴とする。  [0034] Further, the third invention is the information processing apparatus according to the first or second invention, wherein the index information storage means is configured based on a feature amount and Z or an identifier obtained from a content card. Is further stored in association with the content or the position in the content, and the co-occurrence information based on the feature quantity and the Z or identifier converted from the search condition by the search condition conversion means is used as the search condition. Search condition co-occurrence information constituting means configured as co-occurrence information is further provided, wherein the search specifying means includes the search condition co-occurrence information constituted by the search condition co-occurrence information constituting means, and the index co-occurrence information. It is characterized by having co-occurrence search specifying means for specifying the content or the position in the content by detecting the conformity with the above.
[0035] また、第 4の発明は、第 1から第 3の発明のいずれかに記載の情報処理装置におい て、前記コンテンツには文字情報が含まれており、前記識別子生成手段は、前記文 字情報に基づ 、て識別子を生成することを特徴とする。  [0035] In addition, according to a fourth invention, in the information processing device according to any one of the first to third inventions, the content includes character information, and the identifier generation means includes the sentence An identifier is generated based on character information.
[0036] また、第 5の発明は、第 4の発明の情報処理装置において、前記文字情報と識別子 とを対応づけて辞書情報として記憶する辞書情報記憶手段を更に備え、前記識別子 生成手段は、前記コンテンツに含まれる文字情報から前記辞書情報を用いて識別子 を生成することを特徴とする。 [0036] Further, in a fifth invention according to the information processing apparatus of the fourth invention, the character information and the identifier Is further stored as dictionary information, and the identifier generating unit generates the identifier using the dictionary information from the character information included in the content.
[0037] また、第 6の発明は、第 1から第 5の発明のいずれか一つに記載の情報処理装置に おいて、辞書情報記憶手段に前記識別子と標準パターンとを対応づけて標準パター ン辞書情報として記憶する標準パターン辞書情報記憶手段を更に備え、前記識別 子を前記標準パターン辞書情報を用いることにより標準パターンによる特徴量へ変 換する識別子特徴量変換手段を更に有することを特徴とする。  [0037] Further, the sixth invention is the information processing apparatus according to any one of the first to fifth inventions, wherein the identifier and the standard pattern are associated with the dictionary information storage means in the standard pattern. A standard pattern dictionary information storage means for storing as an image dictionary information, and further comprising an identifier feature quantity conversion means for converting the identifier into a feature quantity by a standard pattern by using the standard pattern dictionary information. To do.
[0038] また、第 7の発明は、第 1から第 6の発明のいずれか一つに記載の情報処理装置に おいて、前記索引情報記憶手段は、前記コンテンツ情報の実時間に基づいて前記 特徴量及び Z又は前記識別子を前記コンテンツ又は前記コンテンツ内の位置に関 連づけて更に記憶しており、前記特定手段は、実時間で配信されるコンテンツ力 前 記索引情報と前記検索条件との適合を検出する手段であることを特徴とする請求項 1から 6の 、ずれか一項に記載の情報処理装置。  [0038] Also, in a seventh aspect of the present invention is the information processing device according to any one of the first to sixth aspects, the index information storage means is based on real time of the content information. The feature quantity and the Z or the identifier are further stored in association with the content or the position in the content, and the specifying unit is configured to determine whether the content information index information and the search condition are distributed in real time. The information processing apparatus according to claim 1, wherein the information processing apparatus is a means for detecting conformity.
[0039] また、第 8の発明は、第 1から第 7の発明のいずれか一つに記載の情報処理装置に おいて、コンテンツ情報の検索中及び Z又は検索結果若しくは検出結果に対して共 起情報及び Z又は前記索引情報により関連付けられた広告情報を提示することを特 徴とする。  [0039] Further, an eighth invention is the information processing device according to any one of the first to seventh inventions, wherein the content information is being searched and Z or the search result or the detection result is shared. It is characterized by presenting advertisement information associated with origin information and Z or the index information.
[0040] また、第 9の発明は、第 2の発明の情報処理装置において、前記特徴量抽出手段 が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記コンテンツ力 音素認 識の際に用いられる音素情報から抽出される特徴量、若しくは音素情報から生成さ れる音素識別子であることを特徴とする。  [0040] Further, according to a ninth aspect, in the information processing apparatus according to the second aspect, at least one of a plurality of different feature amounts extracted by the feature amount extraction means is used for the content force / phoneme recognition. It is a feature quantity extracted from phoneme information to be used or a phoneme identifier generated from phoneme information.
[0041] また、第 10の発明は、第 2の発明の情報処理装置において、前記特徴量抽出手段 が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記コンテンツ力 音素片 認識の際に用いられる音素片情報から抽出される特徴量、若しくは音素片情報から 生成される音素片識別子であることを特徴とする。  [0041] Further, in a tenth invention according to the information processing device of the second invention, at least one of a plurality of different feature amounts extracted by the feature amount extraction means is used when the content force phoneme segment is recognized. It is a feature value extracted from phoneme information used or a phoneme identifier generated from phoneme information.
[0042] また、第 11の発明は、第 2の発明の情報処理装置において、前記特徴量抽出手段 が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記コンテンツから感情認 識の際に用いられる感情情報力 抽出される特徴量、若しくは感情情報力 生成さ れる感情識別子であることを特徴とする。 [0042] Further, in an information processing apparatus according to an eleventh aspect, in the information processing apparatus according to the second aspect, at least one of a plurality of different feature amounts extracted by the feature amount extraction means is emotion recognition from the content. It is a feature quantity extracted from emotional information used for recognition or emotion identifier generated by emotional information.
[0043] また、第 12の発明は、第 2の発明の情報処理装置において、前記特徴量抽出手段 が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記コンテンツ力 聴覚情 報に基づく認識の際に用いられる聴覚情報力 抽出される特徴量、若しくは聴覚情 報から生成される識別子であることを特徴とする。  [0043] Also, in a twelfth invention according to the information processing device of the second invention, at least one of a plurality of different feature values extracted by the feature value extraction means is recognized based on the content power auditory information. It is characterized by the feature quantity extracted from the auditory information power used at the time of identification or the identifier generated from the auditory information.
[0044] また、第 13の発明は、第 2の発明の情報処理装置において、前記特徴量抽出手段 が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記コンテンツ力 視覚情 報に基づく認識の際に用いられる視覚情報力 抽出される特徴量、若しくは視覚情 報から生成される識別子であることを特徴とする。  [0044] Further, in a thirteenth invention, in the information processing apparatus according to the second invention, at least one of a plurality of different feature quantities extracted by the feature quantity extraction means is recognized based on the content power visual information. It is characterized by the feature quantity extracted from the visual information force used at the time of identification, or an identifier generated from the visual information.
[0045] また、第 14の発明は、第 2の発明の情報処理装置において、前記コンテンツには 文字情報が含まれており、前記特徴量抽出手段が抽出する複数の異なる特徴量若 しくは識別子生成手段が生成する識別量のうち、少なくとも 1つは文字情報力 抽出 される特徴量若しくは文字情報カゝら生成される識別子であることを特徴とする。  [0045] Also, in the fourteenth invention according to the information processing device of the second invention, the content includes character information, and a plurality of different feature quantities or identifiers extracted by the feature quantity extraction means. Among the identification amounts generated by the generation means, at least one is a feature amount extracted from a character information force or an identifier generated from a character information cover.
[0046] また、第 15の発明は、第 2の発明の情報処理装置において、前記特徴量抽出手段 が抽出する複数の異なる特徴量若しくは識別子生成手段が生成する複数の異なる 識別子のうち少なくとも 1つは、番組情報カゝら抽出される特徴量若しくは番組情報が 識別子であることを特徴とする。  [0046] Further, the fifteenth invention is the information processing device of the second invention, wherein at least one of a plurality of different feature quantities extracted by the feature quantity extraction means or a plurality of different identifiers generated by the identifier generation means. Is characterized in that the feature quantity or program information extracted from the program information column is an identifier.
[0047] また、第 16の発明は、第 2の発明の情報処理装置において、前記特徴量抽出手段 が抽出する複数の異なる特徴量若しくは識別子生成手段が生成する複数の異なる 識別子のうち少なくとも 1つは、センサ情報力も抽出される特徴量若しくはセンサ情報 が識別子であることを特徴とする。  [0047] Further, the sixteenth invention is the information processing apparatus of the second invention, wherein at least one of a plurality of different feature quantities extracted by the feature quantity extraction means or a plurality of different identifiers generated by the identifier generation means. Is characterized in that the feature quantity or sensor information from which sensor information power is also extracted is an identifier.
[0048] また、第 17の発明は、第 3の発明の情報処理装置において、コンテンツから獲得さ れた特徴量及び Z又は識別子に基づ 、て構成される共起情報から、前記評価関数 を再構成する評価関数再構成手段を備えることを特徴とする。  [0048] Further, in the seventeenth invention, in the information processing device of the third invention, the evaluation function is calculated from co-occurrence information configured based on the feature quantity and Z or identifier obtained from the content. An evaluation function reconstructing means for reconstructing is provided.
[0049] また、第 18の発明は、第 3の発明の情報処理装置において、前記検索条件変換手 段によって検索条件力 変換された特徴量及び Z又は識別子に基づいて構成され る共起情報から、前記評価関数を再構成する評価関数再構成手段を備えることを特 徴とする。 [0049] Further, the eighteenth invention is the information processing apparatus according to the third invention, wherein the co-occurrence information is configured based on the feature quantity and the Z or identifier converted by the search condition power by the search condition conversion means. And an evaluation function restructuring means for reconfiguring the evaluation function. It is a sign.
[0050] また、第 19の発明は、第 3の発明の情報処理装置において、前記共起検索特定手 段によりコンテンツ又はコンテンツ内の位置が特定された結果に基づいて共起情報 を構成する検索結果共起情報構成手段を備え、前記検索結果共起情報構成手段 に基づ!/ヽて構成された共起情報から、前記評価関数を再構成する評価関数再構成 手段を備えることを特徴とする。  [0050] Also, in the nineteenth invention according to the information processing device of the third invention, a search that configures co-occurrence information based on a result of specifying the content or a position in the content by the co-occurrence search specifying unit A result co-occurrence information composing means, and an evaluation function reconstructing means for reconstructing the evaluation function from the co-occurrence information constructed based on the search result co-occurrence information composing means! To do.
[0051] 第 20の発明は、コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから 所定の場面を検索するための検索条件を入力する検索条件入力手段と、前記検索 条件に適合する内容を前記コンテンッ記憶手段に記憶されたコンテンッの中から特 定する特定手段を備えた情報処理装置において、前記コンテンツから抽出される音 素認識に用いるための音素特徴量及び Z又は音素認識により得られた音素識別子 と、前記コンテンツから抽出される感情認識に用いるための感情特徴量及び Z又は 感情認識により得られた感情識別子と、を関連付けて索引として記録する索引記録 手段を備え、前記特定手段は、前記索引記録手段により記録された索引情報に基 づいて前記検索条件に適合する内容を前記コンテンツから特定する索引特定手段 を有することを特徴とする。  [0051] In a twentieth invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are stored in the content. In an information processing apparatus provided with a specifying means that identifies from content stored in a storage means, a phoneme feature quantity used for phoneme recognition extracted from the content and a phoneme identifier obtained by Z or phoneme recognition And index recording means for associating and recording an emotion feature quantity used for emotion recognition extracted from the content and an emotion identifier obtained by Z or emotion recognition as an index. Index specifying means for specifying, from the content, contents that match the search condition based on the index information recorded by the recording means It is characterized by having.
[0052] 第 21の発明は、コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから 所定の場面を検索するための検索条件を入力する検索条件入力手段と、前記検索 条件に適合する内容を前記コンテンッ記憶手段に記憶されたコンテンッの中から特 定する特定手段を備えた情報処理装置において、前記コンテンツから抽出される音 素片認識に用いるための音素片特徴量及び Z又は音素片認識により得られた音素 片識別子と、前記コンテンツ力 抽出される感情認識に用いるための感情特徴量及 び Z又は感情認識により得られた感情識別子と、を関連付けて索引として記録する 索引記録手段を備え、前記特定手段は、前記索引記録手段により記録された索引 情報に基づいて前記検索条件に適合する内容を前記コンテンツ力 特定する索引 特定手段を有することを特徴とする。  [0052] In a twenty-first aspect, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. In an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content. Index recording means for associating and recording as an index, the phoneme segment identifier and the emotion feature quantity used for emotion recognition extracted from the content force and the emotion identifier obtained by Z or emotion recognition. The means is an index that specifies the content capability that matches the search condition based on the index information recorded by the index recording means. It has a fixed means.
[0053] 第 22の発明は、コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから 所定の場面を検索するための検索条件を入力する検索条件入力手段と、前記検索 条件に適合する内容を前記コンテンッ記憶手段に記憶されたコンテンッの中から特 定する特定手段を備えた情報処理装置において、前記コンテンツから抽出される音 素認識に用いるための音素特徴量及び Z又は音素認識により得られた音素識別子 と、前記コンテンツ力 抽出される感情認識に用いるための感情特徴量及び Z又は 感情認識により得られた感情識別子と、前記コンテンツカゝら抽出される第 1の認識に 用いるための第 1の特徴量及び/又は第 1の認識により得られた第 1の識別子と、を 関連付けて索引として記録する索引記録手段を備え、前記特定手段は、前記索引 記録手段により記録された索引情報に基づいて前記検索条件に適合する内容を前 記コンテンツ力 特定する索引特定手段を有することを特徴とする。 [0053] According to a twenty-second aspect of the invention, there is provided content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and the search In an information processing apparatus provided with a specifying unit that specifies content that satisfies a condition from the content stored in the content storage unit, the phoneme feature quantity used for phoneme recognition extracted from the content and Z or The phoneme identifier obtained by phoneme recognition, the emotion feature quantity used for emotion recognition extracted from the content power and the emotion identifier obtained by Z or emotion recognition, and the first recognition extracted from the content camera Index recording means for associating and recording the first feature quantity and / or the first identifier obtained by the first recognition as an index, and the specifying means is recorded by the index recording means And an index specifying means for specifying the content power that matches the search condition based on the index information.
[0054] 第 23の発明は、コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから 所定の場面を検索するための検索条件を入力する検索条件入力手段と、前記検索 条件に適合する内容を前記コンテンッ記憶手段に記憶されたコンテンッの中から特 定する特定手段を備えた情報処理装置において、前記コンテンツから抽出される音 素片認識に用いるための音素片特徴量及び Z又は音素片認識により得られた音素 片識別子と、前記コンテンツ力 抽出される感情認識に用いるための感情特徴量及 び Z又は感情認識により得られた感情識別子と、前記コンテンツ力 抽出される第 1 の認識に用いるための第 1の特徴量及び/又は第 1の認識により得られた第 1の識 別子と、を関連付けて索引として記録する索引記録手段を備え、前記特定手段は、 前記索引記録手段により記録された索引情報に基づいて前記検索条件に適合する 内容を前記コンテンツ力 特定する索引特定手段を有することを特徴とする。  [0054] In a twenty-third invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. In an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content. Phoneme segment identifiers, emotion feature quantities used for emotion recognition extracted by the content force and emotion identifiers obtained by Z or emotion recognition, and the first identifier used for the first recognition of the content force extracted. Index specifying means for associating and recording as an index the first feature obtained by the first feature quantity and / or the first recognition, and the specifying means comprises: The system further comprises index specifying means for specifying the content power of content that matches the search condition based on the index information recorded by the index recording means.
[0055] 第 24の発明は、コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから 所定の場面を検索するための検索条件を入力する検索条件入力手段と、前記検索 条件に適合する内容を前記コンテンッ記憶手段に記憶されたコンテンッの中から特 定する特定手段を備えた情報処理装置において、前記コンテンツから抽出される音 素認識に用いるための音素特徴量及び Z又は音素認識により得られた音素識別子 と、前記コンテンツ力 抽出される第 1の認識に用いるための第 1の特徴量及び Z又 は第 1の認識により得られた第 1の識別子と、を関連付けて索引として記録する索引 記録手段を備え、前記特定手段は、前記索引記録手段により記録された索引情報 に基づいて前記検索条件に適合する内容を前記コンテンツ力 特定する索引特定 手段を有することを特徴とする。 [0055] In a twenty-fourth aspect of the invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. In an information processing apparatus provided with a specifying means for specifying from among the contents stored in the storage means, phoneme features to be used for phoneme recognition extracted from the content and phoneme identifiers obtained by Z or phoneme recognition Index recording means for associating and recording the first feature quantity used for the first recognition extracted from the content force and the first identifier obtained by the Z or first recognition as an index And the specifying means includes index information recorded by the index recording means. And an index specifying means for specifying the content power based on the search condition.
[0056] 第 25の発明は、コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから 所定の場面を検索するための検索条件を入力する検索条件入力手段と、前記検索 条件に適合する内容を前記コンテンッ記憶手段に記憶されたコンテンッの中から特 定する特定手段を備えた情報処理装置において、前記コンテンツから抽出される音 素片認識に用いるための音素片特徴量及び Z又は音素片認識により得られた音素 片識別子と、前記コンテンツ力 抽出される第 1の認識に用いるための第 1の特徴量 及び Z又は第 1の認識により得られた第 1の識別子と、を関連付けて索引として記録 する索引記録手段を備え、前記特定手段は、前記索引記録手段により記録された索 引情報に基づいて前記検索条件に適合する内容を前記コンテンツ力も特定する索 引特定手段を有することを特徴とする。  [0056] In a twenty-fifth aspect of the present invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are stored in the content. In an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content. An index that associates and records the phoneme segment identifier and the first feature amount used for the first recognition extracted from the content force and the first identifier obtained by Z or the first recognition as an index. Recording means, and the specifying means also specifies the content power that matches the search condition based on the index information recorded by the index recording means It has an index specifying means.
[0057] 第 26の発明は、コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから 所定の場面を検索するための検索条件を入力する検索条件入力手段と、前記検索 条件に適合する内容を前記コンテンッ記憶手段に記憶されたコンテンッの中から特 定する特定手段を備えた情報処理装置において、前記コンテンツから抽出される感 情認識に用いるための感情特徴量及び/又は感情認識により得られた感情識別子 と、前記コンテンツ力 抽出される第 1の認識に用いるための第 1の特徴量及び Z又 は第 1の認識により得られた第 1の識別子と、を関連付けて索引として記録する索引 記録手段を備え、前記特定手段は、前記索引記録手段により記録された索引情報 に基づいて前記検索条件に適合する内容を前記コンテンツ力 特定する索引特定 手段を有することを特徴とする。  [0057] In a twenty-sixth aspect of the present invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. Emotion feature quantity and / or emotion identifier obtained by emotion recognition for use in emotion recognition extracted from the content in an information processing apparatus having a specific means specified from the content stored in the storage means Index recording means for associating and recording the first feature quantity used for the first recognition extracted from the content force and the first identifier obtained by Z or the first recognition as an index And the specifying means specifies the content power that matches the search condition based on the index information recorded by the index recording means. It characterized by having a stage.
[0058] 第 27の発明は、第 22から第 26のいずれか一つに記載の発明の情報処理装置に おいて、前記第 1の識別子及び Z又は第 1の特徴量が聴覚情報及び Z又は視覚情 報及び Z又は文字情報及び Z又はセンサ情報に基づく識別子及び Z又は特徴量 であることを特徴とする。  [0058] According to a twenty-seventh aspect of the present invention, in the information processing device according to any one of the twenty-second to twenty-sixth aspects, the first identifier and Z or the first feature amount are auditory information and Z or It is an identifier and Z or feature quantity based on visual information and Z or character information and Z or sensor information.
[0059] 本発明によれば、発明者は共起情報を用いた評価は、ある情報の近傍に発生する 情報を組合せて任意の情報が発生する条件の確率的な評価に用いることができるこ とを利用してコンテンツ情報の意味推定を踏まえた検索などに利用するために、この ような共起関係の情報を検索や学習に利用することができると考えた。 [0059] According to the present invention, the inventor can use the evaluation using the co-occurrence information for a probabilistic evaluation of a condition in which arbitrary information is generated by combining information generated in the vicinity of certain information. We thought that such co-occurrence relationship information could be used for searching and learning because it was used for searching based on the semantic estimation of content information.
[0060] 例えば、アクション映画における「悲鳴」と「爆発音」と「赤や黄色の画像情報におけ る放射状の移動を伴う画面変化としての爆発映像」等は共起関係の特徴をもつ情報 として評価 '解釈することで検索 ·学習を実施し、課題の解決を図ろうとするものであ る。  [0060] For example, “scream”, “explosive pronunciation” and “explosion video as a screen change accompanied by radial movement in red and yellow image information” in action movies are information having characteristics of co-occurrence relations. Evaluation 'It is intended to solve problems by conducting search and learning through interpretation.
[0061] より具体的には、従来技術にある多様な認識方法を組合せて動画像における音声 ストリームと映像ストリームをフレームごとに音素'音素片と感情と画像特徴を認識し、 それらの認識結果として得られた識別子を用いて動画像に対し索引付けを実行する と共に、識別子の共起確率をフレーム毎に構成し、共起行列に基づき共起確率の推 移を複数フレームに渡り集計して、共分散行列を求め共分散行列の固有値と固有べ クトルカゝら評価関数を構成する。  [0061] More specifically, a combination of various recognition methods in the prior art recognizes phonemes' phonemes, emotions, and image features for each frame of an audio stream and a video stream in a moving image, and the recognition results are as follows. Index the moving images using the obtained identifiers, configure the co-occurrence probabilities of the identifiers for each frame, add up the co-occurrence probability transitions over multiple frames based on the co-occurrence matrix, The covariance matrix is obtained and the eigenvalues of the covariance matrix and the evaluator function are constructed.
[0062] 続けて、構成された評価関数を用いてコンテンツ情報に索引付を行うことで、コンテ ンッ情報内の多様な認識結果の共起情報によって索引付を行うことが出来る。この 際、評価関数を多変量解析により再構成し評価関数の数を任意に増やしたり、増え た評価関数により検出された画像傾向や音声傾向から人手で評価関数名を定義し たりすることで検索時の条件として選択できるようにしたりても良いし、検索結果に対 する利用者の操作に基づいて評価関数を再構成しても良いし、固有値固有ベクトル ば力りではなく HMMによる学習を実施しても良い。  [0062] Subsequently, by indexing the content information using the configured evaluation function, indexing can be performed based on the co-occurrence information of various recognition results in the content information. At this time, the evaluation function is reconstructed by multivariate analysis and the number of evaluation functions is arbitrarily increased, or the evaluation function name is manually defined from the image tendency and voice tendency detected by the increased evaluation function. The evaluation function may be reconfigured based on the user's operation on the search results, or the eigenvalue eigenvectors are trained by HMM instead of force. May be.
[0063] このようにして構成された評価関数に基づく索引により、利用者は画像特徴や音響 特徴や検出される感情の組合せに基づいた従来不可能な検索の結果を獲得するこ とが可能となり、利用状況に合わせて評価関数が再構成される機能により、より自ら の主観にあったコンテンツ情報の検索が可能となる。  [0063] The index based on the evaluation function configured in this way enables the user to obtain a search result that is impossible in the past based on a combination of image features, acoustic features, and detected emotions. In addition, the function that reconstructs the evaluation function according to the usage situation makes it possible to search for content information that is more suited to its own subjectivity.
[0064] そして、構成された評価関数に基づいてコンテンツを索引付したり、配信中のコン テンッ力 任意の傾向を検出したりすることで、従来の音素や音素片による索引検索 では不可能であった検索を行うことで、人の趣味や嗜好に合わせた検索や情報の収 録、収集、配信、再利用が可能となり、従来では不可能であった課題の克服できると 考え、利用者の嗜好にあったコンテンツの検出やシーンの検索、商品の評判調査、 運転者の感情への配慮、うめき声と感情の検出による医療利用を情報処理装置によ り実現するものである。 [0064] Then, by indexing content based on the configured evaluation function and detecting any tendency of content force during distribution, it is impossible with conventional index searches using phonemes or phonemes. It is possible to search according to a person's hobbies and preferences, and to collect, collect, distribute, and reuse information according to people's hobbies and preferences. Detecting content that suits your taste, searching for scenes, checking product reputation, The information processing device realizes medical use by considering the driver's emotions and detecting moans and emotions.
[0065] なお、従来の共起情報に基づく検索においては音声認識における単語探索や文 書内単語列の共起情報を利用した検索に専ら用いられ、認識された文字列の共起 確率や多様な文章の力 文脈を解釈し文章を同定するために利用されることが一般 的であった。  [0065] It should be noted that conventional search based on co-occurrence information is exclusively used for word search in speech recognition and search using co-occurrence information of word strings in a document, and the co-occurrence probability and variety of recognized character strings are used. The power of simple sentences It was commonly used to interpret sentences and identify sentences.
[0066] しかし、本発明ではこの共起情報の利用に着目することで従来の単語情報の組合 せではなぐコンテンツ情報中の音声に含まれる音素列や音素片列と感情認識によ る感情識別子と画像特徴や画像関連識別子の共起状態を評価する評価関数を構成 することにより、ある映像の一場面において特徴的に共起する画像特徴や音声特徴 に基づいた検索や検出を行うことが発明の要点となっている。  [0066] However, in the present invention, by focusing on the use of this co-occurrence information, a phoneme sequence or phoneme segment sequence included in the voice in the content information, which is not a combination of conventional word information, and an emotion identifier based on emotion recognition To perform search and detection based on image features and audio features that are characteristically co-occurring in a scene of a video. It is the gist of.
[0067] このように、本発明は人間にとって無意識に共起される画像と音素と感情の共起状 態を記録 ·分類 ·蓄積し、記録 ·分類 ·蓄積された情報に基づ!、て再度識別子を構成 し検索'検出に利用できるようにすることで従来の単純な認識による検索では解決で きな力つた課題の解決を図ろうとするものであり、共起関係に基づく検索は聴覚情報 や視覚情報や文字情報のそれぞれが単独であっても行えるため、電話対応における 音声情報内に生じる発話認識と感情認識の共起関係を利用することで各種応用も 可能であり、医療や顧客相談、営業、販売といった利用が可能であると判断できると ともに、ロボットや映像製作 ·編集のツールとしても利用可能である。  [0067] As described above, the present invention records, classifies and stores co-occurrence states of images, phonemes and emotions that are unintentionally co-occurring for humans, and records, classifies and records based on the accumulated information! By constructing the identifier again and making it available for search 'detection', it is intended to solve powerful problems that could not be solved by conventional simple recognition search. And visual information and text information can be used independently, so various applications are possible by using the co-occurrence relationship between utterance recognition and emotion recognition that occurs in voice information in response to telephone calls. It can be used as a tool for robots and video production / editing.
[0068] より具体的に説明すると、動画像コンテンツ情報の時系列変化に応じて音素 30個( 母音、子音、無音)、感情 4個(喜、怒、哀、楽〔平常心〕)、色空間 Web Color216色( Web Colorは、「WEBセーフカラー」「ブラウザ共通色」等とも呼ばれる)の合計 250個 の要素力もなる共起行列をフレームごとに構成して共起確率を求め、これらを 90フレ ーム (3秒)に渡り集計して共起確率の共分散行列を構成し、共分散行列の固有値固 有ベクトルを求め評価関数を構成する。  [0068] More specifically, 30 phonemes (vowels, consonants, silence), 4 emotions (joy, anger, sorrow, comfort [normal heart]), color according to the time-series change of moving image content information Spatial Web Color 216 colors (Web Color is also called “WEB safe color”, “browser common color”, etc.) A co-occurrence matrix with a total of 250 elemental powers is constructed for each frame to determine the co-occurrence probabilities. A covariance matrix of co-occurrence probabilities is constructed by counting over the frame (3 seconds), and an eigenvalue eigenvector of the covariance matrix is obtained to construct an evaluation function.
[0069] このようにして構成された、評価関数を用いてコンテンツ情報をフレーム単位に再 度索引付けを行うことにより、共起状態に基づく索引付けが可能となり、このように構 成された評価関数を用いて多変量解析を行!ヽ、分類されたそれぞれの情報に人手 により関数の名称や識別子を与えたり、多変量解析により得られた関数と共起する確 率の高い文字列を関数の名称や識別子として与えたりし、利用者からの指示に応じ て利用できるようになる。 [0069] By re-indexing the content information in units of frames using the evaluation function configured in this way, indexing based on the co-occurrence state becomes possible, and the evaluation configured in this way Multivariate analysis is performed using functions! The function name and identifier can be given by, or a character string with a high probability of co-occurring with the function obtained by multivariate analysis can be given as the function name and identifier so that it can be used according to instructions from the user become.
[0070] そして、自然発話による音素列や音素片列、感情識別子に基づいて任意の単語文 字列に変換したり、音素列や音素片列により音声を直接検索したり、登録されている キーワードを音素列や音素片列に変換したり、辞書にない音素列を辞書登録したり、 画像特徴の認識結果として得られた識別子を音素列 ·音素片列に変換したりするとと もにそれら音素列や音素片列に関連付けられた映像や感情関する識別子の共起情 報に基づいて辞書を構成したりすることが出来る。  [0070] Then, it is converted into an arbitrary word string based on the phoneme string, phoneme string string, and emotion identifier by natural utterance, or the speech is directly searched by the phoneme string or phoneme string, or registered keywords. Phonemes and phoneme strings, phoneme strings that are not in the dictionary are registered in the dictionary, and identifiers obtained as a result of image feature recognition are converted to phoneme and phoneme strings. A dictionary can be constructed based on the co-occurrence information of video and emotion-related identifiers associated with strings and phoneme strings.
[0071] そして、複数の認識手段により獲得された識別子に基づく共起行列を用いて構成さ れた評価関数によって索引付'検索 '検出'学習を行うことにより、従来では不可能で あった、感情を伴う不特定単語や画像特徴との共起関係に基づく検索が可能となり、 コンテンツ中の悲鳴に伴う画像変化の激しいシーンや泣声に伴う暗闇のシーンとい つた従来の単純な単語や画像変化の検出といった方法では検索不可能であったコ ンテンッの盛り上がりに応じたシーン検索を可能とし、コンテンツやコンテンツ内の時 間軸上の位置や表示画面上の位置や音読上の位置を検索により特定できる様にな るとともに、共起情報による学習により索引付に用いられた情報力 新しく識別用評 価関数を構成することも出来る。  [0071] Then, by performing indexed 'search' detection'learning with an evaluation function configured using a co-occurrence matrix based on identifiers acquired by a plurality of recognition means, it was impossible in the past. Search based on co-occurrence relationships with unspecified words with emotions and image features is possible, and the conventional simple words and image changes such as scenes with intense image changes accompanying screams in content and dark scenes with crying It is possible to search for scenes according to the excitement of content that could not be searched by methods such as detection, and the position on the time axis within the content, the position on the display screen, and the position on reading aloud can be specified by searching In addition to this, it is possible to construct a new evaluation function for identification using the information power used for indexing by learning with co-occurrence information.
[0072] すなわち、本発明は従来技術にあるような音素と索引文字列を相互に変換し検索 を行うのではなぐ音素,音素片力 なる識別子列と認識に用いる評価関数により獲 得された識別子との相互変換を行うことで検索'検出を行ったり、音素'音素片力もな る識別子や識別子列と感情や映像といった他の識別子や識別子列や特徴量との共 起行列を用いて評価関数を構成し索引付け'検索 '検出'学習を行ったり、これらの 索引付'検索 '検出'学習を自動的もしくは利用者指示に基づいて再帰的に行うこと により新規に評価関数を構成したり、既存の評価関数を更新したりすることで利用者 の意図を反映した索引付け'検索 '検出'学習を実現することにより課題の解決を図 る。  [0072] That is, the present invention does not convert phonemes and index character strings as in the prior art, and does not perform a search, an identifier string consisting of phonemes, phoneme power, and an identifier obtained by an evaluation function used for recognition. Search function by performing mutual conversion with, and an evaluation function using a co-occurrence matrix of identifiers or identifier strings that have phoneme power and other identifiers such as emotions and videos, identifier strings, and feature quantities To create a new evaluation function by performing indexing 'search' detection 'learning, or indexing' search 'detection' learning automatically or recursively based on user instructions, By updating the existing evaluation function, the problem can be solved by realizing indexing 'search' detection that reflects the user's intention.
[0073] なお、利用する識別子は前述の音素や色情報ばかりではなく「感情識別子」や「音 階識別子」、「環境音識別子」、画像認識による文字、画像認識に伴う「人物識別子」 、「物体識別子」といった識別子である各種識別子とは、音声や映像からそれぞれの 目的に応じた特徴量を用いて、評価関数や HMMによる確率や尤度、距離により弁 別された記号を指し、感情認識による「感情識別子」、環境音認識による「環境音識 別子」、画像認識による「文字」、顔検出と画像認識による「人物識別子」や「表情識 別子」や「物体識別子」、動画像の「動作識別子」などを指し、これらの認識に動画セ グメンテーシヨンや静止画像のセグメンテーションや音声セグメンテーションといった 技術を用 、ても良 、し、番組情報や文字情報やセンサ情報などを組合せても良 、。 Note that the identifiers used are not only the phoneme and color information described above, but also “emotion identifiers” and “sound Various identifiers that are identifiers such as “floor identifier”, “environmental sound identifier”, characters by image recognition, “person identifier”, “object identifier” accompanying image recognition, and feature quantities according to their purposes from audio and video It is used to refer to a symbol discriminated based on probability, likelihood, and distance by evaluation function, HMM, `` emotional identifier '' by emotion recognition, `` environmental sound identifier '' by environmental sound recognition, `` character '' by image recognition, This refers to `` person identifier '', `` expression identifier '', `` object identifier '', and `` motion identifier '' of moving images by face detection and image recognition. You can use technology, or you can combine program information, text information, sensor information, and so on.
[0074] また、このような識別子や特徴量に基づ!/、た共起確率の遷移確率や共分散行列の 遷移確率や共起行列の遷移確率を用いて時系列的変化を考慮した評価関数を構 成しても良ぐ複数フレームにわたる時系列変化を伴った特徴情報や識別子を一つ の行列空間であらわし、その行列空間の固有値や固有ベクトルを求めることで遷移 確率を考慮した評価関数を構成したり、同じ評価関数を時系列的に異なるフレーム に基づいた共起情報に対し実施し、その評価結果を多変量解析して利用したり出来 るため課題の解決をより効率的にすることも可能である。  [0074] Also, based on such identifiers and feature quantities! /, Evaluation of co-occurrence probabilities, covariance matrix transition probabilities, and co-occurrence matrix transition probabilities, considering time-series changes An evaluation function that considers transition probabilities is obtained by representing feature information and identifiers with time-series changes over multiple frames in one matrix space, and obtaining eigenvalues and eigenvectors of the matrix space. The problem can be solved more efficiently because it can be configured, or the same evaluation function can be applied to co-occurrence information based on different frames in time series, and the evaluation results can be used by multivariate analysis. Is also possible.
[0075] 以上の課題解決方法に従って、前述の関連する課題の解決法を必要な認識手段 の組合せを変えて説明すると、テレビ番組における芸能人や商品名などの名称には 特有の固有名詞が多く用いられており、声を入力しこれを単語に変換する際の変換 効率 ·変換精度が良くないという状況と、携帯情報端末においてはキーによる文字入 力が困難であるという状況に鑑み、誤認識の発生しやすい単語レベルの音声認識を 行うのではなぐより音声波形に近い音声特徴や音素特徴などの記号列、つまり「音 素列」や「音素片列」を音声情報として用いて特定の固有名詞の検出を行うことと、感 情認識による「感情識別子」を利用してコンテンツ中の感情特徴を検出することを組 合せると共に関連する映像や音響に基づいた識別子や特徴量の共起情報を利用す ることで、効率的な情報の検索を実現できる。  [0075] According to the above problem solving methods, the solutions for the related problems described above are explained by changing the combination of necessary recognition means. Names such as entertainers and product names in TV programs often use unique proper nouns. In view of the situation where conversion efficiency and conversion accuracy are poor when inputting voice and converting it into words, and in situations where it is difficult to input characters using keys on mobile information terminals, A specific proper noun is created by using symbolic sequences such as speech features and phoneme features that are closer to the speech waveform rather than performing speech recognition at the word level that is likely to occur, that is, “phoneme sequences” and “phoneme segment sequences” as speech information. Detection and emotion detection in emotions using emotion identifiers based on emotion recognition and co-occurrence of identifiers and features based on related video and audio In Rukoto take advantage of the broadcast, it is possible to realize a search of efficient information.
[0076] また、従来の音素や音素片による索引検索では映画であればシーンの盛り上がり、 お笑いであれば視聴者の受け具合、消費者相談窓口であれば顧客の感情の起伏と いった感情の変化や爆発音や風の音といった環境音、流れている音楽の韻律、同期 して表示される画像の特徴や画像の変化特徴、画像の特徴に伴う認識結果としての 文字列により獲得される識別子や特徴量に対して呼称を与え、その呼称を音素列や 音素片列を用いて検索することで、人の趣味や嗜好に合わせた検索や情報の収録、 収集、配信、再利用が可能となり、従来では不可能であった課題を克服できる。 [0076] In addition, with conventional index searches using phonemes and phonemes, scenes are exciting for movies, viewers are received for comedy, and customer emotions are used for consumer consultation. Environmental sounds such as changes, explosion sounds, and wind sounds, prosody of music flowing, synchronization Names are given to identifiers and feature quantities obtained from character strings obtained as recognition results associated with image features, image change features, and image features, and these names are designated as phoneme strings or phoneme string strings. By using and searching, it is possible to search according to people's hobbies and preferences, and to record, collect, distribute, and reuse information, and to overcome problems that were impossible in the past.
[0077] また、本発明では音素や音素片と共に感情や画像などの識別子や特徴量に応じた 索引付けをコンテンツ情報に自動的に行い、それらの識別子の組合せにより検索を することで、ぉ笑 、番組のネタであれば周囲の特徴量に「笑 、」と判別できる特徴量 が出現し、且つ特定の台詞の音素や音素片列が出現する場所を検出できるようにな るため、従来の映像検索システムでは実現できない検索装置の提供を実現したり、そ の特徴傾向をもつ番組を自動的に録画したり、検出に伴 、メールを配信することが 可能な情報処理装置を実現する。また、笑いの感情識別子と同時に顔検出と顔特徴 量抽出を行うことで識別子や特徴量の共起情報を学習することにより「笑い状態」の 識別子や識別関数を構成してもよ ヽ。  [0077] Further, according to the present invention, content information is automatically indexed according to identifiers and feature quantities of emotions and images together with phonemes and phoneme pieces, and a search is performed by a combination of these identifiers. In the case of a program story, a feature value that can be identified as “laugh,” appears in the surrounding feature value, and a place where a phoneme or phoneme sequence of a specific line appears can be detected. Realize an information processing device that can provide a search device that cannot be realized by a video search system, automatically record a program with the characteristic tendency, and deliver an email upon detection. It is also possible to construct a “laughing state” identifier or discriminating function by learning the co-occurrence information of the identifier and feature amount by performing face detection and facial feature amount extraction simultaneously with the emotion identifier of laughter.
[0078] また、消費者相談窓口の消費者とオペレータの音声から常時特徴抽出を行い、音 素認識を実施し認識された音素にあわせて商品を特定する手段と、検出された感情 を特定された商品名と共に記録する手段とを有することで、特定の商品に関する利 用者の感情的評価を収録し、製品品質の分析に用いたり、特定された商品名発話か ら、相談窓口のオペレータの端末画面に目的の製品のマニュアルを表示させたりし て課題の解決を図る。  [0078] In addition, a feature is always extracted from the voices of consumers and operators at the consumer consultation desk, phone recognition is performed, and a product is identified according to the recognized phonemes, and the detected emotion is specified. This means that the user's emotional evaluation of a specific product can be recorded and used for product quality analysis, or from the specified product name utterance by the operator of the consultation desk. Try to solve the problem by displaying the manual of the target product on the terminal screen.
[0079] また、音階特徴と音素特徴と感情特徴を組合せることで、楽曲の歌声と利用者の歌 声カゝら認識された「サビ」の音階と歌詞の音素列と感情識別子を用いて楽曲の検索 を実施したり、入力された文字列を音素記号列に展開したり、音階の遷移状態、感情 特徴出現頻度を比較し類似性の高い楽曲を検索することで趣味に合った音楽の検 索をしたりすることで従来に無い音楽の検索を可能として課題の解決を図る。  [0079] Also, by combining the scale feature, the phoneme feature, and the emotion feature, the sung voice of the music and the singing voice of the user are recognized, and the phoneme sequence and the emotion identifier of the lyrics are recognized. Search for music, expand input character strings into phoneme symbol strings, compare musical scale transition states and emotion feature appearance frequencies, and search for music with high similarity to find music that suits your hobbies. By searching, it will be possible to search for music that has never existed before, and solve problems.
[0080] また、利用者の発話を音素列に変換するとともに EPGや BML、 RSS、文字放送に おける役者名を音素列に変換し、利用者の発話音素列と一致する役者名音素列を 探し、一致した音素列の役者名に関連付けられた配役名を検出する。この際、音素 列は文字入力された単語やキーワードを音素列に展開しても良い。 [0081] そして、配信される動画像に同期した音声に対する音素認識を実施しながら音素 列索引を構成するとともに、 EPGや BML、 RSS、文字放送から検出された役者名に 基づく配役名の音素列と一致する個所を検索する。この際、配役名に伴う音声信号 に含まれる感情特徴や番組ジャンルを評価してもよい。 [0080] Also, the user's utterance is converted into a phoneme string, and the actor names in EPG, BML, RSS, and teletext are converted into phoneme strings, and an actor name phoneme string that matches the user's utterance phoneme string is searched. The cast name associated with the actor name of the matched phoneme string is detected. At this time, the phoneme string may be expanded into a phoneme string from words or keywords that are input as characters. [0081] Then, a phoneme sequence index is constructed while performing phoneme recognition on the sound synchronized with the distributed moving image, and a phoneme sequence with a cast name based on an actor name detected from EPG, BML, RSS, or text broadcasting. Search for locations that match. At this time, emotional characteristics and program genres included in the audio signal accompanying the cast name may be evaluated.
[0082] この処理の後、配役名に基づく音素列と利用者指定の感情特徴が一致したことを 検出することで録画を開始したり、対象範囲のみをスキップしながら再生したり、一致 度合に基づくランキングを実施して一覧をつくり利用者の操作を促すための検索結 果として出力し利便性の高い検索を実現して課題の解決を図る。  [0082] After this processing, recording is started by detecting that the phoneme string based on the cast name matches the emotion feature specified by the user, or playback is performed while skipping only the target range. Based on the ranking, a list is created and output as a search result for encouraging the user's operation to achieve a convenient search and solve the problem.
[0083] また、音声から得た特徴量に基づき認識された音素や音素片による記号列や感情 や音階、楽器音、環境音などの識別子及び Z又は映像から得た特徴量に基づき認 識された形状や色、文字、動作などの識別子を多変量解析手法により分類し本発明 にもち ヽる新し 、識別子として利用することで課題の解決を図ってもょ ヽ。  [0083] In addition, recognition is performed based on a character string obtained based on a feature value obtained from speech, a symbol string using phonemes or phoneme pieces, an identifier such as an emotion, a scale, an instrument sound, an environmental sound, and a feature value obtained from Z or video. It is possible to solve the problem by classifying identifiers such as shapes, colors, characters, and actions using a multivariate analysis method and using them as identifiers in the present invention.
[0084] また、利用者が頻繁に録画したりスキップ再生したりする情報の特徴量を学習し、 学習した特徴量の検出に伴い自動的に録画を開始したり、スキップ再生を開始したり 、検出にともないメールや RSSを配信したりするといつた任意の処理を実施して課題 の解決を図ってもよい。  [0084] Further, the user learns the feature amount of information that is frequently recorded or skip-played, automatically starts recording upon detection of the learned feature amount, starts skip play, When an e-mail or RSS is delivered upon detection, any processing may be performed whenever the problem is solved.
[0085] これらを踏まえて、本発明は従来の音声に伴う識別子ば力りではなぐ映像や音声 カゝら抽出される識別子や特徴量として音声カゝら認識される感情識別子、環境音識別 子、楽器識別子及び映像識別子、動作識別子、形状識別子を組合せて索引付けや 検索を実施して検索結果を得ると共に、それら処理における識別子や特徴量の共起 状態を学習したり、音素や音素片、感情識別子、及び本実施例に記載されている他 の識別子の情報を配信したり、配信された情報に基づいて検索や検出を実施したり することを特徴としている。  [0085] Based on these, the present invention is based on identifiers extracted from images and voices that are not based on conventional identifiers associated with voices, emotion identifiers recognized as voices as feature quantities, and environmental sound identifiers. Indexing and searching by combining musical instrument identifiers and video identifiers, motion identifiers, and shape identifiers to obtain search results, learning the co-occurrence state of identifiers and feature quantities in these processes, It is characterized in that information on emotion identifiers and other identifiers described in this embodiment is distributed, and search and detection are performed based on the distributed information.
[0086] また、前記非特許文献 12や非特許文献 13のようなシステムと異なり、文字情報の 構文解析は行わず、単純な単語出現頻度を用いた共起状態のみを評価するため品 詞の弁別は行わないといった特徴があり、単語間の共起情報を利用する場合であつ ても漢字といった意味の次元ではなぐ音素や音素片に展開された発音記号レベル での共起状態を用いて検索を行うシステムであり、得られた検索結果に対し構文解 析を行ってもよい。 [0086] Unlike the systems such as Non-Patent Document 12 and Non-Patent Document 13, the character information is not parsed, and only the co-occurrence state using simple word appearance frequency is evaluated. Even if using co-occurrence information between words, even if using co-occurrence information between words, search using the co-occurrence state at the phonetic symbol level expanded into phonemes and phonemes that are not in the dimension of meaning such as kanji The system solves the search results obtained. Analysis may be performed.
[0087] このように、従来では相互に変換しながらコンテンツ情報内で表現される音声や画 像に関連付けられた文字列を検索するにとどまっていた検索技術とは異なり、本発 明では従来技術に記載された各種特徴抽出技術や各種認識技術を組合せることで 複数の音声特徴量や映像特徴量や画像特徴量や文章特徴量の認識に基づく記号 や識別子、文字などを組合せて索引付けを行いそれらの共起情報に基づいた検索 や検出や検出に伴う任意の処理や利用者の選択結果や再利用状況に基づ 、た各 種識別子の共起状態を用いる識別子の学習や特徴量の再構成を行うことで、従来 不可能であった人の主観や感情を考慮した複雑で個人差の大きなコンテンツ情報表 現に対する検索処理が可能となり、発話や文字列に含まれる形容詞や副詞を含む 単語に関連した抽象的な検索を可能とし、デジタルデバイドの元となる情報処理装 置の利用における煩雑さを低減することで課題の解決を図ることができる。  [0087] In this way, unlike the conventional search technology that merely searches for a character string associated with a voice or an image expressed in the content information while being converted to each other, the present invention provides a conventional technology. By combining the various feature extraction technologies and various recognition technologies described in the above, indexing is possible by combining symbols, identifiers, characters, etc. based on recognition of multiple audio features, video features, image features, and text features. Search based on the co-occurrence information, arbitrary processing associated with the detection, detection results of the user, the result of user selection and reuse status, and learning of identifiers and feature quantities Reconstruction enables search processing for complex content information expressions that take into account the subjectivity and emotions of people, which was impossible before, and includes adjectives and adverbs included in utterances and character strings. word It is possible to allow the associated abstract search, attempt to resolve the problem by reducing the complexity in the use of the underlying processing equipment of digital divide.
発明の効果  The invention's effect
[0088] このように、従来技術では困難であった、固有名詞を含めた感情や環境音、画像特 徴、運動特徴と言った任意の識別子や特徴量を共起状態に基づ 、た関連付けによ り記録'学習し弁別するとともに、それらの識別子に音素列や音素片列、感情識別子 を関連付けて共起状態を学習したり、それぞれの識別子や特徴量を音素や音素片 や文字列で検索条件に指定したり、索引付けをコンテンツ情報に実施したりすること で、複雑な主観的条件に基づく情報の検索や記録、配信、受信を実現するばかりで なぐ国際的な発音の違いに対応したり、 HDDレコーダやパソコン、携帯端末やカー ナビ、ロボットなどを用いて利用者に対して簡易な情報の検索提供手段を実現したり できるため、生活における情報調達の利便性を改善できる情報配信装置や情報端 末、情報処理装置を実現し、デジタルデバイドに関わる課題の低減を図る。  [0088] In this way, arbitrary identifiers and feature quantities such as emotions, environmental sounds, image features, and motor features including proper nouns, which were difficult in the prior art, are associated based on the co-occurrence state. To learn and discriminate, and associate these identifiers with phoneme strings, phoneme strings, and emotion identifiers to learn the co-occurrence state, and to identify each identifier and feature quantity with phonemes, phonemes, and strings. By specifying search conditions and performing indexing on content information, it is possible to search, record, distribute, and receive information based on complex subjective conditions. Information distribution that can improve the convenience of information procurement in daily life, such as using a HDD recorder, personal computer, mobile terminal, car navigation system, robot, etc. apparatus Information system powder, to realize an information processing apparatus, reduce the problems related to digital divide.
[0089] また、言語ィ匕しづら!/ヽ形容詞や副詞を表現するコンテンツ情報に対して多様な認 識に伴う特徴量及び Z又は識別子の共起情報に基づいた索引付を行うことでメタ共 起検索 (Meta— occur Retrieval)若しくは抽象共起検索(Abstracts Co -occur Retneva 1)を実現し、コンテンッ情報の画像 ·映像と音声 '音響と音素列'音素片列や感情を 含めた各種認識による識別子の共起情報に基づいたオントロジーゃセマンティックス の抽出によるァノテーシヨン情報を構成することでコンテンツ情報に対する音素列 '音 素片列と感情を中心とした多次元識別子に基づくグラウンデイングの実現を図り、そ れらを再利用することにより情報検索方法に関する知識共有を実現できる。 [0089] In addition, meta-indexing is performed by indexing content information that expresses linguistic adjectives and adverbs and adverbs based on feature quantities associated with various recognitions and co-occurrence information of Z or identifiers. Realization of co-occurrence search (Meta-occurrence Retrieval) or abstract co-occurrence search (Abstracts Co-occur Retneva 1), content image / video and audio 'acoustic and phoneme sequence' various recognition including phoneme sequence and emotion Ontology based on identifier co-occurrence information Information retrieval method by implementing grounding based on multi-dimensional identifiers centering on phoneme sequences and emotions by constructing annotation information by extraction of phonemes, and reusing them Knowledge sharing can be realized.
図面の簡単な説明 Brief Description of Drawings
圆 1]本実施形態における装置の基本構成例を示す図。 圆 1] A diagram showing a basic configuration example of an apparatus according to the present embodiment.
圆 2]基本的な索引付けの手順を示す図。 圆 2] Diagram showing the basic indexing procedure.
圆 3]特徴量識別子変換による識別子生成の動作を示す図。 [3] A diagram showing an operation of generating an identifier by converting a feature amount identifier.
圆 4]映像索引データの構成例を示す図。 圆 4] A diagram showing a configuration example of video index data.
圆 5]単位時間指定方式における映像索引データの構成例を示す図。 [5] A diagram showing a configuration example of video index data in the unit time designation method.
[図 6]索引共起状態学習の動作を示す図。  [FIG. 6] A diagram showing the operation of index co-occurrence state learning.
圆 7]索引からの学習例の手順を示す図。 [7] A diagram showing the procedure of an example of learning from an index.
圆 8]感情と音素と映像の共起行列の一例を示す図。 [8] A diagram showing an example of a co-occurrence matrix of emotions, phonemes, and images.
圆 9]感情と音素と映像の共分散行列の一例を示す図。 [9] A diagram showing an example of a covariance matrix of emotions, phonemes, and images.
圆 10]基本的な検索手順を示す図。 圆 10] Diagram showing basic search procedure.
圆 11]識別子特徴量変換部の動作を示す図。 [11] The figure which shows operation | movement of an identifier feature-value conversion part.
圆 12]基本的な検索条件力もの学習例を示す図。 圆 12] A diagram showing an example of learning with basic search conditions.
圆 13]基本的な検出手順の動作を示す図。 [13] A diagram showing the operation of a basic detection procedure.
圆 14]索引情報生成装置の構成例を示す図。 14] A diagram showing a configuration example of an index information generating device.
圆 15]検索装置の構成例を示す図。 15] A diagram showing a configuration example of a search device.
圆 16]索引付け方法の動作を示す図。 圆 16] Diagram showing the operation of the indexing method.
[図 17]検索方法の動作を示す図。  FIG. 17 is a diagram showing the operation of the search method.
圆 18]基本的な文字列による検索依頼及び実施方法の動作手順について示す図。 圆 18] A diagram showing an operation procedure of a basic character string search request and execution method.
[図 19]検索処理の一例を示す図。  FIG. 19 is a diagram showing an example of search processing.
[図 20]本実施形態における利用環境の一例を示す図。  FIG. 20 is a diagram showing an example of a usage environment in the present embodiment.
[図 21]送信側の処理手順の一例を示す図。  FIG. 21 is a diagram showing an example of a processing procedure on the transmission side.
[図 22]受信側の処理手順の一例を示す図。  FIG. 22 is a diagram showing an example of a processing procedure on the receiving side.
[図 23]検索処理の状態遷移を示す図。  FIG. 23 is a diagram showing state transition of search processing.
圆 24]制御辞書の構成の一例を示す図。 [図 25]基本的な外部情報の獲得手順の一例を示す図。 圆 24] A diagram showing an example of the configuration of the control dictionary. FIG. 25 is a diagram showing an example of a basic procedure for acquiring external information.
[図 26]EPG情報を利用した検索及び任意処理方法の一例を示す図。  FIG. 26 is a diagram showing an example of a search and arbitrary processing method using EPG information.
[図 27]消費者感情による商品信頼性調査応用における状態遷移を示す図。  FIG. 27 is a diagram showing a state transition in a product reliability survey application by consumer sentiment.
[図 28]言語音素記号の検索手順の一例を示す図。  FIG. 28 is a diagram showing an example of a search procedure for language phoneme symbols.
[図 29]言語別文字列の音素記号検索手順の一例を示す図。  FIG. 29 is a diagram showing an example of a phoneme symbol search procedure for language-specific character strings.
[図 30]記号変換関数の構成例を示す図。  FIG. 30 is a diagram showing a configuration example of a symbol conversion function.
[図 31]国際音素記号の変換手順の一例を示す図。  FIG. 31 is a diagram showing an example of an international phoneme symbol conversion procedure.
[図 32]日本語音素国際音素記号の変換辞書の一例を示す図。  FIG. 32 is a diagram showing an example of a conversion dictionary for Japanese phoneme international phoneme symbols.
[図 33]国際音素から日本語音素への変換例の一例を示す図。  FIG. 33 shows an example of conversion from international phonemes to Japanese phonemes.
[図 34]音素力 温度編への変換例を示す図。  FIG. 34 is a diagram showing an example of conversion to phoneme force temperature.
[図 35]音素片力も音素への変換例を示す図。  FIG. 35 is a diagram showing an example of conversion of phoneme force into phonemes.
[図 36]国際音素記号の検索手順の一例を示す図。  FIG. 36 is a diagram showing an example of a search procedure for international phoneme symbols.
[図 37]国際音素記号の検索手順の一例を示す図。  FIG. 37 is a diagram showing an example of an international phoneme symbol search procedure.
[図 38]国際音素記号の検索手順の一例を示す図。  FIG. 38 is a diagram showing an example of an international phoneme symbol search procedure.
符号の説明 Explanation of symbols
10 情報処理部 10 Information processing department
102 索引検索評価部  102 Index search evaluation section
104 共起情報学習部  104 Co-occurrence information learning department
106 辞書抽出部  106 Dictionary extractor
108 索引情報生成部  108 Index information generator
110 索引記号列合成部  110 Index symbol string composition part
112 制御部  112 Control unit
114 メタ記号抽出部  114 Meta symbol extraction unit
116 特徴量抽出部  116 Feature extraction unit
118 識別子特徴量変換部  118 Identifier feature conversion unit
120 特徴量識別子変換部  120 Feature identifier converter
122 評価一覧出力部  122 Evaluation list output section
20 記録部 22 情報記録蓄積部 20 Recording section 22 Information record storage
202 コンテンツ保存部  202 Content storage
204 評価関数保存部  204 Evaluation function storage
206 索引情報保存部  206 Index information storage
208 特徴量保存部  208 Feature storage
210 プログラム保存部  210 Program storage
212 共起学習保存部  212 Co-occurrence learning storage
214 辞書情報保存部  214 Dictionary information storage
216 広告情報保存部  216 Advertising information storage
30 情報入力部  30 Information input section
40 情報出力部  40 Information output section
50 通信回線部  50 Communication line
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0092] 続いて、本発明を適用した場合の情報検索装置の一例について説明する。 Next, an example of an information search apparatus when the present invention is applied will be described.
[構成]  [Constitution]
まず、本発明による装置の具体構成例を説明する。本発明による装置は図 1の装 置基本構成例のように、情報処理部 10、記憶部 20、情報入力部 30、情報出力部 40 、通信回線部 50を備えて構成されている。この装置はテレビやディスプレイなどの表 示装置を内蔵して 、ても外付けで持って 、ても良 、。  First, a specific configuration example of the apparatus according to the present invention will be described. The apparatus according to the present invention includes an information processing section 10, a storage section 20, an information input section 30, an information output section 40, and a communication line section 50, as in the apparatus basic configuration example of FIG. This device has a built-in display device such as a TV or display, but it can be held externally.
[0093] 通信回線部 50は他の情報処理装置との有線無線を問わな 、通信を行!、、他の情 報処理装置と相互の通信や制御を出来るように構成されている。例えば、本発明を 用いた装置同士や通信回線を経由して相互に情報の検索閲覧、提供を実施しても よい。 The communication line unit 50 is configured to perform communication regardless of wired wireless communication with other information processing apparatuses, and to perform mutual communication and control with other information processing apparatuses. For example, information may be searched for, browsed, and provided with each other via devices or communication lines using the present invention.
[0094] なお、この通信回線部 50は任意情報の取得や配信を実行する機能を持ち、より具 体的にはイーサネット(登録商標)や ATM (Asynchronous Transfer Mode)、ファイバ 一チャネル、無線 LAN、赤外線通信などの装置類を必要に応じ組合せて構成され ており、 IPや TCP、 UDP、 IEEE802系のような任意の通信プロトコルを用いることが できる。 [0095] 情報入力部 30はキーボードやポインティングデバイス、動画像キヤプチャ装置、テ レビ放送関連情報受信回路、マイク入力といった情報の入力を可能とする装置により 構成され、量子化された情報を情報処理部の指示により記憶部に保存したり、情報 処理部の加工や指示に基づいて情報出力部に情報を出力したりする機能を持って ちょい。 [0094] The communication line unit 50 has a function of executing acquisition and distribution of arbitrary information, and more specifically, Ethernet (registered trademark), ATM (Asynchronous Transfer Mode), fiber one channel, wireless LAN, It is configured by combining devices such as infrared communication as required, and can use any communication protocol such as IP, TCP, UDP, and IEEE802. [0095] The information input unit 30 includes a keyboard, a pointing device, a moving image capture device, a television broadcast related information receiving circuit, and a device capable of inputting information, such as a microphone input. It has the function of saving to the storage unit according to the instructions and outputting the information to the information output unit based on the processing and instructions of the information processing unit.
[0096] なお、この情報入力部 30はモーションキヤプチャ装置、カメラ、 RFIDリーダ、バー コードリーダ、画像スキャナ、スィッチパネル、 OCR,カードリーダ、後述されるセンサ 類といった他の入力装置や入力装置に接続する端子類を必要に応じ組合せて具備 していても良い。  [0096] The information input unit 30 is connected to other input devices and input devices such as motion capture devices, cameras, RFID readers, barcode readers, image scanners, switch panels, OCRs, card readers, and sensors described later. The terminals to be connected may be combined as necessary.
[0097] 情報出力部 40は画像表示装置、スピーカ出力といった情報の出力を可能とする装 置により構成され、量子化された情報を情報処理部の指示により記憶部への情報保 存ゃ再生をしたり、情報処理部の加工や指示により情報を出力したりする機能を持つ ていてもよい。  [0097] The information output unit 40 is configured by an apparatus capable of outputting information, such as an image display device and speaker output, and the quantized information is stored and reproduced in the storage unit according to instructions from the information processing unit. Or output information by processing or instructions of the information processing unit.
[0098] なお、この情報出力部 40はプリンタや任意の駆動機械や造形装置、ミーリングマシ ン、といった他の出力装置や出力装置に接続する端子類を必要に応じ組合せて具 備していても良ぐ検索結果に基づいた情報を出力することでポスターを印刷したり、 榭脂製品の造形出力を行ったりしても良い。  Note that the information output unit 40 may include other output devices such as a printer, an arbitrary driving machine, a modeling device, and a milling machine, and a combination of terminals connected to the output device as necessary. You may print a poster by outputting information based on good search results, or you may print out a resin product.
[0099] 情報処理部 10は CPUといった電子回路に基づいた演算回路により構成されてお り、情報入力部 30や記憶部 20から取得した情報を処理する。そして、当該処理した 結果を記憶部 20に保存したり、再生したり、加工したりして情報出力部 40や記憶部 2 0に出力したり、通信回線部 50経由で他の情報処理装置と情報の交換のための送 受信を行ったり、情報の受配信を行ったりする。また、情報処理部 10は図 1にあるよう に、プログラムにより検索に必要な各種処理を実現するためのプログラムモジュール コードや、それらを実行するための専用電子回路により構成されてもよい。  The information processing unit 10 is configured by an arithmetic circuit based on an electronic circuit such as a CPU, and processes information acquired from the information input unit 30 and the storage unit 20. Then, the processed result is stored in the storage unit 20, reproduced, processed, and output to the information output unit 40 or the storage unit 20, or with other information processing apparatuses via the communication line unit 50. Send and receive for information exchange and receive and distribute information. Further, as shown in FIG. 1, the information processing unit 10 may be configured by program module codes for realizing various processes necessary for searching by a program, and a dedicated electronic circuit for executing them.
[0100] なお、情報処理部 10は一般的に DSP、リコンフィギュアラブルプロセッサ、 FPGA、 ASIC等の組合せにより構成されていることが考えら記憶部 20は、 RAM、 ROM,フ ラッシュメモリ、ハードディスク、光ディスク、リムーバブルディス等により構成されること が知られている。 [0101] そして、情報処理部 10は、特徴量や索引の識別子からなる検索条件と索引情報の 一致度を評価して検索をする索引検索評価部 102と、特徴量や検索条件、検索結 果により得られた共起情報を学習する共起情報学習部 104と、辞書情報保存部から 目的の変換のための情報を抽出する辞書抽出部 106と、抽出された特徴量から認識 処理により識別子を決定し索引付けを行う索引情報生成部 108と、コンテンツ情報に 対して索引情報の合成を行う索引記号列合成部 110と、各機能部の制御を行う制御 部 112と、コンテンツ情報力も MPEG7のような索引情報を取得したり通信回線部か ら RSS情報や XMLなどのマークアップ言語による情報を取得したり、情報入力部か ら受信した放送波に基づき EPG情報を取得したのちに任意の記号情報における命 令や変数、属性を抽出するメタ記号抽出部 114と、情報入力部経由で外部から得ら れる自然情報及び通信回線部や記憶部から取得した映像や画像、音声と!、つた情 報処理装置で処理可能なコンテンツ情報力 特徴量を抽出する特徴量抽出部 116 と、利用者の認識による識別子や記憶媒体や通信により外部から取得した識別子、 内部でコンテンツなど力も抽出された識別子などに対して、その識別子の標準的な 特徴量に変換する識別子特徴量変換部 118と、コンテンツ情報や利用者の入力から 取得した特徴量を識別子に変換する特徴量識別子変換部 120と、検索結果として評 価一覧として出力する評価一覧出力部 122と、を備えて構成されており、これらの必 要に応じた組合せにより検索、検出、索引付けが行われる。 [0100] Note that the information processing unit 10 is generally composed of a combination of DSP, reconfigurable processor, FPGA, ASIC, etc., so the storage unit 20 is composed of RAM, ROM, flash memory, hard disk, It is known to be composed of optical discs, removable discs and the like. [0101] Then, the information processing unit 10 evaluates the degree of coincidence between the search condition including the feature quantity and the identifier of the index and the index information, and performs the search, the feature quantity, the search condition, and the search result. The co-occurrence information learning unit 104 for learning the co-occurrence information obtained by the above, the dictionary extraction unit 106 for extracting information for target conversion from the dictionary information storage unit, and the identifier by the recognition process from the extracted feature quantity Index information generation unit 108 that performs determination and indexing, index symbol string synthesis unit 110 that synthesizes index information for content information, control unit 112 that controls each functional unit, and content information power that is similar to MPEG7 Arbitrary symbol information after obtaining the necessary index information, obtaining information in the markup language such as RSS information and XML from the communication line part, or obtaining EPG information based on the broadcast wave received from the information input part Instructions and variables in Meta-symbol extraction unit 114 that extracts attributes, natural information obtained from the outside via the information input unit, and video, images, and voices acquired from the communication line unit and storage unit can be processed by an information processing device Content information extractor 116 that extracts feature values, identifiers obtained by user recognition, identifiers obtained from the outside through storage media or communication, identifiers for which content and other forces are extracted internally, etc. Identifier feature value conversion unit 118 for converting to standard feature values, feature amount conversion unit 120 for converting feature values obtained from content information and user input into identifiers, and output as an evaluation list as a search result And an evaluation list output unit 122 that performs search, detection, and indexing according to combinations according to these needs.
[0102] なお、コンテンツ情報に関しては音声情報による音楽、コンテンツに付属するメタ情 報、文字情報による文書や番組情報としての EPGや BML、譜面情報としての音階、 一般的な静止画や動画像、 3次元情報としてのポリゴンデータやベクトルデータゃテ タスチヤデータやモーションデータ(動作データ)、可視化数値データによる静止画像 や動画像、宣伝や広告を目的としたコンテンツ情報等を含んでいても良ぐ視覚情報 や聴覚情報や文字情報やセンサ情報により構成されていおり、その位置は時系列的 であったり、表示における座標情報であったり、文章の音読位置であったり、図表の 記録順序や識別番号順序であったり、視覚情報や聴覚情報から算出される位置や 座標に基づく時空間座標であったりしても良ぐその近傍から共起情報を構成しても 良い。 [0103] 記憶部 20は情報処理部 10の制御に伴って、各情報を蓄積'記録するための情報 記録蓄積部 22を備えている。ここで、情報記録蓄積部 22は、例えば RAMやフラッ シュメモリなどの半導体記憶装置を用いて構成されてもよいし、任意のインターフエ一 スを用いて外部のハードディスクや光ディスク、磁気ディスクを用いて構成されても良 いし、それら記憶部が交換可能な記憶媒体で構成されても良い。 [0102] Regarding content information, music based on audio information, meta information attached to the content, EPG and BML as document and program information based on text information, musical scale as musical score information, general still images and moving images, Visual information that may include polygon data and vector data as 3D information, texture data, motion data (motion data), still images and moving images based on visualization numerical data, and content information for advertising and advertising purposes. And auditory information, text information, and sensor information.The position is chronological, coordinate information in the display, reading position of text, and the recording order and identification number order of the chart. Or co-occurrence information from the vicinity, which may be spatio-temporal coordinates based on positions and coordinates calculated from visual and auditory information Information may be composed. The storage unit 20 includes an information recording / accumulating unit 22 for accumulating / recording each piece of information under the control of the information processing unit 10. Here, the information recording / accumulating unit 22 may be configured using, for example, a semiconductor storage device such as a RAM or a flash memory, or using an external hard disk, an optical disk, or a magnetic disk using an arbitrary interface. The storage unit may be configured with a replaceable storage medium.
[0104] そして、図 1に示すように記憶部 20は、検索対象となる動画像や静止画像、音声、 文書を保存するコンテンツ情報保存部 202と、識別子に関連する評価関数としての HMMやベイズ識別関数や任意の距離関数の認識テンプレートを保存する評価関 数保存部 204と、コンテンツ情報を検索するための索引となる識別子や任意の記号 列を保存する索引情報保存部 206と、コンテンツ情報力 抽出された特徴量情報を 保存する特徴量保存部 208と、プログラムによる検索に必要な各種処理を実現する ためのプログラムモジュールのコードやパラメータを保存するプログラム保存部 210と 、共起情報学習部で学習した識別子の認識テンプレートや本発明を用いて再学習し た識別子の認識テンプレートといった HMMや評価関数を保存する共起学習保存部 212と、任意の識別子もしくは特徴量と他の任意の識別子もしくは特徴量とを相互に 変換する変換テーブル情報力もなる辞書情報を保存する辞書情報保存部 214と、コ ンテンッ情報検索中などに情報処理部力 の指示により宣伝を行うための広告情報 を保存する広告情報保存部 216と、の領域を確保して構成されている。  Then, as shown in FIG. 1, the storage unit 20 includes a content information storage unit 202 that stores a moving image, a still image, audio, and a document to be searched, and an HMM or Bayes as an evaluation function related to the identifier. An evaluation function storage unit 204 that stores a recognition template of an identification function or an arbitrary distance function, an index information storage unit 206 that stores an identifier or an arbitrary symbol string as an index for searching content information, and a content information capability The feature quantity storage unit 208 that stores the extracted feature quantity information, the program storage unit 210 that stores program module codes and parameters for realizing various processes necessary for searching by the program, and the co-occurrence information learning unit A co-occurrence learning storage unit 212 that stores an HMM and an evaluation function such as a recognized identifier recognition template and an identifier recognition template relearned using the present invention; A dictionary information storage unit 214 that stores dictionary information that also has a conversion table information capability for mutually converting an arbitrary identifier or feature amount and another arbitrary identifier or feature amount, and an information processing capability during content information search, etc. And an advertisement information storage unit 216 that stores advertisement information for advertising in accordance with the instructions of the above.
[0105] また、対象となるコンテンツ情報にっ ヽては『コンテンツ情報の例』、利用する特徴 量や識別子にっ ヽては『特徴量や識別子の例』、識別子や特徴量の相互変換に用 いる辞書に関しては『辞書構成の例』により詳しく記述し、情報処理装置 1を検索装 置として用いるにはコンテンツ情報を装置に入力して索引付けを実行するステップや 利用者の入力に基づ 、て検索に用いる問合せ識別子列 (クエリ)を構成するステップ と検索を問合せ識別子列 (クエリ)に基づいて索引を参照し検索結果の絞込みを実 行するステップと検索結果に基づいて検索結果の一覧を出力するステップとが一般 的に必要でありそれらに必要な機能は『基本的な索引処理例』や『基本的な検索処 理例』に詳述し、これらの検索のための索引情報力 索引情報の共起状態を学習す る手順に関し『共起状態の学習処理手順例』に詳述する。 [0106] また、サーバクライアントモデルを導入し、任意の処理部や記憶部をサーバとクライ アントに分割して通信で結びサーバ'クライアント間で情報を交換することにより同等 のサービスやインフラ、検索、索引付け、検出と検出に伴う任意処理を実施しても良く 『端末及び基地局に用いる情報処理装置の手順例』に詳しく述べる。 [0105] In addition, "content information example" for target content information, "example of feature quantity and identifier" for feature quantity and identifier to be used, and mutual conversion of identifier and feature quantity. The dictionary to be used is described in more detail in “Example of dictionary configuration”, and in order to use the information processing device 1 as a search device, content information is input to the device and indexing is performed, or based on user input. A step of constructing a query identifier column (query) used for the search, a step of referring to the index based on the query identifier column (query) and performing a search result narrowing down, and a list of search results based on the search result Are generally required, and the functions required for them are described in detail in Basic index processing examples and Basic search processing examples. Sharing index information The procedure for learning the wake-up state is described in detail in “Example of co-occurrence state learning process”. [0106] In addition, by introducing a server-client model, an arbitrary processing unit or storage unit is divided into a server and a client, connected by communication, and information is exchanged between the server and the client. Indexing, detection, and arbitrary processing associated with detection may be performed, as described in detail in “Procedure examples of information processing devices used in terminals and base stations”.
[0107] なお、本実施例においては一部ハードウェアを用いて実施されている力 これらの ハードウェアはソフトウェアを用いて同様の効果を得ることが良く知られており、各処 理部と同様の処理を行うプログラムを情報処理部に用いられる CPUや DSPなどによ り実施しても良いし、任意の部分ごとに機能や装置を分離して複数の情報処理装置 を通信により連携させて実施してもよい。  [0107] Note that, in this embodiment, a force implemented using a part of hardware. It is well known that these hardware can obtain the same effect using software, and is the same as each processing unit. The processing program may be executed by the CPU or DSP used in the information processing unit, or the functions and devices are separated for each arbitrary part, and multiple information processing devices are linked by communication. May be.
[0108] [動作例]  [0108] [Operation example]
《基本的な索引処理例〉〉  <Example of basic index processing>
まず、索引付手段の基本的な動作 (処理手順)について図 2の動作フローに従って 概略を説明する。まず、コンテンツ情報に基づく映像又は音声等の自然情報や利用 者の入力による文字情報やコンテンツ情報に関連した索引情報やメタ情報力 抽出 された文字情報や外部から受信した番組情報やセンサ情報等の各種情報が情報入 力部 30から入力されたり、通信回線部 50やコンテンツ保存部 202から獲得されると( ステップ S0201)、入力された視覚情報や聴覚情報やセンサ情報に基づく自然情報 や文字情報の特徴量を抽出するために、特徴量抽出処理 (S0202)が特徴量抽出 部 116により実行される。  First, the basic operation (processing procedure) of the indexing means will be outlined according to the operation flow of FIG. First, natural information such as video or audio based on content information, text information input by the user, index information related to content information, extracted meta information, text information extracted, program information received from outside, sensor information, etc. When various information is input from the information input unit 30 or acquired from the communication line unit 50 or the content storage unit 202 (step S0201), natural information or text information based on the input visual information, auditory information, or sensor information. In order to extract the feature amount, feature amount extraction processing (S0202) is executed by the feature amount extraction unit 116.
[0109] ここで、自然情報とは聴覚情報や視覚情報やセンサ情報であり、コンテンツ情報や 広告情報として情報入力部 30に接続された外部機器や通信回線部 50を経由した 外部の情報配信装置や交換可能な外部記憶媒体により獲得され、コンテンツ情報保 存部 202に保存されたコンテンツ情報や広告情報保存部 216に保存された広告情 報としても提供される。 [0109] Here, the natural information is auditory information, visual information, or sensor information, and is an external information distribution device via the external device connected to the information input unit 30 or the communication line unit 50 as content information or advertisement information. It is also obtained as content information acquired by an exchangeable external storage medium and stored in the content information storage unit 202 or advertisement information stored in the advertisement information storage unit 216.
[0110] 特徴量抽出処理 (ステップ S0202)は入力された自然情報から特徴量を抽出する 処理であり、例えば、音声が入力された場合は FFT等の処理が施され、画像であれ ば 1画像中の色空間を量子化することにより特徴量が抽出される。なお、特徴抽出方 法は別記するように多様な形態が考えられるため、後述するように実装に依存しても 良い。 [0110] The feature amount extraction process (step S0202) is a process of extracting feature amounts from the input natural information. For example, when speech is input, processing such as FFT is performed. A feature value is extracted by quantizing the inside color space. Note that the feature extraction method can take various forms as described below, so it depends on the implementation as described later. good.
[0111] 続いて、特徴量識別子変換部 120により、同一分野の識別子の中から特定の識別 子を評価するために複数の評価関数へ抽出された特徴量を与え、同一分野の中で 一番類似性の高い識別子を選択するための特徴量識別子変換処理による識別子生 成処理が実行される (ステップ S0203)。なお、識別子生成処理に用いる特徴量識 別子変換処理にっ 、ては図 3を用いて後述する。  [0111] Subsequently, the feature quantity identifier conversion unit 120 gives the extracted feature quantity to a plurality of evaluation functions in order to evaluate a specific identifier from the identifiers in the same field. An identifier generation process is performed by a feature quantity identifier conversion process for selecting an identifier with high similarity (step S0203). The feature quantity identifier conversion process used for the identifier generation process will be described later with reference to FIG.
[0112] また、評価関数を用いずにコンテンツに付随したメタ情報の文字列や BMLや EPG などの番組情報である文字情報を識別子に直接用いる識別子生成処理 (ステップ S 0203)や辞書情報保存部 216と辞書抽出部 106からなる辞書機能を用いて文字列 を IDに変換し識別子とする識別子生成処理 (ステップ S0203)を実行しても良!ヽ。  [0112] In addition, an identifier generation process (step S 0203) or a dictionary information storage unit that directly uses, as an identifier, a character string of meta information attached to content without using an evaluation function or character information that is program information such as BML or EPG. It is also possible to execute an identifier generation process (step S0203) that converts a character string into an ID by using a dictionary function consisting of 216 and the dictionary extraction unit 106 and uses it as an identifier.
[0113] なお、同一分野の識別子とは、例えば音素認識を例に取ると音素識別子における 同一分野としては母音や子音や無音があり、より詳しく記載すると母音であれば「a/i/ u/e/o」といった識別子に分類でき、音素識別子としては日本語で約 30種類程度の ものが一般的に知られている。  [0113] Note that identifiers in the same field are vowels, consonants, and silences in the same field in phoneme identifiers, for example, in the case of phoneme recognition. It can be classified into identifiers such as “e / o”, and about 30 types of phoneme identifiers are generally known in Japanese.
[0114] そして、同一分野の識別子は音素片であれば数千種類のものが識別子として存在 し、文字認識であれば文字ごとの識別子や文字の部品ごとの識別子が存在し、顔認 識であれば登録されて 、る人物の数だけの識別子が存在し、楽器や環境音や図形 や動作であれば辞書情報に登録されているだけの数の識別子が存在する。  [0114] There are thousands of identifiers in the same field as phonemes, and there are identifiers for each character and character parts for character recognition. There are as many identifiers as there are registered persons, and there are as many identifiers as registered in the dictionary information for musical instruments, environmental sounds, figures and actions.
[0115] そして、これらの識別子には前述のように音素や音素片文字や画像や顔や楽器や 環境音や図形や動作といった複数の異なる情報を認識するために目的に応じて複 数の異なる特徴量の抽出を行い認識する分野に応じた分類がなされている。  [0115] As described above, these identifiers are different depending on the purpose in order to recognize a plurality of different information such as phonemes, phoneme characters, images, faces, musical instruments, environmental sounds, figures, and actions. Classification is performed according to the field of recognition by extracting feature quantities.
[0116] このようにして、特徴量識別子変換部 120により変換された識別子に基づいて、索 引情報生成部 108により、コンテンツ情報に対し時系列的に索引付を行って索引を 生成する索引付処理が実行される (ステップ S0204)。ここで、索引付処理は音声と 映像力 獲得できる識別子や特徴量のみならず、先述の利用者の入力による文字情 報やコンテンツ情報に関連した索引情報やメタ情報力 抽出された文字情報や外部 から受信した番組情報やセンサ情報や他のコンテンツ情報や広告情報などを利用し 関連付けて記録してもよい。 [0117] そして、生成された索引に基づいて、データベースに記録されたり(ステップ S0205 a)、 MPEGファイルが変更されたり(ステップ S0205b)、索引情報が記録されたり(ス テツプ S0205c)する。 In this way, based on the identifier converted by the feature quantity identifier conversion unit 120, the index information generation unit 108 performs indexing on the content information in time series to generate an index. Processing is executed (step S0204). Here, the indexing process includes not only the identifiers and feature quantities that can be acquired with audio and video power, but also the index information and meta information power related to the character information and content information input by the user as described above, and the extracted character information and external information. The program information, sensor information, other content information, advertisement information, etc. received from may be recorded in association with each other. Based on the generated index, it is recorded in the database (step S0205a), the MPEG file is changed (step S0205b), and the index information is recorded (step S0205c).
[0118] 続いて、特徴量識別子変換部 120が実行する特徴量識別子変換処理について、 図 3を用いて説明する。まず、抽出された特徴量が入力されると (ステップ S0301)、 評価関数処理が実行される (ステップ S0302)。ここで、評価関数処理とは、入力され た特徴量に対し、例えば距離関数などの評価関数により尤度を評価する処理である 。そして、特徴量に対し、対象となる評価関数を全て評価を行ったか否かを判定する (ステップ S0303)。ここで、評価する評価関数がまだある場合には、残っている評価 関数に基づいて評価関数処理を実行する(ステップ S0303 ;No→ステップ S0302)  Subsequently, the feature quantity identifier conversion process executed by the feature quantity identifier conversion unit 120 will be described with reference to FIG. First, when the extracted feature amount is input (step S0301), the evaluation function process is executed (step S0302). Here, the evaluation function process is a process for evaluating the likelihood with respect to the input feature quantity using an evaluation function such as a distance function. Then, it is determined whether or not all target evaluation functions have been evaluated for the feature amount (step S0303). If there are still evaluation functions to be evaluated, the evaluation function processing is executed based on the remaining evaluation functions (Step S0303; No → Step S0302).
[0119] 対象となる評価関数を用いて総て評価が終わった場合 (ステップ S0303 ; Yes)、評 価結果の中で一番尤度の高い識別子を選択する (ステップ S0304)。そして、選択さ れた識別子を出力する記号識別子出力ステップ (ステップ S0305)力 S実行されること により、複数の評価関数から最適な識別子を評価結果として獲得することが出来る。 [0119] When all evaluations are completed using the target evaluation function (step S0303; Yes), the identifier with the highest likelihood is selected from the evaluation results (step S0304). Then, by executing the symbol identifier output step (step S0305) force S for outputting the selected identifier, an optimum identifier can be obtained as an evaluation result from a plurality of evaluation functions.
[0120] このようにして音声や映像の識別子認識処理により関連付けられて記録された識別 子情報に基づく索引は、例えば、適切な単位時間を設けて、その単位時間ごとに識 別子を記録する方法や識別子をある程度まとめて、ある識別子の発生時刻と消失時 刻を格納する方法により索引情報の記録が実行でき図 4のようにコンテンツ情報の時 間軸やシーン名称と関連付けて発話音素や画像の識別子や特徴量を記録したり、 図 5のように映像の変化に合わせてそのシーン内で生じている環境音認識による笛 の音や爆発音や発話音素の索引付をしたり、特徴量の索引付をしたり、先述の利用 者の入力による文字情報やコンテンツ情報に関連した索引情報やメタ情報力 抽出 された文字情報や外部力 受信した番組情報やセンサ情報を利用することでコンテ ンッ情報内の位置を特定するための検索用索引が構成できる。  [0120] In the index based on the identifier information recorded in association with the voice or video identifier recognition process in this way, for example, an appropriate unit time is provided, and the identifier is recorded for each unit time. Index information can be recorded by combining the methods and identifiers to some extent and storing the occurrence time and disappearance time of a certain identifier, as shown in Fig. 4, in relation to the time axis of the content information and the scene name. As shown in Fig. 5, whistle sounds, explosion sounds, and utterance phonemes are indexed according to the environmental sound recognition generated in the scene according to the changes in the image, as shown in Fig. 5. Index information related to character information and content information input by the user as described above, meta information power, extracted character information, external power, and using received program information and sensor information Search for an index for specifying a position in the integrators N'information can be configured.
[0121] より具体的には、特徴量抽出処理 (ステップ S0202)で抽出された特徴量と識別子 生成処理 (ステップ S0203)で特徴量力も生成された識別子が映像、音声ともに獲得 され、索引付処理 (ステップ S0204)の索引付けによって獲得された識別子を図 4の 例であれば識別種別項目が音素の行に音素記号と音素認識特徴量がコンテンツ情 報の時間軸情報と関連付けられて記録ステップ(S0205a、 S0205b、 S0205c)で 記録され、図 5の例であれば音素記号に音素識別子が音声特徴量に音素認識用の 特徴量がコンテンツ情報の時間軸情報と関連付けられて記録ステップ (S0205a、 S 0205b, S0205c)によりコンテンツに対する聴覚情報や視覚情報や感情情報の認 識ゃ特徴抽出に伴う複数の識別子や特徴量に基づいたコンテンツ情報内の位置的 近傍の索引情報としての「索引共起情報」が記録される。 [0121] More specifically, the feature quantity extracted in the feature quantity extraction process (step S0202) and the identifier for which the feature quantity force was also created in the identifier generation process (step S0203) are acquired for both video and audio, and indexed. The identifier obtained by indexing (step S0204) is shown in FIG. In the example shown in Fig. 5, the phoneme symbol and phoneme recognition feature value are recorded in the recording step (S0205a, S0205b, S0205c) in association with the time axis information of the content information in the row of the phonetic identification type item. For example, the phoneme identifier is related to the phoneme symbol, the feature value for phoneme recognition is related to the time feature information of the content information, and the recording step (S0205a, S0205b, S0205c) “Index co-occurrence information” is recorded as index information of the location neighborhood in the content information based on a plurality of identifiers and feature amounts associated with recognition feature extraction.
[0122] この際、映像や感情に関しても認識された識別子や認識に用いる特徴量や先述の 利用者の入力による文字情報やコンテンツ情報に関連した索引情報やメタ情報から 抽出された文字情報や外部から受信した番組情報やセンサ情報をコンテンツ情報の 時間軸情報と関連付けて記録することで、検索のための索引情報として用いることが でき、前述の音素と感情や視覚情報や聴覚情報とを組合せることで後述される学習 に用いるための「索引共起情報」を生成し記録できる。  [0122] In this case, identifiers recognized for images and emotions, feature quantities used for recognition, character information input by the user as described above, character information extracted from index information and meta information related to content information, and external information By recording the program information and sensor information received from the content information in association with the time axis information of the content information, it can be used as index information for searching, and the above-mentioned phoneme is combined with emotion, visual information, and auditory information. In this way, “index co-occurrence information” can be generated and recorded for use in learning, which will be described later.
[0123] なお、これらの索引情報はテキスト文字列でファイルに記載することにより実現でき るとともに、 MPEG変更 (ステップ S0205b)においては索引記号合成部 110により M PEGファイル力もメタ記号抽出部 114により抽出したメタ情報記述エリアに対する索 引情報の合成処理を行っても良い。なお、索引情報は文字列情報ば力りではなぐ 文字列から構成されるノ、ッシュ IDや文字列から変換された ASCIIコードのような一対 一の関係にある数値 IDなどであってもよい。  [0123] These index information can be realized by describing them in a text string as a text string, and when changing to MPEG (step S0205b), the index symbol synthesis unit 110 also extracts the MPEG file power by the meta symbol extraction unit 114. The index information may be combined with the meta information description area. Note that the index information may be a numeric ID having a one-to-one relationship such as a character ID that is not composed of character string information, or an ASCII code converted from a character string.
[0124] «共起状態の学習処理手順例》  [0124] «Example of co-occurrence learning processing procedure»
続いて、共起状態を学習することにより、索引付を行う処理手順について、図 6を用 いて説明する。共起状態学習処理の手順では、関連付けられて記録された識別子 の共起状態を図 6に従って学習し、後述する図 8や図 9のように共起情報を構成する ことで、共起情報に基づいた評価関数を作りコンテンツ情報に対し索引付を行う処理 を行う。なお、この共起行列を構成するために集計するフレームの数は本実施例に おいては限定されているが任意の値であってもよぐ人間の認識に影響のあるとされ る 12Hz付近や 24Hz付近や 60Hz (16ms)付近や 110Hz (9ms)付近やテレビ等の 同期信号を基準にして、集計する単位時間を任意に決めても良い。 [0125] まず、図 6は、索引共起状態学習処理の基本的な処理手順を示す図である。前述 の索引付によりコンテンツ情報に対して施された複数の認識方法よる索引情報を位 置的近傍に対して抽出をおこなうことでコンテンツに対する聴覚情報や視覚情報や 感情情報の認識や特徴抽出に伴う複数の識別子や特徴量に基づいて構成された位 置的近傍の索引情報としての「索引共起情報」を獲得できる。 Next, the processing procedure for indexing by learning the co-occurrence state will be described with reference to FIG. In the procedure of the co-occurrence state learning process, the co-occurrence state of the identifier recorded in association is learned according to FIG. 6, and the co-occurrence information is configured as shown in FIGS. Create an evaluation function based on the index and index the content information. Note that the number of frames to be aggregated to form this co-occurrence matrix is limited in this embodiment, but it may be an arbitrary value. Alternatively, the unit time to be counted may be arbitrarily determined based on 24Hz, 60Hz (16ms), 110Hz (9ms), or a synchronization signal from a TV. First, FIG. 6 is a diagram showing a basic processing procedure of index co-occurrence state learning processing. Accompanying the recognition and feature extraction of auditory, visual, and emotional information for content by extracting the index information by the multiple recognition methods applied to the content information by indexing to the positional neighborhood It is possible to acquire “index co-occurrence information” as index information of positional neighbors configured based on a plurality of identifiers and feature quantities.
[0126] より具体的には前記索引付手段によりフレーム毎に記録された音素に基づく音素 記号からなる音素識別子による聴覚情報の索引を抽出する (ステップ S0601)。次に 、前述の検出された音素と同じフレームの画像データの特徴量力 色識別子を抽出 することにより視覚情報の索引を抽出する (ステップ S0602)。さらに、同じフレームの 感情認識に基づく感情識別子に基づいて感情情報の索引を抽出する (ステップ S06 03)。そして、抽出されたそれぞれの索引情報に基づいて共起情報を構成するフレ ームごとの共起行列(図 8)を構成する (ステップ S0604)。これにより、複数の識別子 や特徴量により構成された位置的近傍の索引情報としての「索引共起情報」となる。 なお、抽出するフレームの幅は任意に指定しても良ぐ人間にとって連続的に感じる 境界値の 14Hz、 27Hz、 55Hz、 110Hz付近を対象として索引共起情報を構成して もよいし、一般的に用いられる文字による検索に用いられる文字情報を索引共起情 報に含ませても良い。  More specifically, an index of auditory information by a phoneme identifier consisting of phoneme symbols based on phonemes recorded for each frame by the indexing means is extracted (step S0601). Next, an index of visual information is extracted by extracting the feature value color identifier of the image data of the same frame as the detected phoneme (step S0602). Further, an emotion information index is extracted based on the emotion identifier based on emotion recognition in the same frame (step S06 03). Then, a co-occurrence matrix (FIG. 8) for each frame constituting the co-occurrence information is constructed based on each extracted index information (step S0604). As a result, “index co-occurrence information” is obtained as index information of a positional neighborhood composed of a plurality of identifiers and feature quantities. Note that the width of the extracted frame may be specified arbitrarily. Index co-occurrence information may be configured for the boundary values around 14 Hz, 27 Hz, 55 Hz, and 110 Hz that are continuously felt by humans. The character co-occurrence information may be included in the index co-occurrence information.
[0127] そして、ステップ S0604において索引情報により構成された共起行列に基づいて 位置的近傍の識別子や特徴量による「索引共起情報」が構成され「索引共起情報」 による学習処理 (ステップ S0605)が実行される。ここで、学習に用いる特徴量ゃ識 別子の集計処理 (ステップ S0605a、ステップ S0605b)は、近傍の例として毎秒 30フ レームの動画像であれば 90フレーム(3秒)毎と 、つた所定間隔ごとに集計を実行し ても良いし、統計的検定により過去の平均値から一定幅乖離したタイミングまで集計 を実行しても良いし、公知の検出技術により検出された情報が一定である範囲や境 界毎に対して集計を行っても良いし、指定された教師情報が同一の範囲に対して集 計を行っても良 ヽ処理であり、集計範囲の終了に応じて評価関数を構成する (ステツ プ S0605c、ステップ S0605d)を実行する。そして、学習処理が実行されることにより 、評価関数が生成 '再構成され、生成 '再構成された評価関数は学習情報として共 起学習保存部 212に保存される(ステップ S0606)。 [0127] Then, "index co-occurrence information" is formed based on the identifiers and feature quantities of positional neighbors based on the co-occurrence matrix formed by the index information in step S0604, and learning processing based on "index co-occurrence information" (step S0605). ) Is executed. Here, the feature value identifiers used for learning are aggregated (steps S0605a and S0605b) as an example of the neighborhood. For a moving image of 30 frames per second, every 90 frames (3 seconds), a predetermined interval is used. Aggregation may be performed every time, aggregation may be performed until a certain distance deviates from the past average value by statistical test, or the range of information detected by a known detection technique may be constant. Aggregation may be performed for each boundary, or it may be acceptable if the specified teacher information is aggregated over the same range, and an evaluation function is configured according to the end of the aggregation range. (Step S0605c, Step S0605d) are executed. When the learning process is executed, the evaluation function is generated and reconstructed, and the generated and reconstructed evaluation function is shared as learning information. It is stored in the origin learning storage unit 212 (step S0606).
[0128] 続いて、学習処理 (ステップ S0605)について詳しく説明する。まず、フレーム毎に 識別子の共起情報を集計する (ステップ S0605a)。共起情報を集計する時間幅とし ては、所定のフレーム数 ·時間毎に集計を行い、例えば、 90フレーム(3秒間)毎に識 別子の共起情報を集計しフレーム間共起情報を生成する (ステップ S0605b)。続い て、生成したフレーム間共起情報から共分散行列を生成し、生成された共分散行列 から共起行列の固有値'固有ベクトルを算出し学習情報を生成する (ステップ S0605 c)。そして、算出された固有値'固有ベクトルに基づいて評価関数の標準テンプレー トを生成し学習結果を生成する (ステップ S0605d)。これらの処理が実施されること により評価関数が構成される。  Next, the learning process (step S0605) will be described in detail. First, the co-occurrence information of the identifier is totaled for each frame (step S0605a). The time width for counting the co-occurrence information is calculated every predetermined number of frames · time. For example, the co-occurrence information of the identifier is calculated by adding the co-occurrence information of the identifier every 90 frames (3 seconds). Generate (step S0605b). Subsequently, a covariance matrix is generated from the generated inter-frame co-occurrence information, and eigenvalues' eigenvectors of the co-occurrence matrix are calculated from the generated covariance matrix to generate learning information (step S0605 c). Then, based on the calculated eigenvalue 'eigenvector, a standard template of the evaluation function is generated and a learning result is generated (step S0605d). An evaluation function is constructed by executing these processes.
[0129] なお、集計するフレームの幅や 1フレームの時間長は装置構成により任意に指定し ても良ぐ人間にとって連続的に感じる境界値の 14Hz、 27Hz、 55Hz、 110Hz付近 を対象として共起情報を構成してもよいし、集計したフレーム間情報を「索引共起情 報」としても利用しても良い。  [0129] It should be noted that the frame width to be aggregated and the time length of one frame can be specified arbitrarily depending on the device configuration, and co-occurrence is performed around 14 Hz, 27 Hz, 55 Hz, and 110 Hz, which are the boundary values that are continuously felt by humans. Information may be configured, and the aggregated interframe information may be used as “index co-occurrence information”.
[0130] そして、構成された評価関数の標準テンプレート(関数パラメータ)を再利用できる ように記憶媒体に保存する(ステップ S0606)。具体的には、ステップ S0605dで生 成された評価関数等を共起学習保存部 212に保存する。このようにして構成された 評価関数を用いて図 2のステップ S0203の特徴量識別子変換に用いることにより索 引付手順を実施して再度索引付けを行うことにより、共起情報に基づいた評価関数 を用いてコンテンツ情報に対する索引付けへの利用が可能となる。  Then, the standard template (function parameter) of the configured evaluation function is stored in the storage medium so that it can be reused (step S0606). Specifically, the evaluation function and the like generated in step S0605d are stored in the co-occurrence learning storage unit 212. The evaluation function based on the co-occurrence information is performed by performing the indexing procedure by using the evaluation function configured in this way and performing the indexing procedure by using it for the feature identifier conversion in step S0203 of FIG. Can be used for indexing content information.
[0131] この学習に用いる索引情報に基づいた共起情報について図 8、図 9を用いて具体 的に説明する。識別子として音素 30個 (母音 5個、子音 24個、無音 1個)、感情 4個( 喜、怒、哀、楽)の要素からなる識別子と、 Web Color216色(Web Colorは、「WEBセ 一フカラー」「ブラウザ共通色」等とも呼ばれる)による各色の表示画素数を示す識別 子を組合せて得られる 250要素かける 250要素の共起行列や共分散行列により構 成する。  [0131] The co-occurrence information based on the index information used for this learning will be specifically described with reference to Figs. As an identifier, there are 30 phonemes (5 vowels, 24 consonants, 1 silence), 4 emotions (joy, anger, sorrow, comfort), and Web Color216 colors (Web Color It is composed of 250 elements by 250 elements co-occurrence matrix and covariance matrix obtained by combining identifiers indicating the number of display pixels of each color (also called “color” or “browser common color”).
[0132] なお、この構成は必要に応じてコンテンツに対し時系列的に関連付けられたセンサ 入力に基づくセンサ情報を用いるためにセンサ情報の種類に応じて共起行列に項 目を追加したり、コンテンツに関連する索引情報やメタ情報における文字情報に応じ て共起行列に項目を追加したり、共起情報から構成される評価関数の標準パターン を検索条件に設定する時の呼称に前記文字情報を用いたりしても良い。 [0132] Note that this configuration uses the sensor information based on the sensor inputs associated with the content in time series as necessary, so that the terms are included in the co-occurrence matrix according to the type of sensor information. When adding an eye, adding an item to the co-occurrence matrix according to the index information related to the content or character information in the meta information, or setting the standard pattern of the evaluation function consisting of the co-occurrence information as the search condition The character information may be used for the designation of the name.
[0133] そして、図 8は共起情報の一例を示す図である。横軸と縦軸に同じ要素が入り縦軸 と横軸の交点に動画像におけるフレーム中の画像と音声に関する出現回数が入る。 出現回数とはある識別子がフレーム中に何回出現して 、るかを示す値であり 1フレー ム中の短い時間内に任意の音素や画素や感情識別子が幾つ発生するかにより評価 される数である。  [0133] Fig. 8 is a diagram showing an example of co-occurrence information. The same element is entered on the horizontal axis and the vertical axis, and the number of appearances related to the image and sound in the frame in the moving image is entered at the intersection of the vertical axis and the horizontal axis. The number of occurrences is a value that indicates how many times an identifier appears in a frame, and is a number that is evaluated by how many arbitrary phonemes, pixels, and emotion identifiers are generated within a short time frame. It is.
[0134] 例えば、行列の中身は、感情「喜」、母音「A」の共起回数は「0」、感情「喜」と画像 識別子として赤の出現回数は「6」となっている。なお、これらの情報はコンテンツ情報 力も抽出される値のため必ずしも一定ではなぐフレーム内で認識された識別子の出 現回数を識別子の種類ごとに正規ィ匕して確率値としてもよいし、フレーム内の出現確 率に基づ 、てフレーム間の確率遷移行列を構成してもよ 、。  For example, the content of the matrix is “0” for the co-occurrence of emotion “joy” and vowel “A”, and “6” for the appearance frequency of red as emotion “joy” and image identifier. Since this information is a value from which the content information power is also extracted, the number of occurrences of the identifier recognized in the frame that is not necessarily constant may be normalized for each type of identifier to obtain a probability value. A probability transition matrix between frames may be constructed based on the probability of occurrence of.
[0135] また、図 9は感情特徴と音素特徴との映像特徴の共分散行列の一例を示す図であ る。ここで、本図は横軸と縦軸はそれぞれの特徴量の呼称となっており動画像におけ る数秒間の複数フレーム力 獲得された特徴量がそのフレーム全体における平均か ら見てどの程度散らばりがあるかを示している。例えば、感情特徴は喜怒哀楽の 4つ 力 Sどの程度の分散を持っているかを示し、音素や画像に関しても、それぞれの距離 評価結果がどの程度平均力 乖離して 、るかを示して 、る。  [0135] Fig. 9 is a diagram showing an example of a covariance matrix of video features of emotion features and phoneme features. Here, in this figure, the horizontal axis and the vertical axis are the names of the respective feature amounts, and the number of feature amounts acquired for several frames in a moving image for several seconds is determined from the average over the entire frame. It shows whether there is any scattering. For example, the emotional characteristics indicate the four powers of emotional emotions S, how much the variance is, and for phonemes and images, the distance evaluation results of each distance indicate how much the average power is different. The
[0136] この例の場合、例えば感情のパラメータの 4番目と感情パラメータ 1番目の共分散 は「0. 42」となっており、映像のパラメータの 1番目と感情パラメータ 1番目の変化の 相関性は「0. 32」となっている力 これらの情報はコンテンツ情報力 抽出される値 のため必ずしも一定ではな 、。  In this example, for example, the covariance of the fourth emotion parameter and the first emotion parameter is “0.42”, and the correlation between the first parameter of the video parameter and the first change of the emotion parameter is [0136]. The power of “0.32” is not always constant because the information is the content information power.
[0137] このように、本発明の特徴は人が検索のために指定した共起条件や索引付けのと きに検出された共起情報や検索結果として利用者が頻繁に利用した情報における 共起情報を用いて、複数の異なる性質の識別子による共起行列や共起行列に基づ く共起確率と複数の異なる性質の特徴量に基づく共分散行列を構成し検索のための 評価関数をつくり、検索や検出に利用する所にある。なお、図中にある例としての行 列は正方対角行列を想定して 、るため、図中の下三角行列部分は省略してある。 [0137] As described above, the present invention is characterized by co-occurrence conditions specified by a person for search, co-occurrence information detected during indexing, and information frequently used by users as search results. Using the occurrence information, a co-occurrence matrix based on identifiers of different properties, a co-occurrence probability based on the co-occurrence matrix and a covariance matrix based on the features of the different properties and an evaluation function for searching It is in the place of making, searching and detecting. Note that the example lines in the figure Since the column assumes a square diagonal matrix, the lower triangular matrix portion in the figure is omitted.
[0138] この際、評価関数を構成するために前もって識別子が特定された自然情報の入力 に基づいて特徴量を学習させて標準パターンを抽出し、抽出された標準パターンに より評価関数を構成してもよいし、多変量解析により自己組織化させ構成された識別 子を用いて標準パターンを抽出してもよい。そして、得られた標準パターンは必要に 応じて評価関数保存部 204に保存され、識別子と標準パターンを相互に変換するた めの関連付け情報として標準パターン辞書情報が辞書情報保存部 214に保存され る。  [0138] At this time, in order to construct the evaluation function, the standard pattern is extracted by learning the feature amount based on the input of the natural information whose identifier is specified in advance, and the evaluation function is configured by the extracted standard pattern. Alternatively, the standard pattern may be extracted using an identifier configured by self-organization by multivariate analysis. The obtained standard pattern is stored in the evaluation function storage unit 204 as necessary, and the standard pattern dictionary information is stored in the dictionary information storage unit 214 as association information for mutually converting the identifier and the standard pattern. .
[0139] なお、標準パターンは評価関数と組合せて識別子を特定するために用いられ、入 力される識別子が特定されていない標本特徴量と特定識別子に帰属する特徴量〖こ よって構成された母集団の平均や分散によって構成され、評価関数をユークリッド距 離やマハラノビス距離の評価に利用され、標準テンプレートや標準パラメータや評価 関数パラメータと呼ばれる場合もある。  [0139] Note that the standard pattern is used in combination with the evaluation function to identify the identifier, and is configured by a sample feature amount for which the input identifier is not specified and a feature amount attributed to the specific identifier. It consists of the mean and variance of the population, and the evaluation function is used to evaluate the Euclidean distance and Mahalanobis distance, and is sometimes called a standard template, standard parameter, or evaluation function parameter.
[0140] また、標準パターンは入力される特徴量と評価関数を用いた多変量解析などによる 方法で特徴量力 評価関数に用いるためのパラメータを生成したりしてもよぐ生成さ れたパラメータに基づいて HMM、ベイズ識別関数、マハラノビス距離、ユークリッド 距離といった任意の識別子評価関数を利用しても良い。なお、それら評価関数を構 成するパラメータは多変量解析などの数学的方法により構成される事が一般的に知 られているため抽出方法や学習方法は実装に依存する。  [0140] In addition, the standard pattern is a parameter that can be generated by a method such as multivariate analysis using the input feature value and the evaluation function. Based on this, any identifier evaluation function such as HMM, Bayes discriminant function, Mahalanobis distance, Euclidean distance may be used. In addition, it is generally known that the parameters that make up these evaluation functions are configured by mathematical methods such as multivariate analysis, so the extraction method and learning method depend on the implementation.
[0141] この際、評価関数を用いて多変量解析を行い自己組織化させることで分類し評価 関数を複数設けることで、分類されたそれぞれの評価関数に人手により名称や識別 子を与えたり、多変量解析により得られた評価関数と共起する確率の高いコンテンツ 情報に含まれる文字列を関数の名称や識別子として与えたりすることで利用者から の評価関数の名称指定により評価関数を検索や検出に利用できるようにしてもよい。  [0141] At this time, multivariate analysis is performed using evaluation functions, and classification is performed by self-organization. By providing a plurality of evaluation functions, each classified evaluation function is manually given a name and an identifier, The evaluation function can be searched by specifying the evaluation function name from the user by giving the character string included in the content information with high probability of co-occurring with the evaluation function obtained by multivariate analysis as the function name or identifier. You may make it usable for a detection.
[0142] また、関連付けて記録される識別子情報は音素や音素片の記号であったり、識別 子評価関数を構成するために用いられた母集団につけられた名称や呼称、識別子 や識別子列であったり、代表する特徴量平均自体であったり、音素や音素片ばかり でなく別記される画像や音声や感情に関する識別子や特徴量やそれらの組合せで あったり、利用者の入力による文字情報やコンテンツに関連した索引情報やメタ情報 力 抽出された文字情報や外部力 受信した番組情報やセンサ情報を利用しても良 い。 [0142] The identifier information recorded in association with each other is a symbol of a phoneme or a phoneme piece, or a name or a name, an identifier or an identifier string given to a population used to construct an identifier evaluation function. Or the representative feature average itself, and not only phonemes and phonemes, but also identifiers, features, and combinations of images, sounds, and emotions that are described separately. The index information and meta-information related to the character information entered by the user and the content, meta-information, extracted character information, external information, received program information and sensor information may be used.
[0143] そして、同様の方法で索引付けされた広告情報と索引付けされたコンテンツ情報の 任意個所においてコンテンツ情報の識別子や特徴量と広告情報の識別子や特徴量 が前述の評価関数によって類似性があると評価された場合に広告を関連付けるステ ップを実行したり、索引付け中に宣伝をおこなったりしても良いし、コンテンツ情報の 再生時に一時停止状態にしている間だけ任意の広告や評価関数により関連付けら れた広告を再生したりしても良いし、これらの評価関数は後述する『識別子再構築の 例』や『検索 '検出'索引付けによる識別子学習』を用いて再構成しても良い。  [0143] Then, the identifiers and feature quantities of the content information and the identifiers and feature quantities of the advertisement information are similar to each other by the above-described evaluation function at an arbitrary position of the advertisement information indexed by the same method and the indexed content information. Steps to associate advertisements when they are evaluated may be executed or promoted during indexing, or any advertisement or rating only while paused during content information playback The advertisements associated with the function may be played back, and these evaluation functions can be reconstructed using “Example of identifier reconstruction” and “Identifier learning with search 'detection' indexing” described later. Also good.
[0144] また、索引に用いられる識別子の獲得により共起状態を評価する評価関数を構成 し検索するためにコンテンツに記録されるメタ情報や EPG情報を識別子として用いて も良ぐ図 7のように EPGや BMLなどの番組情報を取得する処理 (ステップ S0701) を追加して放送中に獲得される EPGや BMLによる番組情報を利用し共起状態を構 成して索引付を行ってもよい。  [0144] In addition, meta information or EPG information recorded in the content can be used as an identifier to construct and search an evaluation function that evaluates the co-occurrence state by acquiring the identifier used for the index as shown in Fig. 7. In addition, processing for acquiring program information such as EPG and BML (step S0701) may be added to the program information using EPG and BML acquired during broadcasting, and the co-occurrence state may be configured and indexed. .
[0145] なお、図面が図 6と比べ異なるので補足する力 ステップ S0701において文字情報 として獲得される番組情報としての EPGはそのまま識別子として利用され、それ以外 の識別子や特徴量はステップ S0601からステップ S0603と同等の処理が行われ、 番組情報を識別子として同一の番組情報において共起関係にある他の識別子ゃ特 徴量を用いて図 8や図 9の共起行列が構成される。この際、評価関数の名称に文字 情報や番組情報に基づく番組分野の名称文字列を用いることで文字情報や番組情 報を関連付けた索引付を行っても良い。  [0145] Note that the drawing is different from Fig. 6 and supplementary power. EPG as program information acquired as character information in step S0701 is used as an identifier as it is, and other identifiers and features are used from step S0601 to step S0603. The co-occurrence matrix in FIG. 8 and FIG. 9 is constructed using the characteristic quantities of other identifiers having a co-occurrence relationship in the same program information using the program information as an identifier. At this time, the name of the evaluation function may be indexed by associating the character information and the program information by using the character string of the program field based on the character information and the program information.
[0146] この結果、獲得された共起情報に基づいてステップ S0605に相当するステップ SO 703からステップ S0705の学習処理が実施され、識別子を構成するための評価関数 を構築することが出来るとともに、必要であれば獲得された関数を用いてコンテンツ に再度索引付を行ってもよい。  [0146] As a result, the learning process from step SO703 to step S0705 corresponding to step S0605 is performed based on the acquired co-occurrence information, and an evaluation function for constructing an identifier can be constructed and necessary. If so, the content may be indexed again using the acquired function.
[0147] 《基本的な検索処理例〉〉 [0147] <Example of basic search processing>
次に、検索処理の手順を説明するために図 10に基づいて説明する。まず、利用者 から撮像や発話音声、文字列入力等の検索条件が入力されると (ステップ S1001)、 入力された検索条件に基づいてクエリ生成処理が実行され (ステップ S 1002)、タエ リが生成される。例えば、音声であれば利用者の発話に対する音素認識や音素片認 識に基づいた音素列、音素片列に基づいてクエリが生成され、文字列であればテキ スト入力による文字列の音素列や音素片列への変換に基づいてクエリが生成され、 撮像であれば画像認識によるクエリが生成されることによって、それぞれの認識手法 により検索条件が生成される。 Next, the search process procedure will be described with reference to FIG. First, the user When a search condition such as imaging, speech, character string input or the like is input from (Step S1001), query generation processing is executed based on the input search condition (Step S1002), and a territory is generated. For example, in the case of speech, a phoneme sequence based on phoneme recognition or phoneme recognition for a user's utterance, a query is generated based on the phoneme sequence, and in the case of a character sequence, A query is generated based on the conversion to the phoneme string sequence, and if it is an image capture, a query based on image recognition is generated, so that a search condition is generated by each recognition method.
[0148] この際、入力文字列や視覚長方や聴覚情報に対する複数の認識方法により獲得さ れた識別子を用いて検索条件が構成され、同時に指定された識別子や文字列の共 起関係により検索条件力 共起行列を構成することで「索引共起情報」と同様に「検 索条件共起情報」を構成し、本発明の「索引共起情報」との類似評価に用いる「検索 条件共起情報」としてクエリに利用することが可能になる。  [0148] At this time, search conditions are configured using identifiers acquired by multiple recognition methods for input character strings, visual rectangles, and auditory information, and search is performed based on co-occurrence relationships between the specified identifiers and character strings. By constructing a co-occurrence matrix, “search condition co-occurrence information” is constructed in the same way as “index co-occurrence information”, and “search condition co-occurrence information” is used for similarity evaluation with “index co-occurrence information” of the present invention. It becomes possible to use it for queries as “starting information”.
[0149] また、入力された文字列やそれぞれの認識結果により獲得された識別子を辞書情 報保存部と辞書抽出部に基づく辞書機能によって関連付けられた文字列や識別子 に変換したり、関連付けられた標準パターンに変換したりすることで、利用者の入力 した検索条件の共起関係ば力りではなく「検索条件力 認識された情報に関連付け られた情報」と「検索条件として入力された情報」の共起関係や「検索条件から認識さ れた情報」と「検索条件として入力された情報」の共起関係や「検索条件から認識さ れた情報」と「検索条件として入力された情報に関連付けられた情報」の共起関係に 基づいた「検索条件共起情報」が構成できクエリとして利用可能であり、本発明の学 習に用いる「索引共起情報」としての利用も可能である。  [0149] Also, the input character strings and the identifiers obtained from the respective recognition results are converted into or associated with the character strings and identifiers by the dictionary function based on the dictionary information storage unit and the dictionary extraction unit. Instead of co-occurrence of search conditions entered by the user by converting to standard patterns, the information related to the information recognized by the search conditions and the information entered as search conditions Co-occurrence relations of "information recognized from search conditions" and "information entered as search conditions" or "information recognized from search conditions" and "information entered as search conditions" “Search condition co-occurrence information” based on the co-occurrence relationship of “related information” can be configured and used as a query, and can also be used as “index co-occurrence information” used for learning of the present invention.
[0150] なお、このクエリ入力の際、利用者の入力による文字情報やコンテンツに関連した 索引情報やメタ情報から抽出された文字情報や外部から受信した番組情報やセンサ 情報を利用してもよく、感情識別子や画像識別子などを示す文字列や音素列や音 素片列といった記号列をテキスト入力やメニュー選択、音声人力により実施してもよく 、これらの記号列を辞書情報に基づいて他の識別子や特徴量や識別子列に変換し て検索を行 、コンテンツ情報内の位置を特定しても良 、。  [0150] When inputting this query, character information extracted by the user, index information related to the content, character information extracted from the meta information, program information received from the outside, and sensor information may be used. In addition, character strings indicating emotion identifiers and image identifiers, symbol strings such as phoneme strings and phoneme fragment strings may be implemented by text input, menu selection, or voice human power. You can search by converting to identifiers, feature quantities, or identifier strings, and specify the location in the content information.
[0151] そして、コンテンツ保存部 202に保存されているコンテンツ情報のうち、検索対象の コンテンッ情報に対して繰返し検索を行 、、全てのコンテンッ情報に対して索弓 Iとク エリの一致を評価する検索処理が実行される (ステップ S1003)。ここで、検索処理が 実行されること〖こより、検索対象のコンテンツ情報の識別子又は特徴量に基づく「索 引共起情報」と「検索条件共起情報」が比較され検索結果が獲得される。 [0151] Of the content information stored in the content storage unit 202, the search target The search is repeatedly performed on the content information, and a search process for evaluating the match between the bow I and the query is executed for all the content information (step S1003). Here, since the search process is executed, the “index co-occurrence information” and the “search condition co-occurrence information” based on the identifier or feature amount of the content information to be searched are compared, and the search result is obtained.
[0152] この比較は、「索引共起情報」と「検索条件共起情報」の一致を DPや距離関数によ り評価しても良いし、それぞれの共起情報を評価関数で評価し、獲得された識別子 の一致や距離の遠近を評価することにより類似性や同一性や一致度の比較しても良 いし、全ての識別子や特徴量を評価するのではなく一部の同種の識別子や特徴量 を評価することで類似性や同一性や一致度の比較'評価を行っても良い。  [0152] In this comparison, the match between "index co-occurrence information" and "search condition co-occurrence information" may be evaluated by DP or distance function, and each co-occurrence information is evaluated by evaluation function, It may be possible to compare the similarity, identity, and degree of matching by evaluating the matching of the acquired identifiers and the distance distance. Instead of evaluating all the identifiers and feature quantities, By comparing feature quantities, similarity, identity, and degree of coincidence may be compared.
[0153] そして、獲得された検索結果に基づ!/、て検索評価結果の一致度を評価し、検索結 果の順位付けを行う (ステップ S1004)。さらに、順位付された検索評価結果に基づ いて評価結果一覧を作成し表示する評価結果一覧表示処理 (ステップ S 1005)が実 行される。この際、記憶部にある広告情報を利用者に表示したり、通信回線経由で取 得した広告を提示したり、先の索引付けで関連付けられた広告内容を記憶部や通信 回線部から獲得して利用者に提示しても良い。  [0153] Based on the acquired search results, the degree of matching of the search evaluation results is evaluated, and the search results are ranked (step S1004). Further, an evaluation result list display process (step S 1005) is executed to create and display an evaluation result list based on the ranked search evaluation results. At this time, the advertisement information in the storage unit is displayed to the user, the advertisement obtained through the communication line is presented, and the advertisement content associated with the previous indexing is acquired from the storage unit or the communication line unit. May be presented to the user.
[0154] また、このときに実時間配信中の「索引共起情報」による索引付がなされていないコ ンテンッを利用するのであれば、図 13のようにコンテンツを時分割で獲得するステツ プ S 1301とコンテンッの獲得終了を確認するステップ S 1302とコンテンツの獲得によ つて特徴量の抽出や識別子の生成を行いながら索引付を行うステップ S1303を実 行し、「検索条件共起情報」と「索引共起情報」の比較を行い一致する個所を検出す るステップ S 1304を実行し、検出に応じて分岐するステップ S1305、後述されるよう な録画の開始やチャネルの切換や通知やメール配信やロボットの動作変更といった 後述されるような任意の処理 (ステップ S 1306)を実行することも可能である。  [0154] At this time, if content that is not indexed by "index co-occurrence information" during real-time distribution is used, step S for acquiring content in a time-sharing manner as shown in Fig. 13 Step S1302 for confirming completion of content acquisition with 1301 and step S1303 for indexing while extracting features and generating identifiers by acquiring content, and executing `` search condition co-occurrence information '' and `` Step S1304 for comparing the index co-occurrence information and detecting matching points is executed, and step S1305 branches according to the detection. Recording start, channel switching, notification, e-mail delivery, etc. It is also possible to execute an arbitrary process (step S 1306) as will be described later, such as changing the robot operation.
[0155] この結果、従来の検索であれば索引付の項で述べた共起状態に基づいた索引付 や共起状態に基づいて構成された評価関数による索引付がなされていないため、共 起状態に基づいた検索は実施できな力つた力 本発明においては、コンテンツに対 し前述の方法で施された「索引共起情報」に対して、入力検索条件としての入力文字 列に基づき辞書を参照して文字列に関連する特徴量や識別子に変換して検索に用 いたり、入力音声力 生成された音素列,音素片列をはじめとする特徴量や識別子を 直接検索に用いたり、入力音声力 生成された音素列,音素片列に基づき辞書を参 照して関連する特徴量や識別子に変換して検索に用いたり、入力画像 ·映像 'センサ カゝら抽出 ·生成された特徴量や識別子を直接検索に用いたり、入力画像 ·映像'セン サ力 抽出'生成された特徴量や識別子に基づき辞書を参照して関連する特徴量や 識別子に変換して検索に用いたり、することにより入力検索条件力 変換された識別 子や特徴量を用いた「検索条件共起情報」を構成することによって、コンテンツ情報 に対する「索引共起情報」による索引と利用者の入力した「検索条件共起情報」によ る検索条件とを比較することで「検索条件共起情報」と「索引共起情報」の一致する 対象を検索'検出をすることが可能となり、コンテンツやコンテンツ内の時間軸上の位 置や表示画面上の位置や音読上の位置を特定することが可能となる。 [0155] As a result, in the conventional search, the indexing based on the co-occurrence state described in the section on indexing and the indexing function based on the evaluation function configured based on the co-occurrence state are not performed. In the present invention, a dictionary is created based on an input character string as an input search condition for the “index co-occurrence information” applied to the content by the above-described method. Refer to them and convert them to feature quantities and identifiers related to character strings for search The phoneme sequence generated by the input speech force, phoneme sequence, and other features and identifiers are used for direct search, and the dictionary is based on the phoneme sequence and phoneme sequence generated by the input speech force. Convert to feature quantity and identifier to be used for search, input image · Video 'Sensor image extraction' · Use generated feature quantity and identifier for direct search, input image · Video 'sensor force extraction' generation Based on the converted feature quantity and identifier, refer to the dictionary and convert it to related feature quantity and identifier and use it for the search. By constructing the “occurrence information”, the index by the “index co-occurrence information” for the content information is compared with the search condition by the “search condition co-occurrence information” input by the user, so that the “search condition co-occurrence information” "And" index co-occurrence information " It is possible to search for and find objects that match “”, and to specify the position on the time axis, the position on the display screen, and the position on the reading aloud in the content.
[0156] なお、検索評価結果の一致度評価方法は HMMを用いたり、ベイズ識別関数と!/、 つた確率や距離を用いる方法であったり、多変量解析によるクラスタリングされた母集 団への帰属度合を評価したり、 DPや CDPと!、つた記号列のマッチング方法であるこ とが良く知られているため、より詳しくは『特徴量同士や識別子列同士の一致を評価 する方法の例』に述べる。  [0156] It should be noted that the matching evaluation method of the search evaluation results is an HMM, a method using a Bayes discriminant function and! /, A probability and a distance, or attribution to a clustered population by multivariate analysis. It is well known that this method is a matching method between DP and CDP and!, A symbol string, and more details can be found in “Examples of methods for evaluating matching between feature quantities and identifier strings”. State.
[0157] また、検索を特徴量で行う場合において、入力文字列や入力音声や入力画像にか ら生成されたクエリに用いられる識別子を特徴量へ変換するには、識別子特徴量変 換部 118が実行する識別子特徴量変換処理により変換される。この、識別子特徴量 変換処理について、図 11を用いて説明する。  [0157] Also, in the case where a search is performed using feature amounts, an identifier feature amount conversion unit 118 converts an identifier used in a query generated from an input character string, input speech, or input image into a feature amount. Is converted by the identifier feature amount conversion processing executed by This identifier feature quantity conversion processing will be described with reference to FIG.
[0158] まず、特徴量に変換するための識別子 (識別子列)が入力されると (ステップ S1101 )、対象記号抽出処理を実行する (ステップ S1102)。ここで、対象記号抽出処理とは 入力された識別子 (識別子列)に関して識別子から特徴量に変換するために辞書情 報を用いて識別子に関連付けられた特徴量を選択抽出する処理である。  First, when an identifier (identifier string) for conversion into a feature quantity is input (step S1101), target symbol extraction processing is executed (step S1102). Here, the target symbol extraction process is a process of selectively extracting a feature quantity associated with an identifier using dictionary information in order to convert the identifier into a feature quantity with respect to an input identifier (identifier string).
[0159] この際、必要に応じて音素を音素片に分割したりする様な識別子の細分ィ匕が必要 力どうかを判定する (ステップ S1103)。ここで、更に細分ィ匕する必要があると判定し た場合には (ステップ S1103; Yes)、記号細分化処理を実行し (ステップ S1104)、 更に細分化した後に再度対象記号抽出処理を実行する。例えば、識別子が音素で ある場合に、更に音素片に細分ィ匕して力も対象記号抽出処理を実行することにより 細分化された情報に適した特徴量を獲得することが出来る。 At this time, it is determined whether or not an identifier subdivision that divides phonemes into phoneme pieces as necessary is necessary (step S1103). If it is determined that further subdivision is necessary (step S1103; Yes), symbol subdivision processing is executed (step S1104), and the target symbol extraction processing is executed again after further subdivision. . For example, the identifier is a phoneme In some cases, it is possible to obtain feature quantities suitable for the subdivided information by further subdividing into phoneme segments and executing the target symbol extraction process.
[0160] そして、細分ィ匕が必要ではないと判定した場合には (ステップ S 1103 ; No)、選択さ れた特徴量に基づ!/、て識別子に応じて特徴量同士の距離評価を行うために特徴量 を出力する (ステップ S1105)。このように、上述した識別子特徴量変換処理が実行 されること〖こより、入力された識別子や識別子列が特徴量に変換され、特徴量による 検索を実施できるようになる。  [0160] If it is determined that subdivision is not necessary (step S1103; No), based on the selected feature amount! The feature value is output to evaluate the distance between the feature values according to the identifier (step S1105). As described above, since the identifier feature quantity conversion process described above is executed, the input identifier or identifier string is converted into the feature quantity, and the search based on the feature quantity can be performed.
[0161] また、このような検索条件や検索結果を用いて図 6の索引共起状態学習方法を実 行することにより利用者の趣味や趣向の偏りを統計的に分析し抽出することができる ため、図 12のように、検索条件の共起状態を学習する処理 (ステップ S 1202)や検索 結果の共起状態を学習する処理 (ステップ S1206)や検索結果の中で利用者が選 択した検索結果の共起状態を学習する処理 (ステップ S1209)を通常の検索手順に 追加することによって、利用者の意志に基づく趣味に沿った検索に伴う共起情報を 学習することが可能となり、利用者に合わせた索引付用の評価関数を構成することが 出来る。  [0161] Further, by performing the index co-occurrence state learning method of Fig. 6 using such search conditions and search results, it is possible to statistically analyze and extract the user's hobbies and preferences. Therefore, as shown in Fig. 12, the user selects the process for learning the co-occurrence state of the search condition (step S 1202), the process for learning the co-occurrence state of the search result (step S1206), and the search result. By adding the process of learning the co-occurrence status of search results (step S1209) to the normal search procedure, it becomes possible to learn co-occurrence information associated with searches based on the user's will and use. An evaluation function for indexing can be configured according to the user.
[0162] この図 12の処理は、まず利用者らの音声入力や文字列入力や画像入力により検 索条件が入力される (ステップ S1201)。そして、検索条件情報としての入力文字列 や発話から得られる音素列'音素片列や画像から得られる特徴量'識別子やそれら の検索条件情報に基づいて辞書情報保存部 214から辞書情報抽出部 106によって 抽出された関連する特徴量や識別子の共起情報を獲得して学習する (ステップ S12 02)。この、学習された共起情報に基づいて評価関数が構成され、当該評価関数を 保存する(ステップ S 1203)。  [0162] In the process of Fig. 12, first, search conditions are input by voice input, character string input, or image input by the users (step S1201). Then, based on the input character string as the search condition information, the phoneme string obtained from the utterance 'the feature amount obtained from the phoneme string string or the image' identifier and the search condition information, the dictionary information storage unit 214 to the dictionary information extraction unit 106 Acquires co-occurrence information of related features and identifiers extracted by (step S12 02). An evaluation function is constructed based on the learned co-occurrence information, and the evaluation function is stored (step S 1203).
[0163] より具体的には、発話音素列により認識された発話が「検索、ボカーン、爆発」であ つた場合、キーワードの「検索」で命令辞書に基づいて検索処理を選択し、「ボカ一 ン」で検索条件音素列に「b/ o/ k/ a/ a/ a/ a/ n/ n/ n/ n/ n」という擬音音素列を 設定し、「爆発」で爆発音の特徴量を集めて構成された爆発音評価関数の識別子と「 暖色系の面積が時系列的に増える画像」の画像特徴とを検索条件に設定することで 複数識別子と特徴量の共起状態が構成できる。 [0164] なお、検索条件の構成において利用者の入力が前述のように「ボカーン」であった 場合、同じような爆発音に関する擬音の「ドカーン」を「d/ o/ k/ a/ a/ a/ a/ n」とする ことで関連性のある擬音ば力りではなぐ関連性のある識別子や特徴量や識別子列 による検索を実施できるように検索条件を構成しても良いし、辞書によって関連付け られた認識方法の異なる特徴量や識別子や識別子列に変換し検索条件として追カロ しても良い。 [0163] More specifically, when the utterance recognized by the utterance phoneme sequence is "search, bokan, explosion", the search processing is selected based on the command dictionary in the keyword "search", and the `` B / o / k / a / a / a / a / n / n / n / n / n '' is set as the search condition phoneme string, and `` explosion '' is the feature value of explosion sound The co-occurrence state of multiple identifiers and feature quantities can be configured by setting the identifiers of the explosive pronunciation evaluation function configured by collecting the image features and the image features of “images in which the area of the warm color system increases in time series” as search conditions . [0164] If the user's input is "Bokarn" as described above in the configuration of the search condition, the similar "Dokarn" of explosion sound related to the explosion sound is set to "d / o / k / a / a /". The search conditions may be configured so that a search can be performed using related identifiers, feature quantities, or identifier strings that are not related to the associated pseudo-sound power by setting `` a / a / n ''. It may be converted into feature quantities, identifiers, or identifier strings with different recognition methods and added as search conditions.
[0165] そして、音素列と爆発音評価関数の識別子と画像特徴とに基づいた「索引共起情 報」と同様の「検索条件共起情報」を構成し前述の「共起状態の学習処理手順」に従 つて評価関数を構成することで検索条件の学習が可能となる。なお、「暖色系が広が る特徴量」は暖色系の赤や黄色の画面内占有面積が時系列的に増加することを評 価することで計測可能である。  [0165] Then, "search condition co-occurrence information" similar to the "index co-occurrence information" based on the phoneme sequence, the identifier of the explosive pronunciation evaluation function, and the image feature is constructed, and the above-described "co-occurrence state learning process" is performed. Search conditions can be learned by configuring an evaluation function according to the procedure. Note that the “feature value that the warm color system spreads” can be measured by evaluating the time-series increase in the area occupied by the warm red and yellow screens.
[0166] この際、ステップ S1201において入力される文字列や音素列 ·音素片列が辞書情 報保存部 214に保存されている場合、辞書情報保存部 214から辞書抽出部 106に より抽出された情報に基づいて他の識別子や特徴量に変換した後に学習のための 共起情報に用いても良いし、識別子特徴量変換部 118を用いて識別子を特徴量に 変換して検索に利用しても良 、。  At this time, if the character string or phoneme sequence / phoneme segment sequence input in step S1201 is stored in the dictionary information storage unit 214, it is extracted from the dictionary information storage unit 214 by the dictionary extraction unit 106. It may be used as co-occurrence information for learning after being converted into another identifier or feature amount based on information, or an identifier feature amount conversion unit 118 is used to convert an identifier into a feature amount for use in a search. Also good.
[0167] 続けて、前述の検索条件として指定された共起情報に基づいた検索としてステップ S1204が実行され、検索条件に一致度の高い検索結果が獲得される。そして、取得 された検索結果のうち例えば一致率が 80%を超えた検索結果としてコンテンツ情報 力 得られる対象シーンに付けられた索引情報に用いられる特徴量や識別子による 共起情報を学習するステップ S 1206が実行される。そして、学習結果としてステップ S1207にて保存される。この際、学習の対象となる共起情報は上位 10位以内であつ たり、一致率 90%以上であったりといった条件を与えてもよい。  [0167] Subsequently, step S1204 is executed as a search based on the co-occurrence information specified as the search condition described above, and a search result having a high degree of matching with the search condition is acquired. Then, for example, the co-occurrence information based on the feature amount and the identifier used in the index information attached to the target scene obtained as the content information force as a search result with a matching rate exceeding 80% is acquired from the acquired search results. 1206 is executed. And it is preserve | saved in step S1207 as a learning result. At this time, the co-occurrence information to be learned may be given a condition such as being within the top 10 or having a matching rate of 90% or higher.
[0168] つづいて、利用者により検索結果が選択 (以下、利用者により選択された検索結果 を「選択検索結果」という。)されると (ステップ S1208 ; Yes)、選択検索結果に基づい て共起情報が学習される(ステップ S1209)。そして、ステップ S1209において学習 された共起情報に基づいて評価関数が再構成され、保存される (ステップ S1210)。 そして、利用者力も再度検索結果の中から利用したい検索結果が選択されると (ステ ップ SI 211 ; Yes)、再度ステップ SI 209から処理が実行される。 [0168] Subsequently, when the search result is selected by the user (hereinafter, the search result selected by the user is referred to as "selection search result") (step S1208; Yes), the search result is shared based on the selection search result. The starting information is learned (step S1209). Then, the evaluation function is reconstructed based on the co-occurrence information learned in step S1209 and stored (step S1210). When the search result that the user wants to use is selected again from the search results, SI 211; Yes), the process is executed again from step SI 209.
[0169] このようにして検索されたコンテンツ情報は 1つのコンテンツジャンルやカテゴリとい つた分類であったり、コンテンツとしての 1枚の画像であったり、画像を集めた写真集 であったり、 1曲の音楽であったり、 1曲の音楽におけるサビの演奏部分であったり、 映画やビデオなどの作品であったり、作品中の 1シーンであったり、特定分野の作品 における共通した画像や音声の特徴をもつ範囲であったりしてもよぐコンテンツ情報 における特定の識別子や特徴量の共起傾向に基づ!、た検索結果の獲得が可能とな るため、利用者の指示によるコンテンツのシーン検索やタイトル検索が可能となる。 [0169] The content information searched in this way is classified as one content genre or category, one image as content, a photo collection of images, It is a piece of music, a chorus part of a piece of music, a movie or video work, a scene in a work, or a common image or sound feature of a work in a specific field. It is possible to acquire search results based on the co-occurrence tendency of specific identifiers and feature quantities in content information that may be within the range, so that content scene search by user instructions and Title search is possible.
[0170] そして、利用者力もの再度の検索の入力がなされる力否力 すなわち処理を終了 するカゝ否かが選択される。ここで、再度検索条件が入力される操作がなされると (ステ ップ S1212 ;No)、ステップ S1201に処理が遷移し実行される。また、検索条件が入 力されない操作がなされると (ステップ S121; Yes)、本処理を終了する。 [0170] Then, it is selected whether or not the user is required to input the search again, that is, whether or not to end the process. Here, when the operation for inputting the search condition is performed again (step S1212; No), the process transitions to step S1201 and executed. Further, when an operation for which a search condition is not input is performed (step S121; Yes), this process is terminated.
[0171] この結果、構成された複数の特徴量や複数の識別子の共起情報に基づく評価関 数によって決定した識別子や特徴量をコンテンツ情報に関連付けて記録 '検索'検 出することにより、従来よりも複雑な趣味や趣向、興味に合わせた検索を実現すること が可能となり、音素や音素辺及び Z又は感情及び Z又はその他の識別子及び Z又 はそれらの特徴量の共起状態を用いた情報検索による利便性の向上を実現できる。 [0171] As a result, the identifiers and feature quantities determined by the evaluation functions based on the co-occurrence information of the plurality of feature quantities and the plurality of identifiers that have been configured are recorded in association with the content information, and 'search' is detected. It is possible to search for more complex hobbies, preferences, and interests, using phonemes, phoneme edges, Z or emotions, Z or other identifiers, and co-occurrence states of Z or their features. The convenience of information retrieval can be improved.
[0172] そして、このように獲得された検索結果や検索条件や索引に基づいた共起情報や 共起情報から学習された評価関数を用いて広告や宣伝を関連付けたり、関連付ける 条件や評価関数に従って課金条件を変更したり、広告情報保存部 216に保存され た広告情報の索引を評価して類似性の高い広告を提示したりしてもよぐキーワード やキーワード音素列に関連付けられた特徴量や識別子を広告条件に利用したり、利 用頻度の高い検索条件に関する広告料の設定を高額にしたり、表示する 2次元画像 や 3次元画像の形状データやテクスチャデータの変更や位置の変更をおこなったりし てもよい。  [0172] Then, using the evaluation function learned from the co-occurrence information and the co-occurrence information based on the search results, search conditions, and indexes obtained in this way, You can change the billing conditions, evaluate the index of the advertisement information stored in the advertisement information storage unit 216, and present a highly similar advertisement. Use identifiers as advertising conditions, increase advertising fees for frequently used search conditions, change 2D image and 3D image shape data, texture data, and position You may do it.
[0173] なお、検索結果や検索条件として獲得された音素記号列や音素片記号列、各種識 別子列から前述の索引付のときと同様に共起情報が構成され、識別子列同士の一 致評価を行 、検索条件に従って類似性の高 、コンテンツ情報の検索を実行してもよ いし、利用者の発話音素及び利用者の発話音素に基づいた音声認識辞書登録済 み感情単語音素列により選択された感情識別子及び利用者の発話音素に基づいた 音声認識辞書登録済み色単語音素列により選択された色識別子の共起情報を用い て前述の索引付と同様に組合せられた索引情報を構成してもよい。 Note that co-occurrence information is constructed from the phoneme symbol strings, phoneme symbol strings acquired as search results and search conditions, and various identifier strings as in the case of the above-mentioned indexing. You can perform a close evaluation and perform a search for content information with high similarity according to the search conditions. The voice recognition dictionary registered based on the user's utterance phoneme and the speech recognition dictionary registered based on the user's utterance phoneme. Index information combined in the same manner as the above-described indexing may be configured using the co-occurrence information of the color identifier selected by the above.
[0174] また、音声から得た特徴量に基づき認識された音素や音素片による記号列や感情 や音階、楽器音、環境音などの識別子及び Z又は映像から得た特徴量に基づき認 識された形状や色、文字、動作などの識別子や前述、後述される識別子と関連付け られた特徴量を多変量解析手法を用いて分析や分類、学習を実施し識別子を構成 しても良いし、それらの実施に伴い新しい識別子を構成し利用しても良いし、詳細は 『識別子再構築の例』に述べる。  [0174] In addition, it is recognized based on a symbol string or phoneme or a segment of a phoneme recognized based on a feature obtained from speech, an identifier such as emotion, scale, musical instrument sound, environmental sound, or a feature obtained from Z or video. The identifiers may be constructed by analyzing, classifying, and learning using identifiers such as shapes, colors, characters, actions, etc. and features associated with the identifiers described above and later, using multivariate analysis techniques. New identifiers may be constructed and used in accordance with the implementation of, and details are given in “Examples of identifier reconstruction”.
[0175] また、入力文字列や入力音素列を他の識別子や特徴量に変換するには『辞書構 成の例』に詳述されるように相互の情報変換を実施しても良いし、識別子から特徴量 への変換や特徴量から識別子への変換は後述されるそれぞれの項目にあるように任 意に構成することが出来る。  [0175] Also, in order to convert an input character string or input phoneme string into other identifiers or feature quantities, mutual information conversion may be performed as detailed in "Example of dictionary configuration", The conversion from the identifier to the feature quantity and the conversion from the feature quantity to the identifier can be arbitrarily configured as described in each item described later.
[0176] また、このように構成された識別関数情報を『利用者同士の情報共有手順例』に基 づき交換'配信 '共有したりすることも可能であり、利用者同士で評価関数を再利用 することにより利便性を改善してもよいし、『端末及び基地局に用いる情報処理装置 の手順例』に詳述するが、サーバクライアントモデルにより処理をサーバとクライアント に分割して装置を通信で結びサーバ'クライアント間で情報を交換することにより同等 のサービスやインフラ、検索、索引付け、検出と検出に伴う任意処理を実施しても良 い。 [0176] The identification function information configured in this way can also be exchanged 'distributed' based on 'Example of information sharing procedure between users'. It can be used to improve convenience, and as detailed in “Examples of procedures for information processing devices used in terminals and base stations”, the server client model divides processing into servers and clients, and communicates with devices. By exchanging information between the server and the client, equivalent services and infrastructure, search, indexing, and arbitrary processing associated with detection and detection may be performed.
[0177] また、センサ情報を用いる場合であれば、監視カメラなどに温度センサをつけ周囲 の温度変化と画像特徴の変化を検出し、前述の共起情報において音素識別子とし て爆発の生じたときの音素列を認識し画面内における暖色系の画素数の増加を特 徴量とし、温度の上昇を温度センサ情報として共起行列に追加し記録することにより 爆発に伴う共起情報を学習したり、索引付たり、検索したりすることも出来る。  [0177] If sensor information is used, a temperature sensor is attached to a surveillance camera or the like to detect ambient temperature changes and image feature changes, and when an explosion occurs as a phoneme identifier in the co-occurrence information described above. The feature number is the increase in the number of warm-colored pixels in the screen, and the increase in temperature is added to the co-occurrence matrix as temperature sensor information and recorded, and the co-occurrence information associated with the explosion can be learned. You can also index and search.
[0178] また、映像入力や音声入力やセンサ入力は複数のチャネル力も入力されても良ぐ 各チャネルからの入力のずれを利用して、ステレオ画像やステレオ音声による特徴量 や識別子を構成し位置を推定したり、移動を推定したり、ある事象とある事象が時系 列的に差を持って ヽても必ず関連して起こることを、異なるチャネルの識別子や特徴 量に数秒から数分若しくはそれ以上の時系列幅を持たせて共起関係を評価すること で検出しても良い。 [0178] For video input, audio input, and sensor input, multiple channel forces may be input. Features using stereo images and stereo audio using the input deviation from each channel The identifiers and features of different channels can be used to estimate the position by constructing and identifiers, to estimate movement, and to ensure that certain events are related even if there is a difference in time series. This may be detected by evaluating the co-occurrence relationship with a time series width of several seconds to several minutes or more.
[0179] このように、本発明の第一の特徴は後述されるコンテンツ情報に対する索引付けに お!、て、音素情報や音素片情報及び Z又は感情情報及び Z又は聴覚情報や視覚 情報や文字情報や番組情報やセンサ情報等との組合せにより、多様な索引付けを 実行するとともに付けられた索引に基づいた共起情報の学習をすることと共起情報 に基づいた検索を行うことにあり、第二の特徴は本発明の検索処理の例にあるように 本発明に用いる識別子や特徴量に対して各々の呼称に基づ ヽて音素列や音素片 列を割当てる辞書を用いて音声入力や画像入力、文字列入力に対する検索を実行 するところにある。  [0179] Thus, the first feature of the present invention is the indexing of content information, which will be described later! In addition, various indexing is performed and combined with the index attached by combining with phoneme information, phoneme information, Z or emotion information, Z, auditory information, visual information, character information, program information, sensor information, etc. Learning based on the co-occurrence information and performing a search based on the co-occurrence information, and the second feature is for the identifier and feature quantity used in the present invention as in the example of the search process of the present invention. Thus, a search for speech input, image input, and character string input is performed using a dictionary that assigns phoneme strings and phoneme strings based on each name.
[0180] そして、本発明ではこの音素や音素片として音素や音素片の連続状態を示した情 報として、どのようにこれらの要素が変化するかを示した情報である「連続音素」や「 連続音素片」を考慮しても良ぐ「音素列」や「音素片列」とは、これら音素や音素片が 記号や識別子として並んだ情報列を指し「音素識別子列」や「音素変識別子列」と表 記しても良ぐ各種識別子は識別子列としての一致評価を考慮することも出来る。  [0180] In the present invention, as information indicating the continuous state of phonemes and phonemes as the phonemes and phonemes, information indicating how these elements change is "continuous phonemes" and " A “phoneme string” or “phoneme string string” that can be considered even if “continuous phoneme strings” are considered refers to an information string in which these phonemes or phoneme strings are arranged as symbols or identifiers. Various identifiers that can be expressed as “columns” can also be considered for matching evaluation as identifier columns.
[0181] このため、各々の識別子認識方法や特徴量抽出方法や識別子列一致度評価方法 や情報分類方法や情報学習方法や通信伝達手順や記憶媒体の種類や通信媒体の 種類や情報処理装置の構成や端末と配信基地局の構成や装置の形状や装置の大 きさや装置の設置場所や装置に用いるセンサ類に関しては必要に応じて任意に組 合せて装置を製作したり、プログラムを実施したりしても良ぐ従来の検索では独立に 扱われていた音素や音素片と画像関連特徴量と感情識別子と音響情報とを組合せ て共起状態を利用することで、それらの共起状態を学習し、索引付けし、検索結果を 用いて再学習し、識別子共起辞書の構築や相互変換を行うことで生じる検索や検出 における利便性の向上した検索が本発明に基づく情報処理装置の特徴となり、コン テンッ情報に対して時間軸や情報の閲覧順序に従い複数の認識方法や複数の特 徴抽出方法による索引付けを行うとともに、発話力も検出される感情と音素列 ·音素 片列及びそれら識別子や特徴量の共起状態を利用して検索や検出や多変量解析 を利用した共起状態の学習と学習結果を用いた検索や検出への利用を行うことがで きる。 [0181] For this reason, each identifier recognition method, feature quantity extraction method, identifier string match evaluation method, information classification method, information learning method, communication transmission procedure, type of storage medium, type of communication medium, information processing device Regarding the configuration, the configuration of the terminal and the distribution base station, the shape of the device, the size of the device, the installation location of the device, and the sensors used in the device, devices can be arbitrarily combined as necessary, and programs can be implemented. By using the co-occurrence state in combination with phonemes and phoneme pieces, image-related features, emotion identifiers, and acoustic information, which were handled independently in conventional searches, Features of the information processing device based on the present invention are improved search and detection convenience by learning, indexing, re-learning using search results, constructing and co-converting identifier co-occurrence dictionaries. And the container Indexed by multiple recognition methods and multiple feature extraction methods according to the time axis and information browsing sequence, and emotions and phoneme sequences Using the co-occurrence states of single columns and their identifiers and feature quantities, search and detection, and learning of co-occurrence states using multivariate analysis and search and detection using learning results can be performed.
[0182] また、本発明に基づいて実施される広告 ·宣伝は従来の発明と任意に組み合わさ れても良ぐ広告へのアクセス頻度やコンテンツの利用頻度や広告の質や大きさや 時間に応じて料金を変更したり、クイズやアンケートにより景品を提供したりしてもよく 、本発明を用いて検出された対象に関する広告結果を統計的に処理し、インタラクテ イブな広告を実行することが出来る。  [0182] In addition, advertisements and promotions implemented based on the present invention may be combined arbitrarily with the conventional invention, depending on the frequency of access to the advertisement, the frequency of use of the content, the quality, size, and time of the advertisement. Fees may be changed, or prizes may be provided through quizzes or questionnaires, and interactive results can be executed by statistically processing advertisement results relating to objects detected using the present invention.
[0183] また、検索条件に従ったコンテンツを検索し特定する検索特定機能を用いて実時 間で配信されるコンテンツカゝら条件に合致する情報を検出することで、メールを配信 したり、条件に合致したチャンネルに変更したり、録画や再生を開始したり、ロボットや エージェントが発話を開始したり、録画した別チャネルのコンテンツを検出時間まで 遡って再生したり、装置の設定を変更したり、検出結果へのリンクを含むショートカット を構成したり、検出した情報を用いてコンテンツを集約することで利用者へ提示したり 、することが可能である。  [0183] In addition, by using the search specification function to search and specify content according to the search conditions, information that matches the conditions that are distributed in real time can be used to distribute emails, Change to a channel that matches the conditions, start recording or playback, robot or agent starts utterance, playback the recorded content of another channel retroactively to the detection time, or change device settings It is possible to construct a shortcut including a link to a detection result, or to present to a user by aggregating contents using detected information.
[0184] また、本発明は後述されるその他の特徴量や識別子を用いて索引付けを実行する ことと、その索引の共起状態により新規識別子の学習や識別子の再構築を実行する ことと、共起状態を利用した検索条件の設定を実行することと、利用者の指定した検 索条件に基づいて新規識別子の学習や既存識別子の再構築を実行することと、検 索条件に従って獲得した検索結果における共起状態に基づいて新規識別子の学習 や既存識別子の再構築を実行することと、新規識別子や再構築された既存識別子を 組合せた共起情報に基づく検索や検出そして共起情報の多変量解析や学習により 構成された識別子や特徴量に基づく検索や検出の実施にある。  [0184] Further, the present invention performs indexing using other feature quantities and identifiers described later, performs learning of new identifiers and reconstruction of identifiers according to the co-occurrence state of the indexes, Set search conditions using co-occurrence status, learn new identifiers based on search conditions specified by the user, reconstruct existing identifiers, and search acquired according to the search conditions Based on the co-occurrence status in the result, learning of new identifiers and restructuring of existing identifiers, search and detection based on co-occurrence information combining new identifiers and reconstructed existing identifiers, and a lot of co-occurrence information The search and detection are based on identifiers and feature values constructed by variable analysis and learning.
[0185] また、本発明の検索処理の例にあるように本発明に用いる識別子や特徴量に対し て各々の呼称に基づいた音素列や音素片列や感情識別子もしくはそれらの記号列 を用いたハッシュ値を用いることにより、検索条件や検出条件としての入力文字列や 入力音声や入力画像の認識に伴う識別子や特徴量に関連付けられた内部 IDや呼 称文字列と認識に用いられる音素列や音素片列による記号列との変換辞書及び Z 又は共起辞書を構成し、検索条件や検出条件として与えられた入力音声や入力文 字列、入力画像に基づいて識別子や特徴量を抽出したのちに識別子の変換辞書や 共起辞書、特徴量の共分散行列に基づく評価関数などを利用して必要な対象を選 別し、前述の検索や検出に用いる検索条件や検出条件としての入力文字列や入力 音声や入力画像の認識に伴う識別子や特徴量を利用して文字列以外の識別子や 特徴量を組合せた共起情報による条件生成処理の実施や音素認識や音素片認識 にともなう発話音素列及び Z又は発話音素片列に関連付けられた画像識別子や画 像特徴量、環境音識別子や環境音特徴量、感情識別子や感情特徴量、配信される 番組情報に基づく番組識別子や画像識別子や音響識別子の共起状態に基づく番 組識別子や画像特徴量や音響特徴量の共起特徴量に基づく番組特徴量を用いた 検索、検出、索引付け、共起情報の学習、識別子の再構築にある。 [0185] Further, as shown in the example of the search processing of the present invention, phoneme strings, phoneme string strings, emotion identifiers or their symbol strings based on their names are used for the identifiers and feature quantities used in the present invention. By using the hash value, the input character string as the search condition and detection condition, the identifier associated with the recognition of the input voice and input image, the internal ID associated with the feature quantity, the nominal character string, the phoneme string used for recognition, Conversion dictionary with symbol strings by phoneme strings and Z Or, configure a co-occurrence dictionary and extract identifiers and feature quantities based on input speech, input character strings, and input images given as search conditions and detection conditions, and then convert the identifier conversion dictionary, co-occurrence dictionary, and feature quantity Select necessary targets using an evaluation function based on the covariance matrix of the above, search conditions used for the search and detection described above, input character strings as detection conditions, identifiers associated with recognition of input speech and input images, An image associated with an utterance phoneme sequence and Z or utterance phoneme sequence associated with phonetic recognition or phoneme recognition by performing condition generation processing using co-occurrence information combining identifiers or feature amounts other than character strings using feature amounts Identifier, image feature, environmental sound identifier, environmental sound feature, emotion identifier, emotion feature, program identifier based on the distributed program information, image identifier, acoustic identifier Search and using the program feature based on co-occurrence characteristic of the acoustic feature quantity is detected, indexing, learning co-occurrence information, the reconstruction of the identifier.
[0186] また、コンテンツに関連付けて識別子や特徴量を保存 ·記録する方法は、専用のデ ータベースに時間情報と共に記録や、映像や音声の情報と同時に使用する別フアイ ルとしての索引ファイルへの保存や MPEGファイルなどの映像ストリームに挿入して MPEGファイルの空きエリアやコメントエリア、メタ情報記載エリアの更新や EPGや B MLによるマークアップ言語等を利用した番組情報や文字放送等を用いて配信して 利用者側が受け取って前述のような方法で記憶媒体に保存することで本発明による 索引情報を利用しても良い。  [0186] In addition, identifiers and feature quantities can be stored and recorded in association with content by recording them together with time information in a dedicated database, or by creating an index file as a separate file that can be used simultaneously with video and audio information. Save, insert into MPEG file and other video streams, update MPEG file vacant area, comment area, meta information description area, broadcast using program information or text broadcasting using markup language by EPG or BML, etc. Then, the index information according to the present invention may be used by being received by the user and stored in the storage medium by the method described above.
[0187] <本発明の適用例について >  <Application Examples of the Present Invention>
本発明を各種適用できる範囲や技術について説明する。対象となるコンテンツ情 報としての「コンテンツ情報の例」、共起情報に利用可能な特徴量や識別子としての「 特徴量や識別子の例」、識別子や特徴量を音素や音素片記号列に変換したり識別 子同士を変換したりするための「辞書構成の例」、辞書を構成したりコンテンツ情報を 識別子に変換するための「自然情報カゝら特徴量に変換する方法の例」と「特徴量から 識別子列に変換する方法の例」、各種認識を用いて索引付を行う「情報索引付け方 法の例」、識別子に基づいて特徴量検索を行うための「識別子列から特徴量に変換 する方法の例」、検索において対象範囲を検出するために類似度を評価するための 「特徴量同士や識別子列同士の一致を評価する方法の例」、本発明に基づく「情報 検索方法の例」、本発明の検索機能により検出された情報に応じて処理を行う「識別 子の検出に伴う任意処理の例」、検索結果や索弓 Iを利用して学習を行う「検索'検出The range and technology to which the present invention can be applied will be described. “Examples of content information” as target content information, “examples of features and identifiers” as feature quantities and identifiers that can be used for co-occurrence information, and conversion of identifiers and feature quantities into phoneme and phoneme symbol strings "Example of dictionary configuration" for converting between identifiers and identifiers, "Example of methods for converting natural information to feature quantities" for constructing dictionaries and converting content information into identifiers, and " `` Example of method for converting feature quantity to identifier string '', `` Example of information indexing method '' for indexing using various recognitions, `` Convert from identifier string to feature quantity '' for performing feature quantity search based on identifier "Example of method to perform", "Example of method to evaluate matching between feature quantities and identifier strings" for evaluating similarity to detect target range in search, "Information" based on the present invention "Example of search method", "Example of optional processing associated with identifier detection" that performs processing according to information detected by the search function of the present invention, "Search" that uses search results and Bow I 'detection
•索引付けに基づく識別子学習の例」、学習を利用した「識別子再構築の例」を変形 例として記載する。 • “Example of identifier learning based on indexing” and “Example of identifier reconstruction” using learning are described as modified examples.
[0188] 《コンテンツ情報の例》  [0188] <Example of content information>
まず、本発明を用いて実施される検索や索引付けの対象となるコンテンツとコンテン ッ情報について説明すると、もっぱらコンテンツとは、映画、ドラマ、写真、報道、了二 メ、イラスト、絵画、音楽、プロモーションビデオ、小説、雑誌、ゲーム、論文、教科書、 辞書、書籍、コミック、カタログ、ポスター、放送番組情報などを示していることが一般 的によく知られているが、本発明では公共情報、地図情報、商品情報、販売情報、広 告情報や予約状況、視聴状況、道路状況といった情報やアンケート、監視カメラ映像 、衛星写真、ブログ、模型、人形、ロボットのカメラ'マイク入力などを含んでも良い。  First, the contents and content information to be searched and indexed using the present invention will be explained. The contents are exclusively movies, dramas, photographs, news reports, Ryome, illustrations, paintings, music, It is generally well known to show promotional videos, novels, magazines, games, papers, textbooks, dictionaries, books, comics, catalogs, posters, broadcast program information, etc., but in the present invention public information, maps Information, product information, sales information, advertisement information, reservation status, viewing status, road status, information such as questionnaires, surveillance camera images, satellite photos, blogs, models, dolls, robot cameras' microphone input, etc. may also be included.
[0189] また、映像の時系列的変化、音声の時系列変化、読み手の音読位置の時系列的 変化を期待する文章、 HTMLにおけるマークアップ言語表記による電子情報、それ らにより生成された検索指標情報などであっても良ぐ音読位置を時間軸として解釈 して句点や文や文章をフレームとして捕らえても良い。  [0189] In addition, time-series changes in video, time-series changes in speech, text that expects time-series changes in the reading position of the reader, electronic information in markup language notation in HTML, and search indexes generated from them Interpretation of good reading position even for information, etc., may be interpreted as a time axis, and punctuation, sentences and sentences may be captured as frames.
[0190] また、コンテンツに付属するメタ情報、文字情報による文書情報や番組情報として の EPGや BML、譜面情報としての音階、一般的な静止画や動画像、 3次元情報とし てのポリゴンデータやベクトルデータやテクスチャデータやモーションデータ(動作デ ータ)、可視化数値データによる静止画像や動画像、宣伝や広告を目的としたコンテ ンッ情報等を含んでいても良ぐ視覚情報や聴覚情報や文字情報やセンサ情報を 含む自然情報により構成されて 、る。  [0190] In addition, meta information attached to content, EPG and BML as document information and program information as text information, musical scale as musical score information, general still images and moving images, polygon data as 3D information, Visual information, auditory information, and text that may contain vector data, texture data, motion data (motion data), still images and moving images based on visualized numerical data, content information for the purpose of advertising and advertising, etc. It consists of natural information including information and sensor information.
[0191] 《特徴量や識別子の例〉〉  [0191] <Examples of features and identifiers>
次に、本発明の変形例として考慮される識別子や特徴量について説明する。本発 明に用いられる特徴量と識別子は自然情報として聴覚情報や視覚情報やセンサ情 報が中心に定義されており、聴覚情報や視覚情報やセンサ情報に音素や音素片と 感情とを関連付けて索引付けを行うとともに、それらの情報の共起状態を評価するこ とで検索を行っている。 [0192] まず、聴覚情報に基づくの特徴量や識別子であれば、音声や音響に用いる FFTや ケプストラムやメルケプストラム、方向性パターンと ヽつた周波数特徴や音量特徴およ びそれらの特徴の時間遷移による変化や異なる位置での収録音の音量や位相、周 波数成分の差分といった公知の特徴抽出方法により獲得された特徴量から認識され る識別子として、音素や音素片、喜怒哀楽を示す感情識別子、声の音質に伴う人物 識別子、音階識別子、ピアノやギターを識別する楽器識別子、爆発音や雨音、パチ ンコ屋の音、風の音、波の音、機械音、騒音といった環境音識別子や効果音識別子 が利用でき、音声波形カゝら抽出された特徴量をそれぞれの呼称ごとに集めた母集団 に分類し、分類された母集団に基づき多変量解析による距離関数や学習による HM M関数といった評価関数を構成する。そして、評価関数に関連付けられた呼称音素 列や呼称音素片列、文字列 ID、数値 IDにより、音声や音響の特徴量に基づく環境 音識別子や騒音識別子、機械音識別子と!/ゝつた音声情報に基づく識別子を構成で きる。 Next, identifiers and feature quantities considered as modifications of the present invention will be described. The feature values and identifiers used in the present invention are defined mainly as auditory information, visual information, and sensor information as natural information. The phoneme, phoneme piece, and emotion are associated with the auditory information, visual information, and sensor information. In addition to indexing, search is performed by evaluating the co-occurrence state of such information. [0192] First, for features and identifiers based on auditory information, frequency and volume features that are different from FFT, cepstrum, mel cepstrum, and directional pattern used for speech and sound, and temporal transition of those features As an identifier recognized from feature quantities obtained by a known feature extraction method such as volume and phase of recorded sound at different positions, or differences in frequency components, emotion identifiers indicating phonemes, phonemes, emotions , Person identifiers associated with voice quality, scale identifiers, instrument identifiers for identifying pianos and guitars, explosion sounds and rain sounds, pachinko parlor sounds, wind sounds, wave sounds, mechanical sounds, noises, environmental sound identifiers and sound effects The identifiers can be used, and the features extracted from the speech waveform are classified into populations collected for each name, and distance functions and learning by multivariate analysis are performed based on the classified populations. Constitute the evaluation function, such as HM M function by. Then, based on the nominal phoneme sequence, the nominal phoneme sequence, the character string ID, and the numeric ID associated with the evaluation function, the environmental sound identifier, noise identifier, mechanical sound identifier, and! / An identifier based on can be configured.
[0193] 次に、視覚情報に基づく特徴量や識別子であれば、画像や映像に用いられる輝度 差分や色差分、動きベクトルと!/ヽつた公知の動画特徴量や静止画特徴を用いて認識 される識別子として、市街地や緑地、海岸、山岳、砂漠、天候、表情、時刻や季節に よる日の陰り方を示す風景識別子、自動車や人、顔、花、動物、植物といった物体を 示す物体識別子、輝度や色、輪郭といった画像特徴を示す画像識別子、物体の運 動速度や運動の変化や挙動に伴う状態の変化といった動作を示す動作識別子、画 像範囲に応じた画像系識別子の出現位置を示す表示位置識別子が利用でき、動画 像や静止画像から抽出された特徴量をそれぞれの呼称ごとに集めた母集団に分類 し、分類された母集団に基づき多変量解析による距離関数や学習による HMM関数 といった評価関数を構成する。そして、評価関数に関連付けられた呼称音素列や呼 称音素片列、文字列 ID、数値 IDにより、動画像や静止画像の特徴量に基づく風景 識別子や物体識別子、動作識別子と!/ヽつた動画像や静止画像に基づく識別子を構 成できる。  [0193] Next, if it is a feature amount or identifier based on visual information, it is recognized using a known moving image feature amount or still image feature that is used as a luminance difference, color difference, or motion vector for an image or video. Identifiers that can be used include urban areas, green spaces, coasts, mountains, deserts, weather, facial expressions, landscape identifiers that indicate how the sun is shaded according to time and season, and object identifiers that indicate objects such as cars, people, faces, flowers, animals, and plants. Image identifiers that indicate image features such as brightness, color, and contour, motion identifiers that indicate movement speed of the object, changes in motion, and changes in the state associated with the behavior, and appearance positions of image system identifiers that correspond to the image range. The display position identifiers shown can be used, and feature values extracted from moving images and still images are classified into populations that are collected for each designation, and based on the classified populations, distance functions by multivariate analysis and HMMs by learning Seki Constitute the evaluation function such as. Then, based on the nominal phoneme sequence, the nominal phoneme sequence, the character string ID, and the numeric ID associated with the evaluation function, the scene identifier, object identifier, and motion identifier based on the feature quantity of the moving image or still image are displayed. / You can configure identifiers based on a single moving image or still image.
[0194] また、聴覚情報や視覚情報や文字情報に基づ!、た認識による感情情報としての感 情識別子であれば表情や声の調子による喜怒哀楽といった一般的な感情ば力りで なぐ心理学関連書籍に記載された感情や精神状態を示す単語を検出、認識するこ とで識別として用いても良い。 [0194] In addition, based on auditory information, visual information, and text information, and emotion identifiers as emotion information by simple recognition, it is possible to use general emotional powers such as emotions based on facial expressions and voice tone. It may be used as an identification by detecting and recognizing words that indicate emotions and mental states described in the book related to psychology.
[0195] そして、これらの識別子や特徴量は先の実施例のような 1フレーム内の色の出現頻 度や音素ばかりではなぐ複数フレームにまたがった識別子や特徴量、複数フレーム にまたがる識別子や特徴量の遷移情報に基づいた識別子や特徴量、表示画面中の 座標情報を持つ特徴量や識別子、視覚情報や聴覚情報を用いた位置の算術推定 空間座標系の座標情報を持つ特徴量や識別子、時間軸と関連付けて抽出された特 徴量ゃ識別子であってもよいし、検出された特徴量力 算術的な空間計算により復 元された奥行きや 3次元画像情報の座標情報としての奥行き、復元された奥行きや 座標情報としての奥行きにより算出された面積や質量や容積や速度及び数値情報と しての面積や質量や容積や速度、算出された面積や質量や容積や速度及び数値情 報としての面積や質量や容積や速度力 推定される重さや質量や属性情報としての 重さや質量と 、つた情報を用いても良 、。  [0195] These identifiers and feature quantities are the frequency of appearance of colors in one frame as in the previous embodiment, identifiers and feature quantities that span multiple frames, not just phonemes, and identifiers and features that span multiple frames. Identifiers and feature quantities based on quantity transition information, feature quantities and identifiers with coordinate information in the display screen, arithmetic estimation of positions using visual information and auditory information feature quantities and identifiers with coordinate information in the spatial coordinate system, The feature quantity extracted in association with the time axis may be an identifier, the detected feature quantity force, the depth restored by arithmetic space calculation, or the depth as the coordinate information of the 3D image information. Area, mass, volume, speed and numerical information calculated by depth as depth and coordinate information, and calculated area, mass, volume, speed, and numerical information Good, even with the weight and mass of the weight or the mass and the attribute information estimated area and the mass and volume and speed force to, the ivy information.
[0196] このため、識別子の表記には音声や動画像や静止画像による特徴量に基づいた 評価関数に対して文字列 IDや数値 IDを関連付けた識別子を用いたり、音声や動画 像や静止画像カゝら認識された任意の文字列を用いたりしてもよぐ識別子を組合せて 識別子列として利用したり、評価関数を用いた識別子の認識により獲得された評価 値を利用して識別子の共起確率や特徴量の共分散行列や HMMの出力確率や H MMの遷移確率や距離関数の距離値や DPの一致度評価値といった任意の評価値 の組合せを特徴量として HMMや評価関数の再構築に用いてもょ ヽし、識別子の時 系列的な変化に基づ 、て識別子列を構成しても良 、。  [0196] For this reason, identifiers may be represented using identifiers that associate character string IDs or numerical IDs with evaluation functions based on features of speech, moving images, or still images, or may be used for speech, video, or still images. The identifiers that can be used, such as arbitrary character strings that are recognized by the user, are combined and used as identifier strings, or the identifiers are shared by using the evaluation values obtained by the recognition of the identifiers using the evaluation function. The HMM and evaluation function can be re-created using a combination of arbitrary evaluation values such as the covariance matrix of the occurrence probability, feature value, HMM output probability, HMM transition probability, distance function distance value, and DP match evaluation value. It can be used for construction, and the identifier string can be configured based on the chronological change of the identifier.
[0197] また、多変量解析にともなう自己組織ィ匕により学習された母集団に識別子を与えて 検索や認識、検出、索引付けを行っても良いし、それらの識別子を検索条件に用い ても良いし、多変量解析に用いる複数の画像や映像や音声に伴う特徴量を組合せ て学習された母集団の識別子を評価するための特徴量に用いても良い。  [0197] In addition, an identifier may be given to a population learned by self-organization associated with multivariate analysis to perform search, recognition, detection, and indexing, or these identifiers may be used as search conditions. Alternatively, it may be used as a feature quantity for evaluating an identifier of a population learned by combining feature quantities associated with a plurality of images, videos, and sounds used for multivariate analysis.
[0198] また、任意の識別子に関する特徴量平均や分散や呼称文字列や呼称に伴う音素 列や音素片列を用いて算術的に取得されたハッシュ値を索引付けに用いても良いし 、音素と音素の連続回数や連続時間や「長、中、短」のような数種類に分類された長 さ情報を用いて「discernment -longj、 Γ discernment - short」と 、つた長さ情報つきの 記号や識別子を用いたり、音素 1つの範囲内における位置情報を用いて phoneme - front J , rphoneme -rear」といった位置情報つきの記号や識別子を用いたり、それら の識別子や記号を記号列や識別子列として組合せ、新しく識別子を構成しても良い に用いたりしても良いし、音素や音素変を含む各種識別子が評価結果として連続す る区間における評価関数力 出力される距離や尤度の平均を前述の識別子を分類 する長さ情報や識別子の長さに伴う重み情報に利用しても良い。 [0198] In addition, hash values obtained arithmetically using a feature value average, variance, nominal character string, phoneme sequence or phoneme fragment sequence associated with an arbitrary identifier may be used for indexing. And phoneme continuous times, continuous time, and long classified into several types such as “long, medium, short” "Discernment -longj, Γ discernment-short" using the length information, and the phoneme-front J, r p honeme -rear using the position information within the range of one phoneme. May be used as symbols or identifiers with location information, or may be used to construct new identifiers by combining those identifiers and symbols as symbol strings or identifier strings. The evaluation function force in the interval where identifiers continue as evaluation results may be used for the length information for classifying the identifiers described above and the weight information associated with the identifier lengths.
[0199] この際、前述の従来技術に記載されているような特徴量を用いたり、それらの文献 に引用された特徴量を用いても良ぐ任意の画像認識技術を用いて、芸能人などの テレビ出演者の顔力 特徴を抽出し、抽出された特徴量に基づいて認識し、認識さ れた識別子に対して EPGや BML、 RSS、文字放送、字幕といった番組情報として 用いられる文字情報や利用者や映像付随音声の発話による音素記号列や文字情 報から変換された音素列 ·音素片列を関連付けることにより、表示人物や表示内容を 弁別するための識別子を設けたり、同様に表示物体に対して識別子を設けたりして 本発明を変形させても良い。  [0199] At this time, it is possible to use a feature amount as described in the above-mentioned prior art, or any image recognition technology that can use the feature amount cited in those documents. Character features used as program information such as EPG, BML, RSS, text broadcasting, subtitles, etc. for recognized identifiers are extracted from facial features of TV performers and recognized based on the extracted features. By associating phoneme symbol strings and phoneme string sequences converted from text information or phonetic symbol strings generated by the utterance of the person or the video accompanying sound, an identifier for discriminating the displayed person and the display contents can be provided, Alternatively, the present invention may be modified by providing an identifier.
[0200] また、文字列による文章情報や文字情報であれば任意の文書処理方法と組合せて も良ぐそれらに関わる特許や文献にあるような文字列に対する特徴抽出方法を組合 せて実現してもよいし、後述される「複数の識別子と複数の検索条件に伴う検索およ び任意処理の実施例」のような共起状態を用いた情報評価方法を用いても良ぐ文 字列情報における文字間隔や文字数や文字出現頻度や文字共起頻度をはじめ、文 章情報における単語間隔や単語数や単語出現頻度や単語共起頻度や文章情報に おける記号間隔や記号数や記号出現頻度や記号共起頻度などを組合せて特徴量と したり、それらの特徴量に基づく文章解析や認識に伴う識別子を利用しても良い。  [0200] In addition, text information and character information using character strings can be combined with any document processing method, and can be realized by combining feature extraction methods for character strings as described in patents and documents related to them. It is also possible to use information evaluation methods that use co-occurrence states, such as “Examples of search and optional processing with multiple identifiers and multiple search conditions” described later. Character spacing, number of characters, character appearance frequency, character co-occurrence frequency, sentence spacing, word number, word appearance frequency, word co-occurrence frequency, symbol interval, number of symbols, symbol appearance frequency A symbol co-occurrence frequency or the like may be used as a feature amount, or an identifier associated with sentence analysis or recognition based on the feature amount may be used.
[0201] また、本発明の応用により、環境音を擬音として処理し音素列や音素片列の認識に 基づいて評価することで環境音特徴や環境音識別子や効果音識別子と音素識別子 や音素片識別子による共起行列を構成した後に特徴量を学習して擬音特徴や擬音 識別子として新規特徴量や新規識別子を構成しても良いし、声の質や変化に基づく 人物識別子や声の質や変化に基づく感情識別子を用いて認識に用いる音響モデル を変化させ、認識率の改善に利用しても良い。 [0201] In addition, by applying the present invention, environmental sounds are processed as onomatopoeia and evaluated based on recognition of phoneme sequences and phoneme segment sequences, so that environmental sound features, environmental sound identifiers, sound effect identifiers, phoneme identifiers and phoneme segments are evaluated. After constructing a co-occurrence matrix with identifiers, features may be learned to construct new features or new identifiers as onomatopoeia features or onomatopoeia identifiers, or person identifiers or voice quality or changes based on voice quality or changes Model used for recognition using emotion identifiers May be used to improve the recognition rate.
[0202] また、任意のプロトコルにおいて指定された識別子とその識別しに関わる物品の名 称とを関連付けて利用してもよぐ例えば MIDIなどのインターフェース規格ではゼネ ラル MIDIと呼ばれる方式では IDと楽器とが直接関連付けられており、これを利用し て楽器番号と楽器名称を対応付けるといった方法が考えられ、それらの IDと特徴量 の共起行列を構成しても良 、し、同様に JANコードなどではメーカーコードやアイテ ムコードなどにより一意に対象を固定できるため、通常のバーコードや二次元バーコ ード、 RFIDタグ、文字放送、クローズドキャプション、 EPG、字幕、 BML、 RSSなど を用いても良い。  [0202] In addition, an identifier specified in an arbitrary protocol and the name of an article related to the identification may be used in association with each other. For example, in an interface standard such as MIDI, an ID and a musical instrument are used in a method called general MIDI. Can be used to associate instrument numbers and instrument names, and a co-occurrence matrix of their IDs and feature quantities can be constructed, as well as JAN codes, etc. Since the target can be uniquely fixed by manufacturer code, item code, etc., normal bar code, 2D bar code, RFID tag, teletext, closed caption, EPG, subtitle, BML, RSS, etc. may be used.
[0203] また、共起状態に基づいて構成される特徴量や識別子による共起情報とは、任意 の識別子や任意の特徴量、任意の識別子に応じた認識結果として出力される距離 情報や確率情報が指定時間範囲内において同時発生したことを踏まえた情報であり [0203] In addition, co-occurrence information based on feature quantities and identifiers configured based on the co-occurrence state is any identifier, any feature quantity, or distance information or probability output as a recognition result corresponding to any identifier. The information is based on the fact that the information occurred simultaneously within the specified time range.
、例えば「笑顔」が画像認識された際に時間的近傍で認識された「笑い声」音素識別 子や「笑!、」感情識別子、「笑!、」動作識別子の共起確率や動作特徴量の共起分散 行列によって新しく表現される特徴量であるとともに、それらの新しく表現された特徴 量に基づ 1ヽて「笑 、状態」識別子と ヽぅ識別子とその評価関数や評価 HMMを構成 しても良い。 For example, when “smile” is recognized as an image, the “laughter” phoneme identifier, “laugh!”, “Emotion identifier”, “laugh! Based on these newly expressed feature quantities, the laughter and state identifiers, ヽ ぅ identifiers, their evaluation functions, and evaluation HMMs are constructed. Also good.
[0204] この際、指定される時間範囲として一般的な時間表現ば力りではなく時系列的な動 画像におけるフレーム (フィールド)数や特徴量の近隣フレーム平均からの乖離度合 、文書音読時の文字数や文字位置、単語数、文の数、文章の数、章やページの数と いった、利用者の音読時における時系列的遷移を考慮した単位に基づいて共起対 象範囲を求めても良いし、文字情報は文章情報や番組情報を含んでも良い。  [0204] In this case, the specified time range is not the power of general time expression, but the number of frames (fields) in the time-series moving image and the degree of deviation from the average of neighboring frames, Find the co-occurrence range based on units that take into account time-series transitions when reading aloud, such as the number of characters, character position, number of words, number of sentences, number of sentences, number of chapters and pages. The character information may include text information and program information.
[0205] また、複数の撮像情報力 えら得た画像特徴に基づいた 2. 5次元特徴や 3次元特 徴により立体情報を生成し、生成された立体情報とポリゴン情報やテクスチャ情報か らなる立体情報との距離評価を行い立体画像や立体映像検索における立体形状一 致評価や擬似三次元や三次元情報の重心からの座標位置や座標群の固有値や固 有ベクトルを用いる距離評価による一致度評価を実施しても良 ヽ。  [0205] In addition, solid information is generated using 2.5D features and 3D features based on the image features obtained from multiple imaging information forces, and the 3D information including the generated 3D information, polygon information, and texture information. Evaluate the degree of coincidence by evaluating the distance to the information and evaluating the matching of the 3D shape in the 3D image and 3D image search, and the coordinate position from the centroid of the pseudo 3D and 3D information, the eigenvalue of the coordinate group, and the eigenvalue It is okay to carry out.
[0206] また、音階種別とは「ドレミファソラシド」といった音階情報であり、オクターブ情報を 含む者であってもよぐ音階識別子の時間軸の出現遷移状態にともなうテンポやリズ ム、和音情報などをふくんでもよい。また、楽器種別とは楽器の音響情報をまとめて 学習させることで実現できることが知られており、公知文献によれば単音認識では 90 %超免ることが知られて 、る。 [0206] The scale type is scale information such as "Doremifasolaside". The tempo, rhythm, chord information, etc. associated with the appearance transition state of the time axis of the scale identifier may be included. In addition, it is known that the instrument type can be realized by learning together the acoustic information of the instrument, and according to the known literature, it is known that over 90% is exempted in single-tone recognition.
[0207] また、環境音種別の認識としては FFTゃケプストラムやメルケプストラム、方向性パ ターン、フォルマント抽出といった周波数特徴や音量特徴およびそれらの特徴の時 間遷移による変化や異なる位置での収録音の音量や位相、周波数成分の差分によ る音声特徴、左右の位相差や音量差による音源位置、周波数分布特性や音程遷移 に基づく音色に始まり、波の音や風邪の音などを楽器識別と同様、偏りのある情報ご とにまとめて評価関数をもちいることで認識可能であり、応用として機械音種別に利 用することも出来る。より具体的には、エンジン音や排気音、蒸気機関車の音、線路 を走る音、風の音、動物や虫、鳥の鳴き声、波の音、木々の音、クラクション、悲鳴、 雄叫び、泣き声、笑い声、地響きといった自然音や機械音、生物の発する音、爆発 音など情報に基づいた特徴量や識別子が構成でき、音響識別子であれば、音階識 別子、音量識別子、音色識別子、和音識別子などが考えられ、音位置識別子や音 源方向識別子であれば音の発生している方向の上下左右を弁別したり、反響状態 識別子であれば屋内の反射音の速度により部屋の広さを弁別したり、楽器種別であ ればトランペットやピアノ音を弁別したり、機械音識別子であれば、機械の音、ェンジ ン音、タペット音、スクリュー音、排気音、工具音、家具音、飛行音、騒音であったり、 自然音識別子であれば、風の音、波の音、咆哮音、爆発音、環境音識別子や効果 音識別子であったり、音声識別子であれば言語識別子、話速識別子、感嘆音声識 別子、歓声識別子、罵声識別子であったりしてもよぐこれらの識別子に特化した特 徴量を組合せることが考えられる。  [0207] The recognition of environmental sound types includes frequency characteristics such as FFT cepstrum, mel cepstrum, directional pattern, and formant extraction, volume characteristics, changes of those characteristics due to time transitions, and the recording sound at different positions. Sound characteristics based on differences in volume, phase, and frequency components, sound source position based on left and right phase differences and volume differences, timbre based on frequency distribution characteristics and pitch transitions, wave sounds, cold sounds, etc., similar to instrument identification It can be recognized by using evaluation functions collectively for each piece of biased information, and can also be used as a machine sound type as an application. More specifically, sound of engine and exhaust, sound of steam locomotive, sound of running on track, sound of wind, sound of animals and insects, sound of birds, sound of waves, sound of trees, horn, scream, cry, cry, laughter Features and identifiers based on information such as natural sounds, mechanical sounds, sounds generated by living things, sounds generated by living things, explosion sounds, etc., and acoustic identifiers include scale identifiers, volume identifiers, tone identifiers, chord identifiers, etc. For example, if it is a sound position identifier or sound source direction identifier, it distinguishes the top, bottom, left, and right of the direction in which the sound is generated, and if it is an echo state identifier, it distinguishes the size of the room based on the speed of the indoor reflected sound. If the instrument type, trumpet or piano sound is discriminated, and if it is a machine sound identifier, machine sound, engine sound, tappet sound, screw sound, exhaust sound, tool sound, furniture sound, flight sound, noise Or nature If it is an identifier, it is a wind sound, wave sound, roaring sound, explosion sound, environmental sound identifier or sound effect identifier, and if it is a speech identifier, it is a language identifier, speech speed identifier, exclamation speech identifier, cheer identifier, hoarseness It is conceivable to combine features specific to these identifiers, which may be identifiers.
[0208] また、画像種別の識別子としては輝度微分に基づく輪郭や色相差、色濃度やそれ らの差分といった特徴量に始まり、人の顔を認識するための顔種別、人の顔の形状 から認識される表情種別や人物種別、歩き方や服装、体格から判別される人物種別 、画像の色成分や形状成分から検出され砂漠や海や都市を弁別できる風景種別、 画像特徴の時間的遷移により抽出される手話ゃジエスチヤ、ダンス、動物の挙動や 機械の動作に基づく動作種別、そして、それらの特徴量が表示範囲のどのような位 置にあるのかによつて、例えば、画面上の右上の数値は時刻の文字画像情報と共起 率が高いことから時刻情報であることを検出するといつた位置を示したり、表示範囲 の物体の向きに関連した表示位置種別と 、つた情報に基づ 、た特徴量や識別子が 構成でき、画像識別子であれば、輝度識別子、彩度識別子、色相識別子、輪郭識別 子、動作識別子、画像位置識別子、速度識別子、移動方向識別子といったものや物 体識別子であれば、動物識別子、植物識別子、機械識別子、工具識別子、家具識 別子、人物識別子、材質識別子、標識識別子、風景識別子といったものや、形状識 別子であれば、顔識別子、表情識別子、口形識別子、服装識別子、髪型識別子、皮 膚識別子、体形識別子、姿勢識別子、波形形状識別子といったものや文字識別子 であれば、言語識別子、フォント識別子、文字サイズ識別子、記号種別等を用いても よぐこれらの識別子に特ィ匕した特徴量を組合せることが考えられる。 [0208] The identifier of the image type starts from a feature amount such as a contour based on luminance differentiation, a hue difference, a color density, or a difference between them, a face type for recognizing a human face, and a human face shape. Recognized facial expression type, person type, walking type, clothes, person type discriminated from physique, landscape type that can be detected from color and shape components of the image, and can distinguish deserts, seas, and cities, and temporal transition of image features Sign language that is extracted, such as jestier, dance, animal behavior Depending on the type of operation based on the machine's operation and the position of those feature values in the display range, for example, the upper right numerical value on the screen has a high co-occurrence rate with the character image information of the time Therefore, it is possible to configure the feature amount and identifier based on the information indicating the position when the time information is detected, the display position type related to the direction of the object in the display range, and the image identifier. For example, if it is a luminance identifier, saturation identifier, hue identifier, contour identifier, motion identifier, image position identifier, speed identifier, moving direction identifier, or a body identifier, an animal identifier, a plant identifier, a machine identifier, a tool identifier For furniture identifiers, person identifiers, material identifiers, sign identifiers, landscape identifiers, and shape identifiers, face identifiers, facial expression identifiers, mouth shape identifiers, clothing identifiers, hairstyle identifiers , Skin identifiers, body identifiers, posture identifiers, waveform shape identifiers, and character identifiers, such as language identifiers, font identifiers, character size identifiers, symbol types, etc. It is conceivable to combine feature quantities.
[0209] また、これらの識別子や特徴量はそれぞれの図 14、図 15中の特徴量抽出部ゃ特 徴量識別子変換部において処理が行われ、特徴量識別子変換、文字列 ·特徴量識 別子変換、検索結果生成処理、特徴量抽出部に全ての識別子や特徴量は記載され ていないが実装されていると考えてよぐ自然情報力 特徴量を抽出する方法に従つ て特徴量を抽出し、抽出された特徴量から評価関数を用いて識別子を特定する処理 と入力された表音記号列に基づいて特徴量や識別子に変換する処理が実施され、 索引付けや検索、学習に用いることが出来るとともに、音素や音素片、感情識別子と 共に用いられる識別子や特徴量は実装に応じて変更することが可能であるため、画 像や音声や感情の認識結果にともなう共起状態の索引付けや学習、検索、検索結 果の利用および各種識別子の呼称に応じた音素や音素片による検索条件の生成を 目的とする本発明にとって個々の認識技術は発明の対象ではない。  [0209] These identifiers and feature amounts are processed in the feature amount extraction unit and the feature amount identifier conversion unit in FIGS. 14 and 15, respectively, and feature amount identifier conversion, character string / feature amount identification are performed. All the identifiers and feature quantities are not described in the child transformation, search result generation process, and feature quantity extraction unit, but the natural information power that can be considered to be implemented. The process of extracting and identifying identifiers using the evaluation function from the extracted feature quantities and the process of converting them into feature quantities and identifiers based on the phonetic symbol strings entered are performed for indexing, searching, and learning. Since the identifiers and features used together with phonemes, phonemes, and emotion identifiers can be changed according to the implementation, an index of co-occurrence states associated with image, voice, and emotion recognition results Date, learning, search, search results Usage and various identifiers individual recognition technology to the present invention for the purpose of generating a search condition by the phoneme and phonemic piece corresponding to the designations are not the subject of the invention.
[0210] また、記号に対して意味や音をもたせることで弁別できる文字記号種別、意味が図 形記号化されて弁別される標識種別、前述の画像特徴の要素となり、角や曲線、輪 郭を弁別する形状種別、それらを組合せ意味がある程度固定された図形や図形の 要素を弁別する図形記号種別、番組情報として放送番組の内容を弁別するための E PGや文字放送、 BML、 RSSを用いたガイドといった放送番組や配信コンテンツに 関する番組種別と 、つた情報に基づ 、た識別子や特徴量が構成でき、 EPGであれ ば番組内容を BMLであれば番組内容ば力りではなぐ番組内での進行に伴う状況 の変化を文字列で獲得できる。 [0210] In addition, the character and symbol types that can be distinguished by giving meanings and sounds to the symbols, the sign types that the meanings are discriminated by graphical symbols, and the elements of the above-mentioned image features, such as corners, curves, and contours Shape types that discriminate between them, graphic symbol types that distinguish shapes and elements whose combination meanings are fixed to some extent, and EPG, text broadcasting, BML, and RSS for discriminating the contents of broadcast programs as program information For broadcast programs and distribution contents such as guides Based on the type of information and the related information, the identifiers and features can be configured. In the case of EPG, the program content is BML. Can be obtained in a row.
[0211] また、センサ情報であればセンサ入力に伴う識別子として、本発明に温度センサや ガスセンサ、運動センサを追カ卩して、それらのセンサ力もの入力情報が人の生命に 危険を与える可能性を分類することで識別子を構成し、識別子に関する画像や音声 に伴う共起情報を収集し、ロボットによる人の安全のための保護評価や装置自体の 安全評価のために用いても良いし、心拍センサや脳はセンサ、筋電流センサ、皮膚 抵抗センサと組合せて医療用の精神分析装置を構成しても良い。また、歩行ナビや カーナビ等に関連した発明と組合せて、 GPSなどの位置情報に基づいて位置識別 子を取得し関連付けて検索を実施したり共起状態を学習したりしても良いし、これら の特徴量や識別子の共起状態に基づくサービスや装置を認識や分類、弁別、評価 のために多層ベイズや多層 HMM、多層-ユーラルネットワークなどを用いて構成し てもよい。 [0211] In addition, if it is sensor information, it is possible to add a temperature sensor, a gas sensor, or a motion sensor to the present invention as an identifier accompanying the sensor input, and the input information of those sensor powers may pose a danger to human life. It is possible to construct identifiers by classifying genders, collect co-occurrence information associated with images and sounds related to identifiers, and use them for protection evaluation for human safety by robots and for safety evaluation of the device itself, The heart rate sensor and the brain may be combined with a sensor, a muscular current sensor, and a skin resistance sensor to constitute a medical psychoanalysis apparatus. In combination with inventions related to walking navigation, car navigation, etc., location identifiers can be acquired based on location information such as GPS and linked to perform searches or learn co-occurrence status. Services and devices based on the co-occurrence state of feature quantities and identifiers may be configured using multi-layer Bayes, multi-layer HMMs, multi-layers, etc. for recognition, classification, discrimination, and evaluation.
[0212] また、例えば特定の騒音だけを集めた音や特定の楽器だけを集めた音、ピアノゃド ラム、犬や猫、自動車や工場などの機械音、歓声、音階などの異なる特徴を基準とし て構成された識別子であってもよぐ本発明に基づく装置の外部カゝら入力される映像 に対して同様に特徴抽出と認識の処理を行い顔に基づいて人物や表情を識別した り、形状や色に基づいて物品や文字や図形、記号、標識を識別したり、フレーム間差 分や音源位置の変化により動作を識別したりしてもよぐそれらを映像や音声に関連 付けて記録し索引付けに利用したりしても良いし、将来的に匂いや味、温度、湿度、 重さ、硬さ、粘度、密度、大きさといった環境や化学成分、物理特性に関する索引付 けがなされてもよい。  [0212] Also, for example, a sound that collects only specific noise, a sound that collects only specific instruments, piano drums, dogs and cats, mechanical sounds of cars and factories, cheers, scales, etc. Even if it is an identifier configured as a feature, the feature extraction and recognition processing is similarly performed on the video input from the external device of the apparatus based on the present invention to identify a person or expression based on the face. It is possible to identify items, characters, figures, symbols, signs, etc. based on shapes and colors, or to identify movements based on differences between frames and sound source positions. It may be recorded and used for indexing, and it will be indexed in the future on odor, taste, temperature, humidity, weight, hardness, viscosity, density, size, environment, chemical composition, and physical properties. May be.
[0213] そして、情報処理部内には情報入力部経由で外部から得られる自然情報及び通 信回線部や記憶部から取得した映像や画像、音声、音声情報による音楽、文書、楽 譜情報としての音楽、静止画や動画像、ポリゴンデータやベクトルデータ、数値デー タによる静止画像や動画像、 t 、つた情報処理装置で処理可能なコンテンツ情報か ら特徴量を抽出する特徴量抽出部があり、特徴量や検索条件、検索結果により得ら れた共起情報を学習する共起情報学習部があり、抽出された特徴量から認識処理 により識別子を決定し索引付けを行う索引情報生成部があり、特徴量や索引の識別 子からなる検索条件と索引情報の一致度を評価して検索をする索引検索評価部が あり、検索結果として評価一覧として出力する評価一覧出力部があり、コンテンツ情 報や利用者の入力から取得した特徴量を識別子に変換する特徴量識別子変換部が あり、利用者の認識による識別子や記憶媒体や通信により外部から取得した識別子 、内部でコンテンツなど力も抽出された識別子などに対して、その識別子の標準的な 特徴量に変換する識別子特徴量変換部があり、辞書情報保存部から目的の変換の ための情報を抽出する辞書抽出部があり、コンテンツ情報力 MPEG7のような索引 情報を取得したり通信回線部から RSS情報や XMLなどのマークアップ言語による情 報を取得したり、情報入力部力 受信した放送波に基づき EPGや BML、 RSS、文 字放送の情報を取得したのちに任意の記号情報における命令や変数、属性を抽出 するメタ記号抽出部があり、これらの必要に応じた組合せにより検索、検出、索引付 けが行われてもよい。 [0213] And in the information processing section, there is natural information obtained from the outside via the information input section, and music, documents, and musical score information based on video, images, sounds, and voice information acquired from the communication line section and storage section. There is a feature amount extraction unit that extracts feature amounts from content information that can be processed by music, still images and moving images, polygon data and vector data, numerical data, still images and moving images, t, and other information processing devices. Obtained by feature quantity, search conditions, and search results There is a co-occurrence information learning unit that learns the co-occurrence information, and there is an index information generation unit that performs indexing by determining identifiers from the extracted feature amounts through recognition processing, and is made up of features and index identifiers. There is an index search evaluation unit that evaluates the degree of matching between the condition and index information, and there is an evaluation list output unit that outputs it as an evaluation list as a search result. The feature quantity acquired from the content information and user input There is a feature value identifier conversion unit that converts the identifier into an identifier that is recognized by the user, an identifier obtained from the outside through a storage medium or communication, an identifier from which the content is extracted internally, and the like. There is an identifier feature quantity converter that converts to feature quantities, a dictionary extractor that extracts information for target conversion from the dictionary information storage section, and index information such as content information MPEG7. After acquiring information from markup languages such as RSS information and XML from the communication line section, or acquiring information on EPG, BML, RSS, and text broadcasting based on the received broadcast wave There is a meta-symbol extraction unit that extracts instructions, variables, and attributes in arbitrary symbol information, and search, detection, and indexing may be performed by combining them as necessary.
[0214] このため、例えば特定の騒音だけを集めた音や特定の楽器だけを集めた音、ピアノ やドラム、犬や猫、自動車や工場などの機械音、歓声、音階などの異なる特徴を基準 として構成された識別子であってもよぐ本発明に基づく装置の外部カゝら入力される 映像に対して同様に特徴抽出と認識の処理を行い顔に基づいて人物や表情を識別 したり、形状や色に基づいて物品や文字や図形、記号、標識を識別したり、フレーム 間差分や音源位置の変化により動作を識別したりしてもよぐそれらを映像や音声に 関連付けて記録し索引付けに利用したりしても良いし、将来的に匂いや味、温度、湿 度、重さ、硬さ、粘度、密度、大きさといった環境や化学成分、物理特性に関する索 引付けがなされてもよい。  [0214] For this reason, for example, it is based on different characteristics such as sound that collects only specific noise, sound that collects only specific instrument, mechanical sound such as piano and drum, dog and cat, car and factory, cheer, and scale In the same way, the feature extraction and recognition processing is performed on the video input from the external device of the apparatus based on the present invention, and the person and the facial expression can be identified based on the face. It is possible to identify articles, characters, figures, symbols and signs based on shapes and colors, or identify movements based on differences between frames or changes in sound source position. It may be used for attachment, and in the future, there will be an index on the environment, chemical composition, physical properties such as smell, taste, temperature, humidity, weight, hardness, viscosity, density, size, etc. Also good.
[0215] また、コンテンツに関連付けて識別子や特徴量を保存 ·記録する方法は、専用のデ ータベースに時間情報と共に記録する方法や、映像や音声の情報と同時に使用す る別ファイルとしての索引ファイルを保存する方法、 MPEGファイルなどの映像ストリ 一ムに揷入して MPEGファイルの空きエリアやコメントエリア、メタ情報記載エリアを 更新する方法、 EPGや BML、 RSS、文字放送のようなマークアップ言語を用いて配 信して利用者側が受け取って前述のような方法で保存する方法を用いても良!、。 [0215] In addition, identifiers and feature quantities associated with content can be stored and recorded in a dedicated database along with time information, or an index file as a separate file used simultaneously with video and audio information. , MPEG file and other video streams, update empty area, comment area and meta information description area of MPEG file, markup languages like EPG, BML, RSS, teletext Using You can use the method of receiving and saving the data as described above. ,.
[0216] また、検索条件や検索結果から得られる音声情報から抽出した音素及び Z又は音 素片及び Z又は感情識別子及び Z又は音階記号及び Z又は楽器識別子及び Z又 は環境音識別子及び Z又は動画像及び Z又は静止画像から抽出した動画特徴量 及び Z又は顔識別子及び Z又は人物識別子及び Z又は物体識別子及び Z又は表 情識別子及び Z又は動作識別子及び Z又は表示位置識別子、 EPGや文字放送、 [0216] Also, phonemes and Z or phoneme pieces and Z or emotion identifiers and Z or scale symbols and Z or instrument identifiers and Z or environmental sound identifiers and Z or extracted from speech information obtained from search conditions and search results. Video features extracted from moving images and Z or still images and Z or face identifiers and Z or person identifiers and Z or object identifiers and Z or expression identifiers and Z or motion identifiers and Z or display position identifiers, EPG and text broadcasting ,
BML、RSSやコンテンツに関連したウェブサイトから抽出した文字列や文字列の共 起情報を任意に組合せて識別関数や HMMを構成したり、構成された HMMや識別 関数に対応する識別子を構成したり、識別結果としての距離や一致度や HMMの出 力確率を特徴量として『検索 '検出'索引付けに基づく識別子学習の例』や『識別子 再構築の例』にあるような共起情報の学習や識別子構築を実施しても良いし、前述の 各種特徴量を組合せて多変量解析により分類し識別子を与えて任意の分類用評価 関数を構成してもよい。 Character strings extracted from websites related to BML, RSS, and content, and co-occurrence information of character strings can be arbitrarily combined to form an identification function or HMM, or an identifier corresponding to the configured HMM or identification function can be configured. And the co-occurrence information as described in “Examples of identifier learning based on search 'detection' indexing” and “Example of identifier reconstruction” using the distance, matching degree, and HMM output probability as identification results as features. Learning or identifier construction may be performed, or an arbitrary classification evaluation function may be configured by combining the above-mentioned various feature quantities and classifying them by multivariate analysis and giving identifiers.
[0217] また、この際、後述されるような HTMLや XML、 RSS、 CGIといったプロトコルやマ ークアップ言語、スクリプト、プログラム言語、ノイナリコードなどを用いて音素や音素 片、感情識別子、音階識別子、楽器識別子、環境音識別子といった情報を認識する ためのテンプレートや特徴抽出アルゴリズム、記号列一致評価アルゴリズム、記号認 識アルゴリズムを通信回線経由で取得、配信しても良く『利用者同士の情報共有手 順例』に詳しく述べる。  [0217] At this time, phonemes, phonemes, emotion identifiers, scale identifiers, etc. using protocols such as HTML, XML, RSS, and CGI as described later, markup languages, scripts, programming languages, and noinary codes. Templates for recognizing information such as musical instrument identifiers and environmental sound identifiers, feature extraction algorithms, symbol string matching evaluation algorithms, and symbol recognition algorithms may be obtained and distributed via communication lines. Explain in detail in Example.
[0218] 《辞書構成の例〉〉 [0218] << Example of dictionary structure >>
次に、本発明に用いられる前述の識別子や特徴量を相互に変換する辞書機能に ついて、記憶部 20の辞書情報保存部 214及び情報処理部 10の辞書抽出部 106に より説明する。これらの辞書はハッシュバッファやマップバッファといった一般的なァ ルゴリズムによる情報処理方法や保存方法の利用やデータベースといった汎用プロ グラムによって実施可能であり、辞書機能で利用する辞書情報を記憶媒体に保存さ れた索引によって関連付けられた情報群とすることも可能であることが一般的によく 知られており、公知の方法によって任意に実装できるため、実装に依存する。  Next, the dictionary function for mutually converting the identifiers and feature quantities used in the present invention will be described using the dictionary information storage unit 214 of the storage unit 20 and the dictionary extraction unit 106 of the information processing unit 10. These dictionaries can be implemented by general-purpose programs such as information processing methods and storage methods using general algorithms such as hash buffers and map buffers, and databases, and dictionary information used by the dictionary function is stored in a storage medium. It is generally well known that the information group can be related by an index, and can be arbitrarily implemented by a publicly known method, and therefore depends on the implementation.
[0219] より具体的な辞書構成としては、前述のような識別子を入力するステップと入力され た識別子に関連付けられた他の識別子を選択し出力するステップによる方法があり、 識別子を入力するステップと入力された識別子に関連付けられた識別関数を選択し 出力するステップによる方法があり、識別子列を入力するステップと入力された識別 子列に関連付けられた識別子を選択し出力するステップによる方法があり、識別子 列を入力するステップと入力された識別子列に関連付けられた識別子列を選択し出 力するステップによる方法があり、識別子を入力するステップと入力された識別子に 関連付けられた他の識別子の標準パターンや標準パターンに用いられる識別子群 の平均値を選択し出力するステップによる方法があり、そのどれもが連想配列と呼ば れる方法を用いることで実装でき、それらの組合せにより、任意の識別子と関連する 識別子や識別子列や識別子群や標準パターンとの情報変換が可能となる。 [0219] As a more specific dictionary configuration, the step of inputting an identifier as described above is input. There is a method based on the step of selecting and outputting another identifier associated with the identifier, and a method based on the step of inputting the identifier and the step of selecting and outputting the identification function associated with the input identifier. There is a method based on the step of inputting and the step of selecting and outputting the identifier associated with the input identifier column, and the step of inputting the identifier column and the identifier column associated with the input identifier column are selected and output. There is a method by step, and there is a method by which an identifier is input and a standard pattern of other identifiers associated with the input identifier or an average value of identifier groups used for the standard pattern is selected and output. Can be implemented by using a method called associative array, and any combination of them Information conversion between identifiers and identifier string or identifier group or standard pattern associated with a child is possible.
[0220] なお、識別子とは評価関数により特徴量力 認識される情報を分類するための情 報であり、識別子列とは同じ系統の識別子が時系列的に並んだ情報であり、識別子 群とは任意の識別子が複数集まって共起関係にあることが好ましい情報である。  [0220] The identifier is information for classifying information that is recognized by the evaluation function, and the identifier string is information in which identifiers of the same system are arranged in time series. It is preferable information that a plurality of arbitrary identifiers are gathered and have a co-occurrence relationship.
[0221] まず、これらの辞書は任意のキーワードや IDなどにより索引付けがなされている。よ り具体的な例は、制御辞書や日本語音素国際音素記号変換辞書と同様に記号ゃ識 別子、変数、特徴量による構成となる。他にも日本語単語音素列変換辞書や動作識 別子呼称音素列変換辞書、顔画像識別子名称音素列変換辞書などの前述の識別 子や特徴量と!/ヽつた任意の組合せが考えられ、日本語単語音素列変換辞書であれ ば「日本語」という文字列を「n/i/h/0/n/g/o」という音素列に変換を実施し、動作識別 子呼称音素列変換辞書であれば「うなずく動作」を示す識別子と ru/n/a/z/u/k/ujと Vヽぅ音素列記号への変換を実施し、顔画像識別子名称音素列変換辞書であれば「 太郎さんの顔」を示す識別子と rt/a/r/o/uj t ヽぅ音素列記号への変換に応じて実行 する。 [0221] First, these dictionaries are indexed by arbitrary keywords and IDs. More specific examples are composed of symbol identifiers, variables, and feature quantities, as in the control dictionary and the Japanese phoneme international phoneme symbol conversion dictionary. In addition, any combination of the above identifiers and features such as Japanese word phoneme sequence conversion dictionary, motion identifier caller phoneme sequence conversion dictionary, face image identifier name phoneme sequence conversion dictionary, etc. can be considered, If it is a Japanese word phoneme string conversion dictionary, the character string "Japanese" is converted to a phoneme string "n / i / h / 0 / n / g / o", and the action identifier caller string conversion dictionary If it is a face image identifier name phoneme sequence conversion dictionary, it is converted to an identifier indicating “nodding motion”, r u / n / a / z / u / k / uj, and V phoneme sequence symbol. Executed according to the identifier indicating “Taro's face” and conversion to the rt / a / r / o / uj t ヽ ぅ phoneme sequence symbol.
[0222] このように、一対一であったり、一対多であったり、多対一であったりする相関性を 定量的に記録保存し、保存された情報に基づいて識別子同士の変換や識別子列と 識別子の変換といった処理が可能となり、これらの辞書は変換のための参照情報群 によって構成され、辞書が多対一の場合であれば共起情報に基づ!、た関連付け情 報により辞書情報を構成しても良いし、共起情報による固有値や固有ベクトルを用い て「識別子再構築の例」にあるように評価関数と識別子の辞書を構成しても良いし、 特徴量や識別子を音素列や音素片列や数値 IDや文字列 IDにより変換できるよう〖こ 辞書を構成しても良いし、音素列や音素片列や数値 IDや文字列 IDを評価関数ゃ識 別子に変換する辞書として構成しても良い。 [0222] In this way, the correlation that is one-to-one, one-to-many, or many-to-one is quantitatively recorded and stored. Processing such as identifier conversion is possible, and these dictionaries are composed of reference information groups for conversion. If the dictionary is many-to-one, the dictionary information is based on the co-occurrence information! It may be configured, and eigenvalues and eigenvectors based on co-occurrence information are used. As described in “Example of identifier reconstruction”, a dictionary of evaluation functions and identifiers may be configured, and features and identifiers can be converted using phoneme strings, phoneme strings, numeric IDs, and string IDs A dictionary may be configured, or a phoneme sequence, phoneme sequence, numeric ID, and character string ID may be converted to an evaluation function or an identifier.
[0223] もちろん、前述の識別子や特徴量に基づ!/、た組合せによる任意辞書として、例え ば動作識別子日本語変換辞書により「うなずく動作」識別子カゝら「うなづく」と言う日本 語の呼称に変換したのちに日本語音素辞書を参照して「u/n/a/z/u/k/u」としこれら の特徴量や識別子の共起状態を観測し「うなずく状態」と言う独自の識別子を構成で きる。そして、顔画像認識による「太郎さんの顔」という識別子と「うなずく状態」識別子 の情報の共起状態を評価して「太郎さんがうなずく」 ヽぅ新 ヽ識別子を構成しても 良い。 [0223] Of course, as an arbitrary dictionary based on the above-mentioned identifiers and feature quantities! /, A combination of Japanese names such as “unazuku” identifiers, for example, using the motion identifier Japanese translation dictionary After converting to, the Japanese phoneme dictionary is referred to as `` u / n / a / z / u / k / u '' and the co-occurrence state of these features and identifiers is observed, and the original `` nodding state '' is called You can configure the identifier. Then, the co-occurrence state of the information of “Taro's face” and “nod state” identifier by face image recognition may be evaluated to constitute “Taro's nod” identifier.
[0224] このように、識別子と言語依存単語とを関連付けた辞書を前述の特徴量や識別子 ごとに構成することで任意の識別子や特徴量を組合せた辞書を構築しても良し、再 構築結果を利用して再々構築しても良ぐ抽象的な単語や副詞や形容詞、未知の名 詞と関連付けることにより、それらの単語や音素列に対する特徴量の共起状態を学 習し検索に用いる識別子や特徴量として用いても良いし、それらの識別子に関連づ けられた音素列や音素片列に基づいて MD5や CRCなどの算術処理によりハッシュ 値を算出し、データベースに記録された情報と音素列 ·音素片列とハッシュ値を関連 付けて保存し効率的な辞書内の音素'音素片に関連する識別子や特徴量による検 索や異なる識別子同士や識別子と特徴量や音素列 ·音素片列と識別子や音素列 · 音素片列と特徴量や音素列 ·音素片列と音素や音素片や音素列 ·音素片列と音素 列 ·音素片列の相互の変換が出来るように辞書を構成しても良 、し、ハッシュ値同士 を評価する DPを用いても良 、。  [0224] As described above, a dictionary in which identifiers and language-dependent words are associated with each other may be constructed for each of the above-described feature quantities and identifiers, so that a dictionary combining arbitrary identifiers and feature quantities may be constructed. An identifier that can be used for retrieval by learning the co-occurrence state of features for those words and phoneme strings by associating them with abstract words, adverbs, adjectives, and unknown nouns that can be rebuilt using It may be used as a feature value, or may be used as a hash value by an arithmetic process such as MD5 or CRC based on a phoneme sequence or phoneme segment sequence associated with these identifiers. · Phoneme string sequences and hash values are stored in association with each other to efficiently search for phonemes in the dictionary 'search by identifiers and feature values related to phoneme segments, or between different identifiers, identifiers and feature values, phoneme sequences, phoneme sequence And identifiers and phoneme strings You can configure a dictionary so that you can convert between a phoneme sequence, a feature, a phoneme sequence, a phoneme sequence, a phoneme, a phoneme segment, a phoneme sequence, a phoneme sequence, a phoneme sequence, and a phoneme sequence, DP that evaluates hash values may be used.
[0225] また、動画像や静止画像と音声に伴!、検出される識別子の共起情報や識別子の 呼称を音素や音素片で表記すると ヽつた本発明に用いられる動画像や静止画像や 音声に関連付けた索引を辞書の索引に用いることで映像と音声力 抽出される任意 の特徴量や識別子の相関性に基づいて辞書が構成できるとともに、辞書を組合せて 画像や音響や感情の識別子と音素列や音素片列や文字列を変換したり画像や音響 や音声や感情に関する識別子や特徴量の共起状態を評価したりする変換テーブル によって新規に辞書情報を構成することが出来る。本例では表記を簡易にするため に音素列による記載を行っているが音素片列を用いた辞書構造であっても良いし、 これらの共起辞書は公知技術の映像や音声のその他センサ類力 抽出された特徴 量や識別子の組合せによるため実装に依存することになる。 [0225] Also, accompanying the moving image, still image, and sound! When the co-occurrence information of the detected identifier and the name of the identifier are expressed in phonemes or phoneme pieces, the moving image, still image, or sound used in the present invention is used. A dictionary can be constructed based on the correlation between arbitrary features and identifiers extracted from video and audio power by using the index associated with the dictionary index, and the identifiers and phonemes of images, sounds and emotions can be combined by combining dictionaries. Convert columns, phoneme strings, character strings, images and sound Dictionary information can be newly constructed with a conversion table that evaluates the co-occurrence state of identifiers and features related to voices and emotions. In this example, in order to simplify the notation, a phoneme string is used. However, a dictionary structure using a phoneme string may be used. Force Because it depends on the combination of extracted features and identifiers, it depends on the implementation.
[0226] そして、自然発話の認識による音素列や音素片列、感情識別子に基づ 、て任意の 単語文字列を選択し変換辞書によって単語文字列に関連付けられた任意の識別子 や特徴量へ変換し検索したり、任意の音素列や音素片列、識別子に変換辞書によつ て関連付けられた識別子評価関数を用いて検索したり、音素列や音素片列や感情 識別子により音声を直接検索したり、辞書に登録されているキーワードを音素列や音 素片列に変換して発話音素認識により利用したり、変換辞書に登録されていない音 素列や音素片列や識別子を変換辞書に登録したり、それらの共起情報に基づいて 変換辞書を構成したりすることが出来る。  [0226] Then, based on the phoneme sequence, phoneme segment sequence, and emotion identifier by recognition of natural utterance, an arbitrary word character string is selected and converted to an arbitrary identifier or feature amount associated with the word character string by the conversion dictionary. Search using an identifier evaluation function associated with an arbitrary phoneme sequence, phoneme sequence, or identifier by a conversion dictionary, or directly search for speech using a phoneme sequence, phoneme sequence, or emotion identifier. Or convert a keyword registered in the dictionary to a phoneme sequence or phoneme sequence and use it for speech phoneme recognition, or register a phoneme sequence, phoneme sequence or identifier not registered in the conversion dictionary in the conversion dictionary And a conversion dictionary can be constructed based on the co-occurrence information.
[0227] また、これらの辞書は音素や音素片列に限らず、別記された任意の識別子や特徴 量の共起状態に基づいて構成された共起辞書であってもよぐ共起状態に利用者が 任意名称を割当てることで任意名称から共起情報への変換や共起情報に関連付け られた任意言語の単語に基づいて音素列や音素片列に変換するといつた利用を行 つても良ぐ認識された音素列や音素片列により音声を合成したり、音素列や音素片 列に基づ 、て検索したり、音素列や音素片列に関連付けられた表音文字や単語を 利用者に表示したりして利用者に判断を仰 、でも良 、。  [0227] Also, these dictionaries are not limited to phonemes or phoneme strings, but may be co-occurrence dictionaries configured based on co-occurrence states of arbitrary identifiers and feature quantities described separately. When a user assigns an arbitrary name, conversion from an arbitrary name to co-occurrence information and conversion to a phoneme string or phoneme string sequence based on words in an arbitrary language associated with the co-occurrence information may be used at any time. Users can synthesize speech based on recognized phoneme sequences and phoneme sequence, search based on phoneme sequence and phoneme sequence, and use phonetic characters and words associated with phoneme sequence and phoneme sequence. Or ask the user to make a decision.
[0228] なお、『自然情報力 特徴量に変換する方法の例』や『特徴量力 識別子列に変換 する方法の例』を用いて識別子と特徴量の変換を相互に行っても良い。  [0228] It should be noted that the conversion of identifiers and feature quantities may be performed mutually using "example of method of converting to natural information power feature quantity" and "example of method of converting to feature quantity power identifier string".
[0229] «自然情報から特徴量に変換する方法の例》  [0229] «Example of method for converting natural information into feature value»
次に、索引付けや検索を行うために必要な自然情報力 特徴量に変換する特徴量 抽出機能について、記憶部 20のプログラム保存部 210に保存された特徴抽出プロ グラム若しくは情報処理部 10の特徴量抽出部 116に基づいて説明する。これらの特 徴量抽出機能は多様な公知の一般的なアルゴリズムによる情報処理方法である汎 用プログラムによって実施可能であり、基本的に実装に依存する。 [0230] 動画像や静止画像であれば、文字認識や画像認識に用いられる特徴量、例えば 輝度分布や色相抽出、スノ《イダーネットなどのメッシュ抽出、フレーム間における局 所自己相関の変位パターンによって画像形状や動画像の変化やフレーム間差分に より動作特徴量が抽出され、映像や画像の特徴量として抽出で、自己相関係数抽出 、高次自己相関抽出等と組合せることもできる。また、音声特徴であれば、 FFTゃケ プストラムやメルケプストラム、方向性パターン、フォルマント抽出、リズム抽出、ハー モニタス抽出、自己相関係数抽出、高次自己相関抽出を用いた周波数特徴や音量 特徴それらの変化特徴の抽出が可能である。 Next, with regard to the feature quantity extraction function to be converted into the natural information ability feature quantity necessary for indexing and searching, the feature extraction program stored in the program storage section 210 of the storage section 20 or the characteristics of the information processing section 10 is used. This will be described based on the quantity extraction unit 116. These feature quantity extraction functions can be implemented by a general-purpose program, which is an information processing method using various known general algorithms, and basically depends on the implementation. [0230] For moving images and still images, the amount of features used for character recognition and image recognition, such as luminance distribution and hue extraction, mesh extraction such as Sno << idernet, and displacement pattern of local autocorrelation between frames Motion feature quantities are extracted based on changes in image shape, moving image, and inter-frame differences, and can be combined with autocorrelation coefficient extraction, higher-order autocorrelation extraction, etc., as feature quantities for video and images. For speech features, FFT features, mel cepstrum, mel cepstrum, directional pattern, formant extraction, rhythm extraction, Harmonix extraction, autocorrelation coefficient extraction, higher-order autocorrelation extraction, frequency features and volume features, etc. The change feature can be extracted.
[0231] また、音声においては周波数成分や周波数分布、音量、音源方向やそれらの差分 、差分の差分といった多次差分特徴、およびこれらの情報の平均及び分散や標準偏 差による値やそれらの値の指数部であったり、画像にぉ 、ては色分布や輝度分布、 再度分布、色微積分値、輝度微積分値、彩度微積分値、同様に分析された RGB値 や HSV値、 YR—YB—Y値、 YCM値といったそれぞれの周波数成分分布、色や輝 度、周波数の差分、差分の差分といった多次差分特徴、およびこれらの情報の平均 及び分散や標準偏差による値やそれらの値の指数部であったり、認識された画像関 連識別子や画像関連特徴量の画像範囲内における物体の表示位置に基づいた特 徴量であったり、動画像であれば画像で挙げた特徴の時間軸遷移であったり、立体 画像であれば 2. 5次元特徴や 2. 5次元特徴から復元された各種三次元特徴量、 C Gに用いられる三次元画像座標情報、三次元テクスチャ情報、三次元動き情報、三 次元色変化情報、三次元光源変化情報、三次元硬度テクスチャ情報、任意の画像 認識や 2.5次元画像特徴抽出と 、つた特徴抽出方法やそれらの組合せであったり、 それらの特徴量から認識された時刻情報、天候情報、季節情報、地域情報、文化情 報、といった識別子を用いることが可能である。  [0231] Also, in speech, multi-order difference features such as frequency components, frequency distribution, volume, sound source direction, differences between them, differences in differences, values of these information average and variance, standard deviation, and their values Or the color distribution, luminance distribution, redistribution, color calculus value, luminance calculus value, saturation calculus value, similarly analyzed RGB value, HSV value, YR—YB— Each frequency component distribution such as Y value and YCM value, multi-order difference features such as color, brightness, frequency difference, difference difference, and the average, variance, and standard deviation values of these information and the exponent part of those values Or a feature amount based on the display position of an object within the image range of the recognized image-related identifier or image-related feature amount, or if it is a moving image, the time axis transition of the feature mentioned in the image Or a three-dimensional image 2. 5D features, 2. Various 3D features restored from 5D features, 3D image coordinate information used in CG, 3D texture information, 3D motion information, 3D color change information, 3D Source light source change information, 3D hardness texture information, arbitrary image recognition and 2.5D image feature extraction, and feature extraction methods and combinations thereof, time information recognized from these feature quantities, weather information, season It is possible to use identifiers such as information, regional information, and cultural information.
[0232] このような、時間軸、空間軸、物理量、視覚変化、聴覚変化、人間の主観軸や観測 に伴う情報の変化を捉え特徴量や識別子として用いる方法が従来力 数多く提案さ れており、前述の従来技術に記載されているような特徴量を用いたり、それらの文献 に引用された各種特徴量を組合せたりして用いることも可能であり、実装に依存し、こ れらの自然情報から特徴量に変換する処理は自然情報を入力するステップと特徴量 に変換するステップに相当する。 [0232] Many methods have been proposed in the past that use this as a feature value or identifier by capturing changes in information associated with time axis, space axis, physical quantity, visual change, auditory change, human subjective axis and observation. It is also possible to use feature quantities as described in the above-mentioned prior art, or to combine various feature quantities cited in these documents. The process of converting information to features is the step of inputting natural information and the features Corresponds to the step of converting to
[0233] «特徴量から識別子列に変換する方法の例》  [0233] «Example of method for converting feature quantity to identifier string >>
次に、索引付けや検索に必要な、特徴量と任意の識別子の類似性を確率及び Z又 は距離や尤度によって評価する特徴量識別子変棚能もしくは認識機能について、 記憶部 20のプログラム保存部 210に保存された特徴量識別子変換プログラム及び 情報処理部 10の特徴量識別子変換部 120に関して説明する。これらの特徴量抽出 機能は多様な公知の一般的なアルゴリズムによる情報処理方法である汎用プロダラ ムによって実施可能であり、基本的に実装に依存する。  Next, we store the program in the storage unit 20 for the feature identifier discriminating function or recognition function that evaluates the similarity between the feature quantity and an arbitrary identifier necessary for indexing and searching based on the probability, Z, distance, and likelihood. The feature quantity identifier conversion program stored in the section 210 and the feature quantity identifier conversion section 120 of the information processing section 10 will be described. These feature quantity extraction functions can be implemented by a general-purpose program, which is an information processing method using various known general algorithms, and basically depends on the implementation.
[0234] この方法は従来からいくつも提案されており、例えば、同じ識別子に分類された特 徴量を HMMに与えて、 HMMの遷移確率や出力確率を学習させて評価関数として 用いる方法や、同じ識別子に分類された特徴量の平均と分散力 共分散行列を求め たのち固有値と固有ベクトルを求めて距離関数を構成し、識別子情報群の重心と入 力サンプルの距離を求めるベイズ識別関数やマハラノビス距離関数を用いたり、単 に入力サンプルと識別子群の平均ベクトルとのユークリッド距離関数を用いたりして 距離関数を用いる方法が提案されており、これらの手順は実装に依存するため任意 の方法を用いることが出来る。  [0234] A number of methods have been proposed in the past. For example, a feature amount classified into the same identifier is given to the HMM, and the transition probability and output probability of the HMM are learned and used as an evaluation function. The average and variance of the features classified into the same identifier and the covariance matrix are obtained, then the eigenvalues and eigenvectors are found to construct the distance function, and the Bayes discriminant function and Mahalanobis are used to obtain the distance between the center of the identifier information group and the input sample A method using a distance function using a distance function or simply using a Euclidean distance function between an input sample and an average vector of identifier groups has been proposed, and since these procedures depend on the implementation, any method can be used. Can be used.
[0235] そして、このような評価関数により入力された自然情報力 抽出された特徴量と記 号や識別子との類似性が数値として評価可能となり、入力サンプルの特徴量と一番 重心に距離の近 ヽ関数に関連付けられた識別子や一番尤度の高 、HMMに関連 付けられた識別子、一番近 、距離を示した任意の距離関数に関連付けられた識別 子、一番帰属する確率の高い母集団に関連付けられた識別子が評価関数による評 価結果として認識され、単語の認識、音素の認識、音素片の認識、物の認識、文字 の認識、顔の認識や口形素や表情の認識、感情の認識、音や楽器、動作の認識が 実施されるとともに、これらの認識の時系列的な変化を伴って識別子記号列を得るこ とが出来る。  [0235] Then, it is possible to evaluate the similarity between the feature quantity extracted by the natural information ability input by such an evaluation function and the symbol or identifier as a numerical value. The identifier associated with the nearest function or the highest likelihood, the identifier associated with the HMM, the identifier associated with any distance function indicating the nearest or distance, the most likely to belong The identifier associated with the population is recognized as the evaluation result by the evaluation function. Word recognition, phoneme recognition, phoneme recognition, object recognition, character recognition, face recognition, viseme and facial expression recognition, Emotion recognition, sound, musical instrument, and motion recognition are implemented, and identifier symbol strings can be obtained with time-series changes in these recognitions.
[0236] このような方法により、入力特徴量に対して複数の識別子の中から正しい識別子が 選択されるということは、選択されるべき識別子の評価関数と比較対象である入力特 徴量との距離が最小になったり、出力確率が最大になったりすることによる類似性の 評価であり、識別子が Xであるとあらかじめ解っている入力特徴量 Vに対して、識別 子評価関数 X、 Υ、 ζによる類似性評価を行い結果として尤も類似していると評価され る値を出力した識別子評価関数が Xであれば識別子の認識は成功したと判断できる [0236] By such a method, the correct identifier is selected from among a plurality of identifiers for the input feature amount. This means that the evaluation function of the identifier to be selected and the input feature amount to be compared are compared. Similarity due to minimum distance or maximum output probability For the input feature quantity V that is known in advance as an identifier X, the similarity evaluation using the identifier evaluation functions X, Υ, and ζ is performed. If the output identifier evaluation function is X, it can be determined that the recognition of the identifier was successful.
[0237] この際、確率モデルを用いた認識方法として、共分散行列の対角成分のみを考え る無相関正規確率分布 (対角正規確率分布)や厳密だがデータ数が少な!ヽと正確に モデルのパラメータを推定するのは困難な共分散行列の全成分を考える全相関正 規確率分布、 Vヽくつかの正規分布の和で表現されるモデルを用いる混合正規確率 分布(無相関および全相関)、ベクトル量子化(VQ)を用いて特徴量ベクトルの空間 を分割する離散確率分布などを利用して、各母集団の重心からの距離や帰属確率 を入力特徴量に基づ ヽて評価処理を実施し、認識結果として識別子を得る方法が考 えられている。 [0237] At this time, as a recognition method using a probabilistic model, an uncorrelated normal probability distribution (diagonal normal probability distribution) that considers only the diagonal component of the covariance matrix or a strict but small number of data! It is difficult to accurately estimate the parameters of the model, and the total correlation normal probability distribution considering all components of the covariance matrix, and the mixed normal probability distribution using a model expressed by the sum of V ヽ several normal distributions ( Using the discrete probability distribution that divides the feature vector space using vector quantization (VQ) and non-correlation and total correlation), the distance from the centroid of each population and the attribution probability are based on the input feature value. First, an evaluation process is performed and an identifier is obtained as a recognition result.
[0238] また、音素や音素片、感情識別子、その他任意の識別子を評価する方法は前述の ような各種距離関数や確率関数により尤度を求める評価関数を用いて評価する方法 が一般的であり、それらの評価をコンテンツの時間軸や表示位置に従ってセグメンテ ーシヨンし、セグメントごとに順次評価して識別子を与えたり、時間軸上を任意の単位 時間で区切ってフレーム毎に順次評価することで時系列的な特徴量の識別子を与 えたりすることで索引付けのための特徴量力も識別子への変換を実施できる。  [0238] In addition, a method for evaluating phonemes, phoneme pieces, emotion identifiers, and other arbitrary identifiers is generally performed by using an evaluation function for obtaining a likelihood using various distance functions and probability functions as described above. Then, these evaluations are segmented according to the time axis and display position of the content, and are evaluated sequentially for each segment to give an identifier, or time-series by dividing the time axis into arbitrary unit times and evaluating each frame sequentially. By providing a unique feature identifier, the feature strength for indexing can also be converted to an identifier.
[0239] この際、音声であれば 1フレームの FFTゃケプストラムやメルケプストラム、方向性 パターンによるデータは任意の次元のベクトルでも良いし、動画像や静止画像に関 する画像特徴であれば 1フレームは任意のピクセルサイズで構成されてもよく、これら のフレーム間誤差ベクトルやピクセル間誤差ベクトルを任意次元で与えても良いし、 任意幅のフレーム差分特徴量や、フレーム差分特徴の累積による特徴量を用いても 良い。この時点での特徴量の取り方は実装に依存するため任意の方法を用いても良 い。  [0239] At this time, for speech, one frame of FFT data, mel cepstrum, mel cepstrum, and direction pattern data may be vectors of arbitrary dimensions, or one frame for image features related to moving images and still images. May be configured with an arbitrary pixel size, and these inter-frame error vectors and inter-pixel error vectors may be given in arbitrary dimensions. May be used. Since the method of obtaining the feature value at this point depends on the implementation, any method may be used.
[0240] もちろん、ユークリッド距離以外にも、マハラノビス距離をはじめ距離として用いるこ との出来るベイズ識別関数の出力や確率の逆数値や自然対数等を底とした確率値 や自然対数等を底とした値の指数部やシティブロック距離、チェスボード距離、ォクタ ゴナル距離、へタス距離、ミンコフスキー距離の他、類似度やそれらの距離に重み付 け処理をした距離といった任意の距離算出方法、固有値や固有ベクトルの組合せや 固有値や固有ベクトルなどの用いた距離算出方法、固有値や固有ベクトルのノルム 、最大固有成分などの組合せにより距離算出に用いてもよい。 Of course, besides the Euclidean distance, the output of the Bayes discriminant function that can be used as the distance including the Mahalanobis distance, the probability value based on the inverse of the probability, the natural logarithm, etc., the natural logarithm, etc. Exponent part of value, city block distance, chess board distance, octa In addition to gonal distance, Hetas distance, Minkowski distance, any distance calculation method such as similarity and distance weighted to those distances, distance calculation method using combination of eigenvalues and eigenvectors, eigenvalues and eigenvectors, The distance may be calculated by a combination of eigenvalues, eigenvector norms, maximum eigencomponents, and the like.
[0241] より具体的には自然情報が入力されるステップで例えば音声や画像などの AD変 換を伴うセンサ装置など力 の出力が入力される。次に自然情報力 特徴量を抽出 するステップで音声なら FFTゃケプストラムやメルケプストラム、方向性パターンなど 、画像なら輝度や彩度のデルタ情報や輪郭情報、時間軸差分によるデルタ情報など 、識別子に応じて最適な方法で特徴量が抽出される。  [0241] More specifically, in the step of inputting natural information, for example, a force output such as a sensor device with AD conversion such as voice or image is input. Next, in the step of extracting natural information features, the amount of speech depends on the identifier such as FFT, cepstrum, mel cepstrum, direction pattern, etc., and, for images, luminance and saturation delta information and contour information, and delta information by time axis difference The feature quantity is extracted by an optimum method.
[0242] 次に、ベイズや HMM、距離関数による認識により特徴量評価を行うステップが実 施され、評価に基づ!、て一番確率の高!、識別子や距離の近!、識別子を選択するス テツプが実施される。そして、選択された記号や識別子を認識結果として出力するこ とにより、音素や音素片の記号、感情識別子、画像識別子、顔 ID、認識文字、環境 音 ID、機械音識別子、風景識別子、音階識別子などが得られ索引付けに用いられ る。これらの手順は評価関数処理のステップと評価関数の終わりを確認するステップ とにより複数の評価関数を用いて識別子が評価 ·選択'出力されるステップとして実 施される。  [0242] Next, a feature evaluation step is performed based on recognition using Bayes, HMM, and distance functions. The most probable! Near to identifiers and distances! The step of selecting an identifier is performed. Then, by outputting the selected symbols and identifiers as recognition results, phoneme and phoneme symbols, emotion identifiers, image identifiers, face IDs, recognition characters, environmental sound IDs, mechanical sound identifiers, landscape identifiers, scale identifiers Etc. are obtained and used for indexing. These procedures are executed as a step in which an identifier is evaluated, selected and output using a plurality of evaluation functions by an evaluation function processing step and a step of confirming the end of the evaluation function.
[0243] この際、アナログ値を扱えるプロセッサであればアナログ値を直接入力してもよいし 、アナログ値を評価できるプロセッサであれば、アナログ値のまま評価計算やマッチ ングと 、つた認識や検索処理を行っても良 、し、デジタル値をアナログ値に変換して 評価計算しても良い。  [0243] At this time, if the processor can handle the analog value, the analog value may be directly input. If the processor can evaluate the analog value, the evaluation calculation and matching are performed with the analog value as it is. Processing may be performed, or digital values may be converted into analog values for evaluation calculation.
[0244] また、音声関連の識別子による索引であれば、楽器ごとの音を集めた母集団との距 離を測る楽器種別の識別子や特徴量を用い、エンジン音や排気音、ドアの音といつ た機械音ごとの音を集めた母集団との距離を測る機械音種別の識別子や特徴量を 用い、風の音や波の音、鳥や動物の鳴き声といった環境音ごとの音を集めた母集団 との距離を測る環境音種別の識別子や特徴量を用いることが可能である。  [0244] Also, in the case of an index based on voice-related identifiers, the identifiers and feature quantities of instrument types that measure the distance from the population that collected the sounds of each instrument are used, and engine sounds, exhaust sounds, door sounds, etc. A population that collects sounds for each environmental sound, such as wind sounds, waves, and birds and animals, using mechanical sound type identifiers and features to measure the distance from the population that collected the sounds of each mechanical sound It is possible to use an identifier or feature amount of the environmental sound type that measures the distance to
[0245] また、画像関連の識別子による索引であれば画像種別を基本として、映像中の人 を弁別するために顔種別や服装や体格に基づく人物種別や身振り手振りや表情の 元となる動作種別を用いたり、風景種別及び Z又は画像位置種別及び Z又は看板 やビルの表面に記載された文字の認識による文字記号種別及び z又は道路にある 標識による交通制限を弁別するための標識種別及び Z又は車や船、机や電話など の形状種別及び Z又はトイレや非常口などの図形記号種別を用いたりすることで識 別子を索引付けに用いても良い。 [0245] In addition, an index based on an image-related identifier is based on the image type, and in order to discriminate the person in the video, the person type based on the face type, clothes, and physique, gesture gesture and facial expression. To discriminate traffic restrictions due to the use of the original action type, landscape type and Z or image position type and Z or character / symbol type by recognition of characters written on the signboard or building surface and signs on the road The identifier may be used for indexing by using the following types of signs and Z, or the shape type of cars, ships, desks, phones, etc. and the figure symbol type of Z or toilets or emergency exits.
[0246] 《情報索引付け方法の例〉〉  <Example of information indexing method >>>
次に、本発明に基づく装置による索引付けについて説明する。索引付方法は、検 索のたびに認識による索引付けを行う方法も考えられるが、索引情報は一度構成す ればコンテンツの内容が変化しない限り何度でも再利用可能なため、索引付けは最 初に記憶部に登録されるとき、もしくは最初に検索対象になるとき、もしくは最初に登 録されて力 装置自体の外部からの利用頻度が下がったときなど任意のタイミングで 索引付けを実施してもよぐコンテンツの索引付けが終わった後でコンテンツ情報を 外部の装置力 扱えるように登録されたように見えても良 、。  Next, the indexing by the apparatus according to the present invention will be described. As an indexing method, it is possible to perform indexing by recognition for each search, but once index information is configured, it can be reused any number of times as long as the content does not change. Perform indexing at any time, such as when it is first registered in the storage, or when it becomes the search target for the first time, or when it is registered for the first time and the frequency of use from the outside of the force device itself decreases. It may appear that content information has been registered so that it can be handled by an external device after indexing the content.
[0247] また、この索引付けは情報の収録時に適切な単位時間(例えば 16ミリ秒)ごとの認 識による索引付けを行うことで収録されたコンテンツに対する索引付けば力りではなく 、生放送番組中に放送と同時に索引付けを行いながら索引情報のリアルタイム配信 を行っても良い。  [0247] In addition, this indexing is not a force for indexing the recorded content by performing indexing by recognition at an appropriate unit time (for example, 16 milliseconds) at the time of recording information, but during live broadcasting programs The index information may be distributed in real time while indexing at the same time as broadcasting.
[0248] まず、本発明に基づく索引付け装置は音声 ·映像入力のステップ (S0201)を実行 することで、外部力もコンテンツ情報を取得する。ここで取得されるコンテンツとは、前 述のように映像や音声に限らず静止画や文書情報、 BML、 EPG認識された字幕や 映像に含まれる文字列などの任意のコンテンツ情報であってよい。  [0248] First, the indexing device according to the present invention executes the audio / video input step (S0201), so that external force also acquires content information. The content acquired here is not limited to video and audio as described above, but may be arbitrary content information such as still images, document information, BML, EPG recognized subtitles, and character strings included in video. .
[0249] 次に、索引付けの実施手順を説明する。情報入力部 30や通信回線部 50もしくは 交換可能な記憶媒体を用いて記憶部力 取得したコンテンツ情報は特徴量抽出部 1 16により特徴量となる数値データに変換される特徴量抽出ステップ S0202を実行す る。  Next, an indexing execution procedure will be described. Information input unit 30, communication line unit 50, or storage unit using exchangeable storage medium Content information acquired is converted to numerical data as feature values by feature value extraction unit 1 16 Executes feature value extraction step S0202 The
[0250] この変換ステップ S0202で用いられる特徴量は前述の『自然情報から特徴量に変 換する例』や『特徴量や識別子の例』や『従来の技術』にあるように、動画像や静止画 像、音声、文章力ゝらの特徴量抽出方法が提案されており、特徴量抽出部 116におい て視覚情報に基づく静止画特徴抽出部や動画特徴抽出部、及び聴覚情報に基づく 感情特徴抽出部や音素特徴抽出部や音素片特徴抽出部、及び文字情報に基づく 番組情報抽出部といった特徴分類方法や特徴抽出方法により特徴量を抽出する。 [0250] The feature quantity used in this conversion step S0202 is the same as that described in "Examples of converting natural information into feature quantities", "Examples of feature quantities and identifiers", and "Prior art". A feature extraction method has been proposed for still images, audio, and sentence strength. Feature classification methods such as still image feature extraction unit and moving image feature extraction unit based on visual information, emotion feature extraction unit based on auditory information, phoneme feature extraction unit, phoneme piece feature extraction unit, and program information extraction unit based on character information The feature quantity is extracted by the feature extraction method.
[0251] より具体的には、音声波形であればケプストラムなど、画像特徴であれば輝度や色 相のデルタ信号など、文章であれば文字や単語の共起確率、 EPGや BML力も展開 された音素や音素片の記号列などであってもよ!/、し、 Vヽずれかの公知である任意の 特徴量抽出方法であっても良い。  [0251] More specifically, cepstrum for speech waveforms, luminance and hue delta signals for image features, and co-occurrence probabilities for characters and words for text, EPG and BML power were also developed. It may be a phoneme or a symbol string of a phoneme piece! /, And any known feature quantity extraction method of V deviation may be used.
[0252] 次に、特徴量識別子変換部 120やステップ S0203によって記号ィ匕され識別子を割 当てられる。前述の『特徴量力 識別子列に変換する方法の例』にあるように特徴量 を認識することで『特徴量や識別子の例』や『従来の技術』にあるような従来力 利用 されて ヽる任意の特徴量や任意の特徴量を用いて認識された識別子をコンテンツ情 報の時系列に関連付けて索引付けをするステップ S0204を実行し索引情報を構成 する。  Next, an identifier is assigned and assigned by the feature value identifier conversion unit 120 or step S0203. By recognizing the feature value as described in the above “Example of how to convert to the feature value force identifier string”, the conventional force as in “Example of feature value and identifier” and “Conventional technology” can be used. Step S0204 for indexing by associating an arbitrary feature amount or an identifier recognized using the arbitrary feature amount with the time series of the content information is executed to construct the index information.
[0253] より具体的には、音声波形であれば音素や音素片、感情識別子、環境識別子、動 画層や静止画像であれば形状識別子、顔識別子、表情識別子、文字識別子、物体 識別子、動作識別子、文章であれば単語識別子や単語の共起状態識別子などであ つてもよい。また、このような識別子に関連付ける処理と共に特徴量や識別子が類似 した広告を関連付けてもよ!、。  [0253] More specifically, for speech waveforms, phonemes and phonemes, emotion identifiers, environment identifiers, and for moving images and still images, shape identifiers, face identifiers, facial expression identifiers, character identifiers, object identifiers, motions If it is an identifier or a sentence, it may be a word identifier or a word co-occurrence state identifier. You can also associate advertisements with similar features and identifiers together with the process of associating with these identifiers!
[0254] 次に、構成された索引情報は索引記号列合成部 110によって MPEG情報に追カロ ストリームや既存の MPEG7情報に対する追加変更として記録したり、情報記録蓄積 部 22に索引情報を別ファイルとして記録したり、情報記録蓄積部 22と情報処理部に より構成された専用のデータベースに索引情報を記録したりすることで、利用者が検 索を行 、た 、場合に利用できるようにする。  [0254] Next, the configured index information is recorded in the MPEG information by the index symbol string synthesizing unit 110 as an additional change to the additional carostream or existing MPEG7 information, or the index information is stored in the information recording / accumulating unit 22 as a separate file. By recording or recording index information in a dedicated database configured by the information recording / accumulating unit 22 and the information processing unit, the user can search and use it in some cases.
[0255] このような索引付け処理によって、数種類の特徴量や識別子による記号列がコンテ ンッに関連付けて生成され「索引共起情報」として構成できるとともに「索引共起情報 」を用いたメタデータが付随したコンテンツ情報を構築することが出来るようになる。  [0255] Through such an indexing process, a symbol string based on several types of feature quantities and identifiers is generated in association with the content and can be configured as "index co-occurrence information", and metadata using "index co-occurrence information" can be configured. The accompanying content information can be constructed.
[0256] この際、特徴量と識別子による記号の変換部をより詳しく記載した図によれば複数 の識別子や特徴量を関連付けて評価していることがわかる。つまり、本発明における 共起情報とは、音声に伴う感情の変化や映像の変化に伴う音響の変化、映像の変化 に伴う感情の変化、映像や音声の変化に伴う字幕や EPGや BML、 RSS、文字放送 の変化を関連付けて認識し、コンテンツに対して音素及び Z又は音素片及び Z又は 感情識別子により索引付けをおこない、同様に他の音階や環境音、認識文字列、画 像識別子といった識別子による索引付けを実施し、コンテンツにおける相関性のある 変化に基づいて構成され、検索をしたり、検索条件や検索結果から抽出された特徴 量を学習して新規識別子を構成したりするところに特徴がある。 [0256] At this time, it can be seen that a plurality of identifiers and feature quantities are evaluated in association with each other according to a diagram describing the symbol conversion unit based on the feature quantities and identifiers in more detail. That is, in the present invention Co-occurrence information refers to emotional changes associated with audio, acoustic changes associated with video changes, emotional changes associated with video changes, subtitles, EPG, BML, RSS, and text broadcast changes associated with video and audio changes. The content is indexed by phonemes and Z or phoneme pieces and Z or emotion identifiers, and similarly indexed by identifiers such as other scales, environmental sounds, recognized character strings, and image identifiers. However, it is constructed based on correlated changes in content, and is characterized by searching, or by learning feature quantities extracted from search conditions and search results, and constructing new identifiers.
[0257] また、共起状態を学習するステップでコンテンツに索引付けを行いながら各種識別 子や特徴量の共起状態を学習し、数量ィ匕分析 IV類などにより自律的に分類してクラ スタ毎に索引付けを行い、それらの分類されたクラスタ毎に利用者が任意の文字列 や音素列 ·音素片列を与えて検索に用いても良い。  [0257] In addition, the content is indexed in the step of learning the co-occurrence state, and the co-occurrence state of various identifiers and feature quantities is learned, and autonomously classified by the quantity IV analysis IV class and the like. Indexing may be performed for each cluster, and the user may give an arbitrary character string, phoneme string / phoneme fragment string for each classified cluster, and use it for the search.
[0258] 《識別子列から特徴量に変換する方法の例〉〉  [0258] <Example of a method for converting an identifier string into a feature value >>>
次に、検索や辞書構築に必要な識別子を特徴量に変換する方法に関して説明する  Next, a method for converting identifiers necessary for search and dictionary construction into feature quantities will be described.
[0259] まず、利用者や装置内で変換が必要とされる記号列や識別子列が入力されるステ ップが実施され、それらの情報力 マークアップ言語などであれば必要なタグや属性 を、通常の入力文字列であれば入力単語に基づいた変換辞書による音素列や音素 片列ゃ任意の識別子を抽出する対象抽出ステップを実行する。 [0259] First, a step of inputting a symbol string or an identifier string that needs to be converted in the user or the device is performed. If it is a normal input character string, a target extraction step for extracting a phoneme string or a phoneme fragment string or any identifier by a conversion dictionary based on the input word is executed.
[0260] 次に、得られた識別子を必要であれば音素から音素片へと、画像から画像素へと 変換する識別子の細分化処理を実行する。なお、ここでいう画像素とは画像に対す る部分要素であり、顔画像を例にすると顔画像は顔全体を示し顔画像素である場合 は目、鼻、口といった顔を構成する部品のような任意の画像傾向を部品として分離し たときの分類に基づいて識別子を割当てた要素となる。  [0260] Next, an identifier segmentation process for converting the obtained identifier from a phoneme to a phoneme piece and from an image to an image element, if necessary, is executed. An image element here is a partial element for an image. If a face image is taken as an example, the face image shows the entire face, and if it is a face image element, it is a component of the face, such as the eyes, nose, and mouth. This is an element to which an identifier is assigned based on the classification when an arbitrary image tendency is separated as a part.
[0261] 次に、識別子を特徴量に変換するために、該当する識別子のサンプル平均値を用 いて識別子平均設定のステップを実施し、平均値によって構成された特徴量が出力 される。このように識別子に応じて変換された平均値による特徴量は常に母集団の重 心を意味する値となるため、識別子評価関数に与えると常に識別子の重心と特徴量 の距離が 0となるため正確に認識される。 [0262] この変換により、任意の識別子 Xから変換された特徴量 Yと異なる任意の識別子 V 力も変換された特徴量 Wとの距離評価を行うことが可能となり識別子同士の記号一 致ば力りではなぐ距離評価を実現できるようにあるため、同じ特徴量を用いた識別 子同士の距離を評価したり、識別子カゝら特徴量への変換辞書を構成したりできるよう になる。 [0261] Next, in order to convert an identifier into a feature value, an identifier average setting step is performed using a sample average value of the corresponding identifier, and a feature value constituted by the average value is output. Since the feature value based on the average value converted according to the identifier is always a value that means the weight of the population, the distance between the identifier centroid and the feature amount is always 0 when given to the identifier evaluation function. It is recognized correctly. [0262] By this conversion, it is possible to evaluate the distance between the feature quantity W converted from the feature quantity Y converted from the arbitrary identifier X and the feature quantity W converted from the arbitrary identifier V force. Since distance evaluation can be realized, it is possible to evaluate the distance between identifiers using the same feature quantity, or to construct a conversion dictionary for identifier quantities.
[0263] また、音声においては言語に関連付けられた音声情報ば力りではなぐ音階や環 境音、騒音、笑い声、声力も得られる感情特徴といった任意の音声に関連する識別 子に関し、音階識別子であればそれぞれの音階の、環境音であればそれぞれの波 の音や風の音と!/、つた音声種別の、感情識別子であれば感情にともなう特徴種別の 特徴量における平均値を用いることで識別子力 特徴量への変換に利用できる。  [0263] In addition, in speech, scale identifiers are used for discriminators related to arbitrary speech such as scales that are not related to speech information associated with language, environmental sounds, noise, laughter, and emotional characteristics that can also provide voice. If there is an environmental sound, if it is an environmental sound, it will be the sound of each wave or wind !! /, if it is an emotion identifier, if it is an emotion identifier, the average value of the feature value of the feature type associated with the emotion will be used. It can be used for conversion to feature values.
[0264] また、画像においては「マル」や「バッ」、「三角」、「四角」といった基本的な図形ば 力りでなぐ道路標識や人の顔、指紋画像、風景画像、車種、建造物、文字といった 任意の形状に関連する登録積みの識別子や移動方向や移動速度に関連する動作 識別子であれば認識に用いられる特徴量の平均値を用いることで識別子カゝら特徴量 への変換に利用できる。  [0264] In addition, in the image, road signs, human faces, fingerprint images, landscape images, vehicle types, buildings, etc. that can be connected with the basic figures such as “Maru”, “Bat”, “Triangle”, “Square”. If the identifier of a registered stack related to an arbitrary shape such as a character, or an action identifier related to the moving direction or moving speed is used, it can be converted into a feature value using the average value of the feature value used for recognition. Available.
[0265] そして、「マル」や「バッ」 ヽつた文字列カゝら変換辞書を通じて画像識別子を選定し 、画像識別子に関連付けられた音素や音素片列へ変換した後で音声特徴量にして「 マル」や「バッ」が発話されて 1、る個所を探す検索を行ったり、「マル」や「バッ」の画 像識別子に関連付けられた画像特徴量から「マル」や「バッ」の表示されて!ヽる個所 を探すといった異なる目的の識別子間を相互に変換した検索を行ったりすることが出 来るようになる。  [0265] Then, an image identifier is selected through a conversion dictionary such as "maru" or "bac", converted into a phoneme or a phoneme fragment sequence associated with the image identifier, and then converted into a speech feature value. “Mull” or “Bat” is uttered 1, a search is performed to find the location, and “Mull” or “Bat” is displayed from the image feature value associated with the image identifier of “Mull” or “Bat”. For example, it is possible to search for identifiers with different purposes such as searching for a place to speak.
[0266] また、任意の画像を検索対象とする場合であれば、その画像を細分ィ匕し周囲の形 状や色相にあわせて画像素記号列や画像素片記号列と!ヽつた識別子を構成しても よぐ識別子の配列を画像特徴による空間的な前後左右の変化に合わせて識別子 の遷移確率を求めて最適な識別子列を構成したり、それらの特徴の時系列的変化に 応じて動作識別子を最適な空間的時系列的配置の識別子列に変更したりしてから 特徴量を構成してもよい。  [0266] Also, if an arbitrary image is to be searched, the image is subdivided into an image element symbol string or an image fragment symbol string according to the surrounding shape and hue! Even if a single identifier is constructed, an identifier sequence can be constructed by determining the identifier transition probability according to the spatial change of the front, back, left, and right according to the image features, or the time series of these features can be determined. Depending on the change, the feature identifier may be configured after changing the motion identifier to an identifier sequence having an optimal spatial time-series arrangement.
[0267] 《特徴量同士や識別子列同士の一致を評価する方法の例〉〉 次に、検索に必要な特徴量同士や識別子同士の一致を評価する方法について説明 する。 [0267] <Example of a method for evaluating matching between feature quantities or identifier strings> Next, a method for evaluating the match between the feature quantities necessary for the search and the identifiers will be described.
[0268] まず、特徴量同士を評価する方法として距離関数を用いる方法がよく知られており 、一般的に特徴量はベクトルで構成されているため特徴量同士のユークリッド距離を 測る。より具体的には、同じ特徴抽出方法で得られた第 1の入力ベクトルと第 2の入 力ベクトルに関し特徴ベクトルにおける各要素の差の二乗から累積を求め距離とする 。なお、その他の各種距離関数は別途記載するが、このように距離関数に同じ特徴 抽出方法による同次元数のベクトルを二つ与えることでベクトル間距離を測ることが できる。  [0268] First, a method using a distance function is well known as a method for evaluating feature quantities. Generally, since feature quantities are composed of vectors, the Euclidean distance between feature quantities is measured. More specifically, the cumulative value is obtained from the square of the difference between each element in the feature vector for the first input vector and the second input vector obtained by the same feature extraction method. Although other various distance functions are described separately, the distance between vectors can be measured by giving two vectors of the same number of dimensions by the same feature extraction method to the distance function.
[0269] なお、一般的に特徴量と識別子の距離を測る場合には同じ母集団に分類された特 徴量の平均ベクトルが評価基準となる標準パターンが用いられており、被評価入力 特徴ベクトルと評価用基準となる標準パターンとの距離を測ることで母集団重心との 距離を評価する方法が一般的によく知られており、実装に依存して任意の方法を用 いてもよぐ母集団の平均と分散から 3 σ境界や統計的検定境界や経験値による境 界等を設けて母集団に帰属して 、る力否かを評価しても良 、。  [0269] In general, when measuring the distance between a feature quantity and an identifier, a standard pattern in which an average vector of feature quantities classified into the same population is used as an evaluation criterion is used. The method of evaluating the distance from the population center of gravity by measuring the distance from the standard pattern for evaluation and the standard pattern for evaluation is generally well known, and any method can be used depending on the implementation. From the mean and variance of the group, it is possible to establish a 3σ boundary, a statistical test boundary, a boundary based on experience values, and so on, to evaluate whether or not it belongs to the population.
[0270] このように、公知の任意の方法で特徴量同士の距離は簡単に求めることができるが 、特徴量同士の距離を特徴量に関連する識別子の一致不一致評価に安易に用いる ことはできないため、利用者は任意の閾値を設定する必要があり、例えば同じ母集団 に分類された標本の平均特徴量と標準偏差に対して被評価用入力特徴量が 3 σより 大きく乖離していれば不一致、小さければ一致とすることで特徴量と特徴量に伴う識 別子の一致不一致が判別できるようになり「索引共起情報」と「検索条件共起情報」と の一致や類似性も評価できるようになる。  [0270] As described above, the distance between the feature quantities can be easily obtained by any known method, but the distance between the feature quantities cannot be easily used for the matching / mismatch evaluation of the identifiers related to the feature quantities. Therefore, the user needs to set an arbitrary threshold value.For example, if the input feature value to be evaluated deviates more than 3σ from the average feature value and standard deviation of the samples classified into the same population, Disagreement, and if it is small, it is possible to determine whether or not the feature value matches the discriminator match caused by the feature value, and the match and similarity between the index co-occurrence information and the search condition co-occurrence information are also evaluated. become able to.
[0271] 次に、識別子列同士の一致不一致を評価する方法としては DPマッチングなどがよ く知られており、距離の大小や確率の高低を比較するために任意数の識別子の組合 せによる任意長の識別子列の中から正 、識別子列を選択することができる。より具 体的には「a,a,a,a,b,b,b,b」と「a,a,a,a,a,a,b,b」は 100%出現する記号と順序が一致し「 a,a,a,a,b,b,b,b」と「a,a,a,c,c,b,b,b」は 75%—致すると評価される。なお、識別子列の 一致評価には CDPや Shift— CDP、 mp— CDP、 RIF— CDP、 Self- applicative — CDP等と 、つた任意のマッチング関数を必要に応じて実装に用いてもよ!、。 [0271] Next, DP matching is well known as a method for evaluating matching / non-matching between identifier strings. It is possible to select an identifier sequence from the long identifier sequence. More specifically, “a, a, a, a, b, b, b, b” and “a, a, a, a, a, a, b, b” have 100% symbols and their order. “A, a, a, a, b, b, b, b” and “a, a, a, c, c, b, b, b” are estimated to match 75%. For matching evaluation of identifier columns, CDP, Shift—CDP, mp—CDP, RIF—CDP, Self-applicative — Any matching function such as CDP can be used for implementation if necessary!
[0272] この DPマッチング (動的計画法)によれば、二つの記号列における要素間の対応 付け (整列化)を行いながら効率的に類似度を計算することができるため、被検索側 記号列と検索要求記号列との一致率をパーセンテージであらわすことができるように なる。 [0272] According to this DP matching (dynamic programming), it is possible to efficiently calculate the similarity while associating (sorting) elements between two symbol strings. It becomes possible to express the matching rate between the column and the search request symbol string as a percentage.
[0273] この際、複数のフレーム力 なる識別子列において各フレームの識別子が一致す れば「0」、不一致なら「1」として評価結果を構成し、フレーム数分の評価結果の累積 を生成し、全てのフレームが一致すれば累積値は「0」で不一致度は 0%となり、全て のフレームが不一致であればフレーム数と累積値は等しくなり不一致度 100%と評 価できる。  [0273] At this time, if the identifiers of the frames in the plurality of identifier strings having the frame power match, the evaluation result is configured as “0”, and if they do not match, the evaluation result is configured as “1”. If all the frames match, the cumulative value is “0” and the degree of mismatch is 0%. If all the frames do not match, the number of frames is equal to the cumulative value and the degree of mismatch can be evaluated as 100%.
[0274] なお、一般的にサンプルのフレーム長はまちまちなので、 DPマッチングを行った結 果の累積距離を双方のフレーム数の和で割った値を用いることにより長さの違いを補 正することが出来る。そしてサンプルに対し任意の識別子種別に応じた標準テンプレ ートとのマッチングを順次行 、評価することで、マッチング関数の結果値としての距離 が最も小さ!、 (累積距離が最も小さ!、)つまり一致率の高!、識別子が認識結果として 出力可能となる。  [0274] In general, the frame length of the sample varies, so the difference in length can be corrected by using the value obtained by dividing the cumulative distance of the result of DP matching by the sum of the number of both frames. I can do it. By sequentially matching and evaluating the sample with a standard template according to any identifier type, the distance as the result value of the matching function is the smallest !, (the cumulative distance is the smallest!) High match rate! The identifier can be output as a recognition result.
[0275] この際、時間軸においてフレームごとに出力される識別子が連続して同じ場合、時 系列的にフレーム間で識別子が変化したことを検出して連続する識別子をまとめるこ とにより索引付けを行ったり、連続するフレーム数を一致度評価の重み付けに利用し 、同じ識別子の重みの差が少なければ識別子が一致したと評価したり、時系列的な 識別子の母集団重心からの距離を特徴量として時系列における複数の識別子距離 の遷移を用いて一致度評価関数を構成し、 120秒ごとに 1フレームの識別子情報を 2 0秒ごとに 1フレームに減らしたり逆に 240秒ごとに 1フレームへ増やしたりしてもよい し、特徴量平均や分散や呼称文字列や音素列や音素片列によるハッシュ値やそれ らのハッシュ値同士の DPによりマッチングを行っても良いし、識別子の連続する区間 の距離評価関数から出力される距離や距離平均を識別子の境界評価に利用しても よい。  [0275] At this time, if the identifiers output for each frame on the time axis are the same continuously, indexing is performed by collecting the consecutive identifiers by detecting that the identifiers have changed between frames in time series. The number of consecutive frames is used for weighting for matching evaluation, and if the difference between the weights of the same identifier is small, it is evaluated that the identifiers match, or the distance from the population centroid of the time-series identifier is the feature amount As a result, a matching score evaluation function is constructed using transitions of multiple identifier distances in time series, and the identifier information of one frame is reduced to one frame every 20 seconds or conversely to one frame every 240 seconds. It may be increased, or it may be matched by means of feature value averaging, variance, nominal character string, phoneme string, phoneme string string, or DP between these hash values. Distance and distance mean output from the distance estimation function consecutive intervals may be utilized in the boundary evaluation of identifiers.
[0276] また、正 、識別子列が選択されると!/、うことは正 、識別子列 Xと比較対照である 識別子列 Vの距離が最小になったり確率が最大になったりすることであり、識別子列 が特定の X列であるとあらかじめ解って 、る標本識別子列 Vに対して、識別子列 X列 、識別子列 Y列、識別子列 z列に対する一致評価をマッチング関数により行い結果と して最も一致度が高いとして選択された識別子列が X列であれば認識が成功と判断 される。 [0276] In addition, when an identifier column is selected,! / Is correct, and is compared with the identifier column X This means that the distance of the identifier string V is minimized or the probability is maximized, and the identifier string X column, identifier The match evaluation is performed for the column Y column and the identifier column z using the matching function. If the identifier column selected as the highest match is the X column, the recognition is judged to be successful.
[0277] また、前述の識別子が一致して 、る力否かを!、つたん特徴量に変換し距離で評価 する方法を用いれば、見力 4ナ上異なる識別子の距離を評価することも可能であり、距 離の累積により連続した特徴量に変換された識別子同士の距離を評価することで検 索を実現できる。そして、特徴量同士の距離が近ければ「0」に近く一致を意味し、逆 に大きな数であれば不一致を意味し、連続したフレーム数で割ることにより正規化も 可能となり、定量化可能となる。もちろん、サンプルの平均と分散から 3 σ以内であれ ば一致したと評価したり、計算の仕方として逆数を取ったり、一致すれば「1」という論 理構造を逆転させた一致評価方法を用いたりすることで評価方法を実装に応じて変 更することも可能である。  [0277] Also, whether or not the above-mentioned identifiers match and the power is good! It is also possible to evaluate the distance of different identifiers by using the method of converting to feature quantities and evaluating them with distance, and it is possible to evaluate the distance between identifiers converted into continuous feature quantities by accumulation of distances. The search can be realized by evaluating the distance. If the distance between features is close, it means close match to `` 0 '', and if it is a large number, it means disagreement, and normalization is possible by dividing by the number of consecutive frames, which can be quantified. Become. Of course, if it is within 3σ from the mean and variance of the sample, it is evaluated that they match, or the reciprocal is taken as the method of calculation, or if they match, a coincidence evaluation method that reverses the logical structure of “1” is used. By doing so, the evaluation method can be changed according to the implementation.
[0278] また、一般的によく知られている方法としては DPや CDPといった方法や音声や音 楽、映像に特化した検索方法や一致評価方法がある。これらの方法は色々な応用事 例の紹介や特許出願がなされており、実装に依存して任意の方法が選択できる。  [0278] In addition, generally well-known methods include methods such as DP and CDP, search methods specialized for voice, music, and video, and matching evaluation methods. For these methods, various application examples and patent applications have been made, and any method can be selected depending on the implementation.
[0279] そして、識別子の時系列的変化は DPや CDPといった一致度評価手順により出力 され、得られた評価値を用いて一致度合を画面上に表記し順位付けし、一覧として 表示しても良いし、音声合成によりアナウンスしても良い。  [0279] The time-series changes of the identifiers are output by the degree of matching evaluation procedure such as DP or CDP, and the degree of matching is displayed on the screen using the obtained evaluation values, ranked, and displayed as a list. It may be good or announced by speech synthesis.
[0280] 《情報検索方法の例〉〉  [0280] <Example of information search method >>>
次に、本発明に基づく装置による検索について説明する。  Next, the search by the apparatus based on this invention is demonstrated.
[0281] 本発明による検索装置は各種コンテンツに対して前述のように索引付けが行われ ているものとする。この索引付けは情報の収録時に適切な単位時間(例えば 16ミリ秒 )ごとに索引付けを行って収録されたテレビ放送番組のようなリアルタイム配信情報で も良いし、それら^^約してフレーム間の変化のあるところのみを記録しても良いし、 これらの索引情報を EPGや BML、 RSS、文字放送などで配信しても良いし、 DVD のファイルに関連付けて併記しても良 ヽし、文章ファイルであれば単語ごとや文ごと、 節ごとや章ごとに索引情報を構成しても良い。そして、索引付けされた情報は索引付 けに用いた識別子に合うように利用者の入力を識別子に変換することで検索を実行 する。 [0281] It is assumed that the search device according to the present invention indexes various contents as described above. This indexing may be real-time distribution information such as a TV broadcast program that is recorded by indexing every appropriate unit time (for example, 16 milliseconds) at the time of recording information. It is possible to record only where there are changes, distribute these index information via EPG, BML, RSS, teletext, etc., or record them in association with DVD files. If it is a text file, every word or sentence, Index information may be configured for each section or each chapter. The indexed information is searched by converting the user input into an identifier so that it matches the identifier used for indexing.
[0282] 次に、本検索装置は音声 '文字列入力のステップを実行して索引付けされたコンテ ンッに対し検索条件を指定する。この検索条件の指定は、大きく分けて、音声による もの、文字列によるもの、動画や静止画によるものがある。そして、音声による検索で は、利用者発話や検索に使用する音声から音素や音素片、感情識別子を認識し音 素や音素片列により直接検索を実行する方法と、認識された音素や音素片を用いて 識別子変換辞書を参照し、音素列や音素片列に関連付けられた他の特徴量や識別 子を検索条件に含ませる方法と、認識された音素や音素片に基づいて命令辞書を 参照し、検出された命令を除いた音素列や音素片列に関連付けられた他の特徴量 や識別子を用いて検索する方法とがあり、認識される感情識別子に基づいて利用者 の感情に配慮した処理を行っても良 、。  [0282] Next, the search device executes a speech 'character string input step to specify a search condition for the indexed content. The search conditions can be broadly classified into audio, character strings, and moving images and still images. In voice search, phonemes, phonemes, and emotion identifiers are recognized from speech used for user utterances and searches, and direct search is performed using phonemes and phoneme strings, and recognized phonemes and phonemes. Refers to the identifier conversion dictionary using, and includes a method for including other feature quantities and identifiers associated with phoneme strings and phoneme string sequences in the search condition, and an instruction dictionary based on recognized phonemes and phoneme strings However, there is a method of searching using other feature values or identifiers associated with phoneme sequences or phoneme segment sequences excluding detected commands, and considering user emotions based on recognized emotion identifiers. You can do the processing.
[0283] そして、検索文字列による検索は、検索文字列から直接検索を実行する方法と検 索文字列を用いて識別子変換辞書を参照し、検索文字列に関連付けられた他の特 徴量ゃ識別子を検索条件に含ませる方法と検索文字列に基づいて命令辞書を参照 し、検出された命令を除いた検索文字列に関連付けられた他の特徴量や識別子を 用いて検索する方法とがあり、検索文字列を音素列や音素片列に識別子変換辞書 を用いて変換して検索を実施してもよ 、し、認識される感情識別子に基づ 、て利用 者の感情に配慮した処理を行っても良 、。  [0283] Then, the search by the search character string is performed by referring to the identifier conversion dictionary using the search character string and the method for directly executing the search character string, and other feature quantities associated with the search character string. There are a method of including an identifier in a search condition and a method of referring to an instruction dictionary based on a search character string and performing a search using another feature amount or identifier associated with the search character string excluding the detected instruction. The search character string may be converted into a phoneme sequence or phoneme sequence using an identifier conversion dictionary, and the search may be performed. Based on the recognized emotion identifier, processing that considers the emotion of the user is performed. It ’s okay to go.
[0284] そして、動画や静止画による検索は、利用者撮像による映像や動画や静止画から 検索に使用する画像識別子や動作識別子を認識し画像識別子や動作識別子により 直接検索を実行する方法と認識された画像識別子や動作識別子を用いて識別子変 換辞書を参照し、画像識別子や動作識別子に関連付けられた他の特徴量や識別子 を検索条件に含ませる方法と認識された画像識別子や動作識別子に基づいて命令 辞書を参照し、検出された命令を除!ヽた画像識別子や動作識別子に関連付けられ た他の特徴量や識別子を用いて検索する方法とがあり、認識された文字列や画像氏 関連識別氏や動作識別子を識別子変換辞書を用いて音素列や音素片列に変換し て検索を実施してもよ 、し、認識される感情識別子に基づ 、て利用者の感情に配慮 した処理を行っても良い。 [0284] The search by moving image or still image is recognized as a method of recognizing the image identifier or motion identifier used for the search from the video, moving image or still image captured by the user and executing the search directly by the image identifier or motion identifier. By referring to the identifier conversion dictionary using the image identifier or motion identifier thus determined, the image identifier or motion identifier recognized as a method of including other feature quantities or identifiers associated with the image identifier or motion identifier in the search condition is used. Based on the command dictionary, the detected command is excluded and there is a method of searching using other feature quantities or identifiers associated with the image identifier or motion identifier. Convert related identifiers and action identifiers into phoneme strings and phoneme strings using an identifier conversion dictionary. The search may be performed, or processing may be performed in consideration of the emotion of the user based on the recognized emotion identifier.
[0285] これらの検索条件構成方法に共通する点は、記号化'識別子化されていない情報 はいつたん記号化'識別子化したのち識別子変換辞書を経由して、関連付けられた 他の識別子に変換され、検索条件に加えられるところにあり、変換辞書から獲得した 識別子に基づ!ヽて必要であれば、その識別子の平均特徴量に変換することで特徴 量を用いた検索に利用しても良ぐ太郎さんの顔画像を提示し認識された名前に基 づいて音声検索することで太郎さんが誰かに呼ばれているシーンを探したり、太郎さ んを呼んでいる声が花子さんの声質であることを条件に加えることで花子さんが太郎 さんを呼んでいるシーンを探したりすることが出来る。なお、辞書による変換方法は前 述の『辞書構成の例』および『識別子特徴量変換』、『特徴量識別子変換』の項目を 参考されたい。また、ここで取得される検索条件は利用者の指示により入力される情 報であって、映像や音声に限らず静止画や文書情報、 EPGや BML、 RSS、文字放 送などの情報を利用して特徴量や識別子を構成してもよい。  [0285] The common feature of these search condition construction methods is that information that has not been symbolized 'identified' is converted to other identifiers that are associated with each other via the identifier conversion dictionary after being symbolized 'identifiers. If it is necessary, it can be used for a search using the feature value by converting it to the average feature value of the identifier. Taro-san's face image is presented and a voice search is performed based on the recognized name to find a scene where Taro-san is called by someone, or a voice calling Taro-san is Hanako's voice quality By adding this condition, you can search for scenes where Hanako calls Taro. For the conversion method using a dictionary, refer to the “Dictionary configuration example”, “Identifier feature value conversion”, and “Feature value identifier conversion” items mentioned above. Also, the search conditions acquired here are information entered according to user instructions and use not only video and audio, but also information such as still images, document information, EPG, BML, RSS, and character broadcasting. Thus, feature quantities and identifiers may be configured.
[0286] 次に、検索の実施手順を説明する。まず、情報入力部 30や通信回線部 50もしくは 交換可能な記憶媒体を用いた記憶部力 取得した検索識別子列や文字列であれば 辞書抽出部を参照して検索に用いることの出来る識別子列に変換したり、前述の『識 別子力 特徴量に変換する方法の例』に基づいて特徴量に変換したりして検索に適 した検索条件を入力するステップ S 1001を実行する。  Next, a search execution procedure will be described. First, the information input unit 30, the communication line unit 50, or the storage unit using an exchangeable storage medium. If it is an acquired search identifier string or character string, it is converted into an identifier string that can be used for searching by referring to the dictionary extraction unit. Step S 1001 is entered in which the search condition suitable for the search is input by conversion or conversion to a feature quantity based on the above-mentioned “Example of method for converting to identifier power feature quantity”.
[0287] そして、発話音声や検索サンプル画像と!/、つた自然情報による検索条件であれば 特徴抽出をしたり、抽出された特徴量を用いて識別子を認識したりして検索に用いる ことが可能な情報をステップ S1001で構成することにより、コンテンツ情報の索引と同 じ指標に基づいて利用者指定の検索条件に対する識別子や特徴量の選択を行い、 検索条件を構成するクエリ生成ステップ S1002を実行する。この際、検索に利用可 能な各種識別子や各種特徴量を組合せることで一般的な文字列のみによる検索条 件に変換し条件付けを施しても良い。  [0287] Then, if it is a search condition based on utterance voice, search sample image, and! /, And natural information, it is possible to extract a feature, or to recognize an identifier using the extracted feature and use it for the search. By configuring possible information in step S1001, identifiers and feature quantities for user-specified search conditions are selected based on the same index as the content information index, and query generation step S1002 that configures the search conditions is executed To do. At this time, various identifiers and various feature quantities that can be used for the search may be combined to be converted into a search condition using only a general character string to be conditioned.
[0288] より具体的には、音声であれば発話や音声ファイルの入力により音声情報を音素 列や音素片列へ変換や認識をしたあとで、音素列 ·音素片列命令変換辞書を参照し て検索条件から命令に該当する発話部分を抽出削除し、残りの音素列,音素片列を 検索用に利用して検索を行ったり、映像であればカメラやファイルからの画像指定に より、画像識別子や画像特徴量へ変換や認識をしたあとで検索条件の情報に利用し たり、文章や単語であれば文章や単語から制御命令単語を抽出した残りを検索用に 音素や画像識別子への変換を実施して視覚や聴覚などの異なる情報を組合せた検 索条件としての「検索条件共起情報」を構成して検索装置に与える。 [0288] More specifically, for speech, after converting or recognizing speech information to phoneme sequence or phoneme sequence by utterance or input of speech file, refer to the phoneme sequence / phoneme sequence command conversion dictionary. Then, the utterance part corresponding to the command is extracted from the search condition and deleted, and the remaining phoneme sequence and phoneme segment sequence are used for the search, and if it is a video, the image is specified by specifying the image from the camera or file. After converting or recognizing to identifiers or image features, it can be used as search condition information, and if it is a sentence or word, the rest of the control command word extracted from the sentence or word is converted to a phoneme or image identifier for retrieval. The “search condition co-occurrence information” as a search condition that combines different information such as visual and auditory senses is constructed and given to the search device.
[0289] この際、文字列による検索条件であれば「海の画像検索」という文字列により装置に 指示し「海」と言う文字列と「画像検索」 t 、う命令文字列がある場合、命令文字列を 除外した「海」と!ヽぅ文字列に関連付けられた画像特徴量を用いて色特徴と動作特徴 の共起情報による検索条件を構成したり、「海」を検出するために色識別子と動作識 別子の共起情報により構成された評価関数によって検索条件を構成したり、「海」の 評価関数により索引付けが行われていれば「海」識別子への変換により検索条件を 構成したりすることによって「検索条件共起情報」を構成してもよ 、。  [0289] At this time, if there is a search condition based on a character string, the device is instructed by a character string “search of sea image” and there is a character string “sea” and “image search” t, and a command character string. “Sea”, excluding the command string!画像 Use the image feature value associated with the character string to construct search conditions based on the co-occurrence information of color features and motion features, and to co-occur information of color identifiers and motion identifiers to detect “sea” The search condition is configured by the evaluation function configured by, or if the index is performed by the “sea” evaluation function, the search condition is configured by converting to the “sea” identifier. You can configure "startup information".
[0290] また、音声による検索条件であれば「音声検索、アイシャルリターン、爆発音」と利 用者が音声指示し命令発話音素列に該当する「音声検索」という命令辞書に登録さ れた発話音素列を除外した「アイシャルリターン、爆発音」部分の音素列'音素片列 で従来方法によりコンテンツ中の爆発音を伴う発話箇所の検出や検索を行ったり、悲 しみの感情が共起する「ぼくはしにません!」や「と一び一おあのっとと一び一」とい つた台詞の検出や検索を行ったりしても良いし、毎週放送される連続ドラマであれば 音階の変化に特定の共起情報の傾向がある場合にテーマソングとの比較を行い一 致度が高ければハイライトシーンであると評価するように構成しても良い。  [0290] In addition, if the search condition is by voice, it is registered in the command dictionary called "voice search" that corresponds to the command utterance phoneme sequence when the user gives voice instructions such as "voice search, ideal return, explosion sound". The phoneme sequence of the ‘Issual return, explosive pronunciation’ part excluding the utterance phoneme sequence 'phoneme segment sequence is used to detect and search utterances with explosive pronunciation in the content by conventional methods, and the feeling of sadness co-occur You may be able to detect and search for words such as “I don't like me!” Or “Tobi, I ’ll do it,” or if it ’s a continuous drama broadcast every week. When there is a tendency of specific co-occurrence information in the scale change, it may be compared with the theme song, and if the degree of match is high, the highlight scene may be evaluated.
[0291] そして、このような検索条件に用いる情報を利用して、同時に使用される検索条件 の組合せを用 ヽることで「検索条件共起情報」を構成し「索引共起情報」との一致や 類似性を評価するための検索条件としたり、このような「検索条件共起情報」を複数の 利用者から収集して「検索条件共起情報」としたり、収集された「検索条件共起情報」 を用いて評価関数を構成したりすることが出来る。  [0291] Then, using the information used for such search conditions, "search condition co-occurrence information" is constructed by using combinations of search conditions that are used at the same time. It can be used as a search condition for evaluating matches and similarities, or such "search condition co-occurrence information" can be collected from multiple users and used as "search condition co-occurrence information". An evaluation function can be constructed using “starting information”.
[0292] 次に、記憶部の情報記録蓄積部力 索引情報を読込み、この読込まれた索引情報 と先ほどの検索条件情報を DPや距離関数などで評価し情報の一致度の高い個所を 保存されて 、る索引情報に基づ!ヽてコンテンッの選択とコンテンッ内の位置選択を 行うために『特徴量同士や識別子列同士の一致を評価する方法の例』に従った検索 を実行する検索ステップ (S1003)を実行する。 [0292] Next, the information recording / accumulating unit power index information of the storage unit is read, and the read index information and the previous search condition information are evaluated by DP, distance function, etc. Based on the saved index information, perform a search according to `` Example of how to evaluate matching between feature quantities and identifier columns '' to select content and position within content. The search step (S1003) is executed.
[0293] そして、コンテンツ毎に検索条件と類似度の高いフレーム個所や検索条件と類似 度の高い索引個所がそれぞれの識別子や特徴量ごとに検出され、複数の識別子や 特徴量の類似度が高いコンテンツ内の位置が検索条件における総和や論理式に基 づ!、た条件設定により順に並べられ検索結果評価に基づ!、て検索結果の順位付け を行う順位付けステップ (S 1004)を実行する。なお、類似度は DPによる一致度ゃ距 離評価方法や確率評価方法といった前述の類似性評価方法を組合せて実施しても 良い。 [0293] For each content, a frame part having a high similarity to the search condition and an index part having a high similarity to the search condition are detected for each identifier and feature quantity, and the similarity between a plurality of identifiers and feature quantities is high. Execute the ranking step (S1004) that ranks the search results based on the search result evaluation based on the search result evaluation based on the search result evaluation based on the search condition evaluation! . Note that the similarity may be implemented by combining the above-mentioned similarity evaluation methods such as a distance evaluation method and a probability evaluation method if the degree of coincidence by DP.
[0294] この評価は、特に指標を儲けず順位付けしな 、評価一覧や、単純に各識別子の評 価距離や評価確率の総和に従い最大値や最小値を求めて順位付けした評価一覧 や、 OR式や AND式のような論理式に基づ 、て絞り込み選択した値により順位付け した評価一覧や、論理式応じて計算した値により順位付けした評価一覧が考えられ る。なお、論理式応じて計算した値による評価一覧とは、例えば「(青 or緑) and動きの 大き!ヽ映像」と!、う条件付けは次のような関数で表現する。  [0294] This evaluation is based on an evaluation list without ranking in particular, or an evaluation list in which the maximum value and minimum value are simply determined according to the sum of the evaluation distance and evaluation probability of each identifier, Based on a logical expression such as an OR expression or AND expression, there can be an evaluation list ranked by the values selected by narrowing down or an evaluation list ranked by values calculated according to the logical expression. Note that the evaluation list based on the values calculated according to the logical expression is, for example, “(blue or green) and large amount of motion! ヽ video”, and the condition is expressed by the following function.
A= ( (b-B) + (g-G) ) X (m-M)  A = ((b-B) + (g-G)) X (m-M)
A:画面内特徴  A: In-screen features
b :青特徴  b: Blue feature
B:青特徴平均  B: Blue feature average
g :緑特徴  g: Green feature
G :緑特徴平均  G: Green feature average
m:動き特徴  m: Motion characteristics
M :動き特徴平均  M: Motion feature average
[0295] このように画面内の画像特徴に関する共起状態を数式により表現することでアンド もしくはオアもしくはェクスクルーシブオアもしくはノットと ヽつた論理構造を数式に置 き換えることが可能であり、アンドは掛け算、オアは足し算、ェクスクルーシブオアは 大きいほう、ェクスクルーシブアンドは小さいほう、ノットは掛ける一 1として数式ィ匕する ことにより特徴量を評価したり、個々の特徴量による共分散行列に基づいてマハラノ ビス距離を得て評価したり、共起確率や距離関数によって類似性を評価したりするこ とで共起状態を算術評価することにより検索結果を順位付けし提示することが可能と なる。 [0295] In this way, by expressing the co-occurrence state related to the image feature in the screen by a mathematical expression, it is possible to replace the logical structure that is defined as AND or OR, EXCLUSIVE OR, or NOT with a mathematical expression. AND is multiplication, OR is addition, EXCLUSIVE OR is larger, EXCLUSIVE AND is smaller, knot is 1 The co-occurrence state It is possible to rank and present the search results by performing an arithmetic evaluation.
[0296] また、共起状態を共起行列や共分散行列として用い距離評価関数を構成したり、 共起確率に基づいて確率関数を構成したり、複数の共起情報を組合せたりすること で共起情報に基づく類似性を評価した検索を行うことが出来る。そして、距離であれ ば値が小さ!、場合、確率であれば値が大き!ヽ場合に類似性が高!ヽと見なされるため 、検索結果の評価として複数の識別子や特徴量に応じた順位付けを実現できる。  [0296] Also, by using the co-occurrence state as a co-occurrence matrix or covariance matrix, a distance evaluation function can be constructed, a probability function can be constructed based on the co-occurrence probability, or multiple pieces of co-occurrence information can be combined. It is possible to perform a search that evaluates the similarity based on the co-occurrence information. And if it is distance, the value is small! If it is probability, the value is large! Since the similarity is considered to be high in the case of 順 位, ranking according to a plurality of identifiers and feature quantities can be realized as evaluation of the search result.
[0297] なお、この例における青特徴とは画面内で青を中心とした ± 15度の色相に含まれ る画素の全画面中の出現頻度であり、青特徴平均とはコンテンツ 'アーカイブ全体で の青特徴の平均と考えても良ぐ緑や赤も同様であるとともに実装に依存するため任 意の方法を用いても良い。また、利用者の入力する単語や画像傾向から感性的に得 られる単語と特徴量の共起辞書や単語特徴量変換辞書に用いられる関連付けにお ける解りやすい事例としては季節ごとに自然における色の出現頻度力 春は薄緑と 桜色、夏は深緑と青、秋は黄色と橙色、冬は白と灰色というように、感性に関連付けら れた代表的な色特徴を用いて画像を分類すると!/ヽつた方法のように特徴量の組合せ が考えられる。  [0297] Note that the blue feature in this example is the frequency of appearance in the entire screen of pixels within a ± 15 degree hue centered on blue in the screen, and the blue feature average is the content 'all archive The same can be said for green and red, which can be considered as the average of the blue features of the above, and any method can be used because it depends on the implementation. In addition, as an easy-to-understand example of the association used in words and feature co-occurrence dictionaries and word feature conversion dictionaries that are sensibly obtained from user input words and image trends, natural colors are used for each season. Frequency of appearance If you classify images using typical color features associated with sensitivity, such as light green and cherry blossoms in spring, dark green and blue in summer, yellow and orange in autumn, and white and gray in winter! / A combination of features can be considered, as in the case of a simple method.
[0298] また、動き特徴とは映像の時間軸デルタに基づ 、た特徴であったり、 MPEG4など で用いられる動き特徴ベクトルの大きさであったりしてもよぐ現在のフレームから ± 1 5フレームといった任意の時間間隔で起きている画像の変化情報に基づく特徴と、そ れらの特徴のコンテンツ ·アーカイブ中での平均であっても良ぐこれらは任意の正規 化や補正がなされていても良ぐこれらの特徴量に基づいた評価式の構成は実装に 依存するため任意の組合せを用いても良 、。  [0298] In addition, the motion feature is based on the time axis delta of the video, or may be the size of the motion feature vector used in MPEG4 etc. ± 15 from the current frame Features based on image change information occurring at arbitrary time intervals, such as frames, and the content of those features, which may be averages in the archive, are optionally normalized and corrected. The composition of evaluation formulas based on these feature quantities depends on the implementation, so any combination can be used.
[0299] この際、色ばカゝりでなく画像認識技術や音声認識技術を組合せて、得られる顔 ID や動き ID、画像 ID、音素や音素片による識別子を組合せ任意の評価関数を構成す ることも可能となる。なお、識別子同士の距離評価は前述の DPなどを用いることで可 能であり、特徴量同士の距離評価は任意の距離関数で可能であり、識別子と特徴量 の類似度評価は HMMや距離関数などに可能であり、前述の識別子と特徴量の記 載やそれらの相互変換方法に詳述されている。もちろん、多層ベイズや-ユーラルネ ットワークなどの各種評価方式の組合せて効率的な分類を行うことにより性能の改善 を図ることも可能である。 [0299] In this case, an arbitrary evaluation function is configured by combining the obtained face ID, motion ID, image ID, and phoneme or phoneme identifier by combining image recognition technology and speech recognition technology instead of color distortion. It is also possible. The distance between identifiers can be evaluated by using the above-mentioned DP, etc., and the distance between features can be evaluated with an arbitrary distance function. Similarity evaluations can be applied to HMMs, distance functions, etc., and are described in detail in the description of identifiers and features described above and their mutual conversion methods. Of course, it is also possible to improve performance by performing efficient classification by combining various evaluation methods such as multi-layer Bayes and -Ural network.
[0300] 次に、ここで得られた検索結果を類似度の高い順に一覧にして利用者に提示し、ラ ンキング付けの指標として類似度の値を利用者が閲覧できるようにする評価結果一 覧ステップ(S 1005)を実行し、検索結果の一覧を出力部に出力して画面表示したり 、通信回線部を経由して利用者端末に送信したりして利用者に提示したのちに利用 者が再度検索依頼をした力否かを評価する利用者処理継続確認ステップ (S 1006) を実行する。 [0300] Next, the search results obtained here are listed in descending order of similarity and presented to the user, and the evaluation result that allows the user to view the similarity value as a ranking index is displayed. Execute the browsing step (S1005), output the search result list to the output unit and display it on the screen, or send it to the user terminal via the communication line unit and present it to the user A user processing continuation confirmation step (S 1006) is performed to evaluate whether the user has requested the search again.
[0301] このようにして、本発明の「索引共起情報」と「検索条件共起情報」との一致や類似 性の評価による共起情報を用いた複合検索が実施されて検索結果を得ることができ る。この際、それぞれの検索結果や検索結果として得られた近傍の特徴量に基づく 共起情報を用いて「索引共起情報」や「検索条件共起情報」と組合せた共起確率や 共起行列により検索に用いる評価関数を構成しても良い。  [0301] In this way, a combined search using co-occurrence information based on matching or similarity evaluation between "index co-occurrence information" and "search condition co-occurrence information" of the present invention is performed to obtain a search result. be able to. In this case, co-occurrence probabilities and co-occurrence matrices combined with “index co-occurrence information” and “search condition co-occurrence information” using the co-occurrence information based on each search result and the feature value of the neighborhood obtained as the search result The evaluation function used for the search may be configured as described above.
[0302] また、入力された文字列を任意の識別子や特徴量に変換し検索を実行する共起状 態に基づ 1、た学習は「識別子再構築の例」や検索結果を用いたり、補助情報として E PGや RSS、 HTML, XML、 BML、文字放送との関連付けにより検索結果と補助情 報の共起情報を用いたりすることで識別子を学習しても良いし、サーバ'クライアント 形式により任意の構成をとり、任意の識別子や特徴量を選択的に用いて検索を実行 するサービスを実現してもよ 、。  [0302] Also, based on the co-occurrence state in which the input character string is converted into an arbitrary identifier or feature quantity and the search is executed 1, the learning is performed using "example of identifier reconstruction" or search results, As ancillary information, identifiers may be learned by using co-occurrence information of search results and auxiliary information by associating with EPG, RSS, HTML, XML, BML, teletext, etc. It is possible to realize a service that takes an arbitrary configuration and executes a search by selectively using an arbitrary identifier or feature amount.
[0303] また、検索のための文字列を放送受信部やインターネットに接続された情報回線部 、記憶部における記録情報から XMLや HTML、 MPEG7、 RSS、文字放送、 BML 、 EPGといった任意の手段で取得し、それらの文字列に基づいて検索指標となる特 徴量ゃ識別子列に変換することで検索を実施しても良 ヽし、サーバ ·クライアント形式 により任意の構成をとり、任意の識別子や特徴量を選択的に用いて検索を実行する サービスとして実現してもよく検索文字列力 検索条件を生成できる。  [0303] In addition, the character string for search can be transmitted from the broadcast receiving unit, the information line unit connected to the Internet, or the recorded information in the storage unit by any means such as XML, HTML, MPEG7, RSS, teletext, BML, EPG. It is also possible to perform a search by acquiring and converting the feature quantity that serves as a search index into an identifier string based on these character strings. It may be realized as a service that executes a search by selectively using feature quantities. Search string power Search conditions can be generated.
[0304] なお、文字列による検索はそれぞれの特徴抽出方法に関連する文字列識別子変 換辞書や識別子特徴量変換辞書を用いて任意の文字列に関連付けられた識別子 や識別子の特徴量を選択し利用することで実施され、後述される『識別子の再構築 例』で構成された新しい識別子を用いても良い。例えば、出演者の名前を音素列や 音素片列に変換してコンテンツ検索をしたり、「アクション映画」という単語からァクショ ン映画に分類されるコンテンツにおける爆発音のコンテンツ内での出現頻度を求め、 複数のアクション映画力 爆発音出現頻度の平均値を求めてアクション映画評価関 数を構成してアクション映画評価関数により索引付けを行うことでアクション映画関数 に基づいたコンテンツ検索を実施したりすることができる。 [0304] Note that the search by character string is a character string identifier change related to each feature extraction method. This is implemented by selecting and using identifiers and identifier features associated with an arbitrary character string using a conversion dictionary and identifier feature conversion dictionary. An identifier may be used. For example, perform a content search by converting a performer's name into a phoneme sequence or a phoneme segment sequence, or find the frequency of occurrence of explosion sounds in content classified as an action movie from the word “action movie”. Multiple action movie powers Performing content searches based on the action movie function by determining the average value of the explosion sound appearance frequency and configuring the action movie evaluation function and indexing with the action movie evaluation function Can do.
[0305] また、検索結果に基づ!/、て任意の特徴量や識別子の共起状態を得るステップがあ る。この共起状態は共起確率や共起行列、共分散行列を用いることで構成可能であ り、例えばある条件下で一致率が 70%以上の上位 10位以内の共起情報という条件 付けにより、共起状態を選別し学習に用いることができる。このように構成された共起 情報を利用者が何度も閲覧したり、後述される情報共有方法によって何度も外部か ら利用されたりする場合この共起情報は利用価値が高いと判断される。そして利用頻 度の高い共起情報に特定の識別子を与えることで共起状態に基づく評価関数が構 成でき共起学習保存部や評価関数保存部に新しい識別子と特徴量の共起情報や 評価関数として記録する。 [0305] Further, there is a step of obtaining a co-occurrence state of an arbitrary feature amount or identifier based on the search result. This co-occurrence state can be configured by using co-occurrence probabilities, co-occurrence matrices, and covariance matrices.For example, by co-occurrence information within the top 10 with a matching rate of 70% or higher under certain conditions The co-occurrence state can be selected and used for learning. If the user views the co-occurrence information configured in this way many times or is used from the outside many times by the information sharing method described later, this co-occurrence information is judged to be highly useful. The By assigning specific identifiers to frequently used co-occurrence information, an evaluation function based on the co-occurrence state can be constructed, and the co-occurrence learning storage unit and the evaluation function storage unit can provide new identifier and feature co-occurrence information and evaluation. Record as a function.
[0306] 前述の「青特徴と緑特徴」の共起状態の関数であれば、検索結果として「森と青空」 や「海と海岸」といった傾向の映像を得ることができ、動きの大きい場合を踏まえると、 「木が大きくゆれる森」、「雲が早く動く森と青空」、「波が激しく動く森が見える海岸線 」と 、つた動きの大きな映像の検索が MPEG4などの動き特徴を利用した識別子と組 合せることで可能となる。  [0306] If it is a function of the co-occurrence state of the above-mentioned "blue feature and green feature", videos with trends such as "forest and blue sky" and "sea and coast" can be obtained as search results. Based on the above, the search for a video with large movements such as “Forests with large trees swinging”, “Forests and blue sky with fast moving clouds” and “Coastal lines where you can see forests with intense waves” used motion features such as MPEG4. This is possible by combining with an identifier.
[0307] この検索結果に対し利用者の選択した情報に基づいて学習すると「海」に偏りがあ る場合、特徴量にも「海」の特徴量への偏りが生じることを利用して識別関数を『識別 子再構築の例』にあるように再度構成し、学習される共起情報に反映させることがで きる。また、水平線が中央にある映像であれば波の動きを伴って画像の下半分よりに 青!、色が増えるため「海」や「海岸」がある映像特徴に基づく評価関数を構築すること ができる。 [0308] この際、利用者に選択されなかったものや検索結果に否定的意味合いのある利用 者の削除指示などに基づいて検索結果を選別することで、検索対象から除外すべき 情報群の識別関数を新しく構成でき、先の対象となる検索結果から除外対象の検索 結果を取除いたり、ある目的の識別子や条件では共起確率が低いにもかかわらず別 の識別子では共起確率の高!、識別子を搜して、検索条件に識別子や特徴量を追加 したり削除したりすることで検索結果力も不要なものを取除き、より効率的な検索結果 の提示ができる。 [0307] If this search result is learned based on information selected by the user, and the "sea" is biased, the feature quantity is also identified using the bias toward the "sea" feature quantity. The function can be reconfigured as in “Example of identifier reconstruction” and reflected in the co-occurrence information learned. In addition, if the image has the horizon in the center, it will be blue in the lower half of the image with wave motion! Since the colors increase, it is possible to construct an evaluation function based on the image features with “sea” and “coast”. it can. [0308] At this time, identification of information groups to be excluded from the search target by selecting the search results based on what was not selected by the user and the deletion result of the user having a negative meaning in the search results A new function can be configured to remove the search results to be excluded from the previous target search results, or the co-occurrence probability is high for another identifier even though the co-occurrence probability is low for a certain target identifier or condition! By entering an identifier and adding or deleting identifiers or features in the search condition, you can remove unnecessary search results and present more efficient search results.
[0309] また、検索結果を評価するユーザインタフェースを用いて、利用者が性能の改善を 実施できるようにしても良 、し、文字列による検索と組合せてコンテンツタイトルゃジャ ンル、監督などのコンテンツ属性との組合せにより検索効率を改善しても良いし、検 索条件や識別子や特徴量に基づく共起状態に対して任意の名称を与え、繰返し検 索や検出、指示に利用できるようにしたり、それら検索条件や検索式を通信回線経 由で交換や配信ができるようにしたりしても良 、。  [0309] In addition, the user interface for evaluating the search result may be used so that the user can improve the performance. The content title can be combined with the search by the character string, and the content such as the genre and the director can be used. The search efficiency may be improved by combining with attributes, or an arbitrary name may be given to the co-occurrence state based on the search conditions, identifiers, and features so that it can be used for repeated searches, detection, and instructions. The search conditions and search expressions can be exchanged and distributed via a communication line.
[0310] また、 EPGや BML、 RSS、文字放送を用いる例として、 EPGや BML、 RSS、文字 放送といった放送補助情報カゝら抽出される番組のジャンルやその番組内の映像 '音 声から抽出された特徴量や発生音素列の出現頻度、環境音の識別子出現頻度を用 いて前述のような番組ジャンル識別関数を構成したり、出演者名称と顔認識による顔 IDとを関連付け、異なる番組内で共に認識される顔 IDと出演者名称を関連付けて 共起行列を作り特定の出演者を検出する評価関数を構成したり、出現頻度の高い顔 IDと EPGや BML、 RSS、文字放送の出演者リストでの記載順とを関連付けて出演 者の名称と顔画像を関連付ける評価関数を構成したり、人の発話した音素列や音素 片列に基づいた名称を EPGや BML、 RSS、文字放送から検出して録画やスキップ 再生を行ったりしてもよぐ HTMLや XML、 RSS、 BMLといった任意のマークアップ 言語情報を記憶部の辞書情報保存部にある音素記号変換辞書を用いて音素や環 境音識別子、画像識別子といった前述の識別子に変換し検索や検出に伴う任意処 理を実施たり、それらの利用状況を記録しその記録結果に応じて利用頻度の高い検 索条件の共起情報を用いて識別子の再学習をしても良い。  [0310] Also, as an example of using EPG, BML, RSS, and text broadcasting, it is extracted from the genre of the program extracted from the broadcast auxiliary information such as EPG, BML, RSS, text broadcasting, and the video 'sound' in the program The above-mentioned program genre identification function is constructed using the generated feature quantity, the occurrence frequency of generated phoneme strings, and the appearance frequency of the identifier of the environmental sound, or the name of the performer is associated with the face ID based on face recognition. Create a co-occurrence matrix by associating face IDs and performer names that are recognized together in, and configure an evaluation function to detect specific performers, or face IDs with high appearance frequency and appearances of EPG, BML, RSS, and teletext Create an evaluation function that associates the name of the performer with the face image by associating them with the description order in the performer list, and names based on phoneme sequences and phoneme sequences that are spoken by people from EPG, BML, RSS, and teletext Detect and record or skip playback Any markup language information such as HTML, XML, RSS, and BML can be converted into the above identifiers such as phonemes, environmental sound identifiers, and image identifiers using the phoneme-symbol conversion dictionary in the dictionary information storage unit of the storage unit. You can convert and perform arbitrary processing associated with search and detection, or record their usage status and re-learn identifiers using co-occurrence information of frequently used search conditions according to the recorded results .
[0311] そして、このように構成された識別関数や検索結果、検索結果における共起状態情 報は『利用者同士の情報共有手順例』にあるように通信回線部を経由して他の装置 力も閲覧取得できるように P2Pソフトなどの技術を用いて再利用したり、任意のサイト 上に CGIや任意の Web技術などを用いて公開したりすることにより、任意の利用者が 課金などを伴い利用したり、記憶媒体に入れて販売したりしてもよい。 [0311] Then, the identification function configured as described above, the search result, and the co-occurrence state information in the search result The information can be reused using technology such as P2P software so that other devices can be browsed and acquired via the communication line as shown in “Examples of information sharing procedures between users” or on any site. By publishing using CGI or any Web technology, any user may use it with billing, or sell it in a storage medium.
[0312] この際、利用する情報の精度や内容の細かさ、処理の早さ、利用回数、利用時間、 などによって利用金額を変えたり、本発明の利用によって得られた検索結果を利用 する行為に対して課金したり、金額を変えたりしても良いし、それらの情報の価値を守 るために暗号ィ匕したりしても良 、。  [0312] At this time, the usage amount is changed depending on the accuracy and detail of information to be used, the speed of processing, the number of uses, the usage time, etc., or the search results obtained by using the present invention are used. You may charge for it, change the amount of money, or encrypt it to protect the value of that information.
[0313] また、再利用頻度の高!、共起状態情報や評価関数や評価パラメータ類を自装置 の記憶部に保存したり、必要に応じて通信回線部経由で外部から取得したり、取得し た評価関数や識別子を用いて生成されたメタ情報を他の利用者に提示したり販売し たりしてもよ ヽ。  [0313] Also, the frequency of reuse is high! Co-occurrence state information, evaluation functions, and evaluation parameters are stored in the storage unit of the device, or acquired externally via the communication line unit as necessary. Meta information generated using the evaluation function and identifier may be presented to other users or sold.
[0314] また、検索には多少の時間が力かるため、一般的な広告や利用者の日ごろ利用す る頻度の高い特徴量や識別子の組合せ、検索キーワードに関連付けられた識別子 や特徴量の組合せによって、利用者の日ごろの利用コンテンツと類似度が高いと判 断できる広告を検索中や一覧作成中、検索条件入力中に提示してもよい。  [0314] In addition, since a certain amount of time is required for the search, a combination of features and identifiers frequently used on a daily basis for general advertisements and users, and a combination of identifiers and features associated with search keywords Depending on the user, advertisements that can be determined to have a high degree of similarity to the daily usage content of the user may be presented while searching, creating a list, or entering search conditions.
[0315] «識別子の検出に伴う任意処理の例》  [0315] «Example of optional processing associated with identifier detection»
次に、本発明に基づく装置による検出に伴う任意処理について説明する。  Next, the arbitrary process accompanying the detection by the apparatus based on this invention is demonstrated.
[0316] まず、利用者は任意処理のきっかけとなる検出条件を検索条件と同様に入力する。  [0316] First, the user inputs a detection condition that triggers an arbitrary process in the same manner as the search condition.
入力は音声であったり、映像情報であったり、文字列であったり、本発明により得られ た識別子であったり、それらの組合せであっても良い。この入力に従って、本発明は 検出条件を検索のときと同じ手順で特徴量や識別子の組合せにより共起状態を構成 し検出条件を設定するステップを実行する。  The input may be audio, video information, a character string, an identifier obtained by the present invention, or a combination thereof. In accordance with this input, the present invention executes the steps of configuring the co-occurrence state by the combination of the feature quantity and the identifier and setting the detection condition in the same procedure as when searching for the detection condition.
[0317] 次に、構成された検出条件に基づいて放送波やネットワーク、撮像装置から獲得さ れる番組を取得しながら、特徴量と特徴量に基づいた識別子により、索引付けを実 施しながら情報を装置内の記憶部に収録する。そして、索引付けされた収録情報を 収録と同時に検出条件の共起情報と比較し一致度合を評価する。この評価は前述 のベイズや HMM、マハラノビス距離、ユークリッド距離、 DPといった前述の識別子 同士の距離や識別子列同士の一致度や特徴量同士の距離を評価する任意の評価 方法を用いても良い。 [0317] Next, while acquiring a program acquired from a broadcast wave, a network, or an imaging device based on the configured detection condition, information is obtained while performing indexing using a feature amount and an identifier based on the feature amount. Record in the storage unit in the device. Then, the indexed recording information is compared with the co-occurrence information of the detection condition at the same time as recording, and the degree of coincidence is evaluated. This evaluation is based on the above identifiers such as Bayes, HMM, Mahalanobis distance, Euclidean distance and DP. Any evaluation method that evaluates the distance between each other, the degree of coincidence between identifier strings, and the distance between feature quantities may be used.
[0318] この評価結果として、検出条件に基づいた特定の識別子や識別子列、特徴量にお ける重心からの距離が 1 σ以内に入った場合や特定の識別子や識別子列、特徴量 である確率的に 60%である場合や識別子列同士の一致度が 60%を越える場合を 条件として、登録された任意の処理を実行する。この 60%という値は音素認識ゃ感 情認識画像認識において認識結果が一般的に 60%以上あれば実用化を考慮でき る点に起因しており利用者環境によって任意の率に変更してもよぐ認識率が連続し て 20%未満の場合であれば現在の処理を停止したり、早送りや削除の対象であるこ とを示すフラグを設定したりすると 、つた処理を行っても良 、。  [0318] As a result of this evaluation, the probability that the distance from the center of gravity in the specific identifier, identifier string, and feature quantity within 1 σ is based on the detection condition, or the probability that the identifier, identifier string, and feature quantity are specific. The registered arbitrary process is executed on the condition that it is 60% or the matching degree between identifier strings exceeds 60%. This value of 60% is due to the fact that in phoneme recognition, emotion recognition image recognition, if the recognition result is generally 60% or more, practical application can be considered, even if it is changed to an arbitrary rate depending on the user environment. If the recognition rate is continuously less than 20%, you can stop the current process or set a flag to indicate that it is the subject of fast-forward or delete.
[0319] また、利用者に興味のあるシーンのみを検出して処理を実行するのではなぐ悲鳴 と共に画面内に内臓や血液が表示されているときの特徴量を検出し、恐怖映画など のバイオレンスシーンや公序良俗に反するシーンを早送りしたり、映像にモザイクな どの加工をカ卩えたりすることで、利用者にとって不快なシーンを回避するように共起 情報による検出機能を利用しても良 、。  [0319] In addition, it detects screams that do not execute processing by detecting only the scenes that are of interest to the user, and also detects features when internal organs and blood are displayed on the screen, and violence such as horror movies. You can use the detection function based on co-occurrence information to avoid scenes unpleasant for the user by fast-forwarding scenes and scenes that are offensive to public order and morals, or by adding processing such as mosaic to the video. .
[0320] このようにして、放送局やネットワーク、撮像装置力 獲得される情報を認識し、コン テンッ情報が利用者の目的とするものである力否かを検出し、検出に伴い制御部か ら装置制御を行うことで、録画や再生、早送り、検索、利用者の別端末への告知、視 聴中の画面への通知、装置の移動したり、アナウンスを流したり、メールを配信したり 、 RSSを生成したり、ブックマークを行ったりすることが可能となる。  [0320] In this way, the information acquired by the broadcasting station, the network, and the imaging device is recognized, whether or not the content information is the power intended by the user is detected. By controlling the device, you can record, play, fast forward, search, notify the user to another terminal, notify the screen that you are watching, move the device, send an announcement, deliver an email, etc. , RSS can be generated and bookmarks can be performed.
[0321] 続けて、より詳しくは後述するが、商品応用例として説明する。まず、利用者は検索 条件を入力する。入力された検索条件は、 EPGや BML、 RSS、文字放送を参照し て利用者の入力した情報に関連付けられた、配役名を獲得し、音素や音素片検索を 前述の方法で実行する。この結果、常時録画を実施しながら配役名が発話されてい る箇所から一時間分だけ過去に遡って保存したり、 EPGや BML、 RSS、文字放送 によって番組ごとや CM時間ごとや画面の特徴量の変化ごとに削除対象範囲を確定 したり、そのような変化ごとにコンテンツ内に境界を設け利用者が指示するための指 標にしたりしてもよい。 [0322] このように、コンテンツにおける各種識別子の共起状態を用いて検出を施すことで、 コンテンツに対して複数の検出箇所から指定範囲を構成して保存対象と削除対象に 分類したり、検出されたところ力 遡って映像や音声の情報を保存したり、利用者の 指定により共起情報力 学習した嫌なシーンのスキップ再生を実施したり、検出され た個所力も数秒前まで戻して再生たりすることが可能になる。なお、この例は後の商 品応用としてもより詳細に解説してある。 [0321] Subsequently, although described in more detail later, it will be described as a product application example. First, the user inputs search conditions. The input search condition obtains the cast name associated with the information input by the user by referring to EPG, BML, RSS, and teletext, and executes phoneme and phoneme search by the method described above. As a result, recording is performed retroactively for one hour from the location where the cast name is spoken while constantly recording, or by EPG, BML, RSS, and text broadcasting for each program, every CM time, and screen features. The range to be deleted may be determined for each change, or a boundary may be set in the content for each such change and used as an indicator for the user to instruct. [0322] In this way, by performing detection using the co-occurrence state of various identifiers in the content, the specified range is configured from multiple detection locations for the content, and it is classified and stored as a storage target and a deletion target. As a result, it is possible to save video and audio information retroactively, to skip playback of disgusting scenes that have been learned by co-occurrence information specified by the user, and to return the detected individual power to a few seconds before playback. It becomes possible to do. This example is explained in more detail in later product applications.
[0323] また、本発明による共起状態の検出技術を用いて、 EPGや BML、 RSS、文字放送 、 MPEG7から獲得した俳優や監督などの関わっている作品の広告を実施したり、ス キップ再生中に広告を実施したり、任意の共起状態の検出条件で広告をその時節に 合わせて新 、ものや季節や時刻に適切なものに差換えたりしても良!、。  [0323] Also, using the co-occurrence detection technology according to the present invention, advertisements of works related to actors and directors obtained from EPG, BML, RSS, text broadcasting, MPEG7, etc., and skip playback It is also possible to carry out advertisements in the middle, or replace the advertisements with new ones or appropriate ones according to the season and time according to the detection conditions of any co-occurrence state!
[0324] 《検索 '検出'索引付けに基づく識別子学習の例〉〉  [0324] << Example of identifier learning based on search 'detection' indexing >>
次に、検索 '検出'索引付けに基づく識別子学習について説明する。  Next, identifier learning based on search 'detection' indexing will be described.
[0325] 前述の装置構成と索引や検索結果、検索条件における共起状態に基づいた学習 を行うステップにより、『識別子再構築の例』などを含め任意の識別子及び Z又は任 意の特徴量の共起状態である「検索の結果に基づく共起情報」や「索引付けにより抽 出される共起情報」や「利用者指定の検出条件及び Z又は検索条件に基づいた共 起情報」を共起行列や共分散行列として求め共起確率による確率評価関数や固有 値 ·固有べタトルによる距離評価関数や HMMによる学習や多変量解析による分類と 評価関数の構築といった方法により構成し IDや利用者指定文字列を定義することに より新しい識別子を学習することができる。  [0325] By using the above-mentioned device configuration and learning based on the index, search results, and co-occurrence status in the search conditions, any identifier and Z or any feature quantity including “example of identifier reconstruction” etc. Co-occurrence of "co-occurrence information based on search results", "co-occurrence information extracted by indexing" and "co-occurrence information based on user-specified detection conditions and Z or search conditions" that are co-occurrence states Probability evaluation function based on co-occurrence probabilities and eigenvalues · Distance evaluation function based on eigenvalues, classification using HMM, classification based on multivariate analysis and construction of evaluation function New identifiers can be learned by defining strings.
[0326] まず、索引付けを実施している場合であれば、時系列的に近接する専用索引デー タベースや索引ファイルゃコンテンッファイルの索引 ·属性エリアに記録された特徴 量及び Z又は識別子を収集するステップと収集した特徴量や識別子の共起状態に 基づ!/ヽて共起確率や共起行列や共分散行列を構成するステップとが実施される。近 接するフレームとは利用者の定義により実装に応じて任意に指定できる力 細かい粒 度が必要であれば 16msといったビデオ映像の 1コマを単位としても良ぐ逆に 3秒(1 80フレーム)という時間単位を区切っても良ぐ統計的に距離の離れた特徴が検出さ れたフレーム力もフレームまでの区間に区切っても良ぐこのステップにより獲得され た情報に基づいて共起情報を構成し、 HMMや共分散行列により学習を行ったり距 離関数を構成したりした後に、学習や距離関数による評価関数を共起学習保存部や 評価関数保存部に保存する。 [0326] First, if indexing is performed, the feature quantity and Z or identifier recorded in the index / attribute area of the dedicated index database and index file that are close in time series Based on the co-occurrence state of the collected features and identifiers, the collecting step and the step of constructing the co-occurrence probability, co-occurrence matrix and covariance matrix are executed. The nearest frame is the power that can be arbitrarily specified according to the implementation according to the definition of the user. If fine granularity is required, it can be set as a unit of one frame of video video such as 16 ms. Conversely, 3 seconds (180 frames) It is good even if the time unit is divided. The frame force in which the feature that is statistically distant is detected is also acquired by this step. The co-occurrence information is configured based on the information obtained, and learning is performed by using the HMM or covariance matrix or the distance function is configured. Save to.
[0327] また、検索結果や検索条件、検出条件を利用するのであれば、検索結果として提 示した情報や検索条件や検出条件として指定した情報に関し、利用者が選択したコ ンテンッにおける識別子や特徴量の共起情報をサンプルとして収集するステップを 実行する。そして、収集により獲得したサンプルに基づいて識別子や特徴量の共起 情報を取得する。共起情報の組合せは別途記載されている内容や後述の『複数の 識別子と複数の検索条件に伴う索引付けおよび検索、任意処理の実施例』にあるよ うに幾つもの組合せが考えられる。そして、共起状態に基づいてベイズや HMMによ る学習処理を行!ヽ、学習結果として得られた学習パラメータや距離関数を記憶部の 共起学習保存部や評価関数保存部に保存する。また、検索条件や検出条件も同様 に検索条件や検出条件の指定状況をサンプルとして収集することにより学習サンプ ルを獲得し、学習サンプルにより評価関数を構成できる。  [0327] In addition, if the search result, search condition, or detection condition is used, the identifier or feature of the content selected by the user regarding the information presented as the search result or the information specified as the search condition or detection condition. Perform steps to collect quantity co-occurrence information as a sample. Then, the co-occurrence information of the identifier and the feature amount is acquired based on the sample acquired by the collection. There are various combinations of co-occurrence information as described in the contents described separately and “Examples of indexing and searching with multiple identifiers and multiple search conditions, optional processing” described later. Then, based on the co-occurrence state, learning processing by Bayes or HMM is performed, and the learning parameters and distance functions obtained as learning results are stored in the co-occurrence learning storage unit and the evaluation function storage unit of the storage unit. Similarly, for the search conditions and detection conditions, a learning sample can be obtained by collecting the specified conditions of the search conditions and detection conditions as a sample, and an evaluation function can be configured with the learning samples.
[0328] この際、ニューラルネットワークやファジー、遺伝的アルゴリズム、カオス、フラクタル といった任意の学習アルゴリズムを組合せたり、再利用される共起情報に関して、共 起情報同士の共起情報を再帰的に利用し評価関数を構成したり、検索評価条件に 用いる共起行列の各要素を共起確率や共起行列の要素値の高低に応じて利用した り、検索条件や検出条件に用いる共起行列の各要素を遺伝的アルゴリズムの有効無 効フラグに用 ヽたりしても良 、。  [0328] At this time, arbitrary learning algorithms such as neural network, fuzzy, genetic algorithm, chaos, and fractal are combined, and the co-occurrence information between the co-occurrence information is recursively used for co-occurrence information to be reused. Configure the evaluation function, use each element of the co-occurrence matrix used for the search evaluation condition according to the co-occurrence probability and the level of the element value of the co-occurrence matrix, and each of the co-occurrence matrix used for the search condition and the detection condition The element may be used as a valid / invalid flag of the genetic algorithm.
[0329] また、共起行列の範囲指定方法として、 1作品もしくは 1番組、任意の識別子ゃ特 徴量が共起している範囲、特定の識別子の出現に基づいたセグメントによる指定範 囲の画像特徴量や音声特徴量を分類 '分析'多変量解析しそれらの特徴量の出現 時間を評価したり、分類された情報力 共起行列や共起確率共分散行列を構成し評 価関数をつくり、評価結果として得られた識別子の出現頻度を求めたり、それらの識 別子の単位時間における出現ヒストグラムカゝらシーンの特徴や評価関数を構築し評 価するといつた方法を用いても良ぐそれらを用いた検索条件よつて抽出された検索 結果に関し検索条件以外の識別子や特徴量において共起確率の高いもの(例えば 70%以上)や距離の近いもの (例えば距離平均の 3 σ以内)を新規対象として共起 情報構成のステップで学習に用いる方法や、逆に帰属確率の低いもの(例えば 3 σ より離れるもの)を除外して共起情報構成のステップで学習に用いる方法が考えられ る。 [0329] Also, as a range specification method of co-occurrence matrix, one work or one program, a range where arbitrary identifiers are co-occurring, an image of a specified range by segment based on the appearance of a specific identifier Categorize features and speech features by 'analysis' multivariate analysis and evaluate the appearance time of those features, and create an evaluation function by constructing classified information power co-occurrence matrix and co-occurrence probability covariance matrix Any method can be used to determine the frequency of occurrence of identifiers obtained as an evaluation result, or to construct and evaluate scene features and evaluation functions, such as the appearance histogram for the unit time of these identifiers. A search result extracted based on a search condition using them has a high co-occurrence probability in identifiers and feature quantities other than the search condition (for example, 70% or more) or near-distance (for example, within 3 σ of distance average) as a new target, the method used for learning in the co-occurrence information composition step, or conversely with a low probability of belonging (for example, farther than 3 σ) The method used for learning in the co-occurrence information composition step can be considered.
[0330] これらの識別子再構築に用いる特徴量は、評価関数の出力値、 ΗΜΜの出力確率 、識別子列同士の類似度といった値に基づいて任意に構成される。本実施例におい ては、前述にあるような色の出現頻度や感情識別子の出現頻度、人の動作や仕草、 歩き方、表情といった画像特徴量や音素や音素片、音階、和音コードといった特徴 量ベクトルの共起状態を共分散行列として組合せて用いてもよぐ識別子の共起行 列を構成してもよぐこのような方法を用いることで後述される「複数の識別子と複数 の検索条件に伴う検索および任意処理の実施例」のような検索を実現できるようにな る。  [0330] These feature quantities used for identifier reconstruction are arbitrarily configured based on values such as the output value of the evaluation function, the output probability of ΗΜΜ, and the similarity between identifier strings. In this embodiment, as described above, the frequency of appearance of colors, the appearance frequency of emotion identifiers, image features such as human actions and gestures, walking, facial expressions, and feature quantities such as phonemes, phonemes, scales, and chord codes. A combination of vector co-occurrence states may be used as a covariance matrix, or a co-occurrence matrix of identifiers may be constructed. By using such a method, “multiple identifiers and multiple search conditions” will be described later. It becomes possible to realize a search such as “Example of Search and Optional Processing Associated with“.
[0331] そして、このようにして得られた特徴量や識別子の共起情報に対し任意のラベル付 けを行うことにより評価関数に文字列が与えられ記憶部に保存され学習結果がされる 。なお、識別子や特徴量に与えた文字列を新規の XMLなどにおけるマークアップ言 語におけるタグ名に用いたり、与えられた文字列自体を音素や音素片のような識別 子記号列に変換し利用者力 の音声入力に対応できるようにしたり、表情識別子や 形状識別子、動作識別子などと関連づけ評価関数を構成し利用者の映像入力に対 して対応でさるようにしたりしてちょ 、。  [0331] Then, by arbitrarily labeling the co-occurrence information of the feature quantity and identifier obtained in this way, a character string is given to the evaluation function, stored in the storage unit, and a learning result is obtained. The character strings given to identifiers and features are used as tag names in markup languages such as new XML, or the given character strings themselves are converted into identifier symbol strings such as phonemes and phonemes. It is possible to support human voice input, or configure an evaluation function that is associated with facial expression identifiers, shape identifiers, motion identifiers, etc., so that it can respond to user video input.
[0332] より具体的には、検索結果として提示した一覧に対し利用者が繰返し選択を行いコ ンテンッの閲覧を行った検索条件において、検索条件との距離評価結果がその共 起情報の重心力 見て 3 σ以内であったり、確率評価結果が 80%以上であったりす る場合、選択されたコンテンツの対象範囲における索引の共起情報を共起行列や共 起確率としてとらえ、それらの索引に用いられた識別子や特徴量に基づいて新しく評 価関数を構成する。評価関数は例えばベイズ識別関数であったり、マハラノビス距離 関数であったり、 ΗΜΜ関数であったりしてもよぐそれらの新規に構成された評価関 数により帰属確率や評価距離といった尤度を得ることが出来る。  [0332] More specifically, in the search condition in which the user repeatedly selects the list presented as the search result and browses the content, the distance evaluation result with the search condition indicates the barycentric force of the co-occurrence information. If the result is within 3 σ or the probability evaluation result is 80% or more, the co-occurrence information of the index in the target range of the selected content is regarded as a co-occurrence matrix or co-occurrence probability, and the index A new evaluation function is constructed based on the identifiers and features used in the above. For example, the evaluation function may be a Bayes discriminant function, Mahalanobis distance function, or a power function. I can do it.
[0333] このように、本発明の特徴は従来技術としての各種識別子の認識や特徴抽出方法 、フレーム幅や時間幅の指定や範囲選択方法、識別子列のマッチング方法にあるの ではなぐ音素や音素片と感情識別子を踏まえた他の音響識別子や画像識別子の 共起情報に基づく索引付けと、索引付けを利用した検索'検出と、検出により開始さ れる録画や再生などの処理と索引付けにおける共起情報の学習と検索結果の利用 状況に基づく共起情報の学習とその共起情報の学習によって得られた新規の識別 子および新規の特徴量とそれらの識別子や特徴量を音素列や音素片列を用いて検 索条件に指定できる識別子変換辞書にある。 As described above, the features of the present invention are various identifier recognition and feature extraction methods as conventional techniques. Indexing based on co-occurrence information of other sound identifiers and image identifiers based on emotional identifiers and phonemes and phoneme pieces that are not in the frame width and time width specification, range selection method, identifier string matching method, Search using indexing 'detection, processing of recording and playback started by detection, learning of co-occurrence information in indexing, use of search results, learning of co-occurrence information and learning of co-occurrence information This is an identifier conversion dictionary that can specify new identifiers and new features obtained by the above and their identifiers and features as search conditions using phoneme strings and phoneme string sequences.
[0334] «識別子再構築の例》  [0334] «Example of identifier reconstruction»
次に、本発明に基づく識別子再構築方法にっ 、て説明する。  Next, an identifier reconstruction method based on the present invention will be described.
[0335] 本発明に基づく装置により識別子を再構築するには出力された DPの一致度の値、 HMMの出力確率の値、ベイズ識別関数の出力値、その他特徴量を評価するため の距離関数の距離値や検索結果のうち利用者に利用されている検索結果に関連付 けられた識別子や特徴量と!/、つたものを複数組合せて特徴量とし、新規のベイズ識 別関数や HMM確率評価関数、距離評価関数、確率評価関数、尤度評価関数など を構成することで実施でき、このような識別子再構築方法は前記の特徴量を踏まえ多 層ベイズ若しくは多層-ユーラルネットワーク、多層 HMMなどの任意の学習方法と 認識方法を実装に応じて組合せ利用できる。  [0335] To reconstruct the identifier by the apparatus according to the present invention, the output DP match value, the HMM output probability value, the output value of the Bayes discriminant function, and other distance functions for evaluating the feature amount Among the distance values and search results, identifiers and feature quantities associated with the search results used by users and! /, Multiple combinations of these are used as feature quantities to create new Bayesian identification functions and HMM probabilities It can be implemented by constructing an evaluation function, a distance evaluation function, a probability evaluation function, a likelihood evaluation function, etc., and such an identifier reconstruction method can be implemented based on the above-mentioned feature quantity, a multi-layer Bayes or multi-layer network, multi-layer HMM Any learning method such as can be used in combination with the recognition method depending on the implementation.
[0336] この際、識別子や特徴量の関連付けによる共起情報を識別関数の構築に用いても 良いし、以下のような構成で共起情報を組合せても良ぐ識別子の共起確率を特徴 量とする学習、特徴量の共分散行列を特徴量とする学習、識別子の共起確率と特徴 量の共分散行列を特徴量とする学習、距離関数の出力を特徴量とする学習、識別子 を評価した HMMの出力確率を特徴量とする学習、識別子を評価した HMMの遷移 確率を特徴量とする学習といった方法を組合せて HMM学習パラメータとして与えた り、共分散行列を構成して固有値と固有ベクトルを求め評価関数を構成することで評 価関数のパラメータを学習したり、平均値を求めて距離評価に用いる評価関数のパ ラメータを学習したりすることで、任意の識別子や特徴量に基づく学習を実施し識別 子の再構築、利用者が頻繁に指定する検索条件や検出条件に伴う識別子や特徴量 の共起情報を用いる識別子の再構築、利用者選択後長時間利用される検索結果に 伴う識別子や特徴量の共起情報を用いる識別子の再構築を行うことができる。 [0336] At this time, the co-occurrence information obtained by associating the identifier and the feature quantity may be used for constructing the discriminant function, or the co-occurrence probability of the identifier may be combined in the following configuration. Learning with a quantity, learning with a feature quantity covariance matrix, learning with an identifier co-occurrence probability and a feature quantity covariance matrix, learning with the output of a distance function as a feature quantity, and identifier A combination of methods such as learning that uses the output probability of the evaluated HMM as a feature value and learning that uses the transition probability of the HMM that evaluates an identifier as a feature value is given as an HMM learning parameter, or a covariance matrix is formed to configure eigenvalues and eigenvectors. Learning based on an arbitrary identifier or feature quantity by learning the parameters of the evaluation function by constructing the evaluation function and learning the parameters of the evaluation function used for distance evaluation by obtaining the average value Carry out knowledge Reconstruction of the child, reconstruction of an identifier using co-occurrence information of the user identifier and the feature amount associated with the search condition and the detection condition frequently specified, the search results to be used a long time after the user selected It is possible to reconstruct the identifier using the accompanying identifier and the co-occurrence information of the feature amount.
[0337] 例えば、感情識別子と音素片であれば、感情識別子として、喜怒哀楽の 4個と音素 片約 400個の識別子の認識結果を得る。次に、音素片列に対し DPマッチングにより 「k/o/r/a」と発話して!/ヽる部分を検索する。この結果、 rk/o/r/ajと発話して!/ヽる部分 の周囲で生じている感情識別子を獲得することができ共起情報を構成できるため、 怒りの感情と音素片列「k/o/r/a」の共起状態を学習することや共起状態における特 徴量を学習し識別子を新しく「怒って 、る [k/o/r/a]」 、う識別子や「喜んで!/、る [k/o /r/a]」という識別子を構築することが可能となる。なお、再構築による学習に用いる情 報は DPマッチングの一致率と感情特徴や感情識別子の一比率を用いたり、音素列 や音素片列の評価関数と感情識別子の評価関数による尤度や確率、距離を用いた りしても良い。この際、例えば、映像特徴や画像特徴、動画特徴や静止画特徴、音階 特徴や環境音特徴といった特徴量の抽出方法による関連付けの組合せが可能であ り、感情と発話に伴い特徴抽出された人の顔の特徴量カゝら感情を伴う表情識別子を 構成してちょい。  [0337] For example, in the case of emotion identifiers and phonemes, the recognition results of the identifiers of 4 emotions and about 400 phonemes are obtained as emotion identifiers. Next, utter “k / o / r / a” by DP matching to the phoneme string sequence and search for the part that speaks! As a result, by speaking rk / o / r / aj! / Coincidence information can be constructed by acquiring the emotion identifiers that occur around the utterance part, the anger emotion and the phoneme string “k” Learning the co-occurrence state of “/ o / r / a” and the amount of features in the co-occurrence state, the identifier is newly angry, [k / o / r / a], It is possible to construct the identifier "! /, Ru [k / o / r / a]". Note that the information used for learning by reconstruction uses a DP matching rate and a ratio of emotional characteristics and emotion identifiers, and the likelihood and probability based on the evaluation function of the phoneme sequence or phoneme segment sequence and the evaluation function of the emotion identifier, You may use distance. In this case, for example, it is possible to combine associations using feature quantity extraction methods such as video features, image features, moving image features, still image features, scale features, and environmental sound features. Construct facial expression identifiers with emotions, including facial features.
[0338] なお、識別子を再学習する際の範囲は識別子の境界である場合や特徴量の平均 から 3 σ以上はなれた場合、特徴量の時間的空間的変化が時間的空間的に異なる 情報位置の時間的空間的変化の平均より 3 σ以上はなれた場合、その他の統計的 な検定において有意な乖離を持つ場合、検索対象となった情報の周囲の情報を含 めた平均から 3 σ以内に情報がある場合、任意の利用者指定時間幅を用いる場合と いった指定の境界条件に基づいて、識別子を再学習するための対象となる情報範 囲を構成しても良い。  [0338] It should be noted that when the range for re-learning identifiers is the boundary of identifiers or when the average of feature quantities deviates by 3 σ or more, the temporal and spatial changes in feature quantities differ in terms of temporal and spatial information positions If there is a difference of 3σ or more from the average of the temporal and spatial changes, or if there is a significant divergence in other statistical tests, within 3σ from the average including the information surrounding the searched information When there is information, the information range that is the target for re-learning the identifier may be configured based on the specified boundary condition such as when using an arbitrary user-specified time width.
[0339] 識別子の関連付けの例としては、番組情報と表示位置の関連付け、番組情報と感 情の関連付け、番組情報と音素、音素片の関連付け、番組情報と風景画像の関連 付け、番組情報と文章の関連付け、番組情報と環境音の関連付け、番組情報と音階 やテンポ、和音やコード進行の関連付け、番組情報と表情画像の関連付け、番組情 報と物体画像との関連付け番組情報と動作情報との関連付けや、表示位置と感情の 関連付け、表示位置と音素、音素片の関連付け、表示位置と風景画像の関連付け、 表示位置と文章の関連付け、表示位置と環境音の関連付け、表示位置と音階やテン ポ、和音やコード進行の関連付け、表示位置と表情画像の関連付け、表示位置と物 体画像との関連付け、表示位置と動作情報との関連付けや、 感情と音素、音素片 の関連付け、感情と風景画像の関連付け、感情と文章の関連付け、感情と環境音の 関連付け、感情と音階やテンポ、和音やコード進行の関連付け、感情と表情画像の 関連付け、感情と物体画像との関連付け、感情と動作情報の関連付けや、音素、音 素片と風景画像の関連付け、音素、音素片と文章の関連付け、音素、音素片と環境 音の関連付け、音素、音素片と音階やテンポ、和音やコード進行の関連付け、音素、 音素片と表情画像の関連付け、音素、音素片と物体画像との関連付け、音素、音素 片と動作情報の関連付けや、風景画像と文章の関連付け、風景画像と環境音の関 連付け、風景画像と音階やテンポ、和音やコード進行の関連付け、風景画像と表情 画像の関連付け、風景画像と物体画像との関連付け、風景画像と動作情報の関連 付けや、文章と環境音の関連付け、文章と音階やテンポ、和音やコード進行の関連 付け、文章と表情画像の関連付け、文章と物体画像との関連付け、文章と動作情報 の関連付けや、環境音と音階やテンポ、和音やコード進行の関連付け、環境音と表 情画像の関連付け、環境音と物体画像との関連付け、環境音と動作情報の関連付 けや、音階やテンポ、和音やコード進行と表情画像の関連付け、音階やテンポ、和 音やコード進行と物体画像との関連付け、音階やテンポ、和音やコード進行と動作 情報の関連付けや、表情画像と物体画像との関連付け、表情画像と動作情報の関 連付けや、物体画像と動作情報の関連付け、画像情報と音響情報の関連付けおよ び前述された任意の識別子や特徴量との関連付けが可能であり、これらの関連付け による共起状態の学習を行うステップによって実施されるとともに、共起学習保存部 に保存されマハラノビスによる距離評価や HMMによる確率評価、ベイズ識別関数に よる距離評価やそれらの組合せによる尤度評価が実施され、特徴抽出部である識別 子特徴量変換部におけるその他の識別関数に用いられたり、複合検索結果生成処 理にあるその他一致度評価に用いてもよ!、。 [0339] Examples of identifier associations include association between program information and display position, association between program information and emotion, association between program information and phonemes, association between phonemes, association between program information and landscape images, program information and text Association between program information and environmental sound, association between program information and scale and tempo, chord and chord progression, association between program information and facial expression image, association between program information and object image Association between program information and operation information Display position and emotion, display position and phoneme, phoneme fragment, display position and landscape image, display position and text, display position and environmental sound, display position and scale and tense. P, association of chords and chord progressions, association between display position and facial expression image, association between display position and physical image, association between display position and motion information, association between emotion and phoneme, phoneme fragment, emotion and landscape image Association between emotion and sentence, Association between emotion and environmental sound, Association between emotion and scale and tempo, Association of chord and chord progression, Association of emotion and facial expression image, Association of emotion and object image, Association of emotion and motion information Phonemes, phonemes and landscape images, phonemes, phonemes and texts, phonemes, phonemes and environmental sounds, phonemes, phonemes and scales and tempos, chords and chord progressions, phonemes, Association between phonemes and facial expressions, phonemes, association between phonemes and object images, association between phonemes, phonemes and motion information, association between landscape images and text, landscape images and environmental sounds Association, association between landscape image and scale and tempo, chord and chord progression, association between landscape image and facial expression image, association between landscape image and object image, association between landscape image and motion information, sentence and environmental sound Association, association between text and scale and tempo, chord and chord progression, association between sentence and facial expression image, association between sentence and object image, association between sentence and motion information, environmental sound and scale and tempo, chord and chord progression , Environmental sound and expression image, environmental sound and object image, environmental sound and motion information, scale and tempo, chord and chord progression and facial expression image, scale and tempo, Association of chords and chord progressions with object images, scale and tempo, association of chords and chord progressions with motion information, association between facial expression images and object images, facial expression images and motion information Association, association of object image and motion information, association of image information and acoustic information, and association with any of the above-mentioned identifiers and feature quantities are possible. It is stored in the co-occurrence learning storage unit, distance evaluation by Mahalanobis, probability evaluation by HMM, distance evaluation by Bayes discriminant function, and likelihood evaluation by a combination thereof are performed, and it is a feature extraction unit It can be used for other discriminant functions in the discriminator feature value conversion unit, or for other coincidence evaluation in complex search result generation processing!
このような組合せによる評価結果に従い、例えば悲鳴や爆発音、笑い声、感嘆音な どの音声区間を集めることで「悲鳴識別関数」や「爆発音識別関数」、「笑い声識別関 数」、「へ一」といった音声を識別する「感嘆声識別関数」を構築でき、それらの独自 識別関数を組合せて音素認識と動画特徴と感情特徴を同時に索引付けし検索を行 えるようにしたり、「笑顔関数」や「泣き顔関数」をつくり同様の検索を行えるようにした り、それら識別関数の識別結果による共起状態の学習を行えるようにしたり、特定の 番組における最初の数秒間のタイトル画像特徴と音素認識による番組タイトル発話 認識により特定番組を認識'検出できる評価関数構築を行えるようにしたり、遺伝的 アルゴリズムの遺伝子フラグ指定に共起状態に基づいた共起頻度の高い識別子の 有無を利用したりしてもよい。 According to the evaluation results of such combinations, for example, by collecting speech segments such as screams, explosion sounds, laughter, and exclamation sounds, `` scream discrimination function '', `` explosion discrimination function '', `` laughter discrimination function '', "Exclamation discrimination function" that identifies voices such as By combining discriminant functions, phoneme recognition, video features, and emotion features can be indexed at the same time so that searches can be performed, and `` smiling functions '' and `` crying face functions '' can be created to perform similar searches. It is possible to learn the co-occurrence state based on the identification result of the video, and to construct an evaluation function that can recognize and detect a specific program by recognizing the title title of the first few seconds of the specific program and the program title utterance by phoneme recognition. Alternatively, the presence or absence of an identifier with a high co-occurrence frequency based on the co-occurrence state may be used to specify the gene flag of the genetic algorithm.
[0341] そして、 1番組中に出現する識別子や特徴量の頻度や偏りにより番組のジャンルに 伴う画像と音声の傾向を分析することが可能になるとともに、分析結果に基づいた共 起情報を学習し「ホラー映画識別関数」や「アクション映画識別関数」、「コメディー番 組識別関数」、「ゥンチタ番組識別関数」を構成することで新し ヽ識別子や識別関数 を構築することが可能となり、後述される「複数の識別子と複数の検索条件に伴う検 索および任意処理の実施例」のような従来にない検索や検出が実現される。  [0341] Then, it is possible to analyze image and audio trends associated with the genre of the program based on the identifiers appearing in one program and the frequency and bias of the feature quantity, and learn co-occurrence information based on the analysis results. By configuring the “Horror Movie Identification Function”, “Action Movie Identification Function”, “Comedy Program Identification Function”, and “Uncita Program Identification Function”, it becomes possible to construct new identifiers and identification functions. Thus, unprecedented search and detection such as “an example of search and optional processing with multiple identifiers and multiple search conditions” is realized.
[0342] 次に、検索効率を上げるために検索条件を自律的に追加し再構築する具体的な 方法として、検索条件としての入力された特徴量や識別子が検索結果として得られ たコンテンツに対し高い類似度 (例えば 80%)を示し、同一のコンテンツに関連付け られた検索条件に指定されて!ヽな ヽ他の識別子や他の特徴量の類似度も高 ヽ場合 (例えば 80%)、検索条件に指定されていない他の識別子や特徴量を指定された検 索条件と共に共起情報保存部に記録する。 [0342] Next, as a specific method for autonomously adding and reconstructing search conditions in order to increase search efficiency, the input feature quantities and identifiers as search conditions are applied to the content obtained as search results. If it shows a high similarity (for example, 80%) and is specified in the search condition associated with the same content! Other identifiers and features not specified in the conditions are recorded in the co-occurrence information storage unit together with the specified search conditions.
[0343] 次に、このような共起状態に基づいて関連付けられた情報の累積がある一定値 (例 えば 1000件であったり、評価次元数の n倍であったりしてもよい)を越えた時点で共 起情報による共起行列を構成し、共分散行列や共起確率を求め距離評価関数や H MMによる学習を実施して評価関数を再構成することができる。この際、分散の多い 情報や確率の低い情報は計算力 除外して評価次元数を減らして計算効率を上げ ても良 、し、コマンド制御のなどの定型句や特定単語の場合であればコマンドを文字 列から展開された音素列や音素片列ば力りではなく装置による認識に対する利用者 の肯定若しくは否定の指示に伴い、認識された音素列や音素片列を用いて、音素や 音素片識別のための評価関数テンプレートを更新しても良い。 [0344] より具体的には、音素や音素片であれば「わ一」と認識される音素,音素片による識 別子列において前後数秒間に爆発音が環境音として 1000回の検索において 80% 以上検出された場合、「わ一」の音素や音素片は共起情報として学習の対象になり 評価関数の再構成により「爆発音」を検索する場合に「わ一」と言う音素列も同時に評 価するようになるとともに、画像特徴として放射線状の動作特徴が 1000回の検索に おいて 80%以上検出された場合には動作特徴量も共起情報の学習対象として用い られ「わ一」の音素や音素片と「放射線状」の動作特徴と「爆発音」環境音識別子や 効果音識別子の共起状態によって識別関数を構成し爆発シーンの検索を実施して もよぐ文字列による検索依頼を実行するために「爆発シーン」と表記された文字列を 評価関数に関連付けて利用したり、音声発話による検索依頼を実行するために「b/ a/ k/ u/ h/ a/ ts/ u/ sh/ i/ i/ n」と音素や音素片列による識別子を与えて音声利 用できるようにしたりしても良い。 [0343] Next, the accumulation of information related based on such a co-occurrence state exceeds a certain value (for example, it may be 1000 or n times the number of evaluation dimensions). At this point, a co-occurrence matrix based on co-occurrence information can be constructed, the covariance matrix and co-occurrence probabilities can be obtained, and learning can be performed using a distance evaluation function or HMM to reconstruct the evaluation function. At this time, information with a lot of variance and information with low probability can be excluded, and the calculation efficiency can be improved by reducing the number of evaluation dimensions, and if it is a fixed phrase such as command control or a specific word, a command can be used. The phoneme or phoneme segment is used by using the recognized phoneme sequence or phoneme sequence in accordance with the user's affirmative or negative instruction for recognition by the device, not the phoneme sequence or phoneme sequence developed from the character string. The evaluation function template for identification may be updated. [0344] More specifically, in the case of phonemes or phonemes, the explosion sequence is recognized as an environmental sound within a few seconds before and after the phoneme or phoneme segment that is recognized as “waichi”. If more than% is detected, the phoneme or phoneme fragment of “Waichi” is subject to learning as co-occurrence information, and when searching for “explosive pronunciation” by reconstructing the evaluation function, the phoneme string “Waichi” is also included. At the same time, when radial motion features are detected as 80% or more in 1000 searches, motion feature amounts are also used as learning targets for co-occurrence information. ”Phonemes and phoneme pieces and“ radial ”motion characteristics and“ explosive sound ”environmental sound identifiers and co-occurrence states of sound effect identifiers constitute an identification function to search for explosion scenes. Labeled "explosion scene" to execute a search request In order to use the character string in association with the evaluation function or to execute a search request by voice utterance, “b / a / k / u / h / a / ts / u / sh / i / i / n” It may be possible to use a voice by providing an identifier by a phoneme string sequence.
[0345] また、感情識別子であれば「わ一」と認識される音素 ·音素片による識別子列が感 情識別子の「悲しみ」と同時に認識された場合、共起情報として識別子の共起行列を 構成して共起確率を求める方法や特徴量ベクトルの共分散行列により固有値,固有 ベクトルを求めベイズ識別関数やマハラノビス距離を構成する方法を用いることがで きる。そして、検索対象となるコンテンツ情報に対して尤度評価関数を構成することに より「悲 、シーン」を検索する場合に「わ一」と言う音素列の有無を評価可能となると とも〖こ、同じ「わ一」 t ヽぅ発話を検出しても「喜び」の感情が認識されて ヽるか否かに よって、利用者の「悲し 、シーン検索」とは異なるシーンとして検索結果から除外する ことで、感情に伴い質の異なる検索結果を提供できる。この際、入力された検索条件 の文字列が顔文字と呼ばれる「(一)」や「(;;)」と 、つた表記を用いることで「喜び」や 「悲しみ」といった感情識別文字列として利用し、文字列識別子変換辞書を経由して 感情特徴量や感情識別子に変換し検索に用いても良 ヽ。  [0345] Also, if an identifier string consisting of phonemes / phoneme segments recognized as emotional identifiers is recognized at the same time as emotion identifier "sadness", the co-occurrence matrix of identifiers is used as co-occurrence information. It is possible to use a method for determining the co-occurrence probability by configuration and a method for determining eigenvalues and eigenvectors using the covariance matrix of feature vectors and constructing a Bayes discriminant function or Mahalanobis distance. Then, by constructing a likelihood evaluation function for the content information to be searched, it is possible to evaluate the presence or absence of a phoneme string called “Waichi” when searching for “sad, scene”. The same “Waichi” t ヽ ぅ Even if utterance is detected, it is excluded from the search results as a scene different from the user's “sad, scene search” depending on whether or not the emotion of “joy” is recognized Thus, it is possible to provide search results of different quality according to emotions. At this time, the input search condition character string is used as an emotion identification character string such as “joy” or “sadness” by using “(1)” or “(;;)” called emoticons. However, it can be converted into emotion feature quantities and emotion identifiers via a character string identifier conversion dictionary and used for searching.
[0346] そして、このように構成された尤度評価関数は共起学習保存部や評価関数保存部 に評価関数のパラメータやテンプレートを保存するとともに指定文字列や指定単語と 文字列や単語の発話に基づく音素列,音素片列との関係を辞書部に登録される。ま た、検索条件の利用価値を評価するために通信回線経由で利用できる検索条件の 第三者利用頻度を学習サンプルに用いて利用価値の評価を行っても良いし、「各種 識別子の出力確率」及び Z又は「各種識別子の共起確率」及び Z又は「各種識別子 の遷移確率」及び Z又は「各種識別子の共起確率」及び Z又は「各種特徴量」とを組 合せて一組の「特徴量」とし、共分散行列に基づ!/ヽて固有値固有ベクトルを求め評価 関数を構成したり、それらを特徴量として HMMに与えて学習させたり、各種多変量 解析を用いてクラスタリングを行って母集団を構成し母集団帰属評価関数を作ったり 、遺伝的アルゴリズムにおいて、利用品度の高い検索や検索結果、索引付け処理中 に生じる共起確率の高い識別子や特徴量及び、識別関数から得た距離の平均から の乖離状態が 3 σを超える識別子や特徴量及び、平均確率からみて特に共起確率 及び Ζ又は出現確率の高 、識別子や特徴量を遺伝子フラグとして用 、ても良 、。 [0346] The likelihood evaluation function configured as described above stores the parameters and templates of the evaluation function in the co-occurrence learning storage unit and the evaluation function storage unit, and also specifies the specified character string, the specified word, and the utterance of the character string and the word. The relationship between the phoneme string and the phoneme string string based on is registered in the dictionary unit. Also, search conditions that can be used via communication lines to evaluate the use value of search conditions. Evaluation of utility value may be performed using the third-party usage frequency as a learning sample, and “output probability of various identifiers” and Z or “co-occurrence probability of various identifiers” and Z or “transition probability of various identifiers” And Z or “co-occurrence probability of various identifiers” and Z or “various feature quantities” are combined into one set of “feature quantities” based on the covariance matrix! / Evaluate to create eigenvalue eigenvectors, construct evaluation functions, give them to HMM as features and train them. In a genetic algorithm, the divergence state from the average of distances obtained from the average of distances obtained from identifiers and feature quantities with high co-occurrence probabilities that occur during indexing and search results with high degree of use is 3 σ. The identifiers and feature quantities that exceed and the probability of co-occurrence and Ζ or the appearance probability are particularly high in view of the average probability, and identifiers and feature quantities may be used as gene flags.
[0347] また、感情の認識に伴い音素や音素片の認識辞書を切り替えたり、認識される環 境音の変化に伴い、音素や音素片の認識辞書を切り替えたり、認識される風景画像 に伴い表示物体の画像認識辞書を切り替えたり、認識される画像に伴い音素や音素 片の認識辞書を切り替えたりといった、共起状態に応じた辞書の切換を行っても良く 、本発明によって得られた共起関係に基づく情報を感性情報としてとらえ、コンテンツ 情報の検索に利用しても良 、。  [0347] In addition, the phoneme and phoneme recognition dictionaries are switched according to emotion recognition, the phoneme and phoneme recognition dictionaries are switched according to changes in the recognized environmental sound, and the recognized landscape image is used. The dictionary may be switched according to the co-occurrence state, such as switching the image recognition dictionary of the display object, or switching the recognition dictionary of phonemes and phonemes according to the recognized image. Information based on the starting relationship can be taken as sensitivity information and used to search for content information.
[0348] なお、ここに挙げた事例は本発明を実行するための例を説明しているため、上記以 外の複数の識別子と複数の索引付けおよび検索条件に伴う検索、検出、検索結果 が考慮され、詳細は任意処理例や製品応用例として別途後述する。  [0348] Since the examples given here are examples for carrying out the present invention, search, detection, and search results associated with a plurality of identifiers and a plurality of indexing and search conditions other than the above are included. The details will be described later separately as an arbitrary processing example and a product application example.
[0349] <本発明の応用例について >  [0349] <Application Examples of the Present Invention>
本発明に基づ 、た装置を利用する上での応用例として、サーバ ·クライアント環境を 考慮した「端末及び基地局に用いる情報処理装置の手順例」、利用者同士の情報交 換ゃ共有により利便性の改善を考慮した「利用者同士の情報共有手順例」、本発明 を利用した「ユーザインタフェースの例」を記載する。  Based on the present invention, as an application example for using such a device, “procedure example of information processing device used for terminal and base station” considering server / client environment, information exchange and sharing between users “Example of information sharing procedure between users” and “Example of user interface” using the present invention are described.
[0350] 《端末及び基地局に用いる情報処理装置の手順例〉〉  [0350] <Example of procedure of information processing apparatus used for terminal and base station >>>
まず、基地局と端末に関わるサーバ'クライアントによる処理システムについて説明す る。本装置と端末は図 20のように構成され、利用者端末と配信基地局と端末や基地 局に制御されるロボットなどの装置や制御するリモコンにより構成され、リモコンやロボ ットは端末の一形態や基地局の一形態として利用されても良ぐ利用者は端末に対 して音声を発話し、端末若しくは基地局で認識処理のために以下にあるような任意の 処理手順を実行する。 First, a server-client processing system related to base stations and terminals will be described. This device and the terminal are configured as shown in FIG. 20, and are composed of a user terminal, a distribution base station, a device such as a robot controlled by the terminal and the base station, and a remote controller to be controlled. A user who can be used as a form of a terminal or a form of a base station speaks voice to the terminal, and the terminal or base station uses any of the following for recognition processing. Execute the processing procedure.
[0351] 第 1の方法では、発話により得られた音声や撮像された映像カゝら特徴量抽出を実 施し、特徴量を対象となる中継個所や基地局装置に送信し、特徴量を受信した基地 局装置はその特徴量に応じて音素記号列及び Z又は音素片記号列と感情記号列、 その他画像識別子を生成する。そして、生成された記号列に基づいて、一致する制 御手段を選択し実行する。  [0351] In the first method, feature values are extracted from the speech obtained from speech or captured video images, and the feature values are transmitted to the target relay location or base station apparatus, and the feature values are received. The base station apparatus generates a phoneme symbol string, Z or phoneme symbol string, emotion symbol string, and other image identifiers according to the feature quantity. Then, based on the generated symbol string, a matching control means is selected and executed.
[0352] 第 2の方法は、発話により得られた音声や撮像された映像カゝら特徴量抽出を実施し 、端末内で音素記号列及び Z又は音素片記号列、感情記号列、その他画像識別子 といった認識に伴う識別子を生成し、生成された記号列を対象となる中継個所や基 地局装置に送信する。そして、制御される基地局装置は受信した記号列に基づき一 致する制御手段を選択し実行する。  [0352] The second method performs feature amount extraction from the speech obtained by speech and the captured video image, and the phoneme symbol string and Z or phoneme symbol string, emotion symbol string, and other images in the terminal An identifier that accompanies recognition, such as an identifier, is generated, and the generated symbol string is transmitted to the target relay location or base station apparatus. Then, the controlled base station apparatus selects and executes a matching control means based on the received symbol string.
[0353] 第 3の方法は、発話により得られた音声や撮像された映像カゝら特徴量抽出を実施し 、端末内で生成された特徴量に基づき音素列及び Z又は音素片記号列、感情記号 列、その他画像識別子を認識し、認識された記号列に基づき制御内容を選択し、制 御方法を制御する基地局装置や情報配信を中継する装置に対し送信する。  [0353] The third method performs feature amount extraction from voice obtained by utterance and captured video image, and based on the feature amount generated in the terminal, phoneme sequence and Z or phoneme symbol sequence, It recognizes emotion symbol strings and other image identifiers, selects control contents based on the recognized symbol strings, and transmits them to the base station apparatus that controls the control method and information relay apparatus.
[0354] そして、第 4の方法は端末を用いて発話により得られた音声や撮像された映像の音 声波形や画像をそのまま制御する基地局装置に送信し、制御する装置内で音素記 号列及び Z又は音素片記号列、感情記号列、その他画像識別子を認識し、認識さ れた記号列に基づいて制御手段を選択し、選択された制御を制御される中継個所 や基地局装置が実行するというものである。同様に感情識別子も音声カゝら特徴抽出 や記号ィヒが可能であり、環境音など音や映像の特徴や識別子についても同様である  [0354] Then, the fourth method transmits the voice obtained by utterance using the terminal or the voice waveform or image of the captured video as it is to the base station apparatus that controls it, and the phoneme symbol in the control apparatus. Recognize the string and Z or phoneme symbol string, emotion symbol string, and other image identifiers, select a control means based on the recognized symbol string, and select the relay point or base station device that controls the selected control. It is to execute. Similarly, emotion identifiers can be extracted from voice features and symbols, and so can sound and video features and identifiers such as environmental sounds.
[0355] この際、端末から単純に波形のみを送信したり、特徴量を送信したり、認識された 識別子列を送信したり、識別子列に関連付けられた命令やメッセージなどの処理手 順を送信しても良ぐそれらの送信情報にあわせて配信基地局の構成を変更してクラ イアントサーバモデルを実施しても良ぐ送信側が図 21の構成受信側が図 22の構成 となり、相互に送受信することも可能である。 [0355] At this time, the terminal simply transmits only the waveform, transmits the feature amount, transmits the recognized identifier string, and transmits the processing procedure such as the command and message associated with the identifier string. Even if it is possible to implement the client server model by changing the configuration of the distribution base station according to the transmission information, the configuration shown in Fig. 21 is the transmission side, and the configuration shown in Fig. 22 is the reception side. It is also possible to transmit and receive between each other.
[0356] そして、入力された音素列や音素片列に基づいて関連付けられた処理手順へ変換 する命令辞書は、端末側にあっても配信基地局側にあってもよぐ新しい制御命令や メディア種別、フォーマット種別、装置名に関する音素記号列や画像識別子、感情識 別子と 、つた記号列を、 XMLや HTMLのような後述されるマークアップ言語や RSS 、 CGIを用いて情報の送受信や配信を行っても良 、。  [0356] An instruction dictionary for converting into an associated processing procedure based on the input phoneme sequence or phoneme segment sequence is a new control command or media that can be used on either the terminal side or the distribution base station side. Send / receive and distribute information using phonetic symbol strings, image identifiers, and emotion identifiers related to type, format type, and device name, and markup languages such as XML and HTML, RSS, and CGI, which will be described later. It's okay to go.
[0357] 次に、より具体的な手順について説明する。まず、特徴量や識別子を抽出したり、 評価関数を構成したりすることで、任意の通信回線に接続された環境で他の端末や 装置類との情報交換を実行する。  [0357] Next, a more specific procedure will be described. First, information is exchanged with other terminals and devices in an environment connected to an arbitrary communication line by extracting features and identifiers and configuring evaluation functions.
[0358] 次に、端末側の処理として音素片を用いた場合を例に説明すると、利用者は発話 を伴って音声波形を端末と装置に与える。端末側装置は与えられた音声を分析し特 徴量に変換する。次に変換された特徴量を HMMやベイズと ヽつた認識技術により 認識し識別子に変換する。  [0358] Next, a case in which phoneme pieces are used as processing on the terminal side will be described as an example. A user gives a speech waveform to a terminal and a device with an utterance. The terminal-side device analyzes the given voice and converts it into features. Next, the converted features are recognized and converted into identifiers using a recognition technology that is combined with HMM and Bayes.
[0359] この際、変換された識別子は音素や音素片、感情識別子、各種画像識別子を意味 するものとなるが、他にも別記されるように音声であれば音素や環境音や音階であつ たり、画像であれば画像や動作に基づいた識別子であったりしてもよい。そして、得ら れた識別子に基づいて音素、音素片記号列による辞書を DPマッチングにより参照し て任意の処理手順を選択し、選択された処理手順を対象となる装置に送信し制御を 実行することで、本発明を利用して携帯端末をリモコンとして用いたり、ロボットによる 家電制御を実施したりすることが可能であり、通信先にいる相手の顔や声力 謙譲や 表情を検出して円滑なコミュニケーションを実行するための感情指標の表示や発話 音表記の表示や点字出力部を設けた障害者との対話装置なども構成しても良い。  [0359] In this case, the converted identifier means a phoneme, a phoneme piece, an emotion identifier, and various image identifiers. However, as described elsewhere, if it is a voice, it is a phoneme, an environmental sound, or a scale. Or an identifier based on an image or action. Based on the obtained identifier, the phoneme / phoneme symbol string dictionary is referred to by DP matching to select an arbitrary processing procedure, and the selected processing procedure is transmitted to the target device to execute control. Therefore, it is possible to use the mobile terminal as a remote control by using the present invention, or to control home appliances by robot, and smoothly detect the face, voice humility and facial expression of the other party at the communication destination. It may also be configured to display an emotional index, a display of utterances, an interactive device with a disabled person provided with a braille output unit, etc.
[0360] このような手順で処理された情報は端末側の CPU性能によって、動画や音声とい つた自然情報から特徴量への変換をせずに元の情報のまま送信したり、特徴量への 変換で留めて送信したり、識別子への変換で留めて送信したり、制御情報の選択ま で行ってから送信したり、任意の変換水準を選択することができ、受信側は任意の状 態から情報に基づ!、て処理可能な受信側装置として構成され、獲得した情報に基づ き配信局や制御装置に送信したり、獲得した情報に基づいて検索や記録、メール配 信、機械制御、装置制御といった任意の処理を実施しても良い。 [0360] Depending on the CPU performance of the terminal, the information processed in such a procedure can be transmitted as it is without converting natural information such as video and audio into feature values, or converted to feature values. It is possible to select and send an arbitrary conversion level, such as transmission after conversion, transmission after conversion to an identifier, transmission after selection of control information, and the receiving side can select any state. Based on the information received, it is configured as a receiving device that can be processed and sent to the distribution station or control device based on the acquired information, or searched, recorded, or distributed by mail based on the acquired information. Arbitrary processing such as communication, machine control, and device control may be performed.
[0361] そして、図 23検索処理の状態遷移図にあるように、適宜クエリとなる識別子列や文 字列、特徴量を配信側基地局に送信し、そのクエリに従った情報を入手する。この際 、通信の待ち時間や検索の待ち時間に宣伝や広告を表示しても良ぐ音声による制 御を行う際は通信により制御項目の選択が出来るようにするために図 24制御辞書構 成例にあるような制御辞書を交換 ·獲得しても良 、。  [0361] Then, as shown in the state transition diagram of the search process in Fig. 23, the identifier string, character string, and feature amount that are appropriately queryed are transmitted to the distribution-side base station, and information according to the query is obtained. In this case, the control dictionary configuration shown in Fig. 24 is used to enable control items to be selected by communication when controlling by voice even if advertisements and advertisements are displayed during the communication wait time and search wait time. Exchange control dictionaries like in the example.
[0362] また、この制御命令辞書は音素や音素片、感情識別子と!/、つた前述されるような任 意の識別子や特徴量と装置制御情報で構成することにより自由に内容を更新して再 利用できるようにすることが可能であり、任意の識別子と特徴量を関連付けた検索の ための辞書情報を入替たり再構成したりすることで、流行の検索キーワードを更新出 来るようにしてちょい。  [0362] This control command dictionary is composed of phonemes, phonemes, emotion identifiers,! /, And any identifiers, features, and device control information as described above. It is possible to make it reusable, and by updating or reconfiguring the dictionary information for a search that associates an arbitrary identifier with a feature quantity, it is recommended to update trendy search keywords. .
[0363] なお制御命令辞書は、従来の赤外線リモコンで制御できる製品に送信するための 赤外線制御情報が装置制御情報として選択されたり、それらの制御情報の組合せに より一連の作業をバッチ処理のように連続的に実施したり、装置の CPU性能に応じ て識別子を認識せずに特徴量情報のみを音声対制御応情報処理装置に送信する ようにしてもよい。  [0363] In the control command dictionary, infrared control information to be transmitted to a product that can be controlled by a conventional infrared remote controller is selected as device control information, or a series of operations are batch processed by combining these control information. Alternatively, the feature information may be transmitted to the information processing apparatus for voice versus control without recognizing the identifier according to the CPU performance of the apparatus.
[0364] このような方法で音声制御が出来ない従来装置に対しても赤外線リモコンによる制 御を組合せることで音声情報カゝら変換辞書経由で赤外線リモコンの信号を提供した り、音声制御の可能な装置であれば、特徴量や音声波形に基づいて命令を認識し 制御したりすることが出来るとともに、性能改善に伴う制御用辞書の変更を実行する ことや、制御用辞書のバージョン情報と確認するといつたことや、装置の状態がどのよ うになって!/、るかを確認することができる。  [0364] Even with conventional devices that cannot perform voice control in this way, by combining control with an infrared remote controller, signals from the infrared remote controller can be provided via the voice dictionary or conversion dictionary, and voice control can be performed. If the device is capable, it can recognize and control commands based on features and speech waveforms, execute control dictionary changes to improve performance, and control dictionary version information. After checking, you can check when and how the device status is! /.
[0365] また、このような方法でサーバクライアントモデルを導入し、任意の処理ステップで サーバとクライアントに分割して通信で結びサーバ'クライアント間で任意の情報を交 換することにより同等のサービスやインフラ、検索、索引付けを実施しても良い。  [0365] In addition, the server client model is introduced in this way, and the server and the client are divided into arbitrary processing steps, connected by communication, and the server 'client' exchanges arbitrary information. Infrastructure, search and indexing may be implemented.
[0366] また、通信先にある基幹サーバから DVDレコーダやネットワーク TV、 STB、 HDD レコーダ、音楽録再装置、映像録再装置といったクライアント端末によって獲得され た情報を赤外線通信や FMや VHF周波数帯域通信、 802. l ib,ブルートゥース、 Z igBee、 WiFi、 WiMAX、 UWB、 WUSB (Ultra Wide Band)などの無線通信を経由 して携帯端末や携帯電話に情報を提供することで EPGや BML、 RSS、文字放送に よるデータ放送やテレビ映像、文字放送を携帯端末や携帯電話で利用できるよう〖こ したり、音声入力や文字列入力、携帯端末や携帯電話を振り動かす操作によりクライ アント端末の制御内容を指示したり、携帯端末や携帯電話を一般的なリモコンとして クライアント端末操作に利用したりしても良い。 [0366] In addition, information acquired by client terminals such as DVD recorders, network TVs, STBs, HDD recorders, music recording / playback devices, and video recording / playback devices from core servers at the communication destination can be transmitted via infrared communication, FM, or VHF frequency band communication. 802. l ib, Bluetooth, Z By providing information to mobile terminals and mobile phones via wireless communication such as igBee, WiFi, WiMAX, UWB, and WUSB (Ultra Wide Band), data broadcasting and TV images by EPG, BML, RSS, text broadcasting, Teletext can be used on a mobile terminal or mobile phone, voice control, character string input, instructions for controlling the client terminal by shaking the mobile terminal or mobile phone, or mobile terminal or mobile phone May be used for client terminal operation as a general remote control.
[0367] 《利用者同士の情報共有手順例》 [0367] 《Information sharing procedure example between users》
まず、利用者は図 20のような環境において自分の装置上において構築された検索 条件式と検索条件式に用いられる識別子や特徴量及び Z又は関数パラメータを選 定し通信回線及び Z又は記憶媒体を経由して、第三者に提供する。この際、検索条 件式及び Z又は識別子及び Z又は特徴量及び Z又は関数パラメータを任意のサー バ上に公開することで、第三者に販売や提供しても良いし P2Pソフトを用いて共有し ても良い。また、有名人や専門誌、専門家などの嗜好や価値観に基づいた検索条件 や識別子や特徴量や関数パラメータの組合せを通信回線経由や雑誌添付により販 冗しても良 ヽ。  First, the user selects the search condition formula constructed on his / her device in the environment shown in FIG. 20 and the identifier, feature quantity, and Z or function parameter used in the search condition formula, and the communication line and Z or storage medium. Provide to a third party via At this time, the search condition formula and Z or identifier and Z or feature quantity and Z or function parameter may be disclosed or released to any third party, or P2P software may be used. May be shared. You can also sell search conditions, identifiers, feature quantities, and function parameter combinations based on preferences and values of celebrities, specialized magazines, and professionals via communication lines or by attaching magazines.
[0368] この結果、図 25にあるような手順で他者の検索条件式及び Z又は関数パラメータ を記憶媒体力 複製したり通信回線経由でダウンロードしたりすることにより、索引付 けに用いられた特徴量の抽出方法や識別関数により選択された識別子が同様の構 成であれば自装置上でそれらの検索条件式が利用できるようになる。なお、これらの 配信情報にウィルスが組込まれな 、ように対策を施してもょ 、。  [0368] As a result, the search condition formulas and Z or function parameters of the other party were copied to the storage medium and downloaded via the communication line by the procedure shown in Fig. 25, and used for indexing. If the identifier selected by the feature quantity extraction method or the discriminant function has the same configuration, those search condition formulas can be used on the own device. Take measures to prevent viruses from being included in these distribution information.
[0369] そして、装置ごとに識別子や特徴量に違いが生じる場合は検索に関わる評価関数 や検索条件といった情報を獲得したり変換したりしてもよぐ利用者は他の装置で他 者と同じ方法による検索条件式を取得することができる。この変換において、後述さ れる国際音素記号と言語依存音素記号の変換のように共起情報に基づく識別子間 の変換を行ったり、他の識別子を音素記号へ変換するために、識別子の共起行列や HMM、ベイズ、帰属確率といった評価関数による情報空間における変換を行ったり しても良い。  [0369] If there is a difference in identifiers or feature values between devices, a user who can acquire or convert information such as an evaluation function or search conditions related to a search can be compared with others on other devices. A search condition formula can be acquired by the same method. In this conversion, an identifier co-occurrence matrix is used to convert between identifiers based on co-occurrence information, such as conversion of international phoneme symbols and language-dependent phoneme symbols, which will be described later, or to convert other identifiers to phoneme symbols. It is also possible to perform transformation in information space using evaluation functions such as HMM, Bayes, and membership probability.
[0370] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ新しい制御命令やメディア種別、フォーマット種別、装置名 に関する音素記号列や画像特徴、感情識別子と!/ヽつた記号列を XMLや HTMLの ような後述されるマークアップ言語や RSS、 CGIによって表現してもよぐこのようにし て構成された情報の送受信や配信を行っても良い。 [0370] At this time, the dictionary that converts the phoneme sequence and the phoneme segment sequence and the processing procedure is distributed even on the terminal side. New control commands, media types, format types, device names, phonetic symbol strings, image features, emotion identifiers, and! / Tick symbol strings related to the base station can be described later in markup such as XML and HTML. It may be expressed in language, RSS, or CGI. Information configured in this way may be transmitted / received or distributed.
[0371] 次に、より具体的な手順について図 20に基づいて説明する。まず、第 1の利用者の 装置である端末 Aは他の端末 Cやインターネット経由による基地局 Bなどの通信可能 な情報処理装置対し接続を試みる。この結果、接続可能であれば従来力ゝらのプロトコ ルゃ RSS、 CGIを用いて他の装置が検索に用いることのできる情報を配信して 、る か確認する。そして、配信されているようであれば一覧を取得するステップを実行する Next, a more specific procedure will be described with reference to FIG. First, terminal A, which is the first user's device, attempts to connect to another terminal C or an information processing device that can communicate with the base station B via the Internet. As a result, if it is possible to connect, information that can be used for search by other devices using RSS or CGI will be confirmed. And if it seems to be distributed, execute the step to get the list
[0372] 次に、端末 Aは通信回線や赤外線により目的の検索実施方法に関する詳細な情 報を取得するための評価関数取得ステップを実行する。この結果、端末 Aは関数の 構成に必要な数値情報や識別子記号列、評価式と!、つた検索に必要な情報を取得 することがでさるよう〖こなる。 [0372] Next, terminal A executes an evaluation function acquisition step for acquiring detailed information related to a target search execution method using a communication line or infrared rays. As a result, terminal A can acquire the numerical information, identifier symbol string, evaluation expression, and! Necessary for function construction, and information necessary for the search.
[0373] この検索に必要な情報は音素や音素片認識を考慮する場合、ベイズ関数であれ ば音素や音素片ごとの特徴量に基づいた固有値と固有ベクトルと平均値と事前確率 t 、つた数値情報及び識別子記号であり、 DPなどでマッチングするのであれば検索 指標となる同一表記記号群の音素や音素片からなる識別子記号列であり、 HMMで あれば音素や音素片ごとの標準テンプレートデータとなり、認識する対象や識別子に よってこれらの情報は適宜、画像認識テンプレートや音響認識テンプレート、環境音 テンプレート、動作認識テンプレートなどやそれぞれの識別子列や評価関数に変更 される。  [0373] The information necessary for this search, when considering phoneme and phoneme recognition, if it is a Bayesian function, eigenvalues, eigenvectors, average values, prior probabilities t, and numerical information based on the features for each phoneme or phoneme It is an identifier symbol string consisting of phonemes and phonemes of the same notation symbol group as a search index if matching with DP etc., and if it is an HMM, it becomes standard template data for each phoneme and phoneme. Depending on the object to be recognized and the identifier, these pieces of information are appropriately changed to image recognition templates, sound recognition templates, environmental sound templates, motion recognition templates, etc., and their respective identifier strings and evaluation functions.
[0374] 次に、自装置内の記憶容量に余裕がなければ、利用頻度の低い識別関数や DP、 HMMを削除し先ほど取得した情報に基づ 、て新 、評価関数を自装置の記憶部 に登録し毎回通信により取得しなくても再利用できるようにする評価関数切換ステツ プを実行する。  [0374] Next, if there is no room in the storage capacity of the own device, the identification function, DP, and HMM that are less frequently used are deleted, and a new evaluation function is stored based on the information acquired earlier. The evaluation function switching step is executed so that it can be reused without being registered and registered every time.
[0375] もちろん、実施形態によっては毎回通信により評価関数を取得し、記憶部に記憶し 、サービスの終了や電源の切断に伴い記憶した評価関数を削除するといつた方法を 用いても良 ヽし、配布された記憶媒体から取得したりしても良 ヽ。 [0375] Of course, in some embodiments, the evaluation function is acquired by communication each time, stored in the storage unit, and the stored evaluation function is deleted when the service is terminated or the power is turned off. You can use it, or you can get it from a distributed storage medium.
[0376] また図 20にあるように、情報を交換する対象は基地局や他の端末ば力りではなく本 発明を用いたロボットやリモコンといった情報処理部や情報入出力部、記憶部を本発 明と関連する構成で内包する装置であれば、任意の実施形態が考えられる。  Further, as shown in FIG. 20, the information exchange target is not the power of the base station or other terminals, but the information processing unit such as the robot and remote controller using the present invention, the information input / output unit, and the storage unit. Any embodiment may be considered as long as the device is included in the configuration related to the invention.
[0377] 《ユーザインタフェースの例》  [0377] Example of user interface
次に、ユーザインタフェースへの利用について説明する。  Next, use for a user interface will be described.
[0378] 前述の端末及び基地局に用いる情報処理装置の手順例のような方法で制御法方 法を獲得し、入力対象となるコマンドの音素列記号と制御コマンドを変換するための 辞書を設け、人の発話を音素認識し目的のコマンドが実施されるようにすることで音 声による操作を実現することができる。この際、音声情報から感情を分析し、検出され た結果が「悲しみ」の感情であり「え一ん」 t 、う発話に関連付けられた音素や音素片 が検出されれば慰める文脈を選んだり、「怒り」の感情が検出され「こら」という発話に 関連付けられた音素や音素片が検出されればなだめる文脈を選んだりといった処理 手段を実施してもよい。  [0378] A control method is obtained by a method similar to the procedure example of the information processing apparatus used in the terminal and base station described above, and a phoneme string symbol for the command to be input and a dictionary for converting the control command are provided. Therefore, voice operation can be realized by recognizing a person's utterance and making the target command executed. At this time, emotions are analyzed from the voice information, and the detected result is the emotion of “sadness”, and if a phoneme or phoneme associated with the utterance is detected, a comforting context is selected. Alternatively, a processing means such as selecting a context that can be soothed if a feeling of “anger” is detected and a phoneme or phoneme associated with the utterance “Kora” is detected may be implemented.
[0379] この際、利用者の感情が怒りを伴っている場合においては、利用者に対して謝るよ うなメッセージを音声や文字列により提示しても良いし、カメラなどを追加して「端末及 び基地局に用いる情報処理装置の手順例」にあるような特徴抽出と認識処理の組合 せを利用したり、音素や音素片、感情識別子、画像識別子といった前述の任意の識 別子や特徴量に基づく認識を実施し、識別子の組合せに伴って処理を選択'変更し たりしても良いし、加えて感情識別子や楽器識別子、音階識別子、環境音識別子な どの認識結果を用いてもょ 、。  [0379] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string, or a camera etc. The above-mentioned arbitrary identifiers and features such as phonemes, phoneme pieces, emotion identifiers, and image identifiers can be used. Recognition based on volume may be performed, and processing may be selected and changed according to the combination of identifiers.In addition, recognition results such as emotion identifiers, instrument identifiers, scale identifiers, and environmental sound identifiers may be used. ,.
[0380] また、本発明による検索装置を用いて抽出された利用者の嗜好や主観を利用者自 身に評価させることで強化学習を実施し、抽出された情報の精度を改善しても良い。 例えば、評価時の利用者の発話に伴う感情や音素列、音素片列の認識結果が肯定 的意味合 ヽに関連付けられた r 、ね」 t 、つた誉め言葉などの音素や音素片の記 号列もしくは肯定的意味合 、に関連付けられた感情である「喜び」や「安堵」などの識 別子が検出された場合に強化学習を実施したり、認識結果が否定的意味合いに関 連付けられた「だめね」といった言葉などの音素や音素片の記号列もしくは否定的意 味合いに関連付けられた感情である「悲しみ」や「怒り」、「落胆」である場合に次の強 化学習の対象カゝらはずしたり、否定的意味合いの特徴群を新しく設けて否定対象を 学習するための強化学習を実施したりしてもよい。 [0380] In addition, it is possible to improve the accuracy of the extracted information by performing reinforcement learning by letting the user evaluate the user's preference and subjectivity extracted using the search device according to the present invention. . For example, the recognition result of emotions, phoneme strings, and phoneme string sequences associated with the user's utterance at the time of evaluation is associated with a positive semantic match. Reinforcement learning is performed when an identifier such as “joy” or “relief”, which is an emotion associated with a column or positive meaning, is detected, or the recognition result is linked to a negative meaning. A phoneme or phoneme symbolic string such as “no use” or a negative meaning If the emotions associated with the taste are “sadness”, “anger”, or “disappointment”, remove the target of the next reinforcement learning, or create a new feature group of negative meanings to set the negative object. Reinforcement learning for learning may be performed.
[0381] また、操作可能な処理に関するキーワードを画面に表示し音素列や音素片列リスト を選択したり、発話したりして利用者に提示しても良いし、表示されない隠しコマンド 力あっても良ぐこのような一般的な音声認識を用いない感情を伴う音素 ·音素片認 識による音声ユーザインタフェースが実現できる。  [0381] In addition, keywords related to operable processes may be displayed on the screen, a phoneme string or phoneme string list may be selected, or spoken, and presented to the user. It is possible to realize a voice user interface based on phoneme / phoneme recognition with emotion that does not use general voice recognition.
[0382] この際、音素列や音素片列や感情識別子と処理手順を変換する辞書は、端末側 にあっても配信基地局側にあってもよぐ新しい制御命令やメディア種別、フォーマツ ト種別、装置名に関する音素記号列や画像特徴、感情識別子といった記号列を、 X MLや HTML、 RDFのような後述されるマークアップ言語や RSS、 CGIを用いて情 報の送受信や配信を行っても良く組合せることで利便性を図ることが出来る。  [0382] At this time, the dictionary that converts the phoneme sequence, the phoneme sequence, the emotion identifier, and the processing procedure may be a new control command, media type, or format type that may be on the terminal side or the distribution base station side. Even if symbol strings such as phoneme symbol strings related to device names, image features, and emotion identifiers are transmitted / received and distributed using markup languages such as XML, HTML, and RDF, RSS, and CGI, which will be described later. Convenience can be aimed at by combining well.
[0383] <共起情報の糸且合せ例について >  [0383] <Threads for co-occurrence information>
より具体的に本発明の基本となる複数の識別子や特徴量に基づいた共起情報の組 合せを用いる手順について説明する。まず、大枠としての複数種類の識別子による 検索例処理手順と複数種類の識別子による検索に基づいた任意処理手順を示し、 続けて各々の識別子に伴う組合せの具体例を示す。これらの識別子や特徴量の組 合せは必要に応じて 2個や 3個であっても良いし、 4個以上や 10数個以上の組合せ により実施しても良ぐこれらの識別子の共起確率や特徴量の共分散行列に基づい て構成された共起辞書を参照し、利用者の指示にともない検索条件を構成すること で従来にはない検索の実現を図る。  A procedure for using a combination of co-occurrence information based on a plurality of identifiers and feature quantities, which is the basis of the present invention, will be described more specifically. First, a search example processing procedure using a plurality of types of identifiers as an outline and an arbitrary processing procedure based on a search using a plurality of types of identifiers are shown, followed by a specific example of a combination associated with each identifier. The combination of these identifiers and feature quantities may be two or three as required, or the co-occurrence probability of these identifiers may be implemented by combining four or more or more than ten. By referring to a co-occurrence dictionary constructed based on the covariance matrix of features and feature quantities, and constructing search conditions according to user instructions, a search unprecedented can be realized.
[0384] なお、本発明における共起状態若しくは共起情報とは、聴覚情報や視覚情報ゃセ ンサ情報からなる自然情報に基づいており、映像及び Z又は音声から獲得される識 別子や特徴量を用いて構成される情報を基本とし、配信される文字情報ゃ検知され るセンサ情報を用いる複数の関連付けられた情報であって、利用に応じた適切な単 位時間内にそれらの識別子や特徴量が同時に発生していることを特徴としており、複 数の共起情報力 なる時間遷移に伴い構成されていても良ぐそれらの平均と分散 から構成される共分散行列や共起確率であっても良ぐそれらの確率遷移行列を用 いて共起情報の状態遷移モデルを構成しても良ぐコンテンツの索引情報に用いら れる「索引共起情報」や利用者の入力した検索条件を利用して構成された「共起検 索条件情報」として使用されている。 [0384] The co-occurrence state or co-occurrence information in the present invention is based on natural information including auditory information and visual information, sensor information, and identifiers and features acquired from video and Z or audio. It is based on information configured using quantity, and is a plurality of related information that uses sensor information that is detected as text information to be distributed, and their identifiers and information within an appropriate unit time according to usage It is characterized by the fact that features are generated at the same time, and can be constructed with time transitions that are multiple co-occurrence information powers. Use those stochastic transition matrices "Co-occurrence search conditions" configured using "index co-occurrence information" used for content index information and search conditions entered by users. Used as “information”.
[0385] 《複数種類の識別子による検索処理手順の例〉〉  [0385] <Example of search processing procedure using multiple types of identifiers >>
複数種類の識別子による検索処理を実行するために検索条件や検出条件を指定す る際、識別子や特徴量が評価される範囲の境界は時間軸を分割したフレーム数であ つても良いし、任意の識別方法で得られた特徴量の乖離状態が閾値を超過もしくは 未満の場合であっても良いし、任意の検出や識別方法で得られた識別子境界であつ てもよい。  When specifying search conditions and detection conditions to execute search processing using multiple types of identifiers, the boundary of the range in which identifiers and feature quantities are evaluated may be the number of frames divided on the time axis, or arbitrary This may be the case where the divergence state of the feature quantity obtained by this identification method exceeds or is less than the threshold value, or may be an identifier boundary obtained by any detection or identification method.
[0386] そして、任意の範囲にどのような識別子が共起する力否かの偏りを調べると共に EP Gや BML、 RSS、文字放送、字幕や映像に含まれる文字などを用いて配信情報に 索引を与えたり、出演者の構成やタイトル、監督、プロデューサの名称や役者の配役 上の家族関係や人間関係を識別子として用いたり、識別子や特徴量を分類し共起 辞書を構成しても良い。  [0386] Then, the distribution information is indexed using EPG, BML, RSS, text broadcasting, text included in subtitles and video, etc., as well as checking the bias of what identifiers co-occur in any range. You can also create a co-occurrence dictionary by categorizing identifiers and features, using the composition of the performers, titles, directors, names of producers, family relations and actor relations of actors, and human relations.
[0387] そして、検索結果として得られた識別子に関連付けられた文字列や識別子 IDを変 換辞書により他の識別子や識別子列に変換し、その識別子や識別子列がコンテンツ 情報と一致する箇所を検索することで検索結果として得られた識別子に文字列ゃ識 別子 IDを経由して関連付けられた他の識別子や識別子列を抽出できるようになり、 中間符号系として識別子 IDや文字列を用いる共起関係に基づいた検索が実施でき る。  [0387] Then, the character string or identifier ID associated with the identifier obtained as a search result is converted into another identifier or identifier string by the conversion dictionary, and the part where the identifier or identifier string matches the content information is searched. By doing so, it becomes possible to extract other identifiers and identifier strings associated with the identifier obtained as a search result via the identifier ID and the identifier ID and the character string as an intermediate code system. Search based on origination relationship can be performed.
[0388] より具体的には出演者の名称を EPGや BML、 RSS、文字放送、認識された字幕 や映像に含まれる文字列といった文字情報を取得したり、利用者の発話若しくは入 力した音素列に一致する出演者の名称を検出したり、その名称が映像情報内で発 話されて!/ヽる個所や字幕の表示されて!ヽる箇所を検出したりする。  [0388] More specifically, the name of the performer is obtained from EPG, BML, RSS, teletext, text information such as recognized subtitles and text included in the video, or the phoneme entered or entered by the user. The name of the performer that matches the column is detected, and that name is spoken in the video information! / Detects where to speak and subtitles are displayed!
[0389] この結果、検出箇所が利用者の目的に関連のあるシーンであると判断してコンテン ッ情報を再生したり、録画したり、スキップしたり、特定のタイトル画像特徴により録画 の開始をしたりしてもょ 、し、それらの処理にお!、て統計的処理に伴!、共起行列や 共起確率を用いて検索対象を絞り込むなどの方法を用いても良いし、識別子や特徴 量を分類し共起辞書を構成しても良 ヽ。 [0389] As a result, it is determined that the detected location is a scene related to the user's purpose, and the content information is played back, recorded, skipped, or the recording is started by a specific title image feature. You can also use a method such as narrowing the search target using the co-occurrence matrix or co-occurrence probability, or the identifier or the Characteristic It is okay to classify quantities and construct a co-occurrence dictionary.
[0390] また、 EPGや MPEG7、 BML、 RSS、 XML、 Webサイト、認識された字幕や映像 に含まれる文字列などにより出演者の構成やタイトル、監督、プロデューサ、スポーツ チームの名称や役者の配役上の家族関係や人間関係といった番組情報を識別子と して用い主役と敵役が共起して ヽるシーンや主役と恋人が共起して!/ヽるシーンと!/ヽっ た検索を画像特徴やシーンで表現される感情、シーンで発せられる音声に伴う音素 列や音素片列、シーンにおける映像特徴の変化によって多変量解析し識別子を与 え音素列や音素片列と番組情報と画像特徴もしくは画像識別子を用いて索引付け、 検索、検出、学習を行うといった方法も可能である。  [0390] Also, EPG, MPEG7, BML, RSS, XML, Web site, recognized subtitles and character strings included in the video, etc., the composition of performers, titles, directors, producers, sports team names and actor casts Using the program information such as family relationships and human relationships above as identifiers, the main character and enemy role co-occur and the main character and lover co-occur! / Sounding scenes! / Sounding searches are given multi-variate analysis based on image features, emotions expressed in scenes, phoneme sequences and phoneme sequences associated with voices generated in scenes, and changes in video features in scenes. A method of indexing, searching, detecting, and learning using the phoneme sequence, phoneme sequence, program information, image feature, or image identifier is also possible.
[0391] 《複数種類の識別子による検索に基づいた任意処理手順の例〉〉  [0391] 《Example of arbitrary processing procedure based on search by multiple types of identifiers》 >>
例えば、入力された文字列を音素や音素片による記号列に変換したり、利用者の発 話音声による音素や音素片に基づく記号情報と感情や環境音、画像特徴により認識 された識別子を用いたりしてクエリを構成するとともに本発明に基づいた放送内容の 情報蓄積装置への収録を開始する。  For example, an input character string is converted into a symbol string using phonemes or phonemes, or symbol information based on phonemes or phonemes based on a user's utterance speech and identifiers recognized by emotions, environmental sounds, or image features are used. In this manner, a query is constructed and recording of broadcast contents to the information storage device based on the present invention is started.
[0392] この際、収録と同時に記号列を評価し事前に登録されている記号列との一致を評 価して一致がある一定の割合を超えた場合その前後 1時間を長期保存対象として登 録し、一定時間経過後に収録された情報蓄積部から長期保存対象に含まれない情 報を削除することで有限な記憶容量において不要な情報を削除し効率的な情報の 保存を実現する。この際、統計的処理に伴い共起行列や共起確率を用いて検出対 象を絞り込むといった方法を用いても良い、識別子や特徴量を分類し共起辞書を構 成しても良い。  [0392] At this time, the symbol string is evaluated at the same time as recording, and the match with a pre-registered symbol string is evaluated, and if the match exceeds a certain percentage, the hour before and after that is registered for long-term storage. By deleting information that is not included in the long-term storage target from the information storage unit recorded after a certain period of time, unnecessary information is deleted in a finite storage capacity, and efficient information storage is realized. At this time, a method of narrowing down detection targets using a co-occurrence matrix or a co-occurrence probability according to statistical processing may be used, or a co-occurrence dictionary may be configured by classifying identifiers and feature quantities.
[0393] 《文字列と識別子による検索の例〉〉  [0393] <Example of search by character string and identifier >>>
例えば、入力されたコンテンツ情報に関し音声力 認識される感情や環境音、映像 から認識される画像特徴や動作識別子、物体識別子により索引付けを行い本発明に 基づきデータベースとして収録する。次に、利用者から入力された音声や文字列を 音素や音素片による記号列に変換して収録されたデータベースにクエリとして与えて 検索を行い、目的の情報として検出された検索結果を利用者に提示する。  For example, the input content information is indexed by emotions and environmental sounds recognized by voice, image features recognized from video, motion identifiers, and object identifiers, and recorded as a database according to the present invention. Next, the speech or character string input by the user is converted into a symbol string using phonemes or phoneme fragments and given to the recorded database as a query, and the search result is detected as the target information. To present.
[0394] この際、一般的に擬音と呼ばれる「ワンワン」や「ドカーン」といった音も比較的近似 した音素や音素片として認識されるため検索に用いて環境音識別子を補助する検索 用索引として利用しても良いし、文字入力によるクエリとして獲得された「に)」や「(; ; )」顔文字から検索に用 、る感情識別子を「喜」や「哀」とすることで文字列から感情 識別子を選定し検索条件を構成して検索を実施しても良 、し、これらの感情識別子 の検出によりチャットやエージェント、ロボットの人工知能として本発明の検索技術を 利用し装置と人間の対話に用いてもよぐ識別子や特徴量を分類し共起辞書を構成 しても良い。 [0394] At this time, sounds such as "one-one" and "docan", which are generally called onomatopoeia, are also relatively approximate. It can be used as a search index to assist environmental sound identifiers for searching because it is recognized as a phoneme or phoneme segment, or “ni” ”or“ (; The emotion identifiers used for the search from emoticons can be selected by selecting the emotion identifier from the character string by setting the emotion identifier to “joy” or “sorrow”, and the search condition can be configured to perform the search. By detecting this, the search technology of the present invention may be used as an artificial intelligence for chats, agents, and robots to classify identifiers and feature quantities that can be used for dialogue between the device and humans to form a co-occurrence dictionary.
[0395] 《感情と固有名詞に伴う検索の例〉〉  [0395] 《Example of search with emotion and proper nouns>》
例えば、ある固有名詞を音素や音素片記号に変換することである固有名詞を検出し 、ある固有名詞とその固有名詞の発生箇所付近の感情特徴や感情識別子を評価し たり、固有名詞を発した話者の音声がその固有名詞の発話時間近辺で持つ感情特 徴ゃ感情識別子の出現確率を評価したりすることで、ある固有名詞に伴う感情の出 現頻度から、ある固有名詞に対する利用者の感情の偏りを評価して利用者の嗜好に 応じた検索が可能となる。  For example, a proper noun, which is the conversion of a proper noun into a phoneme or phoneme symbol, is detected, and an emotion feature or emotion identifier near a proper noun and its occurrence is evaluated, or a proper noun is issued By evaluating the probability of the appearance of an emotion identifier, which is the emotional characteristic of the speaker's voice in the vicinity of the utterance time of the proper noun, the user's It is possible to search according to user's preference by evaluating the bias of emotion.
[0396] 《感情と画像に伴う検索の例〉〉  [0396] <Example of search with emotion and image >>
顔検出のアルゴリズムにより得られた画像特徴と感情認識による感情識別子を組合 せることで、特定の感情における表情の特徴量を検出し、その特徴量を統計的に学 習することで、表情を弁別する検索を実行することを可能としたり、 3次元や 2. 5次元 に基づいた特徴量を用いて顔を一定の方向と大きさに変換したのち、変化や動作の ある個所を別項目として学習し識別子を与えて顔の一部を目や口として分離し表情 の変化を学習したりしてもよいし、同様の方法で他の検索に用いるための体や機械、 装置類を分類してもよい。  By combining the image features obtained by the face detection algorithm and the emotion identifiers by emotion recognition, the facial expression features in specific emotions are detected, and the facial features are statistically learned to discriminate facial expressions. Can be performed, or the face is converted to a certain direction and size using features based on 3D or 2.5D, and then a part with change or movement is learned as a separate item. It is also possible to give an identifier to separate a part of the face as eyes and mouth and learn to change facial expressions, or to classify bodies, machines, and devices to be used for other searches in the same way. Also good.
[0397] また、主人公の顔の検出と主人公の名称に伴う音素列と感情識別子との共起状態 を検索することで、従来であれば音量でしか検索できな力つたシーンの盛り上がりを 主人公の名前を呼ぶ声に込められた感情に基づ!/、て検索を実施したり、画面内に大 きなサイズで文字が検出されたシーンと歓声に伴う音素列や音素片列による盛り上 力 Sりと興奮感情識別子の検出により共起状態に基づく検索を実施したりすることでス ポーッの得点シーンや映画のハイライトシーンの検索が可能となる。 [0398] この際、 EPGや BML、 RSS、文字放送における任意のタグや呼称と関連付け、 E PGでスポーツ番組であることを検出し、 BMLで点数の変化を検出し、点数の変化が 表示された時間と前後して、感情特徴から興奮が検出された個所に、再生位置を移 動することでスポーツのハイライトシーンを検出し、その時間的周辺にある画像特徴 を学習することにより画像のみの情報からスポーツのハイライトシーンを検出できるよ うにしても良いし、ブログに添付されされた動画を分析してブログの文章と関連付けて 整理したり、検索をかけたり出来るようにしたりしてもよいし、それらの検出個所まで早 送りをしたり、前記学習によって利用者が頻繁に早送りなどの否定的な操作を行う場 合にその範囲を嫌なシーンや興味の薄 、シーン、公序良俗に反するシーンと見なし て特徴量を抽出を行い自動的にスキップ再生をしたり、得点や記載内容に変化があ つた旨メールや RSSで配信したりするといつたサービスを実施してもよい。 [0397] Also, by searching for the co-occurrence state of the phoneme string and emotion identifier associated with the protagonist's face detection and the protagonist's name, a powerful scene that could only be searched by volume in the past was raised. Based on the emotions contained in the voice that calls the name! /, The search force, and the excitement of the phoneme sequence or phoneme segment sequence that accompanies a scene where a character is detected in a large size on the screen and cheers By performing a search based on the co-occurrence state by detecting S and the excitement emotion identifier, it is possible to search for a score scene of a sport or a highlight scene of a movie. [0398] At this time, EPG, BML, RSS, linked to any tag or name in teletext, EPG detects a sports program, BML detects a change in score, and a change in score is displayed Around the time, the highlight position of the sport is detected by moving the playback position to the place where excitement is detected from the emotion feature, and only the image is learned by learning the image features around that time. It may be possible to detect the highlight scene of sports from the information of the blog, analyze the video attached to the blog, organize it in association with the text of the blog, and make it possible to search It is also possible to fast-forward to those detection points, or when the user frequently performs negative operations such as fast-forward by the above learning, the range is changed to a disliked scene, a less interesting, a scene, public order and morals It can automatically skip playback was extracted feature amount is regarded as a scene that may be carried out when was the service and or delivered in change there Tsutamune mail and RSS to score and description.
[0399] 《画像と環境音に伴う検索の例〉〉  [0399] <Example of search with images and environmental sounds>
例えば、映像特徴量にフレーム間で変化のあるシーンの特徴を抽出した場合にぉ ヽ て部分動き特徴が大きくそれらの運動方向が平行でない場合であって、赤や黄色の 暖色系特徴が画面上に多く存在すると共に放射状の動きが検出され、爆発音と識別 される音声特徴量が検出された場合、そのシーンを爆発シーンとして動画像に同期 して索引情報を記録する。同様に画面内に青が多く波の音が検出された場合は海 辺のシーンとし、青の中にゆっくり動く白い塊が検出され風の音が検出された場合は 空のシーンとして索引情報を記録する。このような索引情報が実施されその映像全 体の長さに対して索引の出現する頻度を求め、その頻度の類似度合を評価すること で画面上の表現の偏りを検出するとともに、利用者閲覧状況を同様に分析することで 利用者の閲覧状況とコンテンツに出現する識別子頻度を踏まえた検索を実現する。 また、画像認識により得点表示画面の特徴量を分析し、それに伴う音声による感情 特徴や歓声のような環境音を識別することで、共起状態を利用した特定のシーンの 検索を実施してもよい。  For example, when scene features that vary from frame to frame are extracted as video feature values, the partial motion features are large and their motion directions are not parallel, and red and yellow warm-colored features are displayed on the screen. If a voice feature that is identified as explosive sound is detected and a radial motion is detected, index information is recorded in synchronization with the moving image as an explosion scene. Similarly, if a lot of blue is detected in the screen and a wave sound is detected, the scene is recorded as a seaside scene. If a slowly moving white block is detected in blue and a wind sound is detected, the index information is recorded as an empty scene. . This index information is implemented, and the frequency of the index appearance is calculated for the entire video length, and the similarity of the frequency is evaluated to detect the bias of the expression on the screen and the user browsing By analyzing the situation in the same way, a search based on the browsing status of the user and the frequency of identifiers appearing in the content is realized. In addition, by analyzing the features of the score display screen through image recognition and identifying environmental sounds such as emotional features and cheers that accompany it, it is possible to search for a specific scene using the co-occurrence state. Good.
[0400] 《環境音と番組情報に伴う検索の例〉〉  [0400] 《Example of search with environmental sound and program information》 >>
例えば、放送配信される BMLや EPGなど力も取得したジャンルがアクションと分類さ れた動画像を収録する間、映像と音声の特徴量と識別子を生成収録する。その収録 された特徴量と識別子に基づいて情報を多変量解析し、アクション映画における各 識別子の出現頻度を取得し分析する。この結果、分析された特徴量を用いて任意の 距離評価関数や HMMなどの認識関数を構成することが可能であり、例えば爆発音 や急激な画面特徴の変化を評価するための評価関数が構築できるため、特徴量の 学習によって EPGや BML、 RSS、文字放送からなる文字情報や画像から認識され た字幕や映像に含まれる文字情報や独自の評価関数による評価結果を得ることが 可能となり、それらの共起状態に基づいて利用者の趣味や趣向に合わせた評価関 数や評価結果の閾値を設定することによりコンテンツ情報の録画や再生といった任 意の処理を実施したり、検索を実施したりすることが可能となる。この際、出演者の構 成やタイトル、監督、プロデューサの名称や役者の配役上の家族関係や人間関係を 識別子として用いたり、それらを音素 '音素片展開して用いたりすることで識別子の一 致度を合わせて評価しても良 、。 For example, while recording moving images in which genres that have acquired power, such as BML and EPG broadcast, are classified as actions, video and audio features and identifiers are generated and recorded. The recording Multivariate analysis of information based on the feature values and identifiers obtained, and the frequency of appearance of each identifier in the action movie is obtained and analyzed. As a result, it is possible to construct an arbitrary distance evaluation function or a recognition function such as an HMM using the analyzed feature quantity. For example, an evaluation function for evaluating explosion sounds or sudden changes in screen features can be constructed. Therefore, by learning feature values, it is possible to obtain text information consisting of EPG, BML, RSS, text broadcasting, text information included in subtitles and images recognized from images, and evaluation results using unique evaluation functions. Based on the co-occurrence status of users, arbitrary processing such as recording and playback of content information can be performed and search can be performed by setting evaluation functions and evaluation result thresholds that match the user's hobbies and preferences. It becomes possible to do. At this time, the composition of the performer, the title, the director, the name of the producer, the family relations and the human relations of the actors as casts are used as identifiers, or they are used as phonemes. It is okay to evaluate the match together.
[0401] 《感情と音階に伴う検索の例〉〉  [0401] 《Examples of search with emotion and scale>》
例えば、前述の各種方法で販売用の音楽を感情特徴や感情識別子、音階特徴や音 階識別子、音素や音素片の記号列によって索引付けをおこないデータベース内に 登録し、利用者が好みであると指定した音楽力 得られる識別子や特徴量力 なる 索引情報とデータベース内に登録されている音楽の識別子や特徴量からなる索引 情報との距離や一致率を評価することで、利用者の趣味や興味に基づ!、た音楽情 報の検索が可能となる。  For example, if music for sale is indexed by the above-mentioned various methods according to emotion characteristics and emotion identifiers, scale characteristics and scale identifiers, and phonemes and phoneme symbol strings are registered in the database, and the user likes Specified music power The identifier and feature value power obtained are index information and the index information consisting of music identifiers and feature values registered in the database, and by evaluating the distance and coincidence rate, it is possible to meet the user's hobbies and interests. Searching music information based on!
[0402] 《その他の組合せによる検索例》  [0402] << Search example by other combinations >>
楽器種別に関しては、楽器名称と音響特徴、楽器名称と画像特徴の共起情報から 任意の楽器が演奏されて 、たり表示されて 、たりするシーンやページの検索が可能 となり、ピアノの出ている映画を検索したいときに「ピアノ [p/i/a/n/o]」と発音し音素列 検索をしたり、音素列に基づいて、ピアノば力り映っている画像情報力 構成した画 像特徴評価関数やピアノの音ば力り集めた音響特徴力 構成した楽器評価関数を 用いて共起状態による検索をしたのち、それらの特徴にしたがって音声ストリームや 映像ストリームを検索しても良いし、検索指示により検出された音声や映像ストリーム を記録したりスキップ再生したりと 、つた任意の処理を実施しても良ぐ EPGや BML 、 RSS、文字放送にピアノメーカが記載されている場合は URLなどを取得して Web に接続して情報を取得しても良いし、演奏中である音楽における楽器の音色を切り 替える指示を行っても良ぐ識別子や特徴量を分類し共起辞書を構成しても良い。 With regard to instrument type, it is possible to search for scenes and pages where any instrument is played or displayed from the co-occurrence information of instrument name and acoustic feature, instrument name and image feature, and the piano is out. When you want to search for movies, search for phoneme strings by pronouncing “piano [p / i / a / n / o]”, or based on the phoneme strings. After performing a search by co-occurrence state using the feature evaluation function and the acoustic feature power collected by the piano's sound force using the configured instrument evaluation function, the audio stream or video stream may be searched according to those features, EPG and BML, which can be recorded or skip-played audio or video stream detected by the search instruction. If the piano maker is described in RSS, teletext, etc., you can get the URL etc. and connect to the Web to get information, or give instructions to switch the tone of the instrument in the music you are playing However, a co-occurrence dictionary may be configured by classifying identifiers and feature quantities that are acceptable.
[0403] 機械音種別に関しては、自動車のタペット音やエンジン音、機関車の排気音を用 いても前述のようなシーンの検索が可能であり、それらの音の呼称を音素列や音素 片列に変換して検索に利用できるようにしても良いし、「エンジン音」という検索条件 であればエンジン音のなって 、るシーンだけを検索したり、エンジンのシーンであれ ば、エンジンの画像特徴量とエンジン音のあるシーンを検索したりすると 、つた方法 をとつてもよい。 [0403] Regarding mechanical sound types, the above scenes can also be searched using car tappet sounds, engine sounds, and locomotive exhaust sounds, and the names of these sounds are called phoneme strings or phoneme string strings. It can be used for search by converting to, and if the search condition is “engine sound”, the engine sound will be searched, and if it is an engine scene, the engine image features If you search for scenes with volume and engine sound, you can use the following method.
[0404] 環境音種別に関しては、前述のいくつかの例にカ卩ぇ風の音や波の音といった自然 音をカ卩えてもよく動物や虫の鳴き声やオフィスの音飲み屋の音、スポーツなどの声援 、駅の改札といった環境により偏りのある音を集めて特徴量の共起状態を観測し評価 関数を構築しても良 、し、雑音種別として自動車の騒音や工場の騒音と!/ヽつた分類 に映画やドラマなどを楽器のときと同様にシーン検索に用いたり、ホワイトノイズゃピ ンクノイズといったノイズの種類によってアンプなどの機器における試験装置の試験 用ノイズを発生させたりしてもよぐそれらの音の呼称を音素列や音素片列に変換し て検索に利用できるようにしても良 、。  [0404] With regard to environmental sound types, natural sounds such as the sound of wind and waves may be added to the above-mentioned examples, and the sound of animals and insects, the sound of office barbers, sports and other cheers It is also possible to collect biased sounds depending on the environment such as station ticket gates, observe the co-occurrence state of features, and build an evaluation function, and classify the noise type as car noise or factory noise! In addition, movies and dramas can be used for scene search in the same way as for musical instruments, or white noise or pink noise can be used to generate test noise for test equipment in equipment such as amplifiers. It is also possible to convert the name of a sound into a phoneme string or phoneme string string so that it can be used for searching.
[0405] 顔種別に関しては、顔の特徴量と感情識別子を関連付けて検索することで感情に 伴う表情識別子のための指標となる画像を検索するといつたことが可能であり、それ らの音の呼称を音素列や音素片列に変換して検索に利用できるようにしても良!、。  [0405] Regarding face types, it is possible to search for images that serve as indices for facial expression identifiers associated with emotions by associating and searching for facial feature quantities and emotion identifiers. You can convert the name to phoneme sequence or phoneme segment sequence and use it for search!
[0406] 人物種別に関しては、顔の特徴量と名前に関する音素列や音素片列を関連付け て検索することで感情に伴う表情識別子のための指標となる画像を検索したり、服装 や体格、髪型などの情報を画像特徴量力も構成して市街監視システムに用いて追跡 対象者の名称力も記録された映像を検索したりするといつたことが可能となり、それら の人物や服装、体格の呼称を音素列や音素片列に変換して検索に利用できるように しても良い。  [0406] Regarding the type of person, by searching the phoneme sequence or phoneme segment sequence related to the facial feature quantity and name, search for images that serve as indices for facial expression identifiers associated with emotions, and the clothing, physique, and hairstyle. It is possible to search for images in which the name power of the person to be tracked is also recorded by composing the image feature amount power in the city surveillance system, and the names of those persons, clothes, and physiques can be called phonemes. It may be converted to a string or phoneme string string for use in searching.
[0407] 表情種別に関しては、前述の顔種別と感情種別にもとづいて表情種別とした場合、 人物種別と関連付けることで、ある人物の感情的振る舞いを踏まえたシーン検索をキ 一ワード提示により音素や音素片記号列により可能となり、それらの表情や感情の呼 称を音素列や音素片列に変換して検索に利用できるようにしても良い。 [0407] With regard to the expression type, if the expression type is based on the face type and the emotion type described above, the scene search based on the emotional behavior of a person can be performed by associating it with the person type. It is possible to use a phoneme or phoneme symbol string by presenting one word, and convert the expression or emotion name into a phoneme string or phoneme string string for use in a search.
[0408] 動作種別〖こ関しては、前述の顔種別と感情種別にもとづいて表情種別とした場合、 人物種別と関連付けることで、ある人物の感情的振る舞いや仕草、動作、ジエスチヤ 、歩き方を踏まえたシーン検索が可能となるとともに、動作識別子と音素や音素片列 を関連付けることで入力された映像情報力 手話情報を検出し音声合成により発話 するといつた処理や、発話を音素列に変換し、音素列に関連付けられた動作を CG で再生して手話を表示するといつた方法が考えられ、それらの動作の呼称を音素列 や音素片列に変換して検索に利用できるようにしても良 、。  [0408] When the action type is related to the face type and the emotion type as described above, it can be associated with the person type to change the emotional behavior, gesture, action, gesture, and walking of a person. It is possible to search for scenes based on this, and to detect the input video information by signifying the motion identifier and the phoneme or phoneme string sequence. When sign language information is detected and uttered by speech synthesis, the processing and the utterance are converted to a phoneme sequence. It is possible to use CG to reproduce the actions associated with the phoneme sequence and display sign language, and the names of these operations can be converted into phoneme sequences or phoneme segment sequences for use in searches. ,.
[0409] 風景種別に関しては、色特徴や直線や曲線の単位面積あたりの存在確率といった 画像特徴の共起情報で自然画像と市街画像を分類したり、シーンの呼称に基づ!/、て 音素列から特徴量に変換したり、シーンを見て発話した内容の音素列や音素片列に より索引付けを行い検索したりすることが可能となる。位置情報を用いると風景種別と 音素列を関連付けることで、大量に蓄積された映画や放送の映像から任意の画像特 徴に基づいて任意の地域の情報を音声により検索することが可能となり、映画の有 名なシーンに用いられたロケ場所の画像特徴に基づいて旅行ガイドを構築したり、類 似した風景の検出をしたりすることが可能となり、それらの風景や地名の呼称を音素 列や音素片列に変換して検索に利用できるようにしても良 、。  [0409] With regard to landscape types, natural images and city images are classified based on co-occurrence information of image features such as color features, existence probability per unit area of straight lines and curves, and based on scene names! /, Phonemes It is possible to convert from a sequence to a feature, or to search by indexing the phoneme sequence or phoneme sequence of the content uttered by looking at the scene. Using location information, by associating landscape types with phoneme strings, it is possible to search for information in any area based on any image characteristics from a large amount of accumulated movies and broadcast images. It is possible to build a travel guide based on the image features of location locations used in famous scenes of the city, and to detect similar landscapes. It may be converted into phoneme strings and used for searching.
[0410] 表示位置種別に関しては、画面内のどの位置にどのような画像があるかを評価する と共に、その範囲を指定して表示し、利用者に名称を呼んでもらうことで、本発明の 装置が表示内容を学習するための指標にするといつた方法が考えられ、一般的な顔 検出技術を用いて顔の位置を検出したあとで、検出した複数の位置に数字を表示し 、順に「1番はだれ?」、「2番はだれ?」として、利用者に名前を呼んでもらい学習し たり、「この人は〇〇さん?」と学習した音素列や音素片列力 発話して確認をとると V、つた方法を用いたり、「わ力もな 、[w/a/k/a/r/a/n/a 」 t 、う特定の制御のための キーワードに関連付けられた音素列や音素片列が検出された場合は学習対象から はずしたり、特徴量だけ学習し名称や呼称との関連付けが保留にされたフラグを立て たりと!/、つた方法を用いて学習効率を改善しても良 、し、それらの表示位置の呼称を 音素列や音素片列に変換して検索に利用できるようにしても良 、。 [0410] With regard to the display position type, it is possible to evaluate what kind of image is in which position in the screen, specify the range, display it, and let the user call the name. When the device is used as an index for learning the display content, it is conceivable to use a method.After detecting the position of the face using general face detection technology, numbers are displayed at the detected positions. “Who is No. 1”, “Who is No. 2”, asks the user to call his name, learns, and speaks and confirms the phoneme string and phoneme string train V, using the Tatsu method, or “w / a / k / a / r / a / n / a” t, phoneme sequences and phonemes associated with keywords for specific control. When a single row is detected, it is removed from the learning target, or only the feature amount is learned and the association with the name or name is put on hold. You can improve the learning efficiency by using the above-mentioned method! It is also possible to convert it to a phoneme string or phoneme string string so that it can be used for searching.
[0411] 画像種別に関しては、前述のいくつかの例にカ卩ぇ楽器や車種、機種、動植物の種 類といったものを弁別するための名称の音素列や関連する音響特徴量と関連付けて 検索することで、前述のピアノであればピアノが表示されて 、て且つ音楽が鳴って ヽ るシーンを検索したりしてもよいし、ピアノメーカのカタログをウェブサイト経由で取得 しても良いし、それらの音の呼称を音素列や音素片列に変換して検索に利用できる ようにしても良く任意の製品や商品の呼称を用いてもょ 、。  [0411] With regard to image types, search in association with the phoneme string of names and related acoustic features for distinguishing between the above-mentioned examples such as musical instruments, car types, models, and animal and plant types. Thus, if the piano is the above-mentioned piano, the piano may be displayed and the scene where the music is heard may be searched, the catalog of the piano manufacturer may be obtained via the website, The names of these sounds can be converted into phoneme strings or phoneme string strings so that they can be used for searching, or any product or product name can be used.
[0412] 文字記号種別に関しては、認識処理により識別された文字列を音素列や音素片列 に変換し検索の対象としたり、静止画であればクリックしたり範囲指定したところの単 語に関連する音声や映像を表示したり検索したりすることが可能となるとともに、それ らの文字やフォントの呼称を音素列や音素片列に変換して検索に利用できるようにし ても良い。  [0412] Regarding the character symbol type, the character string identified by the recognition process is converted into a phoneme string or phoneme string string to be searched, and if it is a still image, it is related to the word that was clicked or range specified. It is possible to display and search the voice and video to be used, and convert the characters and font names into phoneme strings and phoneme string strings so that they can be used for the search.
[0413] 標識種別に関しては、カーナビなどのガイドに関して音素や音素片を用いた検索 に用いたり、車の運転中に検出されたものを音素や音素片によって音声合成により アナウンスしたり、配信された-ユースなどにおいて異国の標識の意味を字幕合成し たりすることが可能となり、それらの標識の呼称を音素列や音素片列に変換して検索 に利用できるようにしても良い。  [0413] Regarding the types of signs, they were used for searches using phonemes and phonemes for guides such as car navigation systems, and what was detected while driving a car was announced by speech synthesis using phonemes and phonemes. -It is possible to synthesize the meaning of foreign country signs in use etc., and the names of these signs may be converted into phoneme strings or phoneme string strings for use in searches.
[0414] 形状種別に関しては、丸いものや四角いもの尖ったものを識別することで、ロボット の動作の妨げになるものや人に危険を及ぼすものを検出するとか、関連付けられた 画像特徴に基づ 、て抽象的なキーワードの音素列や音素片列で検索を実施し、該 当したものを検出するといつた利用も可能であり、任意の番組にけるオープニングテ ロップのような固定的な映像とオープニング発話のような固定的発話の音素列や音 素片列を関連付けた検索が出来るとともに、それらの形状の呼称を音素列や音素片 列に変換して検索に利用できるようにしても良いし、波形の形状種別を用いることで 複数箇所から抽出される脳波や脈波の変化を統計的に分析して識別子を与えて検 索に利用できるようにしても良 、。  [0414] With regard to shape types, it is possible to detect round objects, square objects, and pointed objects, thereby detecting obstacles that hinder the robot's movement and those that are dangerous to humans, or based on associated image features. If a search is performed using a phoneme sequence or phoneme sequence of abstract keywords, and a search is detected, it can be used anytime. It can be used as a fixed video such as an opening telop in an arbitrary program. It is possible to search by associating phoneme strings or phoneme strings of fixed utterances such as opening utterances, and converting the names of these shapes into phoneme strings or phoneme strings to be used for the search. It is also possible to use a waveform shape type to statistically analyze changes in brain waves and pulse waves extracted from a plurality of locations and provide an identifier for use in searching.
[0415] 図形記号種別に関しては、映画のシーンに出現する図形や記号を検索し、他国語 に配信するときに記号や標識の字幕を入れる指標とするといつた利用や抽象的なマ ルゃバッ、正解アイコン、不正解アイコンのような図形を検出しクイズ番組のシーンの 検出に用いることが可能であり、編集時に用いることでメタ情報表記作業を簡易にす ることが出来るとともに、それらの図形や記号の呼称を音素列や音素片列に変換して 検索に利用できるようにしても良 、。 [0415] With regard to graphic symbol types, when searching for graphics and symbols appearing in movie scenes and using them as indicators for subtitles of symbols and signs when distributing to other languages, the usage and abstract label It is possible to detect graphics such as Ryaba, correct icon, and incorrect icon and use it to detect scenes of quiz programs. The names of these figures and symbols may be converted into phoneme strings and phoneme string strings for use in searches.
[0416] 放送番組種別〖こ関しては、出演者や作者、司会、番組タイトルといった番組情報が 獲得できるため、番組のジャンルによって画面構成や音響特徴に関する偏りを抽出 し番組の傾向分析のための指標に用いることが可能であり、それらの番組ジャンルや カテゴリの呼称を音素列や音素片列に変換して検索に利用できるようにしても良い。  [0416] For broadcast program types, it is possible to acquire program information such as performers, authors, moderators, and program titles. It can be used as an index, and the names of those program genres and categories may be converted into phoneme strings or phoneme string strings and used for searching.
[0417] また、将来的に味や匂いや触覚、音感、湿感、質感といった任意の感覚を記録'再 生できるようになった場合であっても、それらの特徴量と識別子を本実施例の記録媒 体への索引に追加し、利用者の利便性を図っても良い。  [0417] In addition, even if it becomes possible to record and reproduce any sensation such as taste, smell, touch, sound, moisture, and texture in the future, those feature values and identifiers are used in this embodiment. It may be added to the index to the other recording media for user convenience.
[0418] この結果、従来不可能であった多様な共起情報に基づく情報の検出が可能となり、 検出に伴う録画、検索、スキップ再生、ダイジェスト再生、メール配信、メッセンジャへ のメッセージ、 RSS配信が可能になる。  [0418] As a result, it is possible to detect information based on a variety of co-occurrence information that has been impossible in the past. Recording, search, skip playback, digest playback, email delivery, messages to messenger, and RSS delivery associated with detection are possible. It becomes possible.
[0419] <製品としての応用事例 >  [0419] <Application examples as products>
これから記載する製品事例は前述された新規性に基づく実施要件及び構成要件とし ての『基本的な検索装置の構成と技術にっ 、て』、『複数の識別子と複数の検索条 件に伴う索引付けおよび検索、任意処理について』を用いて、それぞれの分野に応 じた例に記載される用語傾向や画像傾向や音響傾向や制御辞書に基づいた識別 子の共起辞書を構成したり、識別子と音素及び Z又は音素片列や識別子と文字列 や識別子と特徴量を変換する辞書を用いて検索条件や検出条件を構成したりするこ とにより、本発明における構成要素や実施要素を組合せて実現できる商品やサービ スソリューションの例を示して 、る。  Product examples to be described in the following are implementation requirements and configuration requirements based on the novelty described above, `` Basic search device configuration and technology '', `` Index with multiple identifiers and multiple search conditions '' Using the `` Appendix, Search, and Arbitrary Processing '', you can construct a co-occurrence dictionary of identifiers based on term trends, image trends, acoustic trends, and control dictionaries that are described in examples according to each field, or identifiers By combining search elements and detection conditions using a dictionary that converts phonemes and Z or phoneme fragment strings, identifiers, character strings, identifiers, and feature quantities, the constituent elements and implementation elements of the present invention can be combined. Show examples of products and service solutions that can be realized.
[0420] «放送録画及び映像録再、映像検索システムの例》  [0420] «Example of broadcast recording and video recording / playback, video search system»
画像と環境音に伴う検索の例や環境音と EPGや BML、 RSS、文字放送に伴う検索 の例と複数の識別子に伴う音声映像検索の例と識別子の検出に伴う任意処理の例 を組合せた応用例として図 26を例に説明する。  Examples of search with images and environmental sounds, examples of searches with environmental sounds and EPG, BML, RSS, teletext, audio / video search examples with multiple identifiers, and optional processing with identifier detection An application example will be described with reference to FIG.
[0421] まず、ビデオカメラなどの映像収録装置を設置し、複数のマイクからの音声を抽出 分析し音素に変換することで、特定のキーワードが発せられた方向にカメラを向けた り、キーワードに応じて録画を開始したりといった方法が考えられる。また、鼻歌を歌 つた場合に歌詞を音素化し同時に旋律を抽出することで特定の音楽を選択して録画 したり、すでに収録された録画内容力も再生したりしてもよい。また、感情を伴う映像 検索を実行することでシーンの盛り上がりを検出したり、特定の感情を伴う曲調の音 楽を検出したりすると!/、つた方法を用いても良 、し、利用者がポインティングデバイス やリモコンにより指定したシーンと類似性の高いシーンを検索'検出してもよい。 [0421] First, install a video recording device such as a video camera, and extract audio from multiple microphones. By analyzing and converting to phonemes, it is possible to point the camera in the direction in which a specific keyword was issued, or to start recording according to the keyword. In addition, when singing a rhinoceros, it is possible to select and record specific music by phoneticizing the lyrics and extracting the melody at the same time, or playing back the recorded content. Also, if you detect the excitement of a scene by performing a video search with emotions, or if you detect music with a specific emotional melody! /, You can use this method. A scene having a high similarity to the scene specified by the pointing device or the remote controller may be searched and detected.
[0422] このようにして、収録時に同時に音素記号や感情記号を索引付けし、 EPGや BML 、 RSS、文字放送のような後述されるマークアップ言語や CGIを用いたサービスに伴 い録画範囲や検索範囲を決定したり、不要な部分を削除したり、再生時に自動的に シーンをスキップしてもよい。このため、特定のキーワードを音素に変関し音素の一致 を確認しながら録画を一時ファイルとして実施し、目的のキーワードが検出され場合 にインディックスを構成しながら感情特徴をする。  [0422] In this way, phonetic symbols and emotional symbols are indexed at the same time as recording, and the recording range and markup languages such as EPG, BML, RSS, teletext, etc., and the recording range associated with services using CGI are described. You may decide the search range, delete unnecessary parts, or skip scenes automatically during playback. For this reason, a specific keyword is converted into a phoneme, recording is performed as a temporary file while confirming the phoneme match, and an emotional characteristic is formed while constructing an index when a target keyword is detected.
[0423] また、 EPGや BML、 RSS、文字放送を用いてファイルやファイル名称、目的の動 画や静止画、音声、文章およびそれらの時系列的な提示順序に関する関連付け情 報力 なるファイルを分類し再生や記録を実行する装置に関し、指示するための対 象情報に関する音素列や音素片列を構成したり、音素列や音素片列を EPGや BM L、 RSS、文字放送で配信したり、受信した EPGや BML、 RSSに基づいた音素列や 音素片列を用いて記録内容や記録対象を検索したり、録画したり、再生したりするこ とで、利用者の利便性を図っても良い。  [0423] Also, use EPG, BML, RSS, and text broadcasting to classify files and file names, target video and still images, audio, text, and related information power related to their time-series presentation order. For devices that perform playback and recording, configure phoneme strings and phoneme string sequences for information to be directed, distribute phoneme strings and phoneme string sequences by EPG, BM L, RSS, teletext, Even if it is convenient for the user, it can search, record, and play back the recorded content and recording target using the phoneme sequence and phoneme sequence based on the received EPG, BML, and RSS. good.
[0424] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり携帯情 報端末であったりしてもよぐそれらを用いて通信基地局を経由して本発明の内容を 実施してもよぐ携帯端末力 過程の家庭の本発明を利用した装置などに電話をして 実現したり、携帯端末で認識された情報を家庭の家庭の本発明を利用した装置にメ ールで送信したりしてもょ 、。  [0424] Of course, a device that executes these services may be a desktop information processing device or a portable information terminal, and the contents of the present invention are implemented via a communication base station using them. It can be realized by calling a device using the present invention at home in the process of mobile terminal power, or by mailing information recognized by the mobile terminal to a device using the present invention at home. You can send it.
[0425] この結果、本発明を用いて次のようなことが実現可能となる。例えば、有名人の「有 名夫 (ありなお [/a/r/i/n/a/o/])」という人物がテレビに出演する際、その当日にその 情報を取得した利用者力 どのチャンネルで出るの力、どの時間に出るのか解らない としても、すでに出演が終わっていなければ、本発明を利用した家庭の装置に「有名 夫 (ありなお [/a/r/i/n/a/o/])、録画 (ろくが [/r/o/k/u/g/a/])」とキーワードを与える ことで、家庭の本発明を利用した装置は受信できる全てのチャンネルの録画を開始 して収録すると共に、そのキーワードの中から命令部を除 、た音声を音素展開し収 録しながら、その収録内容に対し音素記号列の検索による検出を実行する。 [0425] As a result, the following can be realized using the present invention. For example, when a famous celebrity person named “Nanao (/ a / r / i / n / a / o /)” appears on TV, the channel of the user who acquired that information on that day. I don't know what time to get out However, if the appearance has not been finished, the home device using the present invention is “famous husband (along with [/ a / r / i / n / a / o /]), recording (Rokuga [/ r / o / k / u / g / a /]) ”and the keyword, the device using the present invention at home starts recording all the receivable channels and records from among the keywords. With the exception of the command section, the phoneme is expanded and recorded, and the recorded content is detected by searching the phoneme symbol string.
[0426] 次に、本発明を利用した装置が対象のキーワードを検出する本実施例ではその一 致度合を 60%とし、 1分ごとに保存フラグ境界を設けながらコンテンツを録画し、 60 %を 1分間に超えない箇所は 1時間後にその録画コンテンツ情報を削除対象とする。 逆に、そのキーワードが 60%以上一致する部分が検出された個所から、例えば一時 間前まで及び Z又は EPGや BML、 RSS、文字放送による番組の境界までを保存対 象とする。 [0426] Next, in the present embodiment in which the device using the present invention detects the target keyword, the matching degree is set to 60%, content is recorded while setting a save flag boundary every minute, and 60% is recorded. The recorded content information will be deleted after one hour if the location does not exceed one minute. On the contrary, from the point where the keyword matches 60% or more is detected, for example, until one hour before and the boundary of the program by Z, EPG, BML, RSS, text broadcasting, etc.
[0427] この結果、「有名夫 (ありなお [/a/r/i/n/a/o/])」という単語の出てくる放送が自動的 にその単語の派生付近一時間を保存することで、どのチャネルで放送されるの力、い つ放送されるかわ力もないままでも、自動的に録画することが可能となる。なお、本発 明により録画された映像をその単語の出現回数や一致度に応じて順位付けし、一覧 として表示しても良い。  [0427] As a result, broadcasts with the word “famous husband (along with [/ a / r / i / n / a / o /])” automatically save one hour near the derivation of that word. Thus, it is possible to automatically record without having the power to broadcast on which channel and when to broadcast. The video recorded by the present invention may be ranked according to the number of appearances and the degree of coincidence of the words and displayed as a list.
[0428] また、このとき同時に顔検出を実施し役者の名称と顔特徴との関連付けて学習する ことをくりかえし、特定の人物が画面内にいるかどうかを学習しても良い。この際、再 生時に利用者に対して、記録対象となった名称が出力される顔特徴のどれと一致す る力を指示させることで、学習効率の改善を図り自動検出録画に関する性能の改善 を装置自身が自立的に行っても良い。くわえて、 EPGや BML、 RSS、文字放送など による役者名と顔特徴との一致度力 自立的に役者名とその人物の顔特徴との一致 度合を評価しながら学習しても良 、。  [0428] Further, at this time, face detection may be performed at the same time, and learning may be repeated in association with the name of the actor and the facial features to learn whether or not a specific person is in the screen. At this time, by instructing the user the power to match which of the facial features the recorded name is output during playback, the learning efficiency is improved and the performance of automatic detection recording is improved. The device itself may perform it independently. In addition, it is possible to learn while evaluating the degree of coincidence between the name of the actor and the person's face feature independently, using EPG, BML, RSS, text broadcasting, etc.
[0429] また、有名人 (ありなお)が俳優である場合において映像や音声作品内で異なる名 称で呼ばれることが考えられる。この場合、例えば次のような手順で、番組内検索を 実行することが出来る。 EPGや BML、 RSS、文字放送で任意の番組の出演者一覧 における俳優名を漢字やかな ·英単語から音素や音素片による記号列変換した情報 を用いて利用者発話から俳優名を検索したり、従来カゝらあるようにテキスト入力された 俳優名を検索したりして、目的の俳優名を抽出する。次に俳優名に関連付けられた 配役名を抽出する。 [0429] In addition, when a celebrity (Arina) is an actor, it may be called with a different name in video and audio works. In this case, for example, the program search can be executed by the following procedure. EPG, BML, RSS, text broadcasting, actor names in the list of performers in various programs are kanji.Look up actor names from user utterances using information converted from English words to symbol strings using phonemes and phonemes. , Text entered as usual The actor name is searched and the target actor name is extracted. Next, the cast name associated with the actor name is extracted.
[0430] 次に配役名に基づいて音素や音素片による辞書を参照しながら配役名に基づく音 素や音素片による記号列を構成する。そして、音素や音素片による記号列で索引付 けされた映像や音声作品情報に対し音素や音素片による記号列による検索を実行 する。この結果、目的の俳優の配役名に関連付けられたシーンを検索することが可 能となり、従来の一般的な音素や音素片による検索では不可能だった EPGや BML 、 RSS、文字放送に関連付けられた検索が可能となるため、映像や音声作品におけ る検索の利便性を向上できる。  [0430] Next, a symbol string using phonemes and phonemes based on the casting names is constructed while referring to a dictionary using phonemes and phonemes based on the casting names. Then, a search using a symbol string based on phonemes or phonemes is performed on video or audio work information indexed by symbol strings based on phonemes or phonemes. As a result, it is possible to search for the scene associated with the cast name of the target actor, and it is related to EPG, BML, RSS, and teletext, which was not possible with conventional phoneme and phoneme searches. This makes it possible to improve the convenience of searching in video and audio works.
[0431] また、爆発音識別子による索引と罵倒語に関連付けられた音素記号列による索引 、韻律の激し 、音楽などが識別子として記録された時間が他の笑 ヽ声ゃ歓声と ヽっ た識別子の出現頻度より高 、映像情報はアクション番組であるし、それらを集約して 評価関数を作り、アクション番組度合を評価し検索する方法や映像情報内に暗い映 像時間と悲鳴に関連付けられた音素や音素片による記号列や感情識別子列による 索引の出現頻度が全体の映像時間の長さに対して他の多くの映像音声情報におけ る悲鳴に関連付けられた索引の出現頻度平均より多く検出される場合にホラー番組 であると評価する関数を作りホラー番組度合を評価し検索するといつた方法や会議 情報の収録に用いて会議における感情の起伏や内容の変化を分類できる検索装置 が実現できる。  [0431] In addition, the index by explosion pronunciation identifier and the index by phoneme symbol string associated with the fallen word, the prosodic, the time when music etc. was recorded as the identifier is the identifier that laughed at other laughter Video information is an action program, and these are aggregated to create an evaluation function to evaluate and search for the degree of action program, as well as phonemes associated with dark video time and scream in the video information. The index appearance frequency of the symbol string or emotion identifier string is detected more than the average appearance frequency of the index associated with screams in many other video and audio information for the entire video time length. If you create a function that evaluates horror programs and evaluate the degree of horror programs and search for them, you can classify the undulations of emotions and changes in content by using when and how to record conference information. A cable device can be realized.
[0432] また、爆発音や風の音や波の音などの環境音も本発明による識別子再構築処理に より、時系列的に分解して環境音としての環境音素片を構成するといつた方法が考え られる。同様に、口形素も時系列に分解して口形素片としてみたり、動画像であれば 映像の変化を動作素や動作素片としてみたり、画像情報であれば画像も画像素や画 像素片としてみたりすることで検索のための新しい指標を再構築してもよい。  [0432] In addition, when the environmental sound such as explosion sound, wind sound or wave sound is decomposed in time series by the identifier reconstruction process according to the present invention, the environmental sound segment as the environmental sound is constructed. It is done. Similarly, visemes are decomposed into time series and viewed as viseme segments, video images are viewed as motion elements and motion segments for moving images, and images are also converted to image and image segments for image information. A new index for search may be reconstructed by viewing it as a piece.
[0433] そして、悲鳴や爆発音などの特徴を学習した場合、そう 、つた危険を示す情報の発 生に応じて録画を開始する監視カメラや、収録内容で 24時間以上経過し悲鳴や爆 発音の前後 1時間以外を削除して収録を継続する監視記録システムを構成し治安対 策に利用することも可能である。 [0434] このように、従来であれば音素や音素片による記号列により音声発話に関連した情 報だけが検索対象となっていたが、本発明のような複数の方法による特徴量と識別 子を用いることで番組内容に則した情報検索を実現可能とする。もちろん、音声だけ にこれらの技術を用いラジオ録音に対して実行すると ヽつた方法で機能縮小された 装置により本発明を実施手も良いし、監視カメラなどに利用して窓や扉画像を識別す る識別関数の画像特徴評価距離が平均より乖離したことを検出して窓や扉の破損を 検出したり、鍵のある扉の前で人が長時間動かずに細かい動作をしていることを検出 することで犯罪の防止を行ったりしても良いし、動画像のシーン境界を検出して映像 編集機に利用しても良いし、マークアップ言語を用いたり、文字列から音素や音素片 に変換したりして検索することで、音声や他の識別子を用いることにより天候を画像 特徴により検出し屋内設備を制御して換気や照明を制御したり、名前や合言葉や顔 認証を用いて個人認証や金額発話による課金決済を実施したりしても良 ヽ。 [0433] And, when learning features such as screams and explosion sounds, yes, surveillance cameras that start recording in response to the occurrence of information indicating the danger, and screams and explosion sounds after 24 hours of recording It is also possible to configure a monitoring and recording system that deletes data other than one hour before and after and continue recording, and use it for security measures. [0434] As described above, conventionally, only information related to speech utterances is searched by using a symbol string of phonemes or phoneme segments. However, features and identifiers by a plurality of methods as in the present invention are used. By using, it is possible to realize information retrieval according to program contents. Of course, when these techniques are used only for voice and performed on radio recording, the present invention may be implemented by a device whose function is reduced by one of the three methods, and it is used for a surveillance camera or the like to identify a window or door image. It is detected that the image feature evaluation distance of the discriminating function deviates from the average to detect a broken window or door, or that a person does not move for a long time in front of a locked door. Detect crimes by detecting them, detect scene boundaries of moving images and use them in video editing machines, use markup languages, or use phonemes and phonemes from character strings Or by using voice or other identifiers to detect weather by image features and controlling indoor equipment to control ventilation and lighting, or using names, passwords, and face recognition. Billing by personal authentication and utterance It's okay to make a payment.
[0435] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ修正情報や新しい番組や役者名、番組ジャンル、配信局 名に関する音素記号列や画像特徴、音声特徴、感情識別子といった記号列を、 XM Lや HTMLのような後述されるマークアップ言語や RSS、 CGIを用いて情報の送受 信や配信を行っても良く組合せることで利便性を図ることが出来る。  [0435] At this time, the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is corrected information, whether it is on the terminal side or on the distribution base station side, new program, actor name, program genre, distribution Even if the phoneme symbol string related to the station name, the image feature, the voice feature, the emotion identifier, etc. are sent / received / distributed using markup languages such as XML and HTML, RSS, CGI, etc. Convenience can be aimed at by combining well.
[0436] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯情報端末であったり、装着型情報端末であったりしてもよぐ それらを用いて通信基地局を経由して本発明の内容を実施してもよい。  [0436] Of course, a device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.
[0437] 《消費者感情にともなう製品品質分析システムの例》  [0437] 《Example of product quality analysis system with consumer sentiment》
前述の固有名詞と感情識別子を用いた検索例や複数の識別子に伴う音声映像検 索により任意の処理を実行する例の応用として本発明を用いた CRM(Customer Rela tionship Management)システムに関し図 27を例に説明する。  FIG. 27 shows a CRM (Customer Relationship Management) system using the present invention as an application of the above-mentioned search example using proper nouns and emotion identifiers and an example of executing arbitrary processing by audio-video search with multiple identifiers. Explained as an example.
[0438] まず、消費者の感情に伴う発話を本発明における複数の分析装置と識別装置を用 いて分析索引付けする。この結果得られた音素や音素片、感情識別子を検索しその 頻度をもとめることで特定の型番の商品を示す音素列とそれに伴う怒りや悲しみとい つた感情から、消費者力 見た商品の評判をその感情特徴や商品を特定できる音素 記号列の出現数力 定量的に分析することが可能とであり、それらの結果を HTML や XMLのような後述されるマークアップ言語や CGIを用いて表示したり、特定された 商品のマニュアルを表示したりするようにしてもよ!、。 [0438] First, utterances associated with consumer emotions are analyzed and indexed using a plurality of analysis devices and identification devices according to the present invention. By searching for phonemes, phoneme fragments, and emotion identifiers obtained as a result, and determining their frequency, the reputation of the product as viewed by consumer power is derived from the phoneme string indicating the product of a specific model number and the emotions associated with anger and sadness that accompany it. Phonemes that can identify emotional features and products The number of occurrences of symbol strings can be analyzed quantitatively, and the results can be displayed using a markup language such as HTML or XML, described later, or CGI, and manuals for identified products can be displayed. You can display it!
[0439] より具体的に説明すると、消費者が相談窓口のオペレータに電話や店頭で対話を 要求する。このときオペレータ消費者双方の音声の特徴量を抽出し、抽出された特 徴量から感情や音素、音素片を認識する。 [0439] More specifically, the consumer requests a consultation from the consultation service operator over the phone or in the store. At this time, voice feature values of both operator and consumer are extracted, and emotions, phonemes, and phonemes are recognized from the extracted feature values.
[0440] この際、前述の方法で認識された音素や音素片、感情を情報蓄積装置に蓄積する[0440] At this time, phonemes, phonemes, and emotions recognized by the above-described method are stored in the information storage device.
。次に、蓄積された情報を商品の名称に関連付けられた音素や音素片の出現してい る音声情報と感情識別子で怒りや悲しみの感情識別子が認識されている音声情報と の関連性を評価する。 . Next, evaluate the relationship between the voice information in which the phoneme or phoneme associated with the product name appears in the stored information and the voice information in which the emotion identifier of anger or sadness is recognized by the emotion identifier. .
[0441] 関連性の評価方法は、特定の商品型番が検出される音声情報における怒りの感情 や悲しみの感情が発生して 、る時間の長!、ものを消費者評価が低 、と位置付けても 良い。このように音声情報内で認識された音素記号列と、感情識別子の分布を評価 することで、消費者の商品に対する感情を定量的に評価することが可能となり商品の 信頼性に関する分析を定量的に行うことが可能となる。  [0441] The relevance evaluation method is based on the fact that anger emotions and sadness emotions occur in the audio information in which a specific product model number is detected, and the consumer evaluation is low. Also good. In this way, by evaluating the phoneme symbol strings recognized in the speech information and the distribution of emotion identifiers, it is possible to quantitatively evaluate the consumer's feelings about the product, and quantitative analysis of the product's reliability is possible. Can be performed.
[0442] この結果、本発明を用いて次のようなことが実現可能となる。例えば、型番「1X5 (い ちえつくすご [/i/ch/i/e/cl/k/u/s/u/g/o/])」という商品の相談が消費者力もあった際 、消費者相談オペレータはその名前を復唱し、本発明を用いた装置に検索を実行す る。この結果、検索された「1X5 (いちえつくすご [/i/ch/i/e/cl/k/u/s/u/g/o/])」のマ -ュアルがオペレータの画面に表示され、消費者の質問に回答することができる。こ の際、消費者の感情を認識し情報蓄積装置に関連付けて保存することで、ある商品 の感情面での評価を定量的に記録することができる。  As a result, the following can be realized using the present invention. For example, if a product with the model number “1X5 (Ichietsu Tsukugo [/ i / ch / i / e / cl / k / u / s / u / g / o /])” is also consumable, The consultant operator repeats the name and performs a search on the device using the present invention. As a result, the searched “1X5 (Ichietsutsugo [/ i / ch / i / e / cl / k / u / s / u / g / o /])” manual is displayed on the operator's screen. And can answer consumer questions. At this time, it is possible to quantitatively record the emotional evaluation of a product by recognizing the consumer's emotion and storing it in association with the information storage device.
[0443] この際、本発明を利用した装置が対象となる商品名の検索を実行する際において 音素や音素片記号列の一致度合の基準を 60%とし、 60%を超える商品のリストを構 成し一覧として表示することで、オペレータは対象となる商品のマニュアルを選択して も良い。  [0443] In this case, when the device using the present invention performs a search for a target product name, the criteria for the matching degree of phonemes and phoneme symbol strings is set to 60%, and a list of products exceeding 60% is constructed. By displaying it as a list, the operator may select the manual for the target product.
[0444] そして、「1X5 ( 、ちえつくすご [/i/ch/i/e/cl/k/u/s/u/g/o/])」と 、う単語に関連付 けられた感情特徴や音素記号列、音素片記号列を収録し分析することが可能となる 。この際、収録された商品番号と同一の音声情報群に関する感情出現時間の分析 により商品信頼性を定量的に評価することができる。 [0444] And "1X5 (, Chie Tsukusugo [/ i / ch / i / e / cl / k / u / s / u / g / o /])" and the emotions associated with U Features, phoneme symbol strings, and phoneme symbol strings can be recorded and analyzed. . At this time, the product reliability can be quantitatively evaluated by analyzing the emotion appearance time for the same audio information group as the recorded product number.
[0445] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ修正情報や新しい商品名や商品ジャンルに関する音素記 号列や画像特徴、音声特徴、感情識別子といった記号列を、 XMLや HTMLのよう な後述されるマークアップ言語や RSS、 CGIを用いて情報の送受信や配信を行って も良く組合せることで利便性を図ることが出来る。  [0445] At this time, the dictionary for converting the phoneme sequence and the phoneme sequence to the processing procedure is not limited to the terminal side or the distribution base station side, and the phoneme symbol for the new product name or product genre. Convenience by combining symbol strings such as strings, image features, voice features, emotion identifiers, etc., even when sending and receiving and distributing information using markup languages such as XML and HTML, RSS, CGI, etc. Can be planned.
[0446] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯情報端末であったり、装着型情報端末であったりしてもよぐ オペレータや顧客の心理状態を分析してストレス過剰にならないように確認しても良 いよぐそれらを用いて通信基地局を経由して本発明の内容を実施してもよい。  [0446] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. It is also possible to analyze the psychological state of the mobile phone and confirm that it does not cause excessive stress. The content of the present invention may be implemented via the communication base station using these.
[0447] 《ウェブブラウザ操作の例》  [0447] <Example of web browser operation>
まず、利用者は利用者のブラウザに対して音声を発話する。発話された音声はその 特徴量を抽出する。そして、第 1の方法ではこの特徴量を対象となる装置に送信し、 特徴量を受信した装置はその特徴量に応じて音素記号列及び Z又は音素片記号 列と感情記号列を生成する。そして、生成された記号列に基づいて、一致する制御 手段を選択し実行する。  First, the user speaks voice to the user's browser. The features of the spoken speech are extracted. In the first method, this feature value is transmitted to the target device, and the device that has received the feature value generates a phoneme symbol string and Z or phoneme symbol string and emotion symbol string according to the feature value. Then, based on the generated symbol string, the matching control means is selected and executed.
[0448] 第 2の方法は、利用者のブラウザ内で音素記号列及び Z又は音素片記号列、感情 記号列を生成し、生成された記号列を対象となる装置に送信する。そして、制御され る装置は受信した記号列に基づき一致する制御手段を選択し実行する。  [0448] In the second method, a phoneme symbol string, Z or phoneme symbol string, and emotion symbol string are generated in the user's browser, and the generated symbol string is transmitted to the target device. The controlled device selects and executes the matching control means based on the received symbol string.
[0449] 第 3の方法は利用者のブラウザ内で生成された特徴量に基づき音素及び Z又は音 素片記号、感情記号列を認識し、認識された記号列に基づき制御内容を選択し、制 御方法を制御する装置に対し送信する。  [0449] The third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on the feature values generated in the user's browser, selects the control content based on the recognized symbol strings, and Sent to the device that controls the control method.
[0450] そして、第 4の方法は、利用者のブラウザを用いて音声波形をそのまま制御する装 置に送信し、制御する装置内で音素記号列及び Z又は音素片記号列、感情記号列 を認識し、認識された記号列に基づいて制御手段を選択し、選択された制御を制御 される装置が実行するというものである。  [0450] Then, in the fourth method, the speech waveform is transmitted as it is using the user's browser, and the phoneme symbol string, Z, phoneme symbol string, and emotion symbol string are transmitted in the controlling device. It recognizes, selects a control means based on the recognized symbol string, and the controlled device executes the selected control.
[0451] この際、利用者の感情が怒りを伴っている場合においては、利用者に対して謝るよ うなメッセージを音声や文字列により提示しても良い。同様に感情識別子も音声から 特徴抽出や記号ィ匕が可能であり、環境音など音や映像の特徴や識別子についても 同様である。 [0451] At this time, if the user's emotions are accompanied by anger, they apologize to the user. Such a message may be presented by voice or character string. Similarly, emotion identifiers can be extracted from voice, and features can be extracted from symbols.
[0452] そして、リンクを示すリファレンスタグに例えば発音という名称の新しい変数や属性 を追加して、話者の発音を音素化してウェブページ内を検索し、一致するページに 移動するといつた方法が考えられる。  [0452] Then, add a new variable or attribute named pronunciation, for example, to the reference tag that indicates the link, phoneme the pronunciation of the speaker, search the web page, and move to the matching page. Conceivable.
[0453] このように、 XMLや HTMLのような後述されるマークアップ言語や CGIを用いるこ とで RSSゃブログ、ウェブ上のカタログ販売といったシステムにおいて、意味や文脈と V、つた認識をせずに音素の一致するものを検索することで容易に音声による操作を 実現できる。  [0453] In this way, by using markup languages such as XML and HTML, and CGI described later, RSS, blogs, catalog sales on the web, etc., the meaning, context and V are not recognized. By searching for phonemes that match the phoneme, voice operations can be easily realized.
[0454] この際、ブラウザ側情報処理端末内でサービスもしくはデーモンといったバックダラ ゥンドプロセスにより記号列同士のマッチングや特徴量抽出、記号列の認識と 、つた 処理を直接ブラウザが処理することなく実施してもよ 、。  [0454] At this time, matching between symbol strings, feature amount extraction, symbol string recognition, and other processes are performed directly in the browser-side information processing terminal by a back-down process such as a service or daemon, without the browser processing directly. Anyway.
[0455] また、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信基 地局側にあってもよぐ修正情報や新しいタグや変数、属性に関する音素記号列や 画像特徴、音声特徴、感情識別子といった記号列を、 XMLや HTMLのような後述さ れるマークアップ言語や RSS、 CGIを用いて情報の送受信や配信を行っても良く組 合せることで利便性を図ることが出来る。  [0455] Also, the dictionary that converts the phoneme sequence and phoneme segment sequence to the processing procedure is a phoneme symbol sequence related to correction information, new tags, variables, and attributes, which may be on the terminal side or on the distribution base station side. It is convenient to combine symbol strings such as image features, audio features, and emotion identifiers by sending and receiving and distributing information using markup languages such as XML and HTML, RSS, and CGI. I can plan.
[0456] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯情報端末であったり、装着型情報端末であったりしてもよぐ それらを用いて通信基地局を経由して本発明の内容を実施しても良く組合せて実現 しても良い。  [0456] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented through a communication base station or may be implemented in combination.
[0457] 《カーナビゲーシヨン装置の例》  [0457] <Example of car navigation system>
感情と固有名詞に伴う検索の例や複数の識別子に伴う音声映像検索の例の応用と して、 VICSなどの情報配信技術と組合せ、カーナビと利用者の対話を位置に基づ いて、音素記号列や音素片記号列、感情識別子を伴い多変量解析することで特定 の位置において人が落ち着いた口調になったり、感情を高ぶらせた口調になったり することが検出可能となり、交通事故状況とあわせて評価することにより利用者の情 緒面に起因する交通事故の発生状況を分析し、それにともなう検索を実行することで 事前に利用者にアナウンスを促し注意を喚起するといつたサービスが実施できる。こ の際、頻繁に発話する単語に関する音声特徴の感情特徴のばらつきを評価し情緒 的な安定を検出しても良ぐそれらを分析することで交通渋滞中の利用者の感情傾 向による危険予測を行ったり、車両運行状況の監視を行ったりしても良い。 As an example of search with emotions and proper nouns and audio / video search with multiple identifiers, combined with information distribution technology such as VICS, based on the position of car navigation and user interaction, phoneme symbols Multivariate analysis with strings, phoneme symbol strings, and emotion identifiers makes it possible to detect that a person has a calm tone at a specific position or a tone that has a high emotional tone. By evaluating together, user's information By analyzing the occurrence of traffic accidents caused by the beginning, and performing a search associated with it, the service can be implemented whenever an announcement is given to the user and the user is alerted in advance. At this time, risk prediction based on the emotional inclination of users in traffic jams is performed by evaluating the variation of the emotional characteristics of the voice characteristics of frequently spoken words and analyzing the emotional stability. Or monitoring the vehicle operation status.
[0458] また、車内での音声で「事故状況 (j/i/k/o/j/o/u/k/y/o/u)」と 、う音素列を検出す ると共に車載カメラで画像認識により事故車両が検出された場合においては、その 情報を基地局に送信し VICS経由や携帯電話と!/、つた任意の通信手段経由で受信 し、経路選択変更するといつたサービスを実施してもよいし、各車両から発信される 情報をオービスなどで捕らえて基地局に送信しても良い。  [0458] In addition, the in-car audio detects the phoneme sequence as "accident situation (j / i / k / o / j / o / u / k / y / o / u)" and When an accident vehicle is detected by image recognition, the information is transmitted to the base station, received via VICS, mobile phone! /, Or any other communication means. Alternatively, the information transmitted from each vehicle may be captured by orvis and transmitted to the base station.
[0459] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ新しい地名やタイトルや住所、道路に関する音素記号列や 画像特徴、感情識別子といった記号列を、 VICSや XMLや HTMLのような後述され るマークアップ言語や RSS、 CGIを用いて情報の送受信や配信を行っても良く組合 せることで利便性を図ることが出来る。  [0459] At this time, a dictionary for converting a phoneme sequence or a phoneme segment sequence and a processing procedure is not limited to a new place name, title or address, road phoneme symbol sequence, Convenience is achieved by combining symbolic strings such as image features and emotion identifiers well by sending and receiving and distributing information using markup languages such as VICS, XML, and HTML, and RSS and CGI described later. I can do it.
[0460] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯情報端末であったり、装着型情報端末であったりしてもよぐ それらを用いて通信基地局を経由して検索を行ったり、単独で検索を行ったりするこ とで本発明の内容を実施してもよい。  [0460] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented by performing a search via a communication base station or performing a search independently.
[0461] 《カラオケ選曲および音楽検索システムの例》  [0461] 《Karaoke selection and music search system example》
感情と音階に伴う検索の例や文字列と識別子による音声映像検索の例、複数の識 別子に伴う音声映像検索により任意の処理を実行する例の応用として、カラオケが楽 曲販売システムでの実施例を説明する。  As an application example of search with emotion and scale, audio video search by character string and identifier, and example of executing arbitrary processing by audio video search with multiple identifiers, karaoke is a music sales system. Examples will be described.
[0462] 本発明によれば曲名やサビの歌詞を音素列や音素片列及び音階列として記録し、 それらの一致するところを検索することで、カラオケでのタイトル検索に用いることが できる。また、カラオケのような特徴構成に加え、音階記号の出現頻度同士を比べて 一致率の高 、ものや出現分布構造、出現位置分布を比較し検索することが可能であ る。 [0463] より具体的には「〇〇バンドの悲しい曲」で「〇〇バンド」の音素列 ·音素片列と一致 する演奏者の名称に基づいて全曲リストから抽出したリストから「悲しい」感情識別子 の一曲中における出現頻度の高!、ものを探し、「〇〇バンド風の悲しい曲」で「〇〇 バンド」の楽曲特徴の似た曲から「悲 U、」識別子の出現頻度の高!、曲を探すと 、つ た方法が用いることができる。また、このような検索により得られた共起情報を学習す ることで利用者の好みを学習しても良いし、利用者が選択再生した後、繰返して選択 した場合や楽曲を最後まで視聴した場合は検索結果を利用者が肯定したと判断し、 一回きりである場合やすぐに次の楽曲に移った場合は否定的判断をしたと解釈する ように構成しても良い。この際、クエリの「〇〇バンド」は音声で発話して自然言語処 理に利用しても良いし、文字列入力によって音素展開して検索しても良いし、文字列 のまま検索して楽曲特徴と感情特徴の類似性を評価しても良い。 [0462] According to the present invention, song titles and chorus lyrics are recorded as phoneme strings, phoneme fragment strings, and musical scale strings, and can be used for title search in karaoke by searching for matching points. Furthermore, in addition to the feature structure like karaoke, it is possible to compare the frequency of appearance of scale symbols and to search for high coincidence, things, appearance distribution structure, and appearance position distribution. [0463] More specifically, “sad” emotions from a list extracted from the entire song list based on the name of the performer that matches the phoneme string / phoneme string sequence of “00 band” in “Sad song of band 00” Look for something with a high frequency of appearance in one song of the identifier, and search for something with a “000 band-like sad song” and a high frequency of “Sad U,” ! When searching for a song, the same method can be used. In addition, the user's preference may be learned by learning the co-occurrence information obtained by such a search, or when the user selects and plays back and then repeatedly selects or listens to the music to the end. In such a case, the user may be judged to have affirmed the search result, and may be interpreted as having made a negative judgment if the search is performed once or if the next song is immediately moved to the next song. At this time, the “00 band” of the query may be spoken by voice and used for natural language processing, may be searched by expanding phonemes by inputting a character string, or may be searched as a character string. You may evaluate the similarity of a music feature and an emotion feature.
[0464] また、感情識別子の出現頻度や出現分布構造や出現位置分布を比較し一致どの 高いものを検索することも可能である。また、音素記号や音素片の出現頻度や出現 位置分布を比較し歌詞構成の似たものや特定のキーワードの含まれる音楽を検索す ることも可能である。そして、それらの検索結果に基づき楽曲を販売するサービスも実 施できる。また、音符の遷移情報や和音やコード進行の遷移情報を特徴量として用 い、楽曲の構造の一致度を評価しても良いし、音符の遷移情報や和音やコード進行 の遷移情報カゝら特徴量を抽出し識別関数を構成して識別子を判別できるようにして も良い。  [0464] It is also possible to compare the appearance frequency, the appearance distribution structure, and the appearance position distribution of emotion identifiers and search for a higher match. It is also possible to search for music with similar lyric composition and music with specific keywords by comparing the frequency of appearance and location distribution of phoneme symbols and phonemes. A service to sell music based on the search results can also be implemented. In addition, note transition information, chords, and chord progression transition information may be used as feature quantities to evaluate the degree of coincidence of music structures, and note transition information, chord and chord progression transition information, etc. The feature quantity may be extracted and an identification function may be configured so that the identifier can be discriminated.
[0465] また、感情認識により音楽ごとに感情認識結果の傾向が異なることを利用して、楽 曲に応じて発生する感情識別子の傾向を音楽ジャンルごとに統計処理により抽出し て多変量解析し音楽ジャンル識別子としたり、楽曲における感情識別子の出現傾向 の類似度を距離評価したりして利用者の感性パラメータに応じた検索を本発明に基 づ 、て行 、感性傾向の近 、音楽を検索して利用者に提示することで、利用者の好 みに応じた楽曲を推薦する t 、つたサービスも可能である。  [0465] Also, using the fact that emotion recognition results tend to differ for each music depending on emotion recognition, the tendency of emotion identifiers generated according to music is extracted by statistical processing for each music genre and subjected to multivariate analysis. Search according to the sensitivity parameter of the user based on the present invention by searching for music genre identifiers or by evaluating the similarity of the appearance tendency of emotion identifiers in music, searching for music that is close to the sensitivity trend By presenting it to the user, a service that recommends music according to the user's preference is also possible.
[0466] このように従来、認識された音階情報が単独で用いられて!/、た検索販売方法にカロ えて感情識別子や音素記号列、音素片記号列を組合せることで、利用者の趣味に 応じた歌詞や旋律傾向、感情傾向、声質傾向を持つ楽曲作品の検索が可能となる。 [0467] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ新しい楽曲のタイトルや歌詞、旋律に関する音素記号列や 音階記号列、感情識別子といった記号列を、 XMLや HTMLのような後述されるマ ークアップ言語や RSS、 CGIを用いて情報の送受信や配信を行っても良く組合せる ことで利便性を図ることが出来る。 [0466] In this way, the traditionally recognized scale information is used alone! /, And the search and sales method is combined with emotion identifiers, phoneme symbol strings, and phoneme symbol strings, so that the user's hobby is It is possible to search for music pieces with lyrics, melodic trends, emotional trends, and voice quality trends. [0467] At this time, the dictionary for converting the phoneme sequence or phoneme sequence and the processing procedure is not limited to the terminal side or the distribution base station side. Convenience can be achieved by combining symbol strings such as scale symbol strings and emotion identifiers, even if they are sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. I can do it.
[0468] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯情報端末であったり、装着型情報端末であったりしてもよぐ それらを用いて通信基地局を経由して本発明の内容を実施してもよい。  [0468] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.
[0469] なお、従来技術にある鼻歌と歌詞による検索は、鼻歌と言う行為と歌詞発話と言う 行為に分離されるため、本発明における共起情報に基づく検索とは異なる。  [0469] Note that the nasal song and lyrics search in the prior art is separated into the act of nasal song and the action of lyric utterance, and thus is different from the search based on the co-occurrence information in the present invention.
[0470] 《商品検索注文システムの例》  [0470] 《Example of product search order system》
本発明は音声操作の応用であり、利用者は情報端末及び Z又は端末側ブラウザに 対して音声を発話する。発話された音声はその特徴量を抽出する。そして、第 1の方 法ではこの特徴量を対象となる装置に送信し、特徴量を受信した配信装置はその特 徴量に応じて音素記号列及び Z又は音素片記号列と感情記号列を生成する。そし て、生成された記号列に基づいて、一致する配信装置側の制御手段を選択し実行 する。  The present invention is an application of voice operation, and a user speaks voice to an information terminal and Z or a terminal-side browser. The feature amount is extracted from the spoken voice. In the first method, this feature amount is transmitted to the target device, and the distribution device that has received the feature amount generates a phoneme symbol string and Z or a phoneme symbol string and an emotion symbol string according to the feature amount. Generate. Then, based on the generated symbol string, a matching control unit on the distribution apparatus side is selected and executed.
[0471] 第 2の方法は、情報端末及び Z又は端末側ブラウザ内で音素記号列及び Z又は 音素片記号列、感情記号列を生成し、生成された記号列を対象となる配信装置側に 送信する。そして、配信装置側は受信した記号列に基づき一致する制御および配信 手段を選択し実行する。  [0471] In the second method, a phoneme symbol string, Z or phoneme fragment symbol string, and emotion symbol string are generated in the information terminal and Z or the terminal-side browser, and the generated symbol string is transmitted to the target distribution device side. Send. Then, the distribution apparatus side selects and executes the matching control and distribution means based on the received symbol string.
[0472] 第 3の方法は情報端末及び Z又は端末側ブラウザ内で生成された特徴量に基づ き音素及び Z又は音素片記号、感情記号列を認識し、認識された記号列に基づき 制御内容を選択し、制御方法を制御する配信装置側に対し送信する。制御方法を 受信した配信装置は制御方法に基づき目的の処理を実施し情報を提供する。  [0472] The third method is to recognize phonemes, Z or phoneme symbols, and emotion symbol strings based on feature values generated in the information terminal and Z or the terminal-side browser, and control based on the recognized symbol strings. The contents are selected and transmitted to the distribution apparatus side that controls the control method. The distribution apparatus that has received the control method performs information processing based on the control method and provides information.
[0473] そして、第 4の方法は、情報端末及び Z又は端末側ブラウザを用いて音声波形を そのまま制御する装置に送信し、制御する配信装置側で音素記号列及び Z又は音 素片記号列、感情記号列を認識し、認識された記号列に基づいて制御手段を選択 し、選択された制御を配信装置側が実行すると!ヽぅものである。 [0473] Then, the fourth method uses the information terminal and the Z or terminal side browser to transmit the speech waveform as it is to the device that controls the phoneme symbol string and the Z or phoneme symbol string on the controlling distribution device side. , Recognize emotion symbol string and select control means based on recognized symbol string However, when the distribution apparatus executes the selected control, it is a problem.
[0474] この際、利用者の感情が怒りを伴っている場合においては、利用者に対して謝るよ うなメッセージを音声や文字列により提示しても良い。同様に感情識別子も音声から 特徴抽出や記号ィ匕が可能であり、環境音など音や映像の特徴や識別子についても 同様であり、『カラオケ楽曲検索』にあるような方法を組合せて検索したりすることが出 来る。  [0474] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string. Similarly, emotion identifiers can be extracted from voice, and features can be extracted from symbols, and so can sound and video features and identifiers such as environmental sounds. To do.
[0475] この際、表示されている商品に関し CGIや HTMLに音素記号列を組込み、それら の記号に基づいて検索評価することで一致するページに移動したり、商品の注文や 詳細の表示をしたりするといつた方法であってもよい。これらの検索対象は書籍や A Vコンテンツ、デジタル素材、化粧品、医薬品、食品、自動車などの工業製品といつ た任意の固有名詞をもつ物に対して実施してもよ 、。  [0475] In this case, phoneme symbol strings are embedded in CGI and HTML for the displayed products, and search and evaluation based on those symbols will move to matching pages, product orders and details will be displayed. If it is, it may be the way. These search targets may be performed on books, AV content, digital materials, cosmetics, pharmaceuticals, food, automobiles, and other industrial products and items with any proper nouns.
[0476] また、各固有名詞を複数の話者により発話させて同一音素に複数の音素や音素片 の認識テンプレートをもたせることで、利用するページの音素列の検索率を改善する といった方法も考えられる。また、このような受発注システムの処理手順の一部を用い てエキスパートシステムなどの応用システムを構築してもよい。  [0476] Also, a method may be considered in which each proper noun is uttered by multiple speakers and the same phoneme is provided with a recognition template for multiple phonemes and phoneme fragments, thereby improving the search rate of the phoneme string of the page to be used. It is done. In addition, an application system such as an expert system may be constructed by using a part of the processing procedure of such an ordering system.
[0477] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ新しい商品や商品ジャンルに関する音素記号列や画像特 徴、音声特徴、感情識別子といった記号列を、 XMLや HTMLのような後述されるマ ークアップ言語や RSS、 CGIを用いて情報の送受信や配信を行っても良 、。  [0477] At this time, the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is not limited to the phoneme symbol sequence or image feature related to the new product or product genre, which may be on the terminal side or on the distribution base station side. It is also possible to send and receive and distribute information such as symbolic strings such as voice features and emotion identifiers using markup languages such as XML and HTML, RSS and CGI, which will be described later.
[0478] もちろん、これらのサービス自体は映画や写真、小説などのコンテンツ配信サービ スであってもよぐデジタル素材配布サービスや商品販売サービスであってもよぐこ れらのサービスを実行する装置は卓上情報処理装置であったり、車載型端末であつ たり、携帯情報端末であったり、装着型情報端末であったりしてもよぐそれらを用い て通信基地局を経由して本発明の内容を実施してもよい。  [0478] Of course, these services themselves may be content distribution services such as movies, photos, and novels, and even digital material distribution services and product sales services. It may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. You may implement.
[0479] 《音声サービスの例》  [0479] <Example of voice service>
例えば、書籍の販売に伴い音声により朗読するサービスなどを実行する場合、音素 や音素片をもちいたり、込められた感情を認識に基づいた識別子で評価したりするこ とで、任意の台詞や文章位置を検索することが可能である。 [0480] この際、朗読に音声合成を用いる場合、発話者の音素片別音声合成のテンプレー トを変更することで、好みの芸能人の音声に発話辞書やテンプレートを変更して朗読 するといつたサービスを実施してもよいし、その朗読における音声合成のためのパラ メータに関する発話辞書若しくはテンプレートを感情の変化に伴い変化させてもよく 組合せることで利便性を図ることが出来る。 For example, when a book-reading service is executed when a book is sold, it is possible to use any speech or sentence by using phonemes or phonemes, or by evaluating the emotions that are included based on recognition. It is possible to retrieve the position. [0480] In this case, if speech synthesis is used for reading, the speech dictionary and template are changed to the speech of the favorite celebrity by changing the speech synthesis template for the speaker's phoneme. The utterance dictionary or template related to the parameters for speech synthesis in the reading can be changed with the change of emotions, and can be combined for convenience.
[0481] また、本音声サービスを応用してロボットやエージェントの音声合成のためのテンプ レートやパラメータを配信し利用者のロボットやエージェントが利用者の趣味に合った 有名人の音声で感情を伴いながら発話したり、家電の制御を行ったりするサービスも 実施したり、本音声サービスを応用して利用者の発話とサービス側の提供する発話 を比較して会話学習サービスなども実現できる。 [0481] Also, by applying this voice service, templates and parameters for voice synthesis of robots and agents are distributed, and the user's robots and agents are accompanied by emotions with celebrity voices that match the user's hobbies. Services such as speaking and controlling home appliances can also be implemented, and this speech service can be applied to compare conversations between users and utterances provided by the service to realize conversation learning services.
[0482] 《音声操作を可能とするリモコンの例〉〉 [0482] <Example of remote control enabling voice operation >>
まず、利用者はリモコンに対して音声を発話する。発話された音声はその特徴量を抽 出する。そして、第 1の方法ではこの特徴量を対象となる装置に送信し、特徴量を受 信した装置はその特徴量に応じて音素記号列及び Z又は音素片記号列と感情記号 列を生成する。そして、生成された記号列に基づいて、一致する制御手段を選択し 実行する。  First, the user speaks voice to the remote control. The feature amount is extracted from the spoken voice. Then, in the first method, this feature value is transmitted to the target device, and the device that has received the feature value generates a phoneme symbol string and Z or phoneme symbol string and emotion symbol string according to the feature value. . Then, based on the generated symbol string, a matching control means is selected and executed.
[0483] 第 2の方法は、リモコン内で音素記号列及び Z又は音素片記号列、感情記号列を 生成し、生成された記号列を対象となる装置に送信する。そして、制御される装置は 受信した記号列に基づき一致する制御手段を選択し実行する。  [0483] In the second method, a phoneme symbol string, Z or phoneme symbol string, and emotion symbol string are generated in the remote controller, and the generated symbol string is transmitted to the target device. The controlled device selects and executes a matching control means based on the received symbol string.
[0484] 第 3の方法はリモコン内で生成された特徴量に基づき音素及び Z又は音素片記号 、感情記号列を認識し、認識された記号列に基づき制御内容を選択し、制御方法を 制御する装置に対し送信する。  [0484] The third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on feature values generated in the remote control, selects the control content based on the recognized symbol strings, and controls the control method. Is sent to the device.
[0485] そして、第 4の方法は、リモコンを用いて音声波形をそのまま制御する装置に送信し 、制御する装置内で音素記号列及び Z又は音素片記号列、感情記号列を認識し、 認識された記号列に基づいて制御手段を選択し、選択された制御を制御される装置 が実行するというものである。  [0485] Then, the fourth method transmits the speech waveform as it is using a remote controller to recognize the phoneme symbol string, Z, phoneme symbol string, and emotion symbol string in the controlling apparatus, and recognizes them. The control means is selected based on the selected symbol string, and the selected device executes the selected control.
[0486] この際、利用者の感情が怒りを伴っている場合においては、利用者に対して謝るよ うなメッセージを音声や文字列により提示しても良い。同様に感情識別子も音声から 特徴抽出や記号ィ匕が可能であり、環境音など音や映像の特徴や識別子についても 同様である。 [0486] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string. Similarly, emotion identifiers are also spoken Feature extraction and symbol recognition are possible, and the same applies to sound and video features and identifiers such as environmental sounds.
[0487] このようなリモコン技術をロボットに導入して、家電制御を行ったり、カーナビゲーシ ヨンシステムに組み込んで制御を行ったりしてもよい。この際、 RSSや HTML、 XML のような後述されるマークアップ言語や CGIを用いて被操作側装置に任意の新 Uヽ 制御記号列情報の配信を行ったり、音素や音素片や音声波形を用いて利用するリモ コンゃ携帯端末の更新された音素記号列情報を赤外線や無線を経由して受信したり 、送信したりしてもよい。  [0487] Such remote control technology may be introduced into a robot to perform home appliance control, or may be incorporated into a car navigation system to perform control. In this case, any new U ヽ control symbol string information is distributed to the operated device using the markup language or CGI described later such as RSS, HTML, XML, etc., and phonemes, phonemes, and speech waveforms are transmitted. The remote phone to be used may receive or transmit the updated phoneme symbol string information of the mobile terminal via infrared or wireless.
[0488] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ修正情報や新しい機能に関する音素記号列や画像特徴、 音声特徴、感情識別子といった記号列を、 XMLや HTMLのような後述されるマーク アップ言語や RSS、 CGIを用いて情報の送受信や配信を行っても良く組合せること で利便性を図ることが出来る。  [0488] At this time, the dictionary that converts the phoneme sequence and the phoneme segment sequence and the processing procedure is not limited to the terminal side or the distribution base station side. Convenience can be achieved by combining symbolic strings such as voice features and emotion identifiers, even if information is sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. .
[0489] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯情報端末であったり、装着型情報端末であったりしてもよぐ それらを用いて通信基地局を経由して本発明の内容を実施してもよい。  [0489] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.
[0490] 《携帯端末に用いる例〉〉  [0490] <Examples for use with mobile devices >>
まず、利用者は携帯端末に対して音声を発話する。発話された音声はその特徴量を 抽出する。そして、第 1の方法ではこの特徴量を対象となる装置に送信し、特徴量を 受信した装置はその特徴量に応じて音素記号列及び Z又は音素片記号列と感情記 号列を生成する。そして、生成された記号列に基づいて、一致する制御手段を選択 し実行する。  First, the user speaks voice to the mobile terminal. The features of the spoken speech are extracted. In the first method, this feature amount is transmitted to the target device, and the device that has received the feature amount generates a phoneme symbol string and Z or a phoneme symbol string and an emotion symbol sequence according to the feature amount. . Then, based on the generated symbol string, the matching control means is selected and executed.
[0491] 第 2の方法は、携帯端末内で音素記号列及び Z又は音素片記号列、感情記号列 を生成し、生成された記号列を対象となる装置に送信する。そして、制御される装置 は受信した記号列に基づき一致する制御手段を選択し実行する。  [0491] In the second method, a phoneme symbol string, a Z or phoneme fragment symbol string, and an emotion symbol string are generated in a mobile terminal, and the generated symbol string is transmitted to a target device. The controlled device selects and executes a matching control means based on the received symbol string.
[0492] 第 3の方法は携帯端末内で生成された特徴量に基づき音素及び Z又は音素片記 号、感情記号列を認識し、認識された記号列に基づき制御内容を選択し、制御方法 を制御する装置に対し送信する。 [0493] そして、第 4の方法は、携帯端末を用いて音声波形をそのまま制御する装置に送信 し、制御する装置内で音素記号列及び Z又は音素片記号列、感情記号列を認識し 、認識された記号列に基づいて制御手段を選択し、選択された制御を制御される装 置が実行するというものである。 [0492] The third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on the feature values generated in the mobile terminal, selects the control content based on the recognized symbol strings, and controls the method. To the device that controls [0493] Then, the fourth method is to transmit the speech waveform as it is using a mobile terminal, and recognize the phoneme symbol string, Z or phoneme symbol string, and emotion symbol string in the controlling device. The control means is selected based on the recognized symbol string, and the controlled device executes the selected control.
[0494] この際、利用者の感情が怒りを伴っている場合においては、利用者に対して謝るよ うなメッセージを音声や文字列により提示しても良い。同様に感情識別子も音声から 特徴抽出や記号ィ匕が可能であり、環境音など音や映像の特徴や識別子についても 同様である。  [0494] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string. Similarly, emotion identifiers can be extracted from voice and features can be extracted from symbols, and so can sound and video features and identifiers such as environmental sounds.
[0495] また、携帯端末の赤外線を用いて DVDデッキやテレビ、エアコンと!/、つた目的の装 置を制御したり、制御するためにその装置の IPアドレスを赤外線や無線 LANを用い て取得したりし、携帯端末用インターネットや屋内 LANを経由して目的の装置の制 御情報を取得して制御する場合、本発明を用いた制御リストを取得することで携帯端 末や携帯電話による音声制御を実現できる。  [0495] In addition, the infrared of the mobile device is used to control the DVD deck, TV, air conditioner, and other devices, and the IP address of the device is acquired using infrared or wireless LAN to control the device. If the control information of the target device is acquired and controlled via the Internet for mobile terminals or the indoor LAN, the sound from the mobile terminal or mobile phone can be obtained by acquiring the control list using the present invention. Control can be realized.
[0496] もちろん、携帯端末から目的の装置に自分の IPアドレスやメールアドレスを送信し、 目的の装置がその IPアドレスに基づ 、て任意のポートに接続し制御情報を送信した り、目的の装置が制御情報をメールに添付して携帯端末に送信したり、単純に赤外 線のやり取りで制御情報を取得するといつた方法であってもよい。また、携帯端末の マイクからの入力を音素認識や音素片認識、感情認識、環境音認識、音階認識を実 施して、検索サービスを実施してもよい。  [0496] Of course, the mobile device sends its own IP address or e-mail address to the target device, and the target device connects to any port based on the IP address and sends control information, Any method may be used when the device sends the control information to a portable terminal by attaching it to an e-mail, or simply acquires the control information by exchanging infrared rays. In addition, a search service may be implemented by performing phoneme recognition, phoneme recognition, emotion recognition, environmental sound recognition, and scale recognition for input from a microphone of a mobile terminal.
[0497] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐ修正情報や新しいコンテンツや番組ジャンル、役者名に関 する音素記号列や画像特徴、音声特徴、感情識別子といった記号列を、 XMLや H TMLのような後述されるマークアップ言語や RSS、 CGIを用いて情報の送受信や配 信を行っても良く組合せることで利便性を図ることが出来る。  [0497] At this time, the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is not related to the correction information, the new content, the program genre, or the actor name, which may be on the terminal side or on the distribution base station side. A combination of symbolic strings such as phoneme symbol strings, image features, voice features, and emotion identifiers that can be sent and received and delivered using markup languages, such as XML and HTML, described later, RSS, and CGI. Convenience can be aimed at.
[0498] また、携帯端末での通話を随時処理し、感情の起伏や発話内容を評価することで、 例えば怒りや悲しみという感情や疲労が会話中の端末利用者の発話力 頻繁に観 測された場合、通話終了後に元気になるようなコンテンツとして、おいしい近所のお 店であったり、元気になる音楽やイラスト、映像作品を端末利用者に提示するといつ たサービスを実施したり、発話中の音素に基づいて宣伝を実行するというサービスも 可能である。 [0498] In addition, by processing calls on mobile devices as needed and evaluating emotional undulations and utterance contents, for example, emotions and fatigue such as anger and sadness are frequently observed in the speech ability of terminal users who are talking. When a phone call is presented to a terminal user at a delicious neighborhood store or energetic music, illustrations, or video works as content that will be energetic after the call ends It is also possible to carry out advertisement services based on phonemes being uttered or to execute advertisements.
[0499] また、携帯端末のマイクを低性能な物と高性能な物を複数個用意して認識用の高 性能な音声収録を行ったり、収録する際のサンプリングレートを上げて、認識を実行 すると共に音声通話送信用にサンプリングレートを低く変換し通話用音声情報を構 成して通話用圧縮音声情報を生成したりしても良い。  [0499] Also, multiple low-performance microphones and high-performance microphones are available for high-performance audio recording for recognition, and recognition is performed by increasing the sampling rate for recording. At the same time, the voice rate for voice call transmission may be converted to a lower sampling rate, and voice information for call may be formed to generate compressed voice information for call.
[0500] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯電話情報端末であったり、装着型情報端末であったりしても よぐそれらを用いて通信基地局を経由して本発明の内容を実施してもよい。  [0500] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a mobile phone information terminal, or a wearable information terminal. The content of the present invention may be implemented via a communication base station.
[0501] 《ロボットやエージェントに用いる例》  [0501] << Examples for robots and agents >>
例えば、ロボットやコンピュータのエージェント 'インターフェースであれば付随する撮 像装置やマイク、録音装置に伴う画像認識機能や音声認識機能を用いて前述の検 出録画機能と同等の検出を実施し、利用者と共にテレビを見ているときに、特定の芸 能人に反応して任意の処理を実施したり、特定のキーワードに反応し任意の処理を 実施したり、特定の感情に反応し任意の処理を実施したりことで、利用者と共にロボッ トが笑ったり泣いたりする演出を実施したり、利用者の嗜好に合わせて周辺に置かれ た他の装置を制御するロボットを実現することが可能となる。また他の処理と同様に、 特徴量の抽出や記号ィ匕を行って力 基幹サーバにリクエストするという方法を用いて も良い。  For example, in the case of a robot or computer agent's interface, the detection and recording functions equivalent to those described above are performed using the image recognition function and voice recognition function associated with the associated imaging device, microphone, and recording device. When watching TV together, perform any process in response to a specific entertainer, perform any process in response to a specific keyword, or perform any process in response to a specific emotion. This makes it possible to implement a robot that laughs and crys with the user, and to realize a robot that controls other devices placed in the vicinity according to the user's preference. . In addition, as with other processes, a method may be used in which a feature quantity is extracted or a symbol 匕 is used to make a request to a power core server.
[0502] より具体的に説明すると、本発明を利用したロボットやエージェントは、コンテンツを 利用者が閲覧している最中にコンテンツ力 抽出される識別子や特徴量と利用者の 表情や発話に伴う音素'音素片'感情に関する特徴量と識別子を観測することにより 、利用者の特徴量や識別子とコンテンツの特徴量や識別子の共起状態を観測するこ とが出来るようになる。この際、本発明を用いたコンテンツ再生装置から感情や音素 に関わる識別子や特徴量を取得しても良いし、自装置内の索引付け機能を用いてコ ンテンッゃ利用者の感情や音素に関わる識別子や特徴量を抽出しても良い。  [0502] More specifically, a robot or an agent using the present invention is accompanied by an identifier or feature amount extracted while the content is being browsed by the user, and the facial expression or utterance of the user. By observing the feature quantity and identifier related to the phoneme 'phoneme piece' emotion, it becomes possible to observe the co-occurrence state of the feature quantity and identifier of the user with the feature quantity and identifier of the content. At this time, identifiers and feature quantities related to emotions and phonemes may be acquired from the content playback device using the present invention, or content related to user emotions and phonemes using the indexing function in the device itself. An identifier or feature amount may be extracted.
[0503] このようにして、例えば「お笑い番組」において収集された特徴量や識別子によって 「お笑 、番組利用者状況評価関数」を構成し、コンテンツが「お笑 、番組の特徴量」 を示し、利用者が「喜んでいる特徴量」を示し、「お笑い番組利用者状況評価関数」 における特徴量の重心に利用者とコンテンツの特徴量や識別子が近 、場合、ロボッ トゃエージェントに「楽し 、」と 、う感情表現をさせることで擬似感情としての演出をす ることが可能となる。もちろん、他の「喜怒哀楽」のような他の感情であっても同様にし て、利用者とコンテンツ力 得られる特徴量と識別子の共起状態によって状況を学習 させても良い。 [0503] In this way, for example, a “comic, program user status evaluation function” is composed of feature values and identifiers collected in “comed program”, and the content is “comic, program feature value”. If the feature and identifier of the user and content are close to the center of gravity of the feature in the “Comedy Program User Situation Evaluation Function” By expressing the emotion as “fun,” it is possible to produce a performance as a pseudo emotion. Of course, other emotions such as other emotions may also be learned in the same manner based on the co-occurrence state of the feature quantity and identifier obtained by the user and the content ability.
[0504] また、任意の物体に付属する RFIDや JANコード、バーコードを識別子として、その 識別子の得られる物体の画像やぶつ力つたときの音、操作したときの音力も特徴量を 学習し、その物体の質量や重量など力 運搬可能である力 衝突時に回避するべき かどうか、利用者に提示した場合にどういった感情を示す力を関連付けて記録し学 習することで装置自身が自動学習し装置の挙動や利用効率を自律的に改善するた めに利用しても良いし、映像コンテンツ内から特定の人物を認識し、その人物の表す 感情特徴の出現頻度や強度が他の人物の表す感情の平均より例えば 3 σ以上乖離 している場合において、その人物の性格特性を同定し、その人物との対話やコミュ二 ケーシヨンにおける補正値などに用 、たり、その人物の性格を言 ヽ当てたりしても良 いし、感情の変化に伴い顔の特徴量を抽出し表情の種類を自律的に学習したり、同 時に発話される音声に対し音素や音素片による認識を行い得られた記号列を収録し 、感情に伴う発話に含まれる音素や音素片傾向を分析したり、同時に環境音を認識 し騒音や爆発音などの外部音響に対する利用者の反応による表情の変化を学習し ても良い。  [0504] Also, using RFID, JAN code, and barcode attached to an arbitrary object as an identifier, the image of the object from which the identifier is obtained, the sound when it hits, and the sound force when it is operated learn the feature amount, Forces such as the mass and weight of the object Force that can be transported The device itself automatically learns by recording and learning whether it should be avoided in the event of a collision, and how to express the emotion when it is presented to the user. It can be used to autonomously improve the behavior and usage efficiency of the device, recognize a specific person from the video content, and the frequency and intensity of emotional characteristics represented by that person For example, if the person's personality is more than 3 σ away from the average of the emotions expressed, the personality characteristics of the person are identified, and the personality characteristics of the person can be used for dialogue and communication corrections. Guessed It is also possible to extract facial features as emotions change and autonomously learn the types of facial expressions, or to obtain a sequence of symbols obtained by recognizing speech spoken simultaneously with phonemes or phonemes. To analyze phoneme and phoneme tendencies included in emotional utterances, and simultaneously recognize environmental sounds and learn facial expression changes due to user responses to external sounds such as noise and explosion sounds .
[0505] そして、このような方法によって学習された情報に基づいて、 CGやロボットやエー ジェントなどの仮想人格における知識 DBに用いて利用者の感性や対応に応じて C Gやロボットやエージェントのリアクションを変化させり、 CGやロボットやエージェント の表情変化に用いたりしても良し、 EPGや BML、 RSS、文字放送などの外部情報を 用いてテレビなどの放送情報を取得し、利用者の好みに合!、そうな芸能人や時事情 報の提供を行っても良いし、前述の動画検索録画手段により収録された情報の再生 回数や再生視聴時間に基づ!ヽて分析する方法を用いて利用者の嗜好を分析しても 良 、し、本発明を用いたロボットに携帯電話の実施例のように赤外線通信や無線 LA Nなどを用いて周囲の装置の制御方法を取得し、利用者の音声に従った装置制御 をすることで装置制御の利便性改善を図ったり、現在表示している情報の識別子や 特徴量を獲得しても良い。 [0505] Based on the information learned by such a method, the reaction of the CG, robot, or agent is performed according to the sensitivity or response of the user using the knowledge database for virtual personalities such as CG, robot, or agent. It can be used to change the facial expression of CG, robots, and agents, and broadcast information such as TV can be acquired using external information such as EPG, BML, RSS, and text broadcasting, to the user's preference. You may provide information on such entertainers and time circumstances, and use a method that analyzes them based on the number of times the information recorded by the aforementioned video search and recording means and the playback viewing time! The user's preference may be analyzed, and the robot using the present invention may be connected to infrared communication or wireless LA The control method of the surrounding device is acquired using N, etc., and the device control is improved according to the user's voice, and the convenience of device control is improved, and the identifier and feature amount of the currently displayed information can be obtained. You may win.
[0506] このように、複数の識別子と特徴量との共起状態に基づいた情報を多元的に評価 、学習することで、ロボットやエージェントの知識データベースとして用いることが可能 であり、人とのコミュニケーションに必要な情報を映像や音楽により学習することで、よ り汎用性の高いロボットやエージェントインタフェースの実現が可能となるとともに、口 ボットの動作に伴う各種センサ入力や画像情報や音響情報や音声情報の共起状態 を評価して、ロボットの自律的動作を学習させたり、学習結果にしたがって自律的に 行動したり、学習結果にしたがって人に対して指示をしたりしてもよいし、ゲーム内で の仮想人格としてキャラクタや NPCの知識 DBに用いたりしても良 、。  [0506] As described above, it is possible to use information based on the co-occurrence state of a plurality of identifiers and feature values in a multi-dimensional manner and use it as a knowledge database for robots and agents. By learning information necessary for communication with video and music, it is possible to realize a more versatile robot and agent interface, as well as various sensor inputs, image information, acoustic information, and voices associated with mouth bot movements. You can evaluate the co-occurrence of information and learn the robot's autonomous behavior, act autonomously according to the learning results, give instructions to people according to the learning results, It can be used as a character or NPC knowledge database as a virtual personality in the company.
[0507] この際、音素列や音素片列と処理手順を変換する辞書は、端末側にあっても配信 基地局側にあってもよぐロボットの修正情報や新しい情報、機能に関する音素記号 列や画像特徴、音声特徴、感情識別子といった記号列を、 XMLや HTMLのような 後述されるマークアップ言語や RSS、 CGIを用いて情報の送受信や配信を行っても 良く組合せることで利便性を図ることが出来る。  [0507] At this time, the phoneme string or the phoneme string string and the dictionary that converts the processing procedure are phoneme symbol strings related to the correction information, new information, and functions of the robot that can be on the terminal side or on the distribution base station side. For convenience, symbol strings such as image features, audio features, and emotion identifiers can be combined and sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. I can plan.
[0508] もちろん、これらのサービスを実行する装置は卓上情報処理装置であったり、車載 型端末であったり、携帯情報端末であったり、装着型情報端末であったりしてもよぐ それらを用いて通信基地局を経由して本発明の内容を実施してもよい。  [0508] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.
[0509] «医療分析装置の例》  [0509] «Examples of medical analyzers»
つづけて、医療応用としての分析装置の例を説明する。本発明の音声と画像から抽 出される音素や音素片、文字や表情、仕草に加えて、脈拍センサや脳波センサ、筋 電流センサ、皮膚抵抗センサ、体重計、血圧計、体温計を用いて観測対象者の発話 に伴う感情と脳波や脈拍などセンサ力 得られる特徴量を本発明に基づいて索引付 けながら収録する。  Next, an example of an analyzer as a medical application will be described. In addition to the phonemes and phonemes extracted from the sound and images of the present invention, characters, facial expressions, and gestures, the subject of observation using a pulse sensor, electroencephalogram sensor, muscle current sensor, skin resistance sensor, weight scale, sphygmomanometer, and thermometer Based on the present invention, it records the emotions associated with the user's utterance and the feature quantities obtained from the sensor power such as brain waves and pulses.
[0510] 次に本発明を用いて、各特徴量や識別子の共起状態を観測'分析し学習すること により、特定の感情に伴う脳波の傾向や血圧傾向、体温傾向、体重傾向、脈拍傾向 、表情傾向に関する多変量解析を実施して分類し偏りを抽出するするとともに、専門 家によって本発明により学習'分類された傾向に基づいて心理状態に関する識別子 の割り当てを行う。 [0510] Next, by using the present invention to observe and analyze the co-occurrence state of each feature quantity and identifier, and learn, the brain wave tendency, blood pressure tendency, body temperature tendency, body weight tendency, pulse tendency associated with a specific emotion , Classify and extract bias by performing multivariate analysis on facial expression tendencies An identifier for the psychological state is assigned based on the tendency learned and classified according to the present invention by the house.
[0511] このようにして分類された情報に基づき追加された索引を利用して本発明に用いら れる再学習を実行することで、心理傾向分析を行う装置や分析された傾向に基づ 、 て利用者との対を行うことで感情と発話と脳波若しくは脈拍若しくは皮膚抵抗若しく は筋電流、血圧傾向、体温傾向、体重傾向の変化にともない記録を採りそれらの情 報を参考にカウンセリングや精神分析や各種センサとの併用により尿や血液といった 体液や体調、皮膚や毛髪、排泄物などの成分に関する共起情報を評価して診断参 照情報の整理を実行する装置を構成できるとともに重症患者のうめき声や挙動を観 察し状態の変化を検出する装置などが構成できる。  [0511] By executing the relearning used in the present invention by using the index added based on the information classified in this way, based on the device that performs psychological trend analysis and the analyzed trend, By taking a pair with a user and recording emotions and utterances and brain waves or pulse or skin resistance or muscle current, blood pressure trends, body temperature trends, and body weight trends, counseling and Severely ill patients can be configured by using psychoanalysis and various sensors to evaluate the co-occurrence information on components such as body fluids such as urine and blood, body condition, skin, hair, and excreta, and organize diagnostic reference information A device that detects the change of state by observing the moaning voice and behavior can be configured.
[0512] また、同様の方法で人の動作に関する偏りを抽出し分析することで、背骨の異常や 歩き方の異常を検出したり、骨折などの外傷治療後の改善状態を記録分析すること で、医療における治療効果の評価を定量的に行う装置を構成したり、人と対話するこ とでカウンセリングや精神分析を試みる情報処理装置の仮想人格に用いる知識情報 データベースとして構成したりしても良いし、義足や義手のといった医療機器に利用 する制御情報を画像情報や音響情報を用いた身体動作に関する共起情報によって 構築や抽出に用いても良いし、患者や利用者の健康に関連する情報の変化に基づ V、た健康管理システムを構成しても良!、。  [0512] Also, by extracting and analyzing deviations related to human movements in the same way, it is possible to detect abnormalities in the spine and walking, and to record and analyze improvements after trauma treatment such as fractures. It may be configured as a device that quantitatively evaluates the therapeutic effect in medical treatment, or as a knowledge information database used for the virtual personality of information processing devices that attempt counseling or psychoanalysis by interacting with people. Control information used for medical devices such as prosthetic limbs and prosthetic hands may be used for construction and extraction by co-occurrence information related to physical movement using image information and acoustic information, or information related to the health of patients and users Based on the change of V, you can configure a health management system!
[0513] 《国際音素記号と言語別音素記号、音素と音素片の記号変換例》  [0513] 《International phoneme symbols and language-specific phoneme symbols, and phoneme and phoneme symbol conversion examples》
次に、このような共起状態を評価する検索において、異国語で発話されたコンテンツ を検索する際に音素特徴傾向が言語により異なるため、それらを補完する技術が必 要になることが将来的な課題として予測されるため、共起状態を利用した音素記号変 換方法について説明する。  Next, in such a search that evaluates the co-occurrence state, when searching for content spoken in foreign languages, the phoneme feature tendency differs depending on the language, so it will be necessary to have a technology that complements them. The phoneme symbol conversion method using the co-occurrence state is explained.
[0514] また、同様の共起情報の組合せを日本語音素と英語音素の共起状態を評価するこ とで異なる言語間の変換を行うようにすることも可能であり、言語の違いによる音素記 号の変化や偏りを解決するために、国際的な音素表記の基準に対して、言語ごとに 帰属確率を考慮した変換のための標準テンプレートとして変換テンプレートを持ち、 それらのテンプレートを HMMや距離評価関数として構成することにより音素情報空間 を互いに変換可能とし、利便性の向上による課題の解決を図る。 [0514] It is also possible to convert between different languages by evaluating the co-occurrence state of Japanese phonemes and English phonemes for the same combination of co-occurrence information. In order to resolve changes and biases in symbols, we have conversion templates as standard templates for conversion considering the probability of attribution for each language against international phoneme notation standards. Phoneme information space by constructing as evaluation function Can be converted to each other, and the problem is solved by improving convenience.
[0515] 共起状態を学習したり、それぞれの識別子や特徴量を音素や音素片や文字列で 検索条件に指定したり、索引付けをコンテンツ情報に実施したりすることで、複雑な 主観的条件に基づく情報の検索や記録、配信、受信を実現するば力りでなぐ国際 的な発音の違いに対応したりすることが可能となる。  [0515] By learning the co-occurrence state, specifying each identifier or feature as a search condition with phonemes, phoneme pieces or character strings, and performing indexing on content information It is possible to cope with differences in international pronunciation that can be achieved with the power to search, record, distribute, and receive information based on conditions.
[0516] また、これらのコンテンツ検索において、海外のコンテンツも含まれることを考慮し発 音環境の変化への対策として外国語音素と日本語音素の変換方法を実現するととも に音素の変換ばかりではなぐ音素列と画像特徴の変換や効果音と音素片列の変換 や感情と文字列の変換といった人間にとって共起状態にある情報同士の変換により 検索条件を整え検索を行うことで課題の解決を図ろうとするものである。  [0516] In addition, in these content searches, foreign language phonemes and Japanese phonemes can be converted as countermeasures against changes in the sound environment in consideration of the inclusion of overseas content. Resolve the problem by adjusting search conditions and performing search by converting information that is co-occurring for humans, such as conversion of phoneme sequences and image features, conversion of sound effects and phoneme sequences, and conversion of emotions and character strings. It is intended to be illustrated.
[0517] もちろん、共起状態を用いるため、映像特徴同士や動作特徴同士、映像と音声に 関連した識別子、和音や環境音などの識別子、感情識別子を組合せることで困った 表情のときに出力される音声特徴などの共起状態を学習し、新しく「困った態度」識 別子を構成するといつた利用や、音声識別子と画像識別子を獲得する層と共起状態 を処理する層と共起状態の時系列遷移を処理する層といった方法で多層ベイズや 多層 HMMを構成し利用することも可能である。  [0517] Of course, because co-occurrence state is used, it is output when facial expressions are in trouble by combining video features and motion features, identifiers related to video and audio, identifiers such as chords and environmental sounds, and emotion identifiers. Learning the co-occurrence state of the voice features, etc., and constructing a new `` problem attitude '' identifier, the co-occurrence with the layer that acquires the voice identifier and image identifier and the layer that processes the co-occurrence state It is also possible to configure and use multi-layer Bayes and multi-layer HMMs by methods such as layers that process time series transitions of states.
[0518] そして、このような多様な認識の出力を人種や言語ごとに分類して文化による評価 層などを与えることにより異なる背景に基づいた認識結果による識別子同士の変換を 実施しても良ぐ確率的に偏りのある共起情報を用いる変換によって、異なる背景で ありながら類似性のある情報の変換を実現することが可能となる。  [0518] Then, by classifying such various recognition outputs by race and language and giving an evaluation layer by culture, etc., it is also possible to convert identifiers based on recognition results based on different backgrounds. By using co-occurrence information that is probabilistically biased, it is possible to realize the conversion of information that has similar backgrounds but with different backgrounds.
[0519] 識別子変換の特徴は言語依存のある音素表記を国際音素記号若しくは異なる言 語環境に基づく音素及び音素片の認識により異なる言語特性を持つ音素記号の共 起情報に基づ 、た音素変換辞書や音素片変換辞書の利用による異言語音素変換 処理及び言語に関連付けられた識別子以外の識別子に関する同様の共起状態に 基づく識別子変換処理の実施にある。  [0519] The feature of identifier conversion is that phoneme conversion is based on language-dependent phoneme notation based on co-occurrence information of international phoneme symbols or phoneme symbols having different language characteristics by recognition of phonemes and phonemes based on different language environments. The implementation of different language phoneme conversion processing using a dictionary or phoneme conversion dictionary and similar identifier conversion processing based on the co-occurrence state for identifiers other than identifiers associated with languages.
[0520] 前述の『本発明を用いた識別子再構築の例』にあるように、例えば、国際音素で表 記された場合であれば、国際音素記号 HMMの出力確率を特徴量として言語別音 素記号に基づく HMMの学習を実行する。また逆に、言語別音素記号に基づく HM Mであれば、言語別音素記号の出力確率を国際音素記号に基づいて学習する。同 様に、音素片から音素への変換や音素から音素片への変換をそれぞれ学習する方 法を用いてもょ 、し、 HMMば力りではなくベイズ識別関数やマハラノビス距離などの 距離を用いる方法や尤度や確率を用いる方法であってもよぐ国際音素への帰属確 率に日本語音素を用いた応用例もあわせて示している。 [0520] As described in the above “Example of identifier reconstruction using the present invention”, for example, in the case of being expressed in an international phoneme, the output probability of the international phoneme symbol HMM is used as a feature amount and the language-specific sound. Perform HMM learning based on prime symbols. Conversely, HM based on language-specific phonetic symbols If it is M, the output probability of language-specific phoneme symbols is learned based on international phoneme symbols. Similarly, it is possible to use a method for learning phoneme-to-phoneme conversion and phoneme-to-phoneme conversion, respectively, and use distances such as Bayes discriminant function and Mahalanobis distance instead of HMM power. It also shows an application example using Japanese phonemes for the probability of belonging to an international phoneme, which may be a method or a method using likelihood or probability.
[0521] また、国際音素記号に対する各言語における音素の帰属確率を求め、対応する表 を作り国際音素辞書を用いて特徴量から音素同定し、各言語に依存した音素記号 列に変換したり、異なる言語間での音素の帰属確率を求めて帰属確率の高 、順に 評価したりすることで、他の言語や他の言語を母語とする人の発話特徴を利用する 装置の言語特徴に変換したりしてもよい。なお、音素、音素片変換ゃ異言語間音素、 音素片変換ばかりでなぐ画像識別子同士の変換や画像識別子と音素、音素片列 の変換もこれら識別子の共起状態に基づいて構成でき前述の辞書機能で利用する ことにより実施してちょい。  [0521] In addition, the phoneme attribution probability in each language for the international phoneme symbol is obtained, a corresponding table is created, the phoneme is identified from the feature quantity using the international phoneme dictionary, and converted into a phoneme symbol string depending on each language, By obtaining the phoneme attribution probabilities between different languages and evaluating the attribution probabilities in descending order, it is converted to the language features of the device that uses the speech features of people who speak other languages or other languages as their native language. Or you may. It should be noted that phoneme, phoneme conversion can be configured based on the co-occurrence of these identifiers as well as conversion between image identifiers as well as between different language phonemes and phoneme conversion, and conversion between image identifiers and phonemes and phoneme string sequences. Use it by using the function.
[0522] また、国際音素記号との対応は国際音声学会による国際音声記号ガイドブック等を 参考に UPA番号や IPA記号、 UCS符号番号と関連付けたり、これらの記号や番号 を識別子としたりして音素変換の管理に用いてもよい。また、異なる言語間の音素記 号を国際音素記号に変換する場合、帰属確率テーブルと前後の音素間の遷移確率 を利用したり、出力確率を再学習して HMMなどにより記号間の変換を実施したり、出 力確率や特徴量の共起情報を用いてユークリッド距離関数やベイズ識別関数と!/、つ た評価関数を構成し記号変換関数として利用したりしても良い。  [0522] In addition, the correspondence with international phoneme symbols can be related to UPA numbers, IPA symbols, UCS code numbers by referring to the International Phonetic Symbol Guidebook by the International Phonetic Society, and these symbols and numbers can be used as identifiers. It may be used for conversion management. Also, when converting phoneme symbols between different languages to international phoneme symbols, use the transition probability between the phoneme probability table and the preceding and following phonemes, or re-learn the output probability and convert between symbols using HMM etc. Alternatively, the Euclidean distance function, the Bayes discriminant function and! /, An evaluation function may be constructed using the co-occurrence information of output probabilities and feature quantities and used as a symbol conversion function.
[0523] この際、音素片から音素や音素から音素片の変換テーブルを用いて同一言語内で の音素と音素片の相互変換を実現したり、地域言語音素記号と国際音素記号の変 換ゃ地域言語音素片記号と国際音素片記号の変換、それぞれの音素片音素間変 換を実施したりしても良い。例えば、この処理によって日本語音素記号力 国際音素 記号を経由して英語音素記号へと識別子変換をしたり、 日本語音素記号から英語音 素記号へ変換したりするといつた異なる言語依存がある音素、音素片同士の識別子 変換をしたり、前述の音素記号変換テーブルを用いて識別子変換をしたりできると共 に識別子変換された識別子列を用いて検索を実施しても良いし、この変換テンプレ ートは時間的遷移を考慮し、モノグラムば力りではなくバイグラムやトライグラムといつ たュ-グラム構成の音素 HMMや音素片 HMMを用いても良い。 [0523] In this case, mutual conversion between phonemes and phonemes within the same language using the phoneme-to-phoneme or phoneme-to-phoneme conversion table, or conversion between local language phoneme symbols and international phoneme symbols. It is also possible to convert between local language phoneme symbols and international phoneme symbols and to convert between each phoneme phoneme. For example, when this process is used to convert identifiers into English phoneme symbols via Japanese phoneme symbolic international phoneme symbols or from Japanese phoneme symbols to English phoneme symbols, phonemes that have different language dependences. In addition, it is possible to perform an identifier conversion between phoneme pieces, or to perform an identifier conversion using the above-mentioned phoneme symbol conversion table, and to perform a search using an identifier string that has been converted to an identifier. In consideration of temporal transition, the phone may use bigram or trigram phoneme HMM or phoneme HMM instead of monogram force.
[0524] このような方法により、 [0524] By this method,
画像→英語名称→英語音素列索引付け→日本語音素列変換→日本語発話入力→ 日本語音素列検索  Image → English name → English phoneme indexing → Japanese phoneme conversion → Japanese utterance input → Japanese phoneme search
日本語発話→日本語音素列→日本語キーワード→英語訳→英語音素列→英語 DB 音素列検索  Japanese speech → Japanese phoneme sequence → Japanese keywords → English translation → English phoneme sequence → English DB phoneme sequence search
といった言語に依存した発話を他言語に変換しながら検索することが可能となり、英 語圏で構成された「役者の写真」を集めたデータベースを「役者名の日本語発音」に 基づいた音素列で検索するといつた任意のメタ共起検索が可能となる。もちろん、役 者ば力りではなく自動車や工具、花、化粧品といった商品の販売カタログのようなも のを構成してもよ 、し、検索のための一覧表示などに用いても良 、。  Utterances that depend on languages such as these can be searched while being converted to other languages, and a database of “photographs of actors” composed of English-speaking languages is used as a phoneme sequence based on “Japanese pronunciation of actor names” Any meta-co-occurrence search can be performed at any time. Of course, instead of the power of actors, it may be configured as a sales catalog for products such as cars, tools, flowers, and cosmetics, or it may be used for displaying a list for searching.
[0525] 次に、より具体的な変換関数の構成手順について説明する。  [0525] Next, a more specific procedure for configuring the conversion function will be described.
[0526] まず、図 28と図 29によれば、単純な言語依存の音声と文字列による検索手順が記 載されている。この記載によれば、入力された文字列や音声波形が言語依存で識別 子に変換されクエリに用いられるとともに言語依存の識別子で索引付けされたデータ ベースを用いて検索することがわかる。  [0526] First, according to FIG. 28 and FIG. 29, a search procedure based on simple language-dependent speech and character strings is described. According to this description, it is understood that the input character string or speech waveform is converted into an identifier that is language-dependent and used for a query, and the search is performed using a database indexed by a language-dependent identifier.
[0527] し力しながら、音素や音素片と言った識別子は言語によって異なるためその表記が 必ずしも一致するわけではなぐ識別子も同様に千差万別である。このような多様な 識別子を相互に検索できるようにするためには、識別子記号列の変換を行う必要が ある。この識別子の変換には、それぞれの言語環境で構成された識別子評価関数を 同一の発話に対し認識を行うことで各識別子の共起状態を観測し、その認識結果と して出力される識別子や出力確率、尤度、距離、特徴量などを学習することで識別 子間の記号の変更を実現することが図 30や図 31の手順にあるように可能である。  [0527] However, since identifiers such as phonemes and phoneme pieces differ depending on the language, identifiers whose notation does not necessarily match are similarly different. In order to be able to search such a variety of identifiers, it is necessary to convert identifier symbol strings. In this identifier conversion, the identifier evaluation function configured in each language environment is recognized for the same utterance, the co-occurrence state of each identifier is observed, and the identifier output as the recognition result or It is possible to change the symbol between identifiers by learning the output probability, likelihood, distance, feature quantity, etc., as shown in the procedure of FIGS.
[0528] より詳しく説明すると、例えば英語による発話の情報と日本語による発話情報とを索 引付けや検索時と同様に自然情報を特徴量に変換する。次に特徴量に基づいて、 日本語と英語の音素や音素片認識を実行する。この結果、お互いの言語に依存した 音声情報をお互いの言語に依存した認識過程を経て識別子による索引付けが実施 される。 [0528] In more detail, for example, the utterance information in English and the utterance information in Japanese are converted into feature quantities in the same manner as in the indexing or searching. Next, based on the feature values, Japanese and English phonemes and phoneme recognition are executed. As a result, voice information that depends on each other's language is indexed by identifiers through a recognition process that depends on each other's language. Is done.
[0529] 次に、実施された識別子列による索引を観測し互いの識別子の共起状態や出力確 率の遷移を観測する。この結果、各音素や音素片において日本語で英語を認識した 場合の日本語音素と英語音素の共起状態が抽出できる。同様に英語で認識した日 本語の音素共起情報も構成できる。このようにして得られた共起情報を元に日本語 音素列で発話した場合の英語音素認識関数として HMMやベイズ識別関数などの 評価関数を構成し、識別関数のための内部定数をファイルなどの記憶媒体に保存し 再利用できるようにする。  [0529] Next, the index by the implemented identifier sequence is observed, and the co-occurrence state of each identifier and the transition of the output probability are observed. As a result, it is possible to extract the co-occurrence state of Japanese phonemes and English phonemes when English is recognized in Japanese for each phoneme or phoneme. Similarly, Japanese phoneme co-occurrence information recognized in English can be constructed. Based on the co-occurrence information obtained in this way, an evaluation function such as an HMM or a Bayes discriminant function is constructed as an English phoneme recognition function when uttered in a Japanese phoneme sequence, and internal constants for the discriminant function are created as files It can be saved and reused on any storage medium.
[0530] この結果得られた識別関数を用いて英語の発話情報を英語対応の日本語音素識 別により認識することで異なる言語間の音素記号変換が可能となる。なお、このような 変換における共起情報に関し共起確率に基づいて国際音素記号と日本語の変換例 を示したのが図 32であり、英語と日本語ばカゝりではなぐ中国語やドイツ語フランス語 、ベトナム語、スペイン語などを組合せたり、その中間音素に国債音素記号を用いた り、相互に変換できる評価関数を構成したりしても良い。  [0530] Using the discriminant function obtained as a result, phonetic symbol conversion between different languages becomes possible by recognizing English speech information by Japanese phoneme discrimination corresponding to English. Figure 32 shows an example of conversion between international phoneme symbols and Japanese based on the co-occurrence probability for co-occurrence information in such conversion. French, Vietnamese, Spanish, etc. may be combined, government bond phoneme symbols may be used as intermediate phonemes, and evaluation functions that can be converted to each other may be configured.
[0531] 同様に、任意の音声波形に対して音素と音素片で同時に索引付けを実施したり、 日本語音素と英語音素と国際音素記号とで同時に索引付けしたりすることで、異なる 言語に依存した音素や音素片で索引付けすることにより、共起状態を観測し認識の ための HMMやベイズによる識別関数を構成してもよい。  [0531] Similarly, any phonetic waveform can be indexed simultaneously with phonemes and phonemes, or it can be indexed with Japanese phonemes, English phonemes, and international phoneme symbols at the same time. By indexing with dependent phonemes and phonemes, the co-occurrence state may be observed and a recognition function based on HMM or Bayes may be constructed.
[0532] また、音素や音素片の共起状態を観測することで図 33にあるように多層 HMMを 構成したり、多層ベイズ関数を構成したりするといつた方法も可能であり、その応用と して、音素力も音素片や音素片力も音素といった図 34や図 35のような異なる識別子 特性における識別子変換を実施できるようになる。 [0532] In addition, by observing the co-occurrence state of phonemes and phoneme pieces, it is possible to construct a multilayer HMM as shown in Fig. 33 or a multilayer Bayes function. Thus, it is possible to perform identifier conversion for different identifier characteristics such as FIG. 34 and FIG. 35, such as phoneme force, phoneme piece, and phoneme piece force and phoneme.
[0533] この方法は、音素 HMMや音素片 HMMから出力確率に基づいて現在の音素を 同定したり、変換 HMM層に入力したりする方法であったり、複数のベイズ関数の確 率指数部出力を並列で評価し距離情報の配列を特徴量として構成する多層ベイズ による方法であったりしても良い。 [0533] This method identifies the current phoneme based on the output probability from the phoneme HMM or phoneme HMM, or inputs it to the transformed HMM layer, or outputs the probability index part of multiple Bayes functions. It is also possible to use a multi-layer Bayes method in which the distances are evaluated in parallel and the array of distance information is configured as a feature quantity.
[0534] 具体的には、図 33であれば日本語の出力確率の遷移状態を入力する音素変換 H[0534] Specifically, in Fig. 33, the phoneme transformation that inputs the transition state of the Japanese output probability H
MMを構成するために、両方の音素評価関数で索引付けした後に、変換元の音素 H MMの出力確率を国際音素記号で分類された HMMに入力し学習させる。その学習 に基づいて、出力確率を評価し国際音素記号を割当てる。この際、学習に関しては 共起行列や共起確率を用いたり、出力確率の値や特徴量をベイズ関数の標本べタト ルとして与え、複数の標本ベクトルから共分散行列を構成したのち固有値固有べタト ルを求め評価関数としたりしても良い。 To construct MM, after indexing with both phoneme evaluation functions, the source phoneme H The MM output probability is input to the HMM classified by international phoneme symbols and learned. Based on this learning, output probabilities are evaluated and international phoneme symbols are assigned. In this case, the co-occurrence matrix and co-occurrence probabilities are used for learning, and output probability values and features are given as sample vectors for the Bayesian function. It is also possible to obtain a tutor and use it as an evaluation function.
[0535] また、図 34であれば現在のフレームにおける音素 HMMの出力確率と次のフレー ムにおける音素 HMMの出力確率と前のフレームの出力確率から「無音」から「A」の 発話に遷移する過程にぉ 、て無音の確率の高 、フレームを「Pau」とし、「A」の出力 確率が増加しているフレームを「A」とし、これらの記号を時系列的に並べて「Pau-A- A」という音素遷移に基づいた記号を割当てている。この際、最初のフレームや最後 のフレームは前後のフレームが欠けるので自己フレームと同じ識別子で埋めている。  [0535] Also, in Fig. 34, transition is made from "silence" to "A" utterance based on the output probability of the phoneme HMM in the current frame, the output probability of the phoneme HMM in the next frame, and the output probability of the previous frame. In the process, the probability of silence is high, the frame is `` Pau '', the frame with the increased output probability of `` A '' is `` A '', and these symbols are arranged in time series `` Pau-A- A symbol based on the phoneme transition “A” is assigned. At this time, the first frame and the last frame are filled with the same identifier as the self frame because the previous and subsequent frames are missing.
[0536] この場合、例えば単純なモデルで 2フレーム目を考えると音素片の過去フレームは 「Pau」が第一位なので音素片の左は「Pau」、中心フレームは前後の出力確率の平均 が最大になる「A」、右フレームは「A」が第一位なので音素片の右の記号は「A」とな りその時系列変化に基づいて音素片を「Pau-A-A」と構成する。このようにして、音素 片と音素の変換を実現することも可能である。  [0536] In this case, for example, when considering the second frame in a simple model, "Pau" is the highest in the past frame of the phoneme, so the left of the phoneme is "Pau", and the average of the output probabilities of the center frame is Since “A” is the highest in the right frame and “A” is the highest in the right frame, the symbol to the right of the phoneme is “A”, and the phoneme is configured as “Pau-AA” based on the time series change. In this way, it is also possible to realize conversion between phonemes and phonemes.
[0537] また、図 35であれば現在のフレームにおける音素片 HMMの出力確率と次のフレ ームにおける音素片 HMMの出力確率から「無音」から「A」の発話に遷移する過程 における出力確率の高い音素片記号において無音が音素片記号に占める割合の高 V、部分では「Pau」とし、「A」無音が音素片記号に占める割合の高 、部分では「A」と し、記号を割当てて 、る。例えば 2フレーム目では「Pau-A-A」が 60%、「A-A-A」が 2 0%、その他 20%となっていおり、その他は表記から省略されている。  [0537] Also, in Fig. 35, the output probability of the speech unit HMM in the current frame and the output probability of the speech unit HMM in the next frame in the process of transitioning from "silence" to "A" utterance In high phoneme symbols, the ratio of silence to the phoneme symbols is high V, where `` Pau '' is the part, and `` A '' is the high percentage of phoneme symbols, the part is `` A '', and the symbol is assigned And For example, in the second frame, “Pau-A-A” is 60%, “A-A-A” is 20%, and other 20%, and others are omitted from the notation.
[0538] この場合、単純なモデルで 2フレーム目を考えると「Pau」が音素片第 1位の三分の 一を占めているため Pau= (60÷ 3) %となり、「A」が音素片第 1位の三分の二と 2位 の全てを占めて 、るため A= (60÷ 3 X 2) + (20÷ 3 X 3) %として計算すると Pau= 20%、 A=60%となり、 2フレーム目の記号は「A」となる。このようにして、音素片と音 素の変換を実現することも可能であり、これらの識別子に基づいた評価式の構成は 例えば前後のフレームの第 1位音素を考慮するなどといった方法で任意の組合せを 用いても良い。 [0538] In this case, considering the second frame with a simple model, “Pau” occupies the first third of the phoneme segment, so Pau = (60 ÷ 3)%, and “A” is the phoneme. Because it accounts for two-thirds and second place of the first place, if calculated as A = (60 ÷ 3 X 2) + (20 ÷ 3 X 3)%, Pau = 20%, A = 60% The symbol for the second frame is “A”. In this way, it is also possible to realize conversion between phonemes and phonemes, and the composition of the evaluation formula based on these identifiers can be arbitrarily determined by taking into account the first-order phoneme of the preceding and following frames, for example. Combination It may be used.
[0539] また、図 32のような国際音素記号変換表を用いる場合上記のような変換において 中間形態として国際音素記号を用いる方法力 ^、くつか考えられ、図 36や図 37や図 3 8にあるように認識を言語別に行った後に国際音素記号に変換し国際音素記号によ る検索を実行する方法や、認識を国際音素記号で行った後に各言語別の音素記号 に変換し言語別音素記号で検索する方法や、認識を話者言語別に行い国際音素記 号に変換し、コンテンツ言語に変換した後に検索や検出を行う方法や、入力された 文字列を国際音素記号に変換して検索したのちそれぞれの言語向け音素に変換し て提示する方法など任意の組合を用いても良 、。  [0539] In addition, when using the international phoneme symbol conversion table as shown in Fig. 32, there are several ways to use international phoneme symbols as an intermediate form in the above conversion ^, Fig. 36, Fig. 37 and Fig. 3 8 As shown in Fig. 4, after recognition is performed for each language, it is converted to international phoneme symbols, and a search using international phoneme symbols is performed. Search by phoneme symbol, recognition by speaker language and convert to international phoneme symbol, search and detect after converting to content language, convert input character string to international phoneme symbol You can use any combination, such as the method of searching and converting to phonemes for each language after searching.
[0540] また、日本語や英語、フランス語、スペイン語、ドイツ語、韓国語、中国語、インド語 、イスラム語、へブライ語、ァラム語、ベトナム語、ギリシャ語などの任意の言語による 文字列を検索対象とする場合であれば文字列の発音に基づいて音素列や音素片列 を構成したり、ひらがなやカタカナといった任意の言語における発音表記や国際音 素表記に変換したりした後で特徴量へ変換し共起状態を確認して言語間の音素変 換を実現しても良いし、それぞれの言語に依存した音素や音素片に関し国際音素記 号を中間形態に用いて前述の方法で変換しても良 、。  [0540] In addition, strings in any language such as Japanese, English, French, Spanish, German, Korean, Chinese, Indian, Islam, Hebrew, Aramaic, Vietnamese, Greek, etc. If the search target is a phonetic string, a phoneme sequence or phoneme segment sequence is constructed based on the pronunciation of the character string, or it is converted into phonetic notation in any language such as Hiragana or Katakana or converted to international phonetic notation. It may be converted into a quantity and the co-occurrence state is confirmed, and the phoneme conversion between languages may be realized, or for the phonemes and phonemes depending on each language, the international phoneme symbol is used as an intermediate form by the above method. You can convert it.
[0541] このように、言語状態や画像状態、音響状態といった異なる環境の識別子に用いる 評価関数を用いて同一の環境に基づく情報に対して評価することで共起状態を観測 し、各識別子の交換可能性を確率的に補足することで環境の変化に対応した評価 関数を構築することが可能であり、特に国際音素記号と地域別音素記号や音素と音 素片の識別子変換に識別子の共起状態を学習させ用いることによって本発明の検 索の多様化を実現できる。  [0541] In this way, co-occurrence states are observed by evaluating information based on the same environment using evaluation functions used for identifiers of different environments such as language states, image states, and acoustic states. It is possible to construct an evaluation function that responds to changes in the environment by probabilistically supplementing the exchangeability. In particular, identifiers can be shared in the conversion of international phoneme symbols and regional phoneme symbols and phoneme and phoneme identifiers. The diversification of the search of the present invention can be realized by learning and using the wake-up state.
[0542] また、識別子同士の変換例として音素間や音素片音素間の変換例をあげたが、本 発明の実施例に記載されるような、環境識別子と音素識別子の変換として「波の音」 と「z/ a/ p/ p/ a/ a/ a/ a/ n/ n/ n/ n/ n」といった擬音音素列の変換や画像識別子 と呼称としての音素識別子列の変換も時間的共起情報の遷移力 変換可能であり、 類似した形状の評価やそれにともなう識別子の変換などに利用しても良い。  [0542] Also, as an example of conversion between identifiers, an example of conversion between phonemes or between phoneme fragment phonemes was given. However, as described in the embodiment of the present invention, as a conversion between environment identifiers and phoneme identifiers, "wave sound ”And“ z / a / p / p / a / a / a / a / n / n / n / n / n ”and conversion of phoneme identifier sequences as image identifiers and names The transition force of co-occurrence information can be converted, and it may be used for evaluation of similar shapes and conversion of identifiers.
[0543] <その他一般 > また、本発明は、音素記号や音素片記号に基づく識別子及び感情識別子を中心 に言及しているが、「従来の技術」や「従来技術の課題」、「課題の解決方法」に記載 された各種技術文献、及びそれらの文献における引用文献に基づいて環境音識別 子、感情識別子、楽器識別子、音階識別子や画像識別子、人物識別子、動作識別 子、表情識別子、表示位置識別子、番組識別子といった従来からある他の特徴量や 識別子のための認識技術や識別技術を用いて施された識別記号の組合せによる索 引付け方法と検索要求を実行することにより、本発明の実施例における利便性の向 上を図ってもよい。 [0543] <Other general> Further, the present invention mainly refers to identifiers and emotion identifiers based on phoneme symbols and phoneme symbol symbols, but they are described in “Prior Art”, “Prior Art Issues”, and “Solutions for Issues”. Conventionally, environmental sound identifiers, emotion identifiers, musical instrument identifiers, scale identifiers and image identifiers, person identifiers, motion identifiers, facial expression identifiers, display position identifiers, program identifiers based on various technical documents and references cited in those documents. The convenience of the embodiment of the present invention is improved by executing an indexing method and a search request by a combination of identification symbols applied using a recognition technique or identification technique for some other feature amount or identifier. You may plan.
[0544] また、 GUIのクリックやポインティング操作、音声入力による指示操作により検索対 象や検出対象や学習対象となる代表的なサンプル画像や音声範囲や映像部品を指 定して選択したりしてもよいし、それらの組合せにより株の売買、商品の売買、オーク シヨン、予約、アンケート、コンテンツの視聴、コンテンツと利用者の共起状態の伝達 による視聴状況調査などを実施しても良いし、特徴の抽出や識別子の認識や検索や 学習や検出は端末側や基地局側や中継局の 、ずれで行っても良 ヽし、クラスタゃグ リツドなどの分散処理を実施しても良いし、感情識別子を用いて音声認識の感情に 伴う文脈遷移係数を変更したり、感情認識により認識された感情によって分岐する選 択肢を追加したり、利用者の声力 認識される感情識別子で処理の選択範囲や分岐 範囲を与えたり、キーワードに関連付けられた識別子とキーワードに関連付けられた 広告とを関連付けて提示したりしてもよい。  [0544] In addition, by selecting and selecting representative sample images, audio ranges, and video components to be searched, detected, and learned by GUI clicks, pointing operations, and instruction operations by voice input, etc. You may also conduct stock trading, product trading, auctions, reservations, questionnaires, content viewing, viewing status surveys by communicating the co-occurrence status of content and users, etc. Feature extraction, identifier recognition, search, learning, and detection may be performed at the terminal, base station, or relay station, or distributed processing such as clustering may be performed. Use emotion identifiers to change the context transition coefficient associated with emotions in speech recognition, add options that branch according to emotions recognized by emotion recognition, or use emotion identifiers that are recognized by the user's voice. The selection range or branch range may be given, or the identifier associated with the keyword and the advertisement associated with the keyword may be presented in association with each other.
[0545] また、利用者の指示により選択 ·指定する映像などの部品は MPEG4などで用 、ら れる画像オブジェクトの画像輪郭や 3次元画像における座標情報を利用して選択範 囲の境界を特定する方法を用いても良いし、音声などの無音部や周波数の偏りから 検出される境界を利用しても良いし、画像内の表示物体を発音することにより索引付 を行ったり、選択したりしても良いし、番組中の撮影場所に関して緯度、経度といった 位置情報を用いて観光案内などの宣伝を行っても良いし、認識された識別子や抽出 された特徴量に応じて広告や宣伝を実施したり、広告や宣伝を実行するための索引 付を行ったりしても良い。  [0545] In addition, video and other parts to be selected / designated by the user's instructions are used in MPEG4 etc., and the boundary of the selection range is specified using the image outline of the image object used or the coordinate information in the 3D image. You can use the method, you can use the boundaries detected from silent parts such as voice and frequency deviation, indexing by selecting the display object in the image, and selecting it. It is also possible to advertise tourist information using location information such as latitude and longitude regarding the shooting location in the program, and carry out advertisement and promotion according to the recognized identifier and extracted feature quantity Or indexing to run advertisements and promotions.
[0546] 検索結果や本発明を用いて索引付されたコンテンツ情報に対して識別子や特徴量 をマークアップ言語のタグや属性として追加し、配信することにより利用者の操作に 応じた関連するコンテンツの提供や広告の提供や商品の販売を行っても良ぐ利用 者の意図に基づ 、たコンテンツ操作やコンテンツ編集やコンテンツ利用が実施でき る。 [0546] Identifiers and feature quantities for search results and content information indexed using the present invention Is added as a markup language tag or attribute and distributed to provide related content according to the user's operation, provide advertisements, or sell products, Content operations, content editing, and content use.
[0547] また、検索結果を用いてコンテンツに関連した情報を補足したり注釈付けたりする ァノテーシヨン処理を行っても良 ヽし、本発明に用いられる共起情報を利用してコン テンッ検索ば力りではなぐネットワーク上の情報を自立的に収集'検索するボットシ ステムなどを構成してもよ 、。  [0547] Annotation processing that supplements or annotates content-related information using search results is also acceptable. If content search is performed using the co-occurrence information used in the present invention, You can configure a bot system that collects and searches information on the network independently.
[0548] この際、音素片とは時間軸上に音素の中心部や前部、後部と複数に分解された音 素記号であったり、第一の音素と第二の音素といった音素間や音素片間の遷移状態 における第一の音素力 第二の音素に変化する位置に基づく中間特徴を持つ音素 情報であったりしてもよいし、検出された感情や環境音や人物に基づいて認識しや す 、ように音素認識辞書や音素片テンプレートを切り替えたりしてもよ 、。  [0548] In this case, a phoneme segment is a phoneme symbol that is decomposed into a central part, a front part, a rear part, and a plurality of phonemes on the time axis, or between phonemes such as the first phoneme and the second phoneme. The first phoneme force in the transition state between the pieces may be phoneme information with intermediate features based on the position where it changes to the second phoneme, or it may be recognized based on the detected emotion, environmental sound, or person. You can also switch phoneme recognition dictionaries and phoneme templates like this.
[0549] また、本発明に用いられる識別子は前述の音素や音素片を含め感情特徴から抽出 された識別子であったり、画像特徴カゝら抽出された画像識別子であったり、音響特徴 をから抽出された楽器識別子や音階識別子、環境音識別子であるような情報同士を 同時に評価し検索や検出することにより、従来になく利便性の高い任意のサービスを 実現する情報処理装置と考えてもょ 、。  [0549] Further, the identifier used in the present invention is an identifier extracted from emotion features including the above-mentioned phonemes and phoneme pieces, an image identifier extracted from image features, or an acoustic feature. Think of it as an information processing device that realizes an unprecedented and convenient service by simultaneously evaluating, searching and detecting information such as musical instrument identifiers, scale identifiers, and environmental sound identifiers. .
[0550] また、特定の感情識別子や環境音識別子、音階識別子と!/ヽつた音声関連識別子 が発生している音声情報において、音素や音素片、各種識別子の認識のための特 徴量の偏りを検出し、同一音素における感情別の偏りを学習することで、任意の音素 に伴う感情の認識や環境音を伴う音素認識と同時に行えるように特徴量の再学習を 行って音素や音素片の認識率改善を行っても良いし、コンテンツ情報のフレーム内 共起情報に基づ!、てフレーム間確率遷移行列を求めてコンテンッ情報の検索に用 V、たり、コンテンツ情報の評価関数に用いたりしても良!、。  [0550] Also, in voice information in which specific emotion identifiers, environmental sound identifiers, scale identifiers, and! / Consistent speech-related identifiers are generated, the bias of the feature amount for recognition of phonemes, phonemes, and various identifiers , And learning the bias for each emotion in the same phoneme, re-learning the features so that it can be performed simultaneously with the recognition of emotions associated with any phoneme and the recognition of phonemes with environmental sounds. The recognition rate may be improved, based on the intra-frame co-occurrence information of the content information, and the inter-frame probability transition matrix is used to search the content information V, or used as an evaluation function for the content information. OK!
[0551] また、このような主観を伴う定量ィ匕困難な情報は認識のたびに量子化されるため必 ず累積誤差が生じ確率的な再現性を得る必要があり、本発明のように多様な識別子 と特徴量を用いることで、例えば利用回数が多いとか新規に多数の項目での検索登 録をしたといった利用者の肯定的な反応や行動により、 EPGや BML、 RSS、文字放 送や画像特徴及び識別子、音声特徴及び識別子と!/ヽつた各種識別子や各種特徴 量を含めた検出情報の共起状態を評価し、利用者による指定以外の識別子や特徴 量における共起情報を検索や検出や学習に用いることで「気づき」を演出し利用者 が収録し再生する頻度の高い情報を自律的に収集したり、収集した情報の評価を音 声や文字画像により提示し利用者の主観を反映させたりしても良い。 [0551] In addition, since quantitative information with such subjectivity is difficult to quantize every time it is recognized, it is necessary to obtain a cumulative error and to obtain probabilistic reproducibility. By using unique identifiers and feature quantities, for example, the number of times of use is high or search registration with many new items is performed. Detection information including EPG, BML, RSS, text broadcasting, image characteristics and identifiers, voice characteristics and identifiers, and various identifiers and various feature quantities, depending on the user's positive responses and actions such as recording By using co-occurrence information in identifiers and feature quantities other than those specified by the user for search, detection, and learning, information that is frequently recorded and played back by the user is produced. It may be collected autonomously or the evaluation of the collected information may be presented by voice or text image to reflect the user's subjectivity.
[0552] また、音声から得た特徴量に基づき認識された音素や音素片による記号列や感情 や音階、楽器音、環境音などの識別子及び Z又は映像から得た特徴量に基づき認 識された形状や色、文字、動作などの識別子、番組情報識別子を数量化分析 I類か ら IV類を踏まえた多変量解析により分類'多変量解析し本発明に追加的に用いる新 しい識別子として利用してもよぐ平均と分散から 3 σに帰属するか、 2 σに帰属する 力 1 σに帰属するかといった形で 3段階に評価して、検索結果の指標に用いても良 い。 [0552] In addition, recognition is based on feature strings obtained from symbol strings, emotions, scales, musical instrument sounds, environmental sounds, etc., and Z or video, which are recognized based on feature quantities obtained from speech. Classifiers such as shape, color, character, action, etc., and program information identifiers are categorized by multivariate analysis based on quantification analysis from class I to class IV, and used as a new identifier that is additionally used in the present invention. However, it can be used as an index for the search result by evaluating it in three stages, whether it belongs to 3σ from the mean and variance, or belongs to force 1σ, which belongs to 2σ.
[0553] また、これらの処理における特徴量はスカラやベクトル、マトリクス、任意階のテンソ 、つた多次元配列や複素数や四元数、八元数と!/、つた多元数によって構成され ていても良い。  [0553] In addition, the features in these processes may be composed of scalars, vectors, matrices, arbitrary-order tensors, multi-dimensional arrays, complex numbers, quaternions, octal numbers and! /, And multi-numbers. good.
[0554] このような方法により、人間の感覚を記号ィ匕した任意の情報同士を任意の時間幅を 持たせて共起状態の評価が可能となり、映像や音声を伴う情報の索引付けや検索、 検出が可能となるため従来では定量ィ匕による検索や検出が困難であった情報の検 索や検出が実現でき、人にやさしいサービスやそのようなサービスを実現する装置や 情報処理システムや通信基地局や携帯端末を実現することができるため、インターネ ットなどのポータルサイトや検索サイト、販売サイト、 SNS (Social Networking Site)、 知識を共有するエキスパートシステムサイト、オークションサイト、文字放送、情報を整 理するための多変量解析システム、スクリーニングシステム、ネットワーク上の信用情 報や認証情報を取り扱う認証サイト、ァグリゲートサービス、情報処理装置のグラフィ カル'インターフェースやタンジブル 'インターフェース、エージェントインタフェース、 ロボット、仮想現実、拡張現実などにおいて RSS (RDF Site Summary)等を用いて情 報を配信する際に本発明を用いるために、 XML (eXtens¾le Markup Language)や S OA (Service Oriented Architecture)、 RDF (Resource Description Framework)、 B ML (Broadcast Markup Language)、 SMIL (Synchronized Multimedia Integration La nguage)、 MathML (Mathematical Markup Language)、 Xpath (XML Path Language )、 SML (Simple (or Stupid or Software) Markup Language)、 MCF (Meta Contents F ramework)、 DDML (Document Definition Markup Language) ^ DSSSL (Document Style Semantics and Specification Language)、 DSML (Directory Services Markup L anguage)、 DTD (Document Type Definition)、 GML (Geography Markup Language) 、 SMIL (Synchronized Multimedia Integration Language)、 SGML (Standard Genera lized Mark-up Language)、 RDF (Resource Description Framework)等のメタ表現形 式の分類指標に本発明を用いてもよぐ SOAP (Simple Object Access Protocol)や UDDI (Universal Description, Discovery, and Integration)、 WDL (Web Services D escription Language)、 SVG (Scalable Vector Graphics)、 HTML (HyperText Marku p Language)、 URI (Uniform Resource Identifier)、 WAP (The Wireless Application Protocol)、 XQL(XML Query Language)、 VML (Vector Markup Language)、 URL ( Uniform Resource Locator)、 EPG (Electronic Program Guide)、 DLNA (Digital Livi ng Network Alliance)、 BML (Broadcast Markup Language)等の各種プロトコノレゃス タリブト、マークアップ言語、スキーマといった情報処理言語の変数、属性や任意のタ グ、属性、関数といった手段を任意に組合せてサービスを実施してもよい。この際、 修正情報や新規情報は修正や新規を示すタグや変数、属性、命令を用いたりして表 現や表記、実装されても良く前述の『マークアップ言語の解釈 ·変換 ·配信 ·制御装置 の例』を組合せることで利便性を図ることが出来る。 [0554] By such a method, it becomes possible to evaluate co-occurrence states by giving arbitrary time widths to arbitrary information that symbolizes human senses, and to index and search information with video and audio. Therefore, it is possible to detect and detect information that has been difficult to search and detect with quantitative keys, and it is possible to detect human-friendly services, devices that realize such services, information processing systems, and communications. Since base stations and mobile terminals can be realized, portal sites such as the Internet, search sites, sales sites, social networking sites (SNS), expert system sites that share knowledge, auction sites, text broadcasting, information Multivariate analysis system for screening, screening system, authentication site handling credit information and authentication information on network, aggregate server In order to use the present invention when distributing information using RSS (RDF Site Summary) etc. in the graphical 'interface' and tangible 'interface, agent interface, robot, virtual reality, augmented reality, etc. , XML (eXtens¾le Markup Language) and S OA (Service Oriented Architecture), RDF (Resource Description Framework), BML (Broadcast Markup Language), SMIL (Synchronized Multimedia Integration Language), MathML (Mathematical Markup Language), Xpath (XML Path Language), SML (Simple (or (Stupid or Software) Markup Language), MCF (Meta Contents Framework), DDML (Document Definition Markup Language) ^ DSSSL (Document Style Semantics and Specification Language), DSML (Directory Services Markup Language), DTD (Document Type Definition), GML (Geography Markup Language), SMIL (Synchronized Multimedia Integration Language), SGML (Standard Generalized Mark-up Language), RDF (Resource Description Framework), etc. SOAP (Simple Object Access Protocol), UDDI (Universal Description, Discovery, and Integration), WDL (Web Services Description Language), SVG (Scalable Vector Graphics), HTML (HyperText Markup Language), URI (Uniform Res ource Identifier), WAP (The Wireless Application Protocol), XQL (XML Query Language), VML (Vector Markup Language), URL (Uniform Resource Locator), EPG (Electronic Program Guide), DLNA (Digital Linking Network Alliance), BML Various protocols such as (Broadcast Markup Language), information processing language variables such as markup language, schema, attributes, arbitrary tags, attributes, functions, etc. may be used in any combination to implement the service. At this time, correction information and new information may be expressed, written, and implemented using tags, variables, attributes, and instructions that indicate correction or new information. Convenience can be achieved by combining “Example of device”.
また、外部から入力される情報は音声や映像ば力りではなぐ脈拍計や血圧計とい つた健康管理計測器類や味覚センサや嗅覚センサ、人体センサ、熱センサ、湿度セ ンサ、温度センサ、照度センサといった環境観測低機器類、およびラマン分光分析、 紫外、赤外、可視分光光度計、レーザ'アブレーシヨン誘導結合プラズマ質量分析装 置、定性定量分析、蛍光 X線元素分析装置、光散乱レーザートモグラフィー装置、フ 一リエ変換型赤外分光光度計、軟 X線透過装置、カラリーメータ、スぺクトロリノ、ケー プディテクタ、熱分析オペレーションシステム、示差熱'熱重量同時測定装置、示差 走査熱量計、熱機械、分析装置、熱膨張計、分解発生ガス分析装置、熱分析自動 試料交換装置、湿度発生装置、プラズマグラフト重合装置、紫外線グラフト重合装置 、全有機炭素量分析装置、ガスクロマトグラフィ、液体クロマトグラフィ、浸透圧計、動 的粘弾性測定装置、イオン化質量分析装置、 ICP (Inductively Coupled Plasma)発 光分析装置、蛍光分光測定装置、生化学自動分析装置、自動輸血検査装置、自動 化学発光酵素免疫分析装置、光電測光式発光分光分析装置、質量分析装置といつ た各種分析装置力ゝらの入力を特徴量として用い識別関数を構成し、映像や音声情 報に関連付けて記録し索引や任意の処理を実行する基準や変数、属性に用いたり、 ロボットなどの行動指標の基準や変数、属性に用いたりしても良ぐこれらの検出によ り人間の身体に発生する危険を検知 ·予測しても良 、。 In addition, information input from the outside is health management measuring instruments such as pulsometers and blood pressure monitors that are not powered by voice or video, taste sensors, olfactory sensors, human body sensors, heat sensors, humidity sensors, temperature sensors, illuminance Low environmental instruments such as sensors, Raman spectroscopy, ultraviolet, infrared, visible spectrophotometer, laser 'ablation inductively coupled plasma mass spectrometer, qualitative quantitative analysis, fluorescent X-ray elemental analyzer, light scattering laser tomography instrument , Fourier transform infrared spectrophotometer, soft X-ray transmission device, colorimeter, spectrolino, cap detector, thermal analysis operation system, differential thermal 'thermogravimetric simultaneous measurement device, differential Scanning calorimeter, thermal machine, analyzer, thermal dilatometer, decomposition gas analyzer, automatic thermal analysis sample changer, humidity generator, plasma graft polymerizer, ultraviolet graft polymerizer, total organic carbon analyzer, gas chromatography , Liquid chromatography, osmometer, dynamic viscoelasticity analyzer, ionization mass spectrometer, ICP (Inductively Coupled Plasma) emission analyzer, fluorescence spectrometer, biochemical automatic analyzer, automatic blood transfusion analyzer, automatic chemiluminescent enzyme The discriminant function is constructed using the input from various analyzers such as immunoanalyzer, photoelectric photometric emission spectrophotometer, mass spectrometer, etc. as features, recorded in association with video and audio information, indexed and arbitrary These detections can be used for criteria, variables, and attributes for executing the process, and for criteria, variables, and attributes for robots and other behavior indicators. Even if the detection and prediction of the risk that occurs in the human body good,.
[0556] また、情報検索装置を含む人工知能や人口無能と!/、つた処理系もしくはロボットや パソコン、カーナビ、基幹サーバや通信基地局といった情報端末や情報処理装置、 携帯電話や腕時計、装身具形状端末、リモコン、 PDA, ICカード、インテリジェント!^ FID、身体埋め込み端末といった携帯端末であってもよぐ本発明は検索及び検出 手法の実施応用であるため、演算部や記憶部といった情報処理機能を有すれば任 意の情報処理装置を含む装置上や回線上の情報配信装置で本発明は実施可能で ある。 [0556] In addition, artificial intelligence and artificial incompetence including information retrieval devices! /, Information processing devices such as robots, personal computers, car navigation systems, backbone servers, and communication base stations, mobile phones, watches, and accessories The present invention may be a mobile terminal such as a terminal, remote control, PDA, IC card, intelligent! ^ FID, or body-embedded terminal. If present, the present invention can be implemented on an apparatus including an arbitrary information processing apparatus or an information distribution apparatus on a line.
[0557] また、市街情報支援システムの支援情報機器として映像や音声、文章を提供する ために GPSや地磁気位置検出システムの組合せにより位置情報と関連付けて位置 に基づいた情報支援を実施しても良ぐ任意の識別子による共起行列や特徴量を用 V、た距離関数を利用しても良 、。  [0557] In addition, in order to provide video, audio, and text as support information equipment for the city information support system, information support based on location may be implemented in association with location information by a combination of GPS and geomagnetic location detection system. It is also possible to use a co-occurrence matrix or feature value with an arbitrary identifier V, or a distance function.
[0558] また、本発明による検索装置を用いて利用者のよく利用する検索条件に基づいて 利用者の嗜好情報を構成し分析を行ったり、それらを集計して多変量解析することで 新しい嗜好カテゴリを設けたりしても良ぐ任意の識別子による共起行列や特徴量を 用 ヽた距離関数を利用しても良!ヽ。  [0558] In addition, by using the search device according to the present invention, user preference information is configured and analyzed based on search conditions frequently used by the user, or the user preference information is aggregated and multivariate analysis is performed to create a new preference. It is also possible to use a co-occurrence matrix with an arbitrary identifier or a distance function that uses a feature quantity, even if a category is provided.
[0559] また、本発明による検索装置を用いて前述の任意の識別子や特徴量の組合せによ る検索条件に基づいた共起行列や共起確率、距離関数を用いて任意の手段による 広告や宣伝を行っても良ぐ他者との嗜好情報の類似度を評価して嗜好に基づいた 相性占いに用いても良いし、検索中のみに限らず学習中や検索結果の提示中とい つた利用者の指示を待つつている間や利用者を待たせている間に宣伝を行っても良 い。 [0559] Further, by using the search device according to the present invention, an advertisement by an arbitrary means using a co-occurrence matrix, a co-occurrence probability, a distance function based on a search condition based on a combination of the above-described arbitrary identifiers and feature amounts, Based on preference by evaluating the similarity of preference information with others who may advertise It may be used for compatibility fortune-telling, and may be used for advertisements while waiting for user instructions, such as during learning or presenting search results, or while waiting for users, not only during search. Yes.
[0560] また、本発明による検索装置を用いて利用者が画面を見ながら発話しながら索引 付けを行っても良 ヽし、抽出された利用者の嗜好や主観を利用者自身に評価させる ことで強化学習を実施し、抽出された情報の制度を改善しても良いし、検索結果にサ ムネイルなどの小さな画像や動画を表示して一覧を構成しても良 、し、検索結果の 検索条件との一致率を色の濃さや明るさ、アイコンの個数、グラフ描画で表現したり、 順位を整えて表現したりしてもょ ヽ。  [0560] In addition, it is acceptable for a user to perform indexing while uttering while looking at the screen using the search device according to the present invention, and let the user evaluate the preference and subjectivity of the extracted user. You can use reinforcement learning to improve the system of extracted information, or you can display a list of small images and videos such as thumbnails in the search results, and search the search results. You can express the matching rate with the conditions by color depth, brightness, number of icons, graph drawing, or arrange the order.
[0561] また、前述のような識別子を用い配信される情報の音素情報や感情情報、環境音 情報、音階情報、楽器情報を関連付け、さらには画像認識情報、顔情報、色空間情 報、画像内物体情報、認識文字列情報を関連付けして情報のデータベースへの登 録ゃデータベースからの検索及び各コンテンツファイルの修正や変更、コンテンツフ アイルに関連付けられた付属ファイル生成のなどの管理を行う情報処理装置に提供 するようにすれば、情報登録および情報検索を簡単にかつ高精度で実現できるので ある。なお、この際、登録'検索対象としての入力された音声情報や映像情報を統計 的に収束させることにより、記録された情報の効率的な登録と、該登録内容の閲覧に 伴うサービスを提供することもできる。  [0561] Further, phoneme information, emotion information, environmental sound information, scale information, and musical instrument information of information distributed using the identifier as described above are associated, and further, image recognition information, face information, color space information, image Information that associates internal object information and recognition character string information and registers information in the database, searches the database, corrects and changes each content file, and manages the generation of attached files associated with the content file By providing it to the processing device, information registration and information retrieval can be realized easily and with high accuracy. At this time, the registered audio information and video information as the search target are statistically converged to provide an efficient registration of recorded information and a service associated with the browsing of the registered contents. You can also.
[0562] また、前述のような識別子を生成したり識別子の相関性を分析してカテゴリを構成し たりする評価関数や HMMを構成し、それらの評価関数や構成情報を利用者同士で 配信したり交換したりすることで、関連付けられた音声情報に基づく音素や音素片の 情報、感情情報、環境音情報、音階情報、楽器情報等を関連付けたり、さらには画 像認識情報、顔情報、色空間情報、画像内物体情報、動作情報、認識文字列情報 、認識記号情報を関連付けたりして情報のデータベースへの登録やデータベースか らの検索条件の設定などを行 、他の情報処理装置に提供するようにすることで任意 の情報登録および情報検索を簡単にかつ高精度に実現できる。  [0562] In addition, an evaluation function and an HMM for generating identifiers as described above and analyzing categories of identifiers to form categories are configured, and the evaluation functions and configuration information are distributed between users. By exchanging or exchanging them, associating phoneme and phoneme information, emotion information, environmental sound information, scale information, musical instrument information, etc. based on the associated voice information, and also image recognition information, face information, color Spatial information, object information in the image, motion information, recognition character string information, recognition symbol information, etc. are linked to the information database and search conditions are set from the database, and provided to other information processing devices By doing so, arbitrary information registration and information retrieval can be realized easily and with high accuracy.
[0563] また、前述の動作特徴は映像ば力りではなぐ音声の音源移動情報やエコー探索 などの反射波変化情報であったり、モータや圧力センサからのフィードバックやトルク 情報であったりしてもよいし、ロボットの操作情報や接触情報を利用してもよい。 [0563] In addition, the above-mentioned operation characteristics are information on the sound source movement of the sound that is not in the image force, reflected wave change information such as echo search, feedback from the motor or pressure sensor, torque It may be information, or robot operation information or contact information may be used.
[0564] また、前述のような登録や検索の際に対象としての入力された音声情報や映像情 報を統計的に収束させることにより、記録された情報の効率的な登録や利用者同士 の交換や販売を行い、該登録内容の閲覧に伴うより効率的なサービスを提供するこ とちでさる。  [0564] In addition, by statistically converging the input audio and video information as targets during registration and search as described above, it is possible to efficiently register recorded information and It is possible to exchange and sell, and to provide more efficient services associated with browsing the registered contents.
[0565] また、前述のような音素や音素片による記号列や識別子を他の装置へ送信し装置 の処理内容を変更させたり音素や音素片による記号列を他の装置から受信して装置 の処理 '制御手段を修正 '追加したりしてもよい。この際、国際音素記や音素片や任 意の言語の音素や音素片を用いても良い。  [0565] Also, a symbol string or identifier using a phoneme or phoneme as described above is transmitted to another device to change the processing content of the device, or a symbol string based on a phoneme or phoneme is received from another device. You may add processing 'modify control means'. In this case, international phonemes, phonemes, or phonemes or phonemes of any language may be used.
[0566] また、前述のような新しく識別子を構築する際の評価基準として、一般的な認識率 が 60%程度であることから、 60%を超える一致率を示す既存の識別子がどの程度あ るかを評価して、その評価に基づいた共起行列や共起確率、ベイズ、 HMMといった 確率関数や尤度関数、距離関数といった評価関数を構成し、新しい識別子の基準と しても良 、し、複数の識別子の一致率が平均 60%程度の場合にぉ 、て新 、評価関 数や識別子を構成しても良いし、記号列の一致度を測るために DPや CDP、リフ CD Pt 、つた任意の記号列マッチング手法を組合せても良 、し、ニューラルネットワーク やファジー、カオス、フラクタル、遺伝的アルゴリズムといったものと組合せて、学習効 率の改善を図っても良 、。  [0566] Also, as a criterion for constructing new identifiers as described above, the general recognition rate is about 60%, so there are existing identifiers that show a match rate exceeding 60%. It is possible to construct a co-occurrence matrix, co-occurrence probability, Bayes, HMM, and other evaluation functions such as probability function, likelihood function, and distance function based on the evaluation. In addition, when the matching rate of multiple identifiers is about 60% on average, new evaluation functions and identifiers may be configured, and DP, CDP, riff CD Pt, It is also possible to combine arbitrary symbol string matching methods, and to improve learning efficiency by combining with neural networks, fuzzy, chaos, fractal, genetic algorithms, etc.
[0567] また、上記情報処理装置は例えば主記憶部や補助記憶部と!/、つた記憶部及び情 報の評価演算処理を行う情報処理部、外部の装置との情報を交換する通信部、利 用者の指示を受ける入力部、利用者に処理結果を提示する出力部などを有する情 報処理装置に基づいた情報登録及び情報検索が可能な装置により構成されるもの とし、パーソナルコンピュータや大型コンピュータ、基幹サーバや通信基地局などを 考慮できる。また、データベースに記録されている情報の統計的分析を行うプロダラ ムを用いて情報分析が可能な装置とするこがより好まし 、。  [0567] Also, the information processing apparatus includes, for example, a main storage unit and an auxiliary storage unit! /, An information storage unit that performs information evaluation calculation processing, a communication unit that exchanges information with an external device, It is composed of a device that can register and retrieve information based on an information processing device that has an input unit that receives user instructions and an output unit that presents processing results to the user. Computers, backbone servers and communication base stations can be considered. In addition, it is more preferable to use an apparatus that can analyze information using a program that statistically analyzes the information recorded in the database.
[0568] また、本発明を用いたサービスと課金システムを連携して利用者への付加価値の 提供による利用者心理や利用者の趣味 '嗜好に配慮した情報配信サービスやエー ジェントサービスを実現してもよ 、。 [0569] また、利用者がロボットやエージェントの提示した結果を肯定的に捕らえると強化学 習されるアルゴリズムにより肯定された回数の多い内容や検索のための評価関数を 増やすようにアルゴリズムを構築することで、ロボットやエージェントが利用者に肯定さ れた 、と!/、う存在欲を与え、ロボットやエージェントが自律的に学習すると 、う学習モ デルを構成しても良い。 [0568] In addition, the service using the present invention and the billing system are linked to provide added value to the user to realize the information distribution service and the agent service in consideration of the user's psychology and the user's hobbies. Anyway. [0569] In addition, if the user positively captures the results presented by the robot or agent, construct an algorithm to increase the number of times that are affirmed by the enhanced learning algorithm and the evaluation function for the search. Therefore, if a robot or an agent is affirmed by the user! /, A desire to exist and the robot or agent learns autonomously may constitute a learning model.
[0570] また、学習結果による共起情報で利用頻度の低いものは利用者の評価や空き容量 の程度を条件にして自動的に削除したり、外部の記憶装置や通信先の記憶装置に 保存して自装置内のものを削除したり、条件を簡易化した索引や識別関数を残し必 要なときに外部から通信回線を用いて取得すると!/、つた方法を用いても良 、。  [0570] In addition, co-occurrence information based on learning results with low usage frequency is automatically deleted on the condition of user evaluation and free space, or saved in an external storage device or communication destination storage device. You can delete items in your own device, or leave an index or identification function that simplifies the conditions and obtain it from the outside using a communication line when necessary!
[0571] また、上記携帯情報端末は例えば携帯電話や PDA(Personal Digital Assistant),ノ ート型コンピュータ、ウェアラブルコンピュータ、腕時計型コンピュータ、カーナビなど の車載型コンピュータなどの 、わゆる可搬型 ·装着型の情報端末を考慮でき、移動 · 装着'保持等の方法や形態、形状などは限定されるものではなぐより具体的には携 帯電話、カーナビ、 DVDレコーダ、 HDDレコーダ、映像録再装置、音楽録再装置、 STB、モデム、 FAX,電話機、パソコン、情報配信サーバ、情報配信基地局、店頭 情報端末、キャッシュレジスタ、 POS(Point Of Sales system)端末、 ATM、プロジェク ター、テレビ、ビデオ、編集機などであってもよい。  [0571] In addition, the portable information terminal is, for example, a mobile phone, a PDA (Personal Digital Assistant), a notebook computer, a wearable computer, a wristwatch computer, or an in-vehicle computer such as a car navigation system. The mobile phone, car navigation system, DVD recorder, HDD recorder, video recording / playback device, music, etc. Recording / playback device, STB, modem, FAX, telephone, personal computer, information distribution server, information distribution base station, store information terminal, cash register, POS (Point Of Sales system) terminal, ATM, projector, TV, video, editing machine It may be.
[0572] また、これら情報処理装置と携帯情報端末には特徴抽出部や利用者情報入力部、 情報検索部、情報蓄積部、クエリ情報送受信部が実行するために必要な任意の組 合せで含まれており、それらの処理間の情報は無線 LANや赤外線通信、携帯電話 、通常 LAN、有線回線、無線回線などを経由してインターネット、イントラネットなどの 通信網により情報の交換や相互検索を行うことが可能であってもよぐマークアップ言 語を用いるのであればマークアップ言語送受信部、マークアップ言語解釈部を必要 に応じて情報入力部や情報出力部に追加してあってもよい。  [0572] These information processing devices and portable information terminals are included in any combination necessary for execution by the feature extraction unit, user information input unit, information search unit, information storage unit, and query information transmission / reception unit. The information between these processes must be exchanged and mutually searched via the communication network such as the Internet and Intranet via wireless LAN, infrared communication, mobile phone, normal LAN, wired line, wireless line, etc. If a markup language is used, a markup language transmission / reception unit and a markup language interpretation unit may be added to the information input unit and the information output unit as necessary.
[0573] また、広告を行う際の広告情報は通信回線経由で取得しても良いし、コンテンツに 付属した広告を提示しても良いし、広告状態の記録を取って広告効果を検証しても 良いし、広告の成立頻度の高い検索共起情報を分析しても良いし、索引付けのとき に得られた共起情報と類似性の高い共起情報をもつ広告を提示しても良ぐそれらを サービスとして提供しても良!、。 [0573] In addition, advertisement information for advertisements may be acquired via a communication line, advertisements attached to content may be presented, advertisement status is recorded to verify advertisement effectiveness. It is also possible to analyze search co-occurrence information with a high frequency of establishment of advertisements, or present advertisements with co-occurrence information highly similar to the co-occurrence information obtained at the time of indexing. Make them May be provided as a service!
[0574] また、記憶部にある任意の情報は同一装置内にあってもよいし、通信回線を経由し て他の装置から取得してもよいし、コンテンツ検索のサービスを行っても良い。 [0574] In addition, arbitrary information in the storage unit may be in the same device, may be acquired from another device via a communication line, or may be a content search service.
[0575] また、本発明に基づ 、た検索システムはデータベースや索引検索評価部を情報処 理装置に内蔵しても外付けにしても良ぐ外付けである場合には情報処理装置に無 線有線を問わず何らかの手段により通信可能にすることで実現することが可能である [0575] Further, according to the present invention, the search system is not included in the information processing apparatus if the database and the index search evaluation unit are external to the information processing apparatus. It can be realized by enabling communication by any means regardless of wired or wired.
[0576] なお、本発明はあくまでも例であって、必ずしも本文中の記載に拘束されるもので はなぐ任意の特許や文献に記載された技術と組合せて性能の改善を図っても良い [0576] It should be noted that the present invention is merely an example, and the performance may be improved in combination with a technique described in any patent or document that is not necessarily limited to the description in the text.

Claims

請求の範囲 The scope of the claims
[1] コンテンツ情報を獲得するコンテンツ情報獲得手段と、  [1] Content information acquisition means for acquiring content information;
検索条件を入力する検索条件入力手段と、  Search condition input means for inputting search conditions;
前記コンテンツ情報獲得手段により獲得されたコンテンツ情報から、前記検索条件 入力手段により入力された検索条件に適合するコンテンツ情報又は当該コンテンツ 情報内の位置を特定する特定手段と、  Identification means for specifying content information that matches the search condition input by the search condition input means or a position in the content information from the content information acquired by the content information acquisition means;
を備えた情報処理装置にぉ 、て、  In an information processing device equipped with
コンテンツ情報力 特徴量を抽出する特徴量抽出手段と、  Content information capability Feature amount extraction means for extracting feature amounts;
前記特徴量抽出手段により抽出された特徴量力 評価関数を用いて識別子を生成 する識別子生成手段と、  Identifier generating means for generating an identifier using the feature quantity force evaluation function extracted by the feature quantity extracting means;
前記特徴量及び Z又は前記識別子を前記コンテンツ又は前記コンテンツ内の位置 に関連づけて索引情報として記憶する索引情報記憶手段と、  Index information storage means for storing the feature quantity and Z or the identifier as index information in association with the content or a position in the content;
前記検索条件入力手段により入力された検索条件を特徴量及び Z又は識別子に 変換する検索条件変換手段と、を備え、  Search condition conversion means for converting the search condition input by the search condition input means into a feature value and Z or an identifier,
前記特定手段は、前記検索条件変換手段により変換された特徴量及び Z又は識 別子を用いて前記索引情報と前記検索条件との適合を検出することでコンテンツ又 はコンテンツ内の位置を特定する検索特定手段を有することを特徴とする情報処理 装置。  The specifying unit specifies a content or a position in the content by detecting a match between the index information and the search condition by using the feature amount and the Z or the identifier converted by the search condition conversion unit. An information processing apparatus having a search specifying means.
[2] コンテンツ情報を獲得するコンテンツ情報獲得手段と、  [2] Content information acquisition means for acquiring content information;
検索条件を入力する検索条件入力手段と、  Search condition input means for inputting search conditions;
前記コンテンツ情報獲得手段により獲得されたコンテンツ情報から、前記検索条件 入力手段により入力された検索条件に適合するコンテンツ情報又は当該コンテンツ 情報内の位置を特定する特定手段と、  Identification means for specifying content information that matches the search condition input by the search condition input means or a position in the content information from the content information acquired by the content information acquisition means;
を備えた情報処理装置にぉ 、て、  In an information processing device equipped with
コンテンツ情報カゝら複数の異なる特徴量を抽出する特徴量抽出手段と、 前記特徴量抽出手段により抽出された複数の異なる特徴量力 評価関数を用いて 複数の異なる識別子を生成する識別子生成手段と、  Feature quantity extraction means for extracting a plurality of different feature quantities from the content information module, identifier generation means for generating a plurality of different identifiers using a plurality of different feature quantity force evaluation functions extracted by the feature quantity extraction means,
複数の異なる前記特徴量及び Z又は前記識別子を前記コンテンツ又は前記コンテ ンッ内の位置に関連づけて索引情報として記憶する索引情報記憶手段と、 前記検索条件入力手段により入力された検索条件を複数の異なる特徴量及び Z 又は識別子に変換する検索条件変換手段と、を備え、 A plurality of different feature quantities and Z or the identifier are used as the content or the container. Index information storage means for storing as index information in association with the position in the network, and search condition conversion means for converting the search condition input by the search condition input means into a plurality of different feature quantities and Z or identifiers. ,
前記特定手段は、前記検索条件変換手段により変換された複数の異なる特徴量及 び Z又は識別子を用いて前記索引情報と前記検索条件との適合を検出することでコ ンテンッ又はコンテンツ内の位置を特定する検索特定手段を有することを特徴とする 情報処理装置。  The specifying means detects a position in the content or content by detecting a match between the index information and the search condition using a plurality of different feature quantities and Z or identifiers converted by the search condition conversion means. An information processing apparatus comprising search specifying means for specifying.
[3] 前記索引情報記憶手段は、コンテンツから獲得された特徴量及び Z又は識別子に 基づいて構成される共起情報を前記コンテンツ又は前記コンテンツ内の位置に関連 づけて更に記憶しており、  [3] The index information storage means further stores co-occurrence information configured based on the feature amount acquired from the content and Z or the identifier in association with the content or the position in the content,
前記検索条件変換手段によって検索条件から変換された特徴量及び Z又は識別 子に基づく共起情報を検索条件共起情報として構成する検索条件共起情報構成手 段を更に備え、  And further comprising a search condition co-occurrence information configuration means for configuring, as search condition co-occurrence information, the co-occurrence information based on the feature amount and Z or the identifier converted from the search condition by the search condition conversion means,
前記検索特定手段は、前記検索条件共起情報構成手段により構成された検索条 件共起情報と、前記索引共起情報とのとの適合を検出することでコンテンツ又はコン テンッ内の位置を特定する共起検索特定手段を有することを特徴とする請求項 1ま たは 2に記載の情報処理装置。  The search specifying unit specifies a position in the content or content by detecting a match between the search condition co-occurrence information configured by the search condition co-occurrence information configuring unit and the index co-occurrence information. The information processing apparatus according to claim 1, further comprising a co-occurrence search specifying unit.
[4] 前記コンテンツには文字情報が含まれており、 [4] The content includes text information,
前記識別子生成手段は、前記文字情報に基づ!、て識別子を生成することを特徴と する請求項 1から 3のいずれか一項に記載の情報処理装置。  4. The information processing apparatus according to claim 1, wherein the identifier generating unit generates an identifier based on the character information.
[5] 前記文字情報と識別子とを対応づけて辞書情報として記憶する辞書情報記憶手段 を更に備え、 [5] It further comprises dictionary information storage means for storing the character information and the identifier in association with each other as dictionary information,
前記識別子生成手段は、前記コンテンツに含まれる文字情報から前記辞書情報を 用いて識別子を生成することを特徴とする請求項 4に記載の情報処理装置。  5. The information processing apparatus according to claim 4, wherein the identifier generating unit generates an identifier from character information included in the content using the dictionary information.
[6] 辞書情報記憶手段に前記識別子と標準パターンとを対応づけて標準パターン辞書 情報として記憶する標準パターン辞書情報記憶手段を更に備え、 [6] The dictionary information storage means further comprises standard pattern dictionary information storage means for storing the identifier and standard pattern in association with each other as standard pattern dictionary information,
前記識別子を前記標準パターン辞書情報を用いることにより標準パターンによる特 徴量へ変換する識別子特徴量変換手段を更に有することを特徴とする請求項 1から 5の 、ずれか一項に記載の情報処理装置。 The identifier feature quantity converting means for converting the identifier into a feature quantity based on a standard pattern by using the standard pattern dictionary information. 5. The information processing apparatus according to one of the items.
[7] 前記索引情報記憶手段は、前記コンテンツ情報の実時間に基づいて前記特徴量 及び Z又は前記識別子を前記コンテンツ又は前記コンテンツ内の位置に関連づけ て更に記憶しており、 [7] The index information storage means further stores the feature amount and Z or the identifier in association with the content or a position in the content based on the real time of the content information.
前記特定手段は、実時間で配信されるコンテンツから前記索引情報と前記検索条 件との適合を検出する手段であることを特徴とする請求項 1から 6のいずれか一項に 記載の情報処理装置。  7. The information processing according to claim 1, wherein the specifying unit is a unit that detects a match between the index information and the search condition from content distributed in real time. apparatus.
[8] コンテンツ情報の検索中及び Z又は検索結果若しくは検出結果に対して共起情報 及び Z又は前記索引情報により関連付けられた広告情報を提示することを特徴とす る請求項 1から 7のいずれか一項に記載の情報処理装置。  [8] Any one of claims 1 to 7, wherein the co-occurrence information and the advertisement information associated with Z or the index information are presented during the search of the content information and Z or the search result or detection result. The information processing apparatus according to claim 1.
[9] 前記特徴量抽出手段が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記 コンテンツ力 音素認識の際に用いられる音素情報力 抽出される特徴量、若しくは 音素情報から生成される音素識別子であることを特徴とする請求項 2に記載の情報 処理装置。  [9] At least one of a plurality of different feature amounts extracted by the feature amount extraction unit is a phoneme information power used in the content force phoneme recognition. A feature amount to be extracted or a phoneme generated from phoneme information. 3. The information processing apparatus according to claim 2, wherein the information processing apparatus is an identifier.
[10] 前記特徴量抽出手段が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記 コンテンツ力 音素片認識の際に用いられる音素片情報力 抽出される特徴量、若 しくは音素片情報力 生成される音素片識別子であることを特徴とする請求項 2に記 載の情報処理装置。  [10] At least one of a plurality of different feature quantities extracted by the feature quantity extraction unit is a phoneme piece information force used in the content force phoneme piece recognition feature quantity to be extracted, or phoneme piece information The information processing apparatus according to claim 2, wherein the information is a generated phoneme identifier.
[11] 前記特徴量抽出手段が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記 コンテンツから感情認識の際に用いられる感情情報力 抽出される特徴量、若しくは 感情情報力 生成される感情識別子であることを特徴とする請求項 2に記載の情報 処理装置。  [11] At least one of a plurality of different feature amounts extracted by the feature amount extraction means is an emotion information power used for emotion recognition from the content. 3. The information processing apparatus according to claim 2, wherein the information processing apparatus is an identifier.
[12] 前記特徴量抽出手段が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記 コンテンツ力 聴覚情報に基づく認識の際に用いられる聴覚情報力 抽出される特 徴量、若しくは聴覚情報力 生成される識別子であることを特徴とする請求項 2に記 載の情報処理装置。  [12] At least one of a plurality of different feature amounts extracted by the feature amount extraction unit is an auditory information power used in recognition based on the content power auditory information. The information processing apparatus according to claim 2, wherein the information processing apparatus is a generated identifier.
[13] 前記特徴量抽出手段が抽出する複数の異なる特徴量のうち、少なくとも 1つは前記 コンテンツ力 視覚情報に基づく認識の際に用いられる視覚情報力 抽出される特 徴量、若しくは視覚情報力 生成される識別子であることを特徴とする請求項 2に記 載の情報処理装置。 [13] At least one of a plurality of different feature amounts extracted by the feature amount extraction means is a visual information force used for recognition based on the content force visual information. The information processing apparatus according to claim 2, wherein the information processing apparatus is an identifier generated or a visual information ability.
[14] 前記コンテンツには文字情報が含まれており、 [14] The content includes text information,
前記特徴量抽出手段が抽出する複数の異なる特徴量若しくは識別子生成手段が 生成する識別量のうち、少なくとも 1つは文字情報力 抽出される特徴量若しくは文 字情報から生成される識別子であることを特徴とする請求項 2に記載の情報処理装 置。  Among the plurality of different feature quantities extracted by the feature quantity extraction means or the identification quantity generated by the identifier generation means, at least one is an identifier generated from the feature information extracted from the character information power or character information. The information processing apparatus according to claim 2, wherein the information processing apparatus is characterized.
[15] 前記特徴量抽出手段が抽出する複数の異なる特徴量若しくは識別子生成手段が 生成する複数の異なる識別子のうち少なくとも 1つは、番組情報力 抽出される特徴 量若しくは番組情報が識別子であることを特徴とする請求項 2に記載の情報処理装 置。  [15] At least one of a plurality of different feature quantities extracted by the feature quantity extraction unit or a plurality of different identifiers generated by the identifier generation unit is a program information capability. The feature quantity or program information extracted is an identifier. The information processing apparatus according to claim 2, wherein:
[16] 前記特徴量抽出手段が抽出する複数の異なる特徴量若しくは識別子生成手段が 生成する複数の異なる識別子のうち少なくとも 1つは、センサ情報力 抽出される特 徴量若しくはセンサ情報が識別子であることを特徴とする請求項 2に記載の情報処 理装置。  [16] At least one of a plurality of different feature quantities extracted by the feature quantity extraction unit or a plurality of different identifiers generated by the identifier generation unit is a sensor information force. The feature quantity or sensor information extracted is an identifier. The information processing apparatus according to claim 2, wherein:
[17] コンテンツカゝら獲得された特徴量及び Z又は識別子に基づいて構成される共起情 報から、前記評価関数を再構成する評価関数再構成手段を備えることを特徴とする 請求項 3に記載の情報処理装置。  17. An evaluation function reconstructing means for reconstructing the evaluation function from co-occurrence information configured based on the feature amount and Z or identifier obtained from the content camera. The information processing apparatus described in 1.
[18] 前記検索条件変換手段によって検索条件から変換された特徴量及び Z又は識別 子に基づいて構成される共起情報から、前記評価関数を再構成する評価関数再構 成手段を備えることを特徴とする請求項 3に記載の情報処理装置。 [18] An evaluation function reconstructing unit that reconstructs the evaluation function from the co-occurrence information configured based on the feature quantity and the Z or the identifier converted from the search condition by the search condition conversion unit. The information processing apparatus according to claim 3, wherein the information processing apparatus is characterized.
[19] 前記共起検索特定手段によりコンテンツ又はコンテンツ内の位置が特定された結 果に基づいて共起情報を構成する検索結果共起情報構成手段を備え、 [19] Search result co-occurrence information constituting means for constituting co-occurrence information based on the result of specifying the content or the position in the content by the co-occurrence search specifying means,
前記検索結果共起情報構成手段に基づ!、て構成された共起情報から、前記評価 関数を再構成する評価関数再構成手段を備えることを特徴とする請求項 3に記載の 情報処理装置。  4. The information processing apparatus according to claim 3, further comprising: an evaluation function reconfiguring unit that reconfigures the evaluation function from co-occurrence information configured based on the search result co-occurrence information configuring unit. .
[20] コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから所定の場面を検 索するための検索条件を入力する検索条件入力手段と、前記検索条件に適合する 内容を前記コンテンツ記憶手段に記憶されたコンテンツの中から特定する特定手段 を備えた情報処理装置にお!、て、 [20] Content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and conforming to the search condition In an information processing apparatus provided with a specifying means for specifying the content from the contents stored in the content storage means!
前記コンテンツ力 抽出される音素認識に用いるための音素特徴量及び Z又は音 素認識により得られた音素識別子と、前記コンテンツ力 抽出される感情認識に用い るための感情特徴量及び Z又は感情認識により得られた感情識別子と、を関連付け て索引として記録する索引記録手段を備え、  The phoneme feature amount used for the phoneme recognition extracted by the content force and the phoneme identifier obtained by Z or phoneme recognition, and the emotion feature amount and Z or emotion recognition used for the emotion recognition extracted by the content force Index recording means for associating and recording the emotion identifier obtained by
前記特定手段は、前記索引記録手段により記録された索引情報に基づいて前記 検索条件に適合する内容を前記コンテンツ力 特定する索引特定手段を有すること を特徴とする情報処理装置。  The information processing apparatus according to claim 1, wherein the specifying unit includes an index specifying unit that specifies the content power that matches the search condition based on the index information recorded by the index recording unit.
[21] コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから所定の場面を検 索するための検索条件を入力する検索条件入力手段と、前記検索条件に適合する 内容を前記コンテンツ記憶手段に記憶されたコンテンツの中から特定する特定手段 を備えた情報処理装置にお!、て、  [21] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!
前記コンテンツ力 抽出される音素片認識に用いるための音素片特徴量及び Z又 は音素片認識により得られた音素片識別子と、前記コンテンツ力 抽出される感情認 識に用いるための感情特徴量及び Z又は感情認識により得られた感情識別子と、を 関連付けて索引として記録する索引記録手段を備え、  The phoneme feature quantity used for recognition of the phoneme segment extracted from the content force, the phoneme identifier obtained by Z or phoneme recognition, the emotion feature amount used for the emotion recognition extracted from the content force, and Index recording means for associating and recording Z or emotion identifiers obtained by emotion recognition as an index,
前記特定手段は、前記索引記録手段により記録された索引情報に基づいて前記 検索条件に適合する内容を前記コンテンツ力 特定する索引特定手段を有すること を特徴とする情報処理装置。  The information processing apparatus according to claim 1, wherein the specifying unit includes an index specifying unit that specifies the content power that matches the search condition based on the index information recorded by the index recording unit.
[22] コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから所定の場面を検 索するための検索条件を入力する検索条件入力手段と、前記検索条件に適合する 内容を前記コンテンツ記憶手段に記憶されたコンテンツの中から特定する特定手段 を備えた情報処理装置にお!、て、  [22] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!
前記コンテンツ力 抽出される音素認識に用いるための音素特徴量及び Z又は音 素認識により得られた音素識別子と、前記コンテンツ力 抽出される感情認識に用い るための感情特徴量及び/又は感情認識により得られた感情識別子と、前記コンテ ンッから抽出される第 1の認識に用いるための第 1の特徴量及び Z又は第 1の認識 により得られた第 1の識別子と、を関連付けて索引として記録する索引記録手段を備 え、 The phoneme feature amount used for the phoneme recognition extracted by the content force and the phoneme identifier obtained by Z or phoneme recognition, and the emotion feature amount and / or emotion recognition used for the emotion recognition extracted by the content force The emotion identifier obtained by the above, the first feature quantity used for the first recognition extracted from the content, and the Z or first recognition Index recording means for associating and recording as an index the first identifier obtained by
前記特定手段は、前記索引記録手段により記録された索引情報に基づいて前記 検索条件に適合する内容を前記コンテンツ力 特定する索引特定手段を有すること を特徴とする情報処理装置。  The information processing apparatus according to claim 1, wherein the specifying unit includes an index specifying unit that specifies the content power that matches the search condition based on the index information recorded by the index recording unit.
[23] コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから所定の場面を検 索するための検索条件を入力する検索条件入力手段と、前記検索条件に適合する 内容を前記コンテンツ記憶手段に記憶されたコンテンツの中から特定する特定手段 を備えた情報処理装置にお!、て、  [23] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!
前記コンテンツ力 抽出される音素片認識に用いるための音素片特徴量及び Z又 は音素片認識により得られた音素片識別子と、前記コンテンツ力 抽出される感情認 識に用いるための感情特徴量及び Z又は感情認識により得られた感情識別子と、前 記コンテンツ力も抽出される第 1の認識に用いるための第 1の特徴量及び Z又は第 1 の認識により得られた第 1の識別子と、を関連付けて索引として記録する索引記録手 段を備え、  The phoneme feature quantity used for recognition of the phoneme segment extracted from the content force, the phoneme identifier obtained by Z or phoneme recognition, the emotion feature amount used for the emotion recognition extracted from the content force, and The emotion identifier obtained by Z or emotion recognition, the first feature amount used for the first recognition in which the content power is also extracted, and the first identifier obtained by Z or the first recognition It has an index recording means to record as an index
前記特定手段は、前記索引記録手段により記録された索引情報に基づいて前記 検索条件に適合する内容を前記コンテンツ力 特定する索引特定手段を有すること を特徴とする情報処理装置。  The information processing apparatus according to claim 1, wherein the specifying unit includes an index specifying unit that specifies the content power that matches the search condition based on the index information recorded by the index recording unit.
[24] コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから所定の場面を検 索するための検索条件を入力する検索条件入力手段と、前記検索条件に適合する 内容を前記コンテンツ記憶手段に記憶されたコンテンツの中から特定する特定手段 を備えた情報処理装置にお!、て、  [24] Content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are stored in the content storage means Information processing device with a specific means to identify the content!
前記コンテンツ力 抽出される音素認識に用いるための音素特徴量及び Z又は音 素認識により得られた音素識別子と、前記コンテンツ力 抽出される第 1の認識に用 いるための第 1の特徴量及び/又は第 1の認識により得られた第 1の識別子と、を関 連付けて索引として記録する索引記録手段を備え、  The phoneme feature amount used for the phoneme recognition extracted by the content force and the phoneme identifier obtained by the Z or phoneme recognition, the first feature amount used for the first recognition by the content force and And / or index recording means for associating and recording the first identifier obtained by the first recognition as an index,
前記特定手段は、前記索引記録手段により記録された索引情報に基づいて前記 検索条件に適合する内容を前記コンテンツ力 特定する索引特定手段を有すること を特徴とする情報処理装置。 The specifying means has index specifying means for specifying the content power that matches the search condition based on the index information recorded by the index recording means. An information processing apparatus characterized by the above.
[25] コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから所定の場面を検 索するための検索条件を入力する検索条件入力手段と、前記検索条件に適合する 内容を前記コンテンツ記憶手段に記憶されたコンテンツの中から特定する特定手段 を備えた情報処理装置にお!、て、 [25] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!
前記コンテンツ力 抽出される音素片認識に用いるための音素片特徴量及び Z又 は音素片認識により得られた音素片識別子と、前記コンテンツ力 抽出される第 1の 認識に用いるための第 1の特徴量及び/又は第 1の認識により得られた第 1の識別 子と、を関連付けて索引として記録する索引記録手段を備え、  The phoneme feature amount used for recognition of the phoneme extracted from the content force and the phoneme identifier obtained by the Z or phoneme recognition, and the first for use in the first recognition of the content force extracted Index recording means for associating and recording the feature quantity and / or the first identifier obtained by the first recognition as an index;
前記特定手段は、前記索引記録手段により記録された索引情報に基づいて前記 検索条件に適合する内容を前記コンテンツ力 特定する索引特定手段を有すること を特徴とする情報処理装置。  The information processing apparatus according to claim 1, wherein the specifying unit includes an index specifying unit that specifies the content power that matches the search condition based on the index information recorded by the index recording unit.
[26] コンテンツを獲得するコンテンツ獲得手段と、前記コンテンツから所定の場面を検 索するための検索条件を入力する検索条件入力手段と、前記検索条件に適合する 内容を前記コンテンツ記憶手段に記憶されたコンテンツの中から特定する特定手段 を備えた情報処理装置にお!、て、 [26] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!
前記コンテンツ力 抽出される感情認識に用いるための感情特徴量及び Z又は感 情認識により得られた感情識別子と、前記コンテンツカゝら抽出される第 1の認識に用 いるための第 1の特徴量及び/又は第 1の認識により得られた第 1の識別子と、を関 連付けて索引として記録する索引記録手段を備え、  Emotion feature quantity and Z or emotion identifier obtained by emotion recognition to be used for emotion recognition extracted from the content power and first feature to be used for first recognition extracted from the content camera Index recording means for associating and recording the quantity and / or the first identifier obtained by the first recognition as an index;
前記特定手段は、前記索引記録手段により記録された索引情報に基づいて前記 検索条件に適合する内容を前記コンテンツ力 特定する索引特定手段を有すること を特徴とする情報処理装置。  The information processing apparatus according to claim 1, wherein the specifying unit includes an index specifying unit that specifies the content power that matches the search condition based on the index information recorded by the index recording unit.
[27] 前記第 1の識別子及び Z又は第 1の特徴量が聴覚情報及び Z又は視覚情報及び[27] The first identifier and Z or the first feature amount are auditory information and Z or visual information and
Z又は文字情報及び Z又はセンサ情報及び Z又は番糸且情報に基づく識別子及びZ or character information and Z or sensor information and identifier based on Z or warp and information and
Z又は特徴量であることを特徴とする請求項 22から 26のいずれか一項に記載の情 報処理装置。 27. The information processing device according to claim 22, wherein the information processing device is Z or a feature amount.
[28] コンピュータに、 コンテンツ情報を獲得するコンテンツ情報獲得機能と、 [28] On the computer, Content information acquisition function to acquire content information,
検索条件を入力する検索条件入力機能と、  Search condition input function for entering search conditions,
前記コンテンツ情報獲得機能により獲得されたコンテンツ情報から、前記検索条件 入力機能により入力された検索条件に適合するコンテンツ情報又は当該コンテンツ 情報内の位置を特定する特定機能と、  From the content information acquired by the content information acquisition function, a content function that matches the search condition input by the search condition input function, or a specific function that specifies a position in the content information,
を備えた情報処理装置にぉ 、て、  In an information processing device equipped with
コンテンツ情報カゝら特徴量を抽出する特徴量抽出機能と、  A feature amount extraction function for extracting feature amounts from content information;
前記特徴量抽出機能により抽出された特徴量力 評価関数を用いて識別子を生成 する識別子生成機能と、  An identifier generation function for generating an identifier using the feature amount force evaluation function extracted by the feature amount extraction function;
前記特徴量及び Z又は前記識別子を前記コンテンツ又は前記コンテンツ内の位置 に関連づけて索引情報として記憶する索引情報記憶機能と、  An index information storage function for storing the feature quantity and Z or the identifier as index information in association with the content or a position in the content;
前記検索条件入力機能により入力された検索条件を特徴量及び Z又は識別子に 変換する検索条件変換機能と、を備え、  A search condition conversion function for converting the search condition input by the search condition input function into a feature amount and Z or an identifier,
前記特定機能は、前記検索条件変換機能により変換された特徴量及び Z又は識 別子を用いて前記索引情報と前記検索条件との適合を検出することでコンテンツ又 はコンテンツ内の位置を特定する検索特定機能を実現させるプログラム。  The specifying function specifies content or a position in the content by detecting a match between the index information and the search condition using the feature amount converted by the search condition conversion function and Z or an identifier. A program that realizes the search specific function.
コンピュータに、  On the computer,
コンテンツ情報を獲得するコンテンツ情報獲得機能と、  Content information acquisition function to acquire content information,
検索条件を入力する検索条件入力機能と、  Search condition input function for entering search conditions,
前記コンテンツ情報獲得機能により獲得されたコンテンツ情報から、前記検索条件 入力機能により入力された検索条件に適合するコンテンツ情報又は当該コンテンツ 情報内の位置を特定する特定機能と、  From the content information acquired by the content information acquisition function, a content function that matches the search condition input by the search condition input function, or a specific function that specifies a position in the content information,
を備えた情報処理装置にぉ 、て、  In an information processing device equipped with
コンテンツ情報から複数の異なる特徴量を抽出する特徴量抽出機能と、 前記特徴量抽出機能により抽出された複数の異なる特徴量力 評価関数を用いて 複数の異なる識別子を生成する識別子生成機能と、  A feature amount extraction function for extracting a plurality of different feature amounts from content information; an identifier generation function for generating a plurality of different identifiers using a plurality of different feature amount force evaluation functions extracted by the feature amount extraction function;
複数の異なる前記特徴量及び Z又は前記識別子を前記コンテンツ又は前記コンテ ンッ内の位置に関連づけて索引情報として記憶する索引情報記憶機能と、 前記検索条件入力機能により入力された検索条件を複数の異なる特徴量及び z 又は識別子に変換する検索条件変換機能と、を備え、 An index information storage function for storing a plurality of different feature quantities and Z or the identifier as index information in association with the content or the position in the content; A search condition conversion function for converting the search condition input by the search condition input function into a plurality of different feature quantities and z or identifiers, and
前記特定機能は、前記検索条件変 能により変換された複数の異なる特徴量及 び Z又は識別子を用いて前記索引情報と前記検索条件との適合を検出することでコ ンテンッ又はコンテンツ内の位置を特定する検索特定機能を実現させるプログラム。  The specific function detects a position in the content or content by detecting a match between the index information and the search condition using a plurality of different feature quantities and Z or identifiers converted by the search condition change. A program that realizes the search identification function to identify.
PCT/JP2006/320557 2005-10-14 2006-10-16 Information processing device, and program WO2007043679A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2007540220A JPWO2007043679A1 (en) 2005-10-14 2006-10-16 Information processing apparatus and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2005-300674 2005-10-14
JP2005300674 2005-10-14

Publications (1)

Publication Number Publication Date
WO2007043679A1 true WO2007043679A1 (en) 2007-04-19

Family

ID=37942896

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/320557 WO2007043679A1 (en) 2005-10-14 2006-10-16 Information processing device, and program

Country Status (2)

Country Link
JP (1) JPWO2007043679A1 (en)
WO (1) WO2007043679A1 (en)

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009265774A (en) * 2008-04-22 2009-11-12 Canon Inc Information processor and information processing method
JP2010017274A (en) * 2008-07-09 2010-01-28 Fuji Xerox Co Ltd Image processor and image processing program
JP2010055174A (en) * 2008-08-26 2010-03-11 Nippon Telegr & Teleph Corp <Ntt> Context extraction server, context extraction method, and program
JP2010267012A (en) * 2009-05-13 2010-11-25 Hitachi Ltd System and method for voice retrieving data
JP2012174029A (en) * 2011-02-22 2012-09-10 Sony Corp Information processor, information processing method, and program
JP2012221035A (en) * 2011-04-05 2012-11-12 Nippon Telegr & Teleph Corp <Ntt> Electronic information access system, method and program
JP2014071328A (en) * 2012-09-28 2014-04-21 Xing Inc Karaoke device and computer program
JP2014126946A (en) * 2012-12-25 2014-07-07 Korea Inst Of Industrial Technology Artificial emotion generator and method
KR20140139549A (en) * 2012-03-14 2014-12-05 제너럴 인스트루먼트 코포레이션 Sentiment mapping in a media content item
JP2015177416A (en) * 2014-03-17 2015-10-05 株式会社ニコン Content reaction output device, content reaction output system, and content reaction output program
JP2016038796A (en) * 2014-08-08 2016-03-22 東芝テック株式会社 Information processor and program
CN105719643A (en) * 2014-12-22 2016-06-29 卡西欧计算机株式会社 VOICE RETRIEVAL APPARATUS and VOICE RETRIEVAL METHOD
JP2016118999A (en) * 2014-12-22 2016-06-30 カシオ計算機株式会社 Speech retrieval device, speech retrieval method, and program
WO2016103651A1 (en) * 2014-12-22 2016-06-30 日本電気株式会社 Information processing system, information processing method and recording medium
US9430795B2 (en) 2011-02-22 2016-08-30 Sony Corporation Display control device, display control method, search device, search method, program and communication system
JPWO2014171046A1 (en) * 2013-04-17 2017-02-16 パナソニックIpマネジメント株式会社 Video receiving apparatus and information display control method in video receiving apparatus
WO2017098760A1 (en) * 2015-12-08 2017-06-15 ソニー株式会社 Information processing device, information processing method, and program
JP2017123579A (en) * 2016-01-07 2017-07-13 株式会社見果てぬ夢 Neo medium generation device, neo medium generation method, and neo medium generation program
JP2017182261A (en) * 2016-03-29 2017-10-05 大日本印刷株式会社 Information processing apparatus, information processing method, and program
JP2018142358A (en) * 2018-05-01 2018-09-13 東芝テック株式会社 Information processor and program
JP2019008607A (en) * 2017-06-26 2019-01-17 Jcc株式会社 Video management server and video management system
JP2019030949A (en) * 2017-08-09 2019-02-28 日本電信電話株式会社 Robot control device, robot control method, and robot control program
CN109889891A (en) * 2019-03-05 2019-06-14 腾讯科技(深圳)有限公司 Obtain the method, apparatus and storage medium of target media file
CN110096938A (en) * 2018-01-31 2019-08-06 腾讯科技(深圳)有限公司 A kind for the treatment of method and apparatus of action behavior in video
JP2020004378A (en) * 2018-06-29 2020-01-09 バイドゥ オンライン ネットワーク テクノロジー (ベイジン) カンパニー リミテッド Method and device for information push
CN111126635A (en) * 2019-12-25 2020-05-08 哈尔滨新中新电子股份有限公司 Assessment method for DIY store POS machine maintenance type selection based on customer satisfaction analysis
WO2020102005A1 (en) * 2018-11-15 2020-05-22 Sony Interactive Entertainment LLC Dynamic music creation in gaming
JP2020522920A (en) * 2017-06-06 2020-07-30 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Edge caching for cognitive applications
JP2020135424A (en) * 2019-02-20 2020-08-31 Kddi株式会社 Information processor, information processing method, and program
US10824874B2 (en) 2018-06-08 2020-11-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing video
WO2020246075A1 (en) * 2019-06-04 2020-12-10 ソニー株式会社 Action control device, action control method, and program
WO2022049690A1 (en) * 2020-09-03 2022-03-10 日本電信電話株式会社 Movement amount estimation device, movement amount estimation method, and program
JP2022524471A (en) * 2020-02-21 2022-05-06 グーグル エルエルシー Systems and methods for extracting temporal information from animated media content items using machine learning
US11328700B2 (en) 2018-11-15 2022-05-10 Sony Interactive Entertainment LLC Dynamic music modification
WO2022180858A1 (en) * 2021-02-26 2022-09-01 株式会社I’mbesideyou Video session evaluation terminal, video session evaluation system, and video session evaluation program
US11551219B2 (en) * 2017-06-16 2023-01-10 Alibaba Group Holding Limited Payment method, client, electronic device, storage medium, and server
CN116452241A (en) * 2023-04-17 2023-07-18 广西财经学院 User loss probability calculation method based on multi-mode fusion neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000090133A (en) * 1998-09-08 2000-03-31 Fujitsu Ltd Three-dimensional pt plate model generating device
JP2000284793A (en) * 1999-03-31 2000-10-13 Sharp Corp Voice summary device, recording medium recording voice summary program
JP2001243185A (en) * 2000-03-01 2001-09-07 Sony Corp Advertisement information display method, advertisement information display system, advertisement information display device, and recording medium
JP2002007432A (en) * 2000-06-23 2002-01-11 Ntt Docomo Inc Information retrieval system
JP2003167914A (en) * 2001-11-30 2003-06-13 Fujitsu Ltd Multimedia information retrieving method, program, recording medium and system therefor
JP2005182703A (en) * 2003-12-24 2005-07-07 Triax Inc Image analysis system, image analysis method and strap of portable communication terminal used therefor

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08286693A (en) * 1995-04-13 1996-11-01 Toshiba Corp Information processing device
JP3607228B2 (en) * 1998-12-17 2005-01-05 松下電器産業株式会社 VIDEO SEARCH DATA GENERATION DEVICE, VIDEO SEARCH DATA GENERATION METHOD, VIDEO SEARCH DEVICE, AND VIDEO SEARCH METHOD

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000090133A (en) * 1998-09-08 2000-03-31 Fujitsu Ltd Three-dimensional pt plate model generating device
JP2000284793A (en) * 1999-03-31 2000-10-13 Sharp Corp Voice summary device, recording medium recording voice summary program
JP2001243185A (en) * 2000-03-01 2001-09-07 Sony Corp Advertisement information display method, advertisement information display system, advertisement information display device, and recording medium
JP2002007432A (en) * 2000-06-23 2002-01-11 Ntt Docomo Inc Information retrieval system
JP2003167914A (en) * 2001-11-30 2003-06-13 Fujitsu Ltd Multimedia information retrieving method, program, recording medium and system therefor
JP2005182703A (en) * 2003-12-24 2005-07-07 Triax Inc Image analysis system, image analysis method and strap of portable communication terminal used therefor

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HAGITA N. ET AL: "Multimedia computing [VI. Kan]", JOURNAL OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, vol. 80, no. 10, 1997, pages 1050 - 1055, XP003011789 *
IDE I. ET AL: "Gengo Joho o Tomonau Gazo no Gazoteki Tokuchoryo to Gogi no Tokeiteki Taiozuke", INFORMATION PROCESSING SOCIETY OF JAPAN KENKYU HOKOKU, vol. 99, no. 3, 1999, pages 137 - 143, XP003011791 *
OIKAWA S. ET AL: "Onsei Media Data o Taisho to shita Metadata Jido Chushutsu Hoshiki ni Kansuru Kenkyu", IEICE TECHNICAL REPORT, vol. 104, no. 176, 2004, pages 133 - 137, XP003011792 *
ONO A. ET AL: "Jotai Kan'i Model to Scene Kijutsu Gengo ni yoru Jido Keyword Fuyo Kino o Motsu Gazo Database to Sono Hyoka", TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, vol. J79-D-II, no. 4, 1996, pages 476 - 483, XP003011790 *
OTAKE T. ET AL: "Tahenryo Kaiseki o Mochiita Dogazo Tokuchoryo Chushutsu Hoshiki ni yoru Bangumi Shikibetsu Jikken", ITE TECHNICAL REPORT, vol. 28, no. 48, 2004, pages 1 - 6, XP003011793 *

Cited By (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009265774A (en) * 2008-04-22 2009-11-12 Canon Inc Information processor and information processing method
JP2010017274A (en) * 2008-07-09 2010-01-28 Fuji Xerox Co Ltd Image processor and image processing program
JP2010055174A (en) * 2008-08-26 2010-03-11 Nippon Telegr & Teleph Corp <Ntt> Context extraction server, context extraction method, and program
JP2010267012A (en) * 2009-05-13 2010-11-25 Hitachi Ltd System and method for voice retrieving data
US9886709B2 (en) 2011-02-22 2018-02-06 Sony Corporation Display control device, display control method, search device, search method, program and communication system
JP2012174029A (en) * 2011-02-22 2012-09-10 Sony Corp Information processor, information processing method, and program
US9430795B2 (en) 2011-02-22 2016-08-30 Sony Corporation Display control device, display control method, search device, search method, program and communication system
JP2012221035A (en) * 2011-04-05 2012-11-12 Nippon Telegr & Teleph Corp <Ntt> Electronic information access system, method and program
KR101696988B1 (en) * 2012-03-14 2017-01-16 제너럴 인스트루먼트 코포레이션 Sentiment mapping in a media content item
KR20140139549A (en) * 2012-03-14 2014-12-05 제너럴 인스트루먼트 코포레이션 Sentiment mapping in a media content item
JP2014071328A (en) * 2012-09-28 2014-04-21 Xing Inc Karaoke device and computer program
JP2014126946A (en) * 2012-12-25 2014-07-07 Korea Inst Of Industrial Technology Artificial emotion generator and method
JPWO2014171046A1 (en) * 2013-04-17 2017-02-16 パナソニックIpマネジメント株式会社 Video receiving apparatus and information display control method in video receiving apparatus
JP2015177416A (en) * 2014-03-17 2015-10-05 株式会社ニコン Content reaction output device, content reaction output system, and content reaction output program
JP2016038796A (en) * 2014-08-08 2016-03-22 東芝テック株式会社 Information processor and program
JP2016118999A (en) * 2014-12-22 2016-06-30 カシオ計算機株式会社 Speech retrieval device, speech retrieval method, and program
WO2016103651A1 (en) * 2014-12-22 2016-06-30 日本電気株式会社 Information processing system, information processing method and recording medium
JP2016119000A (en) * 2014-12-22 2016-06-30 カシオ計算機株式会社 Speech retrieval device, speech retrieval method, and program
CN105719643A (en) * 2014-12-22 2016-06-29 卡西欧计算机株式会社 VOICE RETRIEVAL APPARATUS and VOICE RETRIEVAL METHOD
WO2017098760A1 (en) * 2015-12-08 2017-06-15 ソニー株式会社 Information processing device, information processing method, and program
US11288723B2 (en) 2015-12-08 2022-03-29 Sony Corporation Information processing device and information processing method
JPWO2017098760A1 (en) * 2015-12-08 2018-09-20 ソニー株式会社 Information processing apparatus, information processing method, and program
JP2017123579A (en) * 2016-01-07 2017-07-13 株式会社見果てぬ夢 Neo medium generation device, neo medium generation method, and neo medium generation program
JP2017182261A (en) * 2016-03-29 2017-10-05 大日本印刷株式会社 Information processing apparatus, information processing method, and program
US11283894B2 (en) 2017-06-06 2022-03-22 International Business Machines Corporation Edge caching for cognitive applications
JP2020522920A (en) * 2017-06-06 2020-07-30 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Edge caching for cognitive applications
JP7067845B2 (en) 2017-06-06 2022-05-16 インターナショナル・ビジネス・マシーンズ・コーポレーション Edge caching for cognitive applications
US11551219B2 (en) * 2017-06-16 2023-01-10 Alibaba Group Holding Limited Payment method, client, electronic device, storage medium, and server
JP2019008607A (en) * 2017-06-26 2019-01-17 Jcc株式会社 Video management server and video management system
JP2019030949A (en) * 2017-08-09 2019-02-28 日本電信電話株式会社 Robot control device, robot control method, and robot control program
CN110096938B (en) * 2018-01-31 2022-10-04 腾讯科技(深圳)有限公司 Method and device for processing action behaviors in video
CN110096938A (en) * 2018-01-31 2019-08-06 腾讯科技(深圳)有限公司 A kind for the treatment of method and apparatus of action behavior in video
JP2018142358A (en) * 2018-05-01 2018-09-13 東芝テック株式会社 Information processor and program
US10824874B2 (en) 2018-06-08 2020-11-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing video
JP2020004378A (en) * 2018-06-29 2020-01-09 バイドゥ オンライン ネットワーク テクノロジー (ベイジン) カンパニー リミテッド Method and device for information push
US10931772B2 (en) 2018-06-29 2021-02-23 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for pushing information
WO2020102005A1 (en) * 2018-11-15 2020-05-22 Sony Interactive Entertainment LLC Dynamic music creation in gaming
CN113038998A (en) * 2018-11-15 2021-06-25 索尼互动娱乐有限责任公司 Dynamic music creation in a game
US11328700B2 (en) 2018-11-15 2022-05-10 Sony Interactive Entertainment LLC Dynamic music modification
US11969656B2 (en) 2018-11-15 2024-04-30 Sony Interactive Entertainment LLC Dynamic music creation in gaming
JP2020135424A (en) * 2019-02-20 2020-08-31 Kddi株式会社 Information processor, information processing method, and program
JP6997733B2 (en) 2019-02-20 2022-01-18 Kddi株式会社 Information processing equipment, information processing methods, and programs
CN109889891B (en) * 2019-03-05 2023-03-24 腾讯科技(深圳)有限公司 Method, device and storage medium for acquiring target media file
CN109889891A (en) * 2019-03-05 2019-06-14 腾讯科技(深圳)有限公司 Obtain the method, apparatus and storage medium of target media file
WO2020246075A1 (en) * 2019-06-04 2020-12-10 ソニー株式会社 Action control device, action control method, and program
CN111126635A (en) * 2019-12-25 2020-05-08 哈尔滨新中新电子股份有限公司 Assessment method for DIY store POS machine maintenance type selection based on customer satisfaction analysis
CN111126635B (en) * 2019-12-25 2023-06-20 哈尔滨新中新电子股份有限公司 Evaluation method for DIY store POS machine maintenance type selection based on customer satisfaction analysis
JP2022524471A (en) * 2020-02-21 2022-05-06 グーグル エルエルシー Systems and methods for extracting temporal information from animated media content items using machine learning
JP7192086B2 (en) 2020-02-21 2022-12-19 グーグル エルエルシー Systems and methods for extracting temporal information from animated media content items using machine learning
JP7464135B2 (en) 2020-09-03 2024-04-09 日本電信電話株式会社 Movement amount estimation device, movement amount estimation method, and program
WO2022049690A1 (en) * 2020-09-03 2022-03-10 日本電信電話株式会社 Movement amount estimation device, movement amount estimation method, and program
WO2022180858A1 (en) * 2021-02-26 2022-09-01 株式会社I’mbesideyou Video session evaluation terminal, video session evaluation system, and video session evaluation program
CN116452241B (en) * 2023-04-17 2023-10-20 广西财经学院 User loss probability calculation method based on multi-mode fusion neural network
CN116452241A (en) * 2023-04-17 2023-07-18 广西财经学院 User loss probability calculation method based on multi-mode fusion neural network

Also Published As

Publication number Publication date
JPWO2007043679A1 (en) 2009-04-23

Similar Documents

Publication Publication Date Title
WO2007043679A1 (en) Information processing device, and program
KR102018295B1 (en) Apparatus, method and computer-readable medium for searching and providing sectional video
US11488576B2 (en) Artificial intelligence apparatus for generating text or speech having content-based style and method for the same
JP6876752B2 (en) Response method and equipment
CN113569088B (en) Music recommendation method and device and readable storage medium
CN109086408A (en) Document creation method, device, electronic equipment and computer-readable medium
Buitelaar et al. Mixedemotions: An open-source toolbox for multimodal emotion analysis
US20140289323A1 (en) Knowledge-information-processing server system having image recognition system
WO2020081872A1 (en) Characterizing content for audio-video dubbing and other transformations
CN110517689A (en) A kind of voice data processing method, device and storage medium
WO2005071665A1 (en) Method and system for determining the topic of a conversation and obtaining and presenting related content
CN109920409B (en) Sound retrieval method, device, system and storage medium
CN113010138B (en) Article voice playing method, device and equipment and computer readable storage medium
US9525841B2 (en) Imaging device for associating image data with shooting condition information
WO2007069512A1 (en) Information processing device, and program
WO2021149929A1 (en) System for providing customized video producing service using cloud-based voice combining
Maybury Multimedia information extraction: Advances in video, audio, and imagery analysis for search, data mining, surveillance and authoring
Wang et al. Generating images from spoken descriptions
Zhang Voice keyword retrieval method using attention mechanism and multimodal information fusion
Yang Research on music content recognition and recommendation technology based on deep learning
CN113407766A (en) Visual animation display method and related equipment
US20210337274A1 (en) Artificial intelligence apparatus and method for providing visual information
CN117521603A (en) Short video text language model building training method
KR20230174986A (en) Apparatus and method for predicting favorite perfume
CN116013242A (en) Speech synthesis method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2007540220

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06821872

Country of ref document: EP

Kind code of ref document: A1