WO2007043679A1

WO2007043679A1 - Information processing device, and program

Info

Publication number: WO2007043679A1
Application number: PCT/JP2006/320557
Authority: WO
Inventors: Masayoshi Ihara; Ryutaro Egawa; Hiroshi Otsuka; Kei Maruno; Shunji Mitsuyoshi
Original assignee: Sharp Kabushiki Kaisha; Sgi Japan, Ltd.; Advanced Generation Interface Japan, Inc.
Priority date: 2005-10-14
Filing date: 2006-10-16
Publication date: 2007-04-19
Also published as: JPWO2007043679A1

Abstract

Provided is an information retrieval device, which is enabled to retrieve arbitrary contents information easily by making use of cooccurrence information on the basis of various pieces of information inputted. A feature quantity is extracted from the visual information, the audio information and the character information of contents information and sensor information, and an identifier is created from the feature quantity extracted, by means of a feature function. The feature quantity and/or the identifier are stored as index information in relation to the contents or the position in the contents. A retrieval condition inputted is converted into a feature quantity and/or an identifier, and the contents or the position in the contents is specified by detecting the fitness, which is based on the cooccurrence information between the index information and the retrieval condition and in the vicinity of the inside of the contents information, by using the feature quantity and/or the identifier, as converted.

Description

Specification

Information processing apparatus and program

Technical field

[0001] Content information acquisition means for acquiring content information, search condition input means for inputting search conditions, and search conditions input by the search condition input means from the content information acquired by the content information acquisition means Content information conforming to the above or a specifying means for specifying a position in the content information.

Background art

Conventionally, a method for detecting a change in content information in a content information search using a general information processing apparatus has been proposed as in Patent Document 1, and a change in volume as a feature amount is proposed. A method has been proposed that uses a scene that exceeds a certain threshold as a highlight scene.

[0003] Here, the feature amount is a quantification of time-series changes, changes to neighboring pixels, changes in color, acoustic frequency, etc. within a specified range for information such as input audio and video. Value. Various methods can be considered for converting the rate of change into numerical values.For example, for audio, a method of converting to a numerical value based on changes in the frequency axis using a cepstrum or FFT can be considered. Then, it is possible to consider a method of making a numerical value as a time-series change, a difference value, a relative value, or an absolute value of luminance and hue in adjacent pixels, and will be described later in detail in a modification example.

[0004] Also, when performing a search by voice on content information, it is difficult to detect proper names such as the main character's name because they are not registered in the dictionary. As a search technology that applies vocabulary-independent phoneme recognition, a method of searching for arbitrary key words has been proposed. As described in Non-Patent Document 2, a method for registering a device control method in a dictionary by phoneme recognition and a phoneme dictionary as a user interface for controlling the device using a phoneme dictionary is described. The Yes.

[0005] Further, according to Non-Patent Document 3 as an application of such technology, a phoneme symbol string based on phoneme recognition and a search method based on image recognition have been proposed. For example, "still image → word set → text → A method has been proposed in which a character string associated with an image is converted to a phoneme sequence or a phoneme segment sequence, or a phoneme sequence or a phoneme segment sequence is converted into a character string and linked to each other as a `` voice → video ''. ing.

[0006] Further, according to Patent Document 3, a phoneme and a symbol string based on Z or phoneme pieces are registered in a database in association with geographical position information, and search and provision of information with proper nouns that are common in city information is performed. An information distribution device and a receiving device have been proposed, and according to Patent Document 4, retrieval of speech information indexed by phoneme recognition is proposed, and related techniques are also proposed in these cited documents. ing.

[0007] Also, with respect to other recognition techniques, a technique for recognizing emotions from voice feature information is disclosed in Patent Document 5, and a technique for detecting scales and musical instruments has been proposed in Non-Patent Document 4. A method for performing a search based on a detected character string by recognizing a moving image or a still image and detecting a character string or the like has been proposed by Patent Document 6, and Patent Document 7 or the like. A method for recognizing image power and motion, called gesture recognition and motion recognition, has been proposed. According to Patent Document 8, a method for recognizing facial images has been proposed. Proposal 'Invented.

[0008] Further, a method for extracting a sentence feature for estimating a meaning by measuring a co-occurrence relation based on the co-occurrence probability based on the co-occurrence frequency in the same sentence of words and characters in the sentence. Patent Document 9 and methods based on those cited documents have been proposed, but specific scene features can be extracted by combining information that is based on multiple recognitions in a time-series manner. There is no method proposed to specify the position on the time axis in the content, the position on the display screen, or the position on the reading aloud!

[0009] In addition, it is known that a state in which a plurality of pieces of different information are generated in the positional vicinity of each other is generally called "co-occurrence", and "co-occurrence relation", "co-occurrence state" and "co-occurrence""Information" can be used to evaluate the conditions under which arbitrary information is generated by combining information generated in the vicinity of certain information. It is possible and is used to estimate the meaning of sentences using a covariance matrix based on co-occurrence probabilities and co-occurrence information. Also, in the present invention, the positional neighborhood is a temporal and spatial neighborhood based on the time series position, the reading position and the display position. For example, in a sentence where “a person crys, te, ru” and “a person” and “cry” exist in the same sentence, they are in a co-occurrence relationship because they are close to each other in position. It can be said that there is.

[0010] Further, Patent Document 10 proposes a method of indexing content information in the sensitivity word space, and Non-Patent Document 5 provides an index based on character strings based on utterance contents for video and audio. Although a search method is proposed, it is proposed to construct an evaluation function for search using a co-occurrence relationship based on the recognition result in the content information, the feature value for recognition, and the identifier. Wow! /

[0011] In addition, for example, human resources can respond flexibly, such as product reputation surveys at call centers, search according to hobbies of content information such as moving images, nursing of patients in medical settings, and reactions in virtual personalities of robots or agents. The evaluation is performed using information based on the co-occurrence relationship constructed based on multiple features and identifiers (symbols for distinguishing features) obtained from the environment. Based on the results, no method has been proposed for detecting information and providing information and processing that is highly convenient for the user!

[0012] Therefore, by using the present invention, in an environment such as a call center that performs telephone reception as in Patent Document 11, an operator who can evaluate the compatibility between the operator and the customer and can achieve smooth communication is assigned. Such as expanding the functions of such systems, or extracting video feature values in frame units as in Patent Document 12 and improving the search method by evaluating whether or not the video feature values match. In order to analyze this information, it is also possible to analyze the co-occurrence relationship by performing multivariate analysis using Patent Document 13.

[0013] According to conventional applications and documents, many phonemes and syllables are confused, but the syllables, phonemes, and phonemes in the present invention are “Akasana” in Japanese, In the example, “A / ka / sa / tayoyo” or “a / ka / sa / ta / na” is used for syllables, and “a / k / a” for phonemes. / s / a / t / a / n / a '' or `` a / cl / k I a / s / a / cl / t / a / n / aj '' — K / k / k— a / a / a— s I s / sa / a / a- 1 / t / t— a / a / a— n / n / n— a / a ”or“ a / a- cl / cl / cl- k / k / k— a / a I as / s / sa / a / a— cl / cl / cl-t / t / t— a / a / a— n / n / n— a / a ” , “A— a— a / a— cl— cl / cl— cl— cl / cl— cl— k / cl— k— k / k— k— a / a— a— a / a— a— s / s— s— s / s— a— a / ··· t— a— a / a— a— n / nn ~ n / n— a— a / a— a— a ” / Cl / points to the silent or unvoiced part of the unvoiced plosive before sounding, even if it is a segment of the phoneme by separation based on any position such as the first, middle, and second half of the phoneme. , Phonemes, and phonemes may be changed to different notation symbols by any improvement.

[0014] It should be noted that the difference between phoneme and phoneme recognition and normal speech recognition is explained. Phoneme recognition and phoneme recognition are different from general speech recognition. Yes, in more detail, phoneme recognition and phoneme recognition do not use a language model related to grammar, so the meaning is recognized as a recognition result, and it is not converted to a symbol that includes meaning like kanji Or, do not discriminate between homonyms and homonyms, and do not discriminate between nouns and verbs according to the context. It is characterized by analyzing the utterance of the utterance using Dell and evaluating only the match between the utterance and the recognition symbol.

[0015] A "phoneme" refers to a vowel or consonant that is a component of speech, and a "phoneme segment" is an element obtained by subdividing one phoneme into , The middle of “A”, the end of “A”, and the notation based on the change of phonemes for the utterances that are intermediate sounds such as the sound between “A” and “I” It may be written as “phoneme identifier” or “phoneme segment identifier”.

Patent Document 1:: JP 2004-233541 A

Patent Document 2: JP-A 62-220998

Patent Document 3: JP-A-2004-54915

Patent Document 4: Japanese Patent Laid-Open No. 2002-221984

Patent Document 5:: JP 2002-91482 A

Patent Document 6:: JP 2002-14973 A

Patent Document 7:: Japanese Patent Laid-Open No. 09-330400 Patent Document 8: JP-A-5-153581

Patent Document 9: Japanese Patent Laid-Open No. 7-36883

Patent Document 10: Japanese Unexamined Patent Application Publication No. 2005-107718

Patent Document 11: Japanese Unexamined Patent Application Publication No. 2004-280158

Patent Document 12: Japanese Patent Laid-Open No. 10-320400

Patent Document 13: Japanese Patent Application No. 2005-147048

Non-Patent Document 1: Masayuki Nakazawa, Takashi Endo, Kiyoshi Furukawa, Jun Toyoura, Takashi Oka (New Information Processing Development Corporation), "Study of speech summaries and topic summaries using phoneme symbol sequences of speech waveform power", Shin Academic Report, SP96-28, pp.61—68, June 1996.

Non-Patent Document 2: “Research and Development on Life Support Interface for Aged Society”, Key Project Project Research Report by Aomori Prefectural Industrial Research Center Vol.5, Apr.1998-Mar.2001 031

Non-Patent Document 3: Takashi Oka, Hironobu Takahashi, Takuichi Nishimura, Nobuhiro Sekimoto, Hidehide Mori, Masanori Ihara, Hiroaki Yabe, Hiroaki Hashiguchi, Hiroshi Matsumura. Pattern Search Algorithm 'Map-Supporting "CrossMediator" -. Someone Unknown, editor, Artificial Intelligence 'Gikai Study Group, volume 1, pages 1-6. Japanese Society for Artificial Intelligence, 2001.

Non-Patent Document 4: Masahiro Tani: "Integration of instrument sound features by Bayesian Network and application to instrument identification", 2003 IEICE General Conference "D-14 Speech, Auditory" D-14-21, pi 88 , March 2003

Non-Patent Document 5: Satoshi Nagao, “Semantic Transcoding-Towards More Practical Semantic We”, Research on Human-Centered Intellectual Information Technology VI-3.6, Japan Information Processing Development Corporation Technical Research Institute, March 2003

Disclosure of the invention

Problems to be solved by the invention

Conventional search methods generally used a search method that uses character strings and audio information associated with images and video, and a search method that evaluates identifiers and feature values obtained by a single recognition method or feature extraction method. Search based on abstract concepts that are difficult to express in language and search based on sensory concepts such as scene excitement and searches based on hobbies and subjectivity are difficult There was a problem of being.

[0017] Therefore, according to Non-Patent Document 3, a search is performed using a phoneme symbol acquired by phoneme recognition as an identifier! Based on co-occurrence information that combines identifiers and feature quantities based on multiple recognition methods, such as emotion identifiers based on emotion recognition obtained from speech identifiers, motion identifiers based on motion identifiers, image identifiers based on motion recognition, and motion recognition. A method for constructing a covariance matrix and constructing a new evaluation function for indexing or searching has not been proposed.

[0018] Therefore, the inventor creates an evaluation function based on the co-occurrence relationship of identifiers and feature quantities obtained as a result of such various recognitions, and performs search and indexing, which has been impossible in the past. It is considered that abstract searches such as climax can be performed, and the name of the evaluation function is appropriately named by the user or producer for any evaluation function configured as an analysis result, and based on the named character string By generating phoneme strings and phoneme string strings, user-defined evaluation functions and indexes can be used to specify search conditions, and the configured evaluation functions can be distributed and distributed. We thought that a high search environment could be realized.

[0019] As described above, the technology related to the co-occurrence relation of information measures the co-occurrence relation based on the co-occurrence frequency of words and characters in the same sentence in the same sentence using the co-occurrence probability and covariance matrix. As a method for extracting sentence features for estimating the number of words, a method based on Patent Document 9 and those cited documents has been proposed. However, in the present invention, identifiers extracted by various recognition methods and recognition of them are recognized. It is characterized by using co-occurrence information and co-occurrence probability, covariance matrix and co-occurrence matrix.

[0020] When various devices are examined based on such a problem, for example, it is difficult to search for "Kime dialogue" which is generally referred to in a series of movies. It is more difficult to judge whether or not it is used as a story in a program, or it is difficult to determine if they are automatically recorded and recorded. For example, it is difficult to see when skipping force, or whether you are calling the hero's name crying, angry, or joyfully. It is difficult to search according to the excitement of the voice, or it is difficult to identify words by putting voice recognition from the video / audio stream with the current voice recognition system Even if a phoneme is recognized from an audio stream card, it is difficult to symbolize the actor name related to the cast name even though the cast name can be symbolized in the content such as a movie. However, since the cast name and actor name in the video stream are character strings, there is a problem that the search can be performed only with the character string symbols, and the emotional excitement of the scene due to video and audio cannot be searched. there were.

[0021] This problem is mainly due to the fact that it is thought that the user's intended search and detection can be easily performed by recognizing spoken words and image information in the content. In the past, content information was not as powerful as what could be obtained from a single recognition result, and in the past we tried to recognize it at the single-word level, but in the content, screaming and crying sounds that do not become confused words make the scene swell. Influencing points, simple recognition, indexing, and search cannot narrow down the search results. Considering the emotion that the voice power generated in the scene is also recognized. Recognizing and indexing phoneme sequences based on utterances, and detecting the intervals where they occur almost simultaneously based on co-occurrence information, a method is realized, which is caused by a number of factors. And .

[0022] In addition, there are differences in recognition and interpretation of phonemes and phonemes depending on the language, and it is not always possible to unify people and phoneme symbol strings in different native languages. There is a problem that it can not be put into practical use, and it is not versatile when providing information to any terminal, and it can not absorb the difference between international phonetic symbols and phonetic symbols in local languages. was there.

[0023] In addition, in the CRM system that records and analyzes consumer interaction, the ability to interact with the customer at the consumer consultation desk is recorded objectively and quantitatively, while recording voice characteristics. It was difficult to grasp, record and analyze automatically, and the operator of the consulting office had a good dialogue situation.

[0024] Also, if you want to sing a song that does not have the title of karaoke, or search music data, you can search for emotional excitement of the music or video, search for enormous music or video titles, It was difficult to search for the appearance position of the song and to search from the chorus part of the lyrics. [0025] In addition, conventional text search such as EPG, BML, RSS, and text broadcasting is complicated, and the input of video / audio stream power identifies phoneme symbols, phoneme symbols, and emotions based on the extracted information Emotion identifiers for identifying musical instruments, musical instrument identifiers for identifying musical scales, musical scale identifiers for identifying musical scales, voice features for identifying language, phonemes and phonemes, emotions, musical instruments and musical scales, and the sound of indoor sounds Acoustic features for identifying the direction of a person and sound, image features for discriminating the shape and movement of landscapes, people, objects, animals, characters, etc. Searches such as the rise of the target scene have not been conducted, and there is a problem that a search with a high degree of freedom cannot be performed. These combinations will be described in more detail later.

[0026] According to the conventional technique proposed as a general countermeasure against such a problem, a method for realizing mutual search by converting a phoneme string and a character string has been proposed. Based on recognition results, emotion recognition results, and phoneme recognition results! Evaluation based on different co-occurrence states and learning based on learning results, or when performing more complex searches, recognition based on different evaluation criteria It was impossible to perform a complex search using the index by.

[0027] In addition, in the prior art, although emotion recognition is performed based on facial expressions, a phoneme sequence or phoneme segment sequence obtained from speech input and an emotion identifier are associated with facial expression images by image input. Classification, search evaluation, and learning methods are not proposed, and recognition by phonemes and phonemes is not proposed. Therefore, content such as movies and dramas should be appropriately used with emotions, phoneme sequences, and image features. Search for scenes' Whenever you detect, start recording based on detection, play, skip over unwanted places, broadcast announcements, deliver emails, or generate RSS Therefore, the present invention solves the problems related to search, detection, and indexing that involve voice input based on user emotions and emotions expressed in content as in the present invention. Is emotion Invention Field of the apparatus for ヽ such a device that performs generation and control of sensibility different.

[0028] Further, in a system such as Non-Patent Document 3, an image is uniformly segmented, and word strings statistically associated with segmented image features are expanded into phonemes and phonemes. Although it is possible to search based on utterances or to search for locations that are uttered in the video, it is possible to combine specific image feature trends, emotion feature trends, and voice feature trends that accompany recognition. Statistically classify based on co-occurrence state and configure an evaluation function to give an identifier, associate a phoneme sequence and phoneme sequence with a utterance of a name indicating the identifier target, and search for those identifiers It was impossible to construct an indexed evaluation function for

[0029] For this reason, if it is impossible to perform a search based on a trend analysis in which a certain phoneme sequence 'phoneme segment sequence and an image feature or an image feature and an emotion identifier are associated with each other, Using the co-occurrence information included, it is possible to search related to the excitement of the content information scene such as “explosion scene with scream” and “screaming hero's name”. I helped.

[0030] As described above, it is difficult for the conventional search technology to realize a search with a high degree of freedom in consideration of human senses, hobbies, subjectivity, and emotions. An information gap called a digital divide is born between people who are good at it and those who are good at it, and it has become a general issue in the information society.

[0031] In view of the above-described problems, an object of the present invention is to provide an information search device that can easily search for arbitrary content information by using co-occurrence information based on various types of input information. Etc. is to provide.

Means for solving the problem

In order to solve the above-described problem, an information processing apparatus according to a first aspect of the present invention includes a content information acquisition unit that acquires content information, a search condition input unit that inputs search conditions, and the content information acquisition unit described above. Content information that conforms to the search condition input by the search condition input means or a specifying means for specifying a position in the content information from the content information acquired by A feature quantity extraction means, an identifier generation means for generating an identifier using the feature quantity evaluation function extracted by the feature quantity extraction means, and the feature quantity and Z or the identifier as the content or the content. Index information storage means for storing the index information in association with the position in the search condition, and the search condition input by the search condition input means And a search condition conversion means for converting to a feature quantity and Z or an identifier, wherein the specifying means uses the feature quantity and Z or identifier converted by the search condition conversion means to use the index information and the search condition. Content or position within content by detecting It has a search specifying means for specifying a position.

[0033] An information processing apparatus according to a second aspect of the present invention is based on a content information acquisition means for acquiring content information, search condition input means for inputting search conditions, and content information acquired by the content information acquisition means. And content information that conforms to the search condition input by the search condition input means or a specifying means for specifying a position in the content information, and a content information extraction that extracts a plurality of different feature quantities A plurality of different feature quantity forces extracted by the feature quantity extraction means, and an identifier generation means for generating a plurality of different identifiers using an evaluation function, and a plurality of different feature quantities and Z or the identifiers as the content. Or index information storage means for storing as index information in association with the position in the content, and the search condition input means Search condition conversion means for converting the inputted search condition into a plurality of different feature quantities and Zs or identifiers, and the specifying means includes a plurality of different feature quantities converted by the search condition conversion means and It has a search specifying means for specifying the content or the position in the content by detecting a match between the index information and the search condition using Z or an identifier.

[0034] Further, the third invention is the information processing apparatus according to the first or second invention, wherein the index information storage means is configured based on a feature amount and Z or an identifier obtained from a content card. Is further stored in association with the content or the position in the content, and the co-occurrence information based on the feature quantity and the Z or identifier converted from the search condition by the search condition conversion means is used as the search condition. Search condition co-occurrence information constituting means configured as co-occurrence information is further provided, wherein the search specifying means includes the search condition co-occurrence information constituted by the search condition co-occurrence information constituting means, and the index co-occurrence information. It is characterized by having co-occurrence search specifying means for specifying the content or the position in the content by detecting the conformity with the above.

[0035] In addition, according to a fourth invention, in the information processing device according to any one of the first to third inventions, the content includes character information, and the identifier generation means includes the sentence An identifier is generated based on character information.

[0036] Further, in a fifth invention according to the information processing apparatus of the fourth invention, the character information and the identifier Is further stored as dictionary information, and the identifier generating unit generates the identifier using the dictionary information from the character information included in the content.

[0037] Further, the sixth invention is the information processing apparatus according to any one of the first to fifth inventions, wherein the identifier and the standard pattern are associated with the dictionary information storage means in the standard pattern. A standard pattern dictionary information storage means for storing as an image dictionary information, and further comprising an identifier feature quantity conversion means for converting the identifier into a feature quantity by a standard pattern by using the standard pattern dictionary information. To do.

[0038] Also, in a seventh aspect of the present invention is the information processing device according to any one of the first to sixth aspects, the index information storage means is based on real time of the content information. The feature quantity and the Z or the identifier are further stored in association with the content or the position in the content, and the specifying unit is configured to determine whether the content information index information and the search condition are distributed in real time. The information processing apparatus according to claim 1, wherein the information processing apparatus is a means for detecting conformity.

[0039] Further, an eighth invention is the information processing device according to any one of the first to seventh inventions, wherein the content information is being searched and Z or the search result or the detection result is shared. It is characterized by presenting advertisement information associated with origin information and Z or the index information.

[0040] Further, according to a ninth aspect, in the information processing apparatus according to the second aspect, at least one of a plurality of different feature amounts extracted by the feature amount extraction means is used for the content force / phoneme recognition. It is a feature quantity extracted from phoneme information to be used or a phoneme identifier generated from phoneme information.

[0041] Further, in a tenth invention according to the information processing device of the second invention, at least one of a plurality of different feature amounts extracted by the feature amount extraction means is used when the content force phoneme segment is recognized. It is a feature value extracted from phoneme information used or a phoneme identifier generated from phoneme information.

[0042] Further, in an information processing apparatus according to an eleventh aspect, in the information processing apparatus according to the second aspect, at least one of a plurality of different feature amounts extracted by the feature amount extraction means is emotion recognition from the content. It is a feature quantity extracted from emotional information used for recognition or emotion identifier generated by emotional information.

[0043] Also, in a twelfth invention according to the information processing device of the second invention, at least one of a plurality of different feature values extracted by the feature value extraction means is recognized based on the content power auditory information. It is characterized by the feature quantity extracted from the auditory information power used at the time of identification or the identifier generated from the auditory information.

[0044] Further, in a thirteenth invention, in the information processing apparatus according to the second invention, at least one of a plurality of different feature quantities extracted by the feature quantity extraction means is recognized based on the content power visual information. It is characterized by the feature quantity extracted from the visual information force used at the time of identification, or an identifier generated from the visual information.

[0045] Also, in the fourteenth invention according to the information processing device of the second invention, the content includes character information, and a plurality of different feature quantities or identifiers extracted by the feature quantity extraction means. Among the identification amounts generated by the generation means, at least one is a feature amount extracted from a character information force or an identifier generated from a character information cover.

[0046] Further, the fifteenth invention is the information processing device of the second invention, wherein at least one of a plurality of different feature quantities extracted by the feature quantity extraction means or a plurality of different identifiers generated by the identifier generation means. Is characterized in that the feature quantity or program information extracted from the program information column is an identifier.

[0047] Further, the sixteenth invention is the information processing apparatus of the second invention, wherein at least one of a plurality of different feature quantities extracted by the feature quantity extraction means or a plurality of different identifiers generated by the identifier generation means. Is characterized in that the feature quantity or sensor information from which sensor information power is also extracted is an identifier.

[0048] Further, in the seventeenth invention, in the information processing device of the third invention, the evaluation function is calculated from co-occurrence information configured based on the feature quantity and Z or identifier obtained from the content. An evaluation function reconstructing means for reconstructing is provided.

[0049] Further, the eighteenth invention is the information processing apparatus according to the third invention, wherein the co-occurrence information is configured based on the feature quantity and the Z or identifier converted by the search condition power by the search condition conversion means. And an evaluation function restructuring means for reconfiguring the evaluation function. It is a sign.

[0050] Also, in the nineteenth invention according to the information processing device of the third invention, a search that configures co-occurrence information based on a result of specifying the content or a position in the content by the co-occurrence search specifying unit A result co-occurrence information composing means, and an evaluation function reconstructing means for reconstructing the evaluation function from the co-occurrence information constructed based on the search result co-occurrence information composing means! To do.

[0051] In a twentieth invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are stored in the content. In an information processing apparatus provided with a specifying means that identifies from content stored in a storage means, a phoneme feature quantity used for phoneme recognition extracted from the content and a phoneme identifier obtained by Z or phoneme recognition And index recording means for associating and recording an emotion feature quantity used for emotion recognition extracted from the content and an emotion identifier obtained by Z or emotion recognition as an index. Index specifying means for specifying, from the content, contents that match the search condition based on the index information recorded by the recording means It is characterized by having.

[0052] In a twenty-first aspect, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. In an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content. Index recording means for associating and recording as an index, the phoneme segment identifier and the emotion feature quantity used for emotion recognition extracted from the content force and the emotion identifier obtained by Z or emotion recognition. The means is an index that specifies the content capability that matches the search condition based on the index information recorded by the index recording means. It has a fixed means.

[0053] According to a twenty-second aspect of the invention, there is provided content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and the search In an information processing apparatus provided with a specifying unit that specifies content that satisfies a condition from the content stored in the content storage unit, the phoneme feature quantity used for phoneme recognition extracted from the content and Z or The phoneme identifier obtained by phoneme recognition, the emotion feature quantity used for emotion recognition extracted from the content power and the emotion identifier obtained by Z or emotion recognition, and the first recognition extracted from the content camera Index recording means for associating and recording the first feature quantity and / or the first identifier obtained by the first recognition as an index, and the specifying means is recorded by the index recording means And an index specifying means for specifying the content power that matches the search condition based on the index information.

[0054] In a twenty-third invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. In an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content. Phoneme segment identifiers, emotion feature quantities used for emotion recognition extracted by the content force and emotion identifiers obtained by Z or emotion recognition, and the first identifier used for the first recognition of the content force extracted. Index specifying means for associating and recording as an index the first feature obtained by the first feature quantity and / or the first recognition, and the specifying means comprises: The system further comprises index specifying means for specifying the content power of content that matches the search condition based on the index information recorded by the index recording means.

[0055] In a twenty-fourth aspect of the invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. In an information processing apparatus provided with a specifying means for specifying from among the contents stored in the storage means, phoneme features to be used for phoneme recognition extracted from the content and phoneme identifiers obtained by Z or phoneme recognition Index recording means for associating and recording the first feature quantity used for the first recognition extracted from the content force and the first identifier obtained by the Z or first recognition as an index And the specifying means includes index information recorded by the index recording means. And an index specifying means for specifying the content power based on the search condition.

[0056] In a twenty-fifth aspect of the present invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are stored in the content. In an information processing apparatus provided with a specifying means that identifies from the content stored in the storage means, it is obtained by phoneme feature quantity and Z or phoneme piece recognition used for phoneme recognition extracted from the content. An index that associates and records the phoneme segment identifier and the first feature amount used for the first recognition extracted from the content force and the first identifier obtained by Z or the first recognition as an index. Recording means, and the specifying means also specifies the content power that matches the search condition based on the index information recorded by the index recording means It has an index specifying means.

[0057] In a twenty-sixth aspect of the present invention, content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are included in the content. Emotion feature quantity and / or emotion identifier obtained by emotion recognition for use in emotion recognition extracted from the content in an information processing apparatus having a specific means specified from the content stored in the storage means Index recording means for associating and recording the first feature quantity used for the first recognition extracted from the content force and the first identifier obtained by Z or the first recognition as an index And the specifying means specifies the content power that matches the search condition based on the index information recorded by the index recording means. It characterized by having a stage.

[0058] According to a twenty-seventh aspect of the present invention, in the information processing device according to any one of the twenty-second to twenty-sixth aspects, the first identifier and Z or the first feature amount are auditory information and Z or It is an identifier and Z or feature quantity based on visual information and Z or character information and Z or sensor information.

[0059] According to the present invention, the inventor can use the evaluation using the co-occurrence information for a probabilistic evaluation of a condition in which arbitrary information is generated by combining information generated in the vicinity of certain information. We thought that such co-occurrence relationship information could be used for searching and learning because it was used for searching based on the semantic estimation of content information.

[0060] For example, “scream”, “explosive pronunciation” and “explosion video as a screen change accompanied by radial movement in red and yellow image information” in action movies are information having characteristics of co-occurrence relations. Evaluation 'It is intended to solve problems by conducting search and learning through interpretation.

[0061] More specifically, a combination of various recognition methods in the prior art recognizes phonemes' phonemes, emotions, and image features for each frame of an audio stream and a video stream in a moving image, and the recognition results are as follows. Index the moving images using the obtained identifiers, configure the co-occurrence probabilities of the identifiers for each frame, add up the co-occurrence probability transitions over multiple frames based on the co-occurrence matrix, The covariance matrix is obtained and the eigenvalues of the covariance matrix and the evaluator function are constructed.

[0062] Subsequently, by indexing the content information using the configured evaluation function, indexing can be performed based on the co-occurrence information of various recognition results in the content information. At this time, the evaluation function is reconstructed by multivariate analysis and the number of evaluation functions is arbitrarily increased, or the evaluation function name is manually defined from the image tendency and voice tendency detected by the increased evaluation function. The evaluation function may be reconfigured based on the user's operation on the search results, or the eigenvalue eigenvectors are trained by HMM instead of force. May be.

[0063] The index based on the evaluation function configured in this way enables the user to obtain a search result that is impossible in the past based on a combination of image features, acoustic features, and detected emotions. In addition, the function that reconstructs the evaluation function according to the usage situation makes it possible to search for content information that is more suited to its own subjectivity.

[0064] Then, by indexing content based on the configured evaluation function and detecting any tendency of content force during distribution, it is impossible with conventional index searches using phonemes or phonemes. It is possible to search according to a person's hobbies and preferences, and to collect, collect, distribute, and reuse information according to people's hobbies and preferences. Detecting content that suits your taste, searching for scenes, checking product reputation, The information processing device realizes medical use by considering the driver's emotions and detecting moans and emotions.

[0065] It should be noted that conventional search based on co-occurrence information is exclusively used for word search in speech recognition and search using co-occurrence information of word strings in a document, and the co-occurrence probability and variety of recognized character strings are used. The power of simple sentences It was commonly used to interpret sentences and identify sentences.

[0066] However, in the present invention, by focusing on the use of this co-occurrence information, a phoneme sequence or phoneme segment sequence included in the voice in the content information, which is not a combination of conventional word information, and an emotion identifier based on emotion recognition To perform search and detection based on image features and audio features that are characteristically co-occurring in a scene of a video. It is the gist of.

[0067] As described above, the present invention records, classifies and stores co-occurrence states of images, phonemes and emotions that are unintentionally co-occurring for humans, and records, classifies and records based on the accumulated information! By constructing the identifier again and making it available for search 'detection', it is intended to solve powerful problems that could not be solved by conventional simple recognition search. And visual information and text information can be used independently, so various applications are possible by using the co-occurrence relationship between utterance recognition and emotion recognition that occurs in voice information in response to telephone calls. It can be used as a tool for robots and video production / editing.

[0068] More specifically, 30 phonemes (vowels, consonants, silence), 4 emotions (joy, anger, sorrow, comfort [normal heart]), color according to the time-series change of moving image content information Spatial Web Color 216 colors (Web Color is also called “WEB safe color”, “browser common color”, etc.) A co-occurrence matrix with a total of 250 elemental powers is constructed for each frame to determine the co-occurrence probabilities. A covariance matrix of co-occurrence probabilities is constructed by counting over the frame (3 seconds), and an eigenvalue eigenvector of the covariance matrix is obtained to construct an evaluation function.

[0069] By re-indexing the content information in units of frames using the evaluation function configured in this way, indexing based on the co-occurrence state becomes possible, and the evaluation configured in this way Multivariate analysis is performed using functions! The function name and identifier can be given by, or a character string with a high probability of co-occurring with the function obtained by multivariate analysis can be given as the function name and identifier so that it can be used according to instructions from the user become.

[0070] Then, it is converted into an arbitrary word string based on the phoneme string, phoneme string string, and emotion identifier by natural utterance, or the speech is directly searched by the phoneme string or phoneme string, or registered keywords. Phonemes and phoneme strings, phoneme strings that are not in the dictionary are registered in the dictionary, and identifiers obtained as a result of image feature recognition are converted to phoneme and phoneme strings. A dictionary can be constructed based on the co-occurrence information of video and emotion-related identifiers associated with strings and phoneme strings.

[0071] Then, by performing indexed 'search' detection'learning with an evaluation function configured using a co-occurrence matrix based on identifiers acquired by a plurality of recognition means, it was impossible in the past. Search based on co-occurrence relationships with unspecified words with emotions and image features is possible, and the conventional simple words and image changes such as scenes with intense image changes accompanying screams in content and dark scenes with crying It is possible to search for scenes according to the excitement of content that could not be searched by methods such as detection, and the position on the time axis within the content, the position on the display screen, and the position on reading aloud can be specified by searching In addition to this, it is possible to construct a new evaluation function for identification using the information power used for indexing by learning with co-occurrence information.

[0072] That is, the present invention does not convert phonemes and index character strings as in the prior art, and does not perform a search, an identifier string consisting of phonemes, phoneme power, and an identifier obtained by an evaluation function used for recognition. Search function by performing mutual conversion with, and an evaluation function using a co-occurrence matrix of identifiers or identifier strings that have phoneme power and other identifiers such as emotions and videos, identifier strings, and feature quantities To create a new evaluation function by performing indexing 'search' detection 'learning, or indexing' search 'detection' learning automatically or recursively based on user instructions, By updating the existing evaluation function, the problem can be solved by realizing indexing 'search' detection that reflects the user's intention.

Note that the identifiers used are not only the phoneme and color information described above, but also “emotion identifiers” and “sound Various identifiers that are identifiers such as “floor identifier”, “environmental sound identifier”, characters by image recognition, “person identifier”, “object identifier” accompanying image recognition, and feature quantities according to their purposes from audio and video It is used to refer to a symbol discriminated based on probability, likelihood, and distance by evaluation function, HMM, `` emotional identifier '' by emotion recognition, `` environmental sound identifier '' by environmental sound recognition, `` character '' by image recognition, This refers to `` person identifier '', `` expression identifier '', `` object identifier '', and `` motion identifier '' of moving images by face detection and image recognition. You can use technology, or you can combine program information, text information, sensor information, and so on.

[0074] Also, based on such identifiers and feature quantities! /, Evaluation of co-occurrence probabilities, covariance matrix transition probabilities, and co-occurrence matrix transition probabilities, considering time-series changes An evaluation function that considers transition probabilities is obtained by representing feature information and identifiers with time-series changes over multiple frames in one matrix space, and obtaining eigenvalues and eigenvectors of the matrix space. The problem can be solved more efficiently because it can be configured, or the same evaluation function can be applied to co-occurrence information based on different frames in time series, and the evaluation results can be used by multivariate analysis. Is also possible.

[0075] According to the above problem solving methods, the solutions for the related problems described above are explained by changing the combination of necessary recognition means. Names such as entertainers and product names in TV programs often use unique proper nouns. In view of the situation where conversion efficiency and conversion accuracy are poor when inputting voice and converting it into words, and in situations where it is difficult to input characters using keys on mobile information terminals, A specific proper noun is created by using symbolic sequences such as speech features and phoneme features that are closer to the speech waveform rather than performing speech recognition at the word level that is likely to occur, that is, “phoneme sequences” and “phoneme segment sequences” as speech information. Detection and emotion detection in emotions using emotion identifiers based on emotion recognition and co-occurrence of identifiers and features based on related video and audio In Rukoto take advantage of the broadcast, it is possible to realize a search of efficient information.

[0076] In addition, with conventional index searches using phonemes and phonemes, scenes are exciting for movies, viewers are received for comedy, and customer emotions are used for consumer consultation. Environmental sounds such as changes, explosion sounds, and wind sounds, prosody of music flowing, synchronization Names are given to identifiers and feature quantities obtained from character strings obtained as recognition results associated with image features, image change features, and image features, and these names are designated as phoneme strings or phoneme string strings. By using and searching, it is possible to search according to people's hobbies and preferences, and to record, collect, distribute, and reuse information, and to overcome problems that were impossible in the past.

[0077] Further, according to the present invention, content information is automatically indexed according to identifiers and feature quantities of emotions and images together with phonemes and phoneme pieces, and a search is performed by a combination of these identifiers. In the case of a program story, a feature value that can be identified as “laugh,” appears in the surrounding feature value, and a place where a phoneme or phoneme sequence of a specific line appears can be detected. Realize an information processing device that can provide a search device that cannot be realized by a video search system, automatically record a program with the characteristic tendency, and deliver an email upon detection. It is also possible to construct a “laughing state” identifier or discriminating function by learning the co-occurrence information of the identifier and feature amount by performing face detection and facial feature amount extraction simultaneously with the emotion identifier of laughter.

[0078] In addition, a feature is always extracted from the voices of consumers and operators at the consumer consultation desk, phone recognition is performed, and a product is identified according to the recognized phonemes, and the detected emotion is specified. This means that the user's emotional evaluation of a specific product can be recorded and used for product quality analysis, or from the specified product name utterance by the operator of the consultation desk. Try to solve the problem by displaying the manual of the target product on the terminal screen.

[0079] Also, by combining the scale feature, the phoneme feature, and the emotion feature, the sung voice of the music and the singing voice of the user are recognized, and the phoneme sequence and the emotion identifier of the lyrics are recognized. Search for music, expand input character strings into phoneme symbol strings, compare musical scale transition states and emotion feature appearance frequencies, and search for music with high similarity to find music that suits your hobbies. By searching, it will be possible to search for music that has never existed before, and solve problems.

[0080] Also, the user's utterance is converted into a phoneme string, and the actor names in EPG, BML, RSS, and teletext are converted into phoneme strings, and an actor name phoneme string that matches the user's utterance phoneme string is searched. The cast name associated with the actor name of the matched phoneme string is detected. At this time, the phoneme string may be expanded into a phoneme string from words or keywords that are input as characters. [0081] Then, a phoneme sequence index is constructed while performing phoneme recognition on the sound synchronized with the distributed moving image, and a phoneme sequence with a cast name based on an actor name detected from EPG, BML, RSS, or text broadcasting. Search for locations that match. At this time, emotional characteristics and program genres included in the audio signal accompanying the cast name may be evaluated.

[0082] After this processing, recording is started by detecting that the phoneme string based on the cast name matches the emotion feature specified by the user, or playback is performed while skipping only the target range. Based on the ranking, a list is created and output as a search result for encouraging the user's operation to achieve a convenient search and solve the problem.

[0083] In addition, recognition is performed based on a character string obtained based on a feature value obtained from speech, a symbol string using phonemes or phoneme pieces, an identifier such as an emotion, a scale, an instrument sound, an environmental sound, and a feature value obtained from Z or video. It is possible to solve the problem by classifying identifiers such as shapes, colors, characters, and actions using a multivariate analysis method and using them as identifiers in the present invention.

[0084] Further, the user learns the feature amount of information that is frequently recorded or skip-played, automatically starts recording upon detection of the learned feature amount, starts skip play, When an e-mail or RSS is delivered upon detection, any processing may be performed whenever the problem is solved.

[0085] Based on these, the present invention is based on identifiers extracted from images and voices that are not based on conventional identifiers associated with voices, emotion identifiers recognized as voices as feature quantities, and environmental sound identifiers. Indexing and searching by combining musical instrument identifiers and video identifiers, motion identifiers, and shape identifiers to obtain search results, learning the co-occurrence state of identifiers and feature quantities in these processes, It is characterized in that information on emotion identifiers and other identifiers described in this embodiment is distributed, and search and detection are performed based on the distributed information.

[0086] Unlike the systems such as Non-Patent Document 12 and Non-Patent Document 13, the character information is not parsed, and only the co-occurrence state using simple word appearance frequency is evaluated. Even if using co-occurrence information between words, even if using co-occurrence information between words, search using the co-occurrence state at the phonetic symbol level expanded into phonemes and phonemes that are not in the dimension of meaning such as kanji The system solves the search results obtained. Analysis may be performed.

[0087] In this way, unlike the conventional search technology that merely searches for a character string associated with a voice or an image expressed in the content information while being converted to each other, the present invention provides a conventional technology. By combining the various feature extraction technologies and various recognition technologies described in the above, indexing is possible by combining symbols, identifiers, characters, etc. based on recognition of multiple audio features, video features, image features, and text features. Search based on the co-occurrence information, arbitrary processing associated with the detection, detection results of the user, the result of user selection and reuse status, and learning of identifiers and feature quantities Reconstruction enables search processing for complex content information expressions that take into account the subjectivity and emotions of people, which was impossible before, and includes adjectives and adverbs included in utterances and character strings. word It is possible to allow the associated abstract search, attempt to resolve the problem by reducing the complexity in the use of the underlying processing equipment of digital divide.

The invention's effect

[0088] In this way, arbitrary identifiers and feature quantities such as emotions, environmental sounds, image features, and motor features including proper nouns, which were difficult in the prior art, are associated based on the co-occurrence state. To learn and discriminate, and associate these identifiers with phoneme strings, phoneme strings, and emotion identifiers to learn the co-occurrence state, and to identify each identifier and feature quantity with phonemes, phonemes, and strings. By specifying search conditions and performing indexing on content information, it is possible to search, record, distribute, and receive information based on complex subjective conditions. Information distribution that can improve the convenience of information procurement in daily life, such as using a HDD recorder, personal computer, mobile terminal, car navigation system, robot, etc. apparatus Information system powder, to realize an information processing apparatus, reduce the problems related to digital divide.

[0089] In addition, meta-indexing is performed by indexing content information that expresses linguistic adjectives and adverbs and adverbs based on feature quantities associated with various recognitions and co-occurrence information of Z or identifiers. Realization of co-occurrence search (Meta-occurrence Retrieval) or abstract co-occurrence search (Abstracts Co-occur Retneva 1), content image / video and audio 'acoustic and phoneme sequence' various recognition including phoneme sequence and emotion Ontology based on identifier co-occurrence information Information retrieval method by implementing grounding based on multi-dimensional identifiers centering on phoneme sequences and emotions by constructing annotation information by extraction of phonemes, and reusing them Knowledge sharing can be realized.

Brief Description of Drawings

圆 1] A diagram showing a basic configuration example of an apparatus according to the present embodiment.

圆 2] Diagram showing the basic indexing procedure.

[3] A diagram showing an operation of generating an identifier by converting a feature amount identifier.

圆 4] A diagram showing a configuration example of video index data.

[5] A diagram showing a configuration example of video index data in the unit time designation method.

[FIG. 6] A diagram showing the operation of index co-occurrence state learning.

[7] A diagram showing the procedure of an example of learning from an index.

[8] A diagram showing an example of a co-occurrence matrix of emotions, phonemes, and images.

[9] A diagram showing an example of a covariance matrix of emotions, phonemes, and images.

圆 10] Diagram showing basic search procedure.

[11] The figure which shows operation | movement of an identifier feature-value conversion part.

圆 12] A diagram showing an example of learning with basic search conditions.

[13] A diagram showing the operation of a basic detection procedure.

14] A diagram showing a configuration example of an index information generating device.

15] A diagram showing a configuration example of a search device.

圆 16] Diagram showing the operation of the indexing method.

FIG. 17 is a diagram showing the operation of the search method.

圆 18] A diagram showing an operation procedure of a basic character string search request and execution method.

FIG. 19 is a diagram showing an example of search processing.

FIG. 20 is a diagram showing an example of a usage environment in the present embodiment.

FIG. 21 is a diagram showing an example of a processing procedure on the transmission side.

FIG. 22 is a diagram showing an example of a processing procedure on the receiving side.

FIG. 23 is a diagram showing state transition of search processing.

圆 24] A diagram showing an example of the configuration of the control dictionary. FIG. 25 is a diagram showing an example of a basic procedure for acquiring external information.

FIG. 26 is a diagram showing an example of a search and arbitrary processing method using EPG information.

FIG. 27 is a diagram showing a state transition in a product reliability survey application by consumer sentiment.

FIG. 28 is a diagram showing an example of a search procedure for language phoneme symbols.

FIG. 29 is a diagram showing an example of a phoneme symbol search procedure for language-specific character strings.

FIG. 30 is a diagram showing a configuration example of a symbol conversion function.

FIG. 31 is a diagram showing an example of an international phoneme symbol conversion procedure.

FIG. 32 is a diagram showing an example of a conversion dictionary for Japanese phoneme international phoneme symbols.

FIG. 33 shows an example of conversion from international phonemes to Japanese phonemes.

FIG. 34 is a diagram showing an example of conversion to phoneme force temperature.

FIG. 35 is a diagram showing an example of conversion of phoneme force into phonemes.

FIG. 36 is a diagram showing an example of a search procedure for international phoneme symbols.

FIG. 37 is a diagram showing an example of an international phoneme symbol search procedure.

FIG. 38 is a diagram showing an example of an international phoneme symbol search procedure.

Explanation of symbols

10 Information processing department

102 Index search evaluation section

104 Co-occurrence information learning department

106 Dictionary extractor

108 Index information generator

110 Index symbol string composition part

112 Control unit

114 Meta symbol extraction unit

116 Feature extraction unit

118 Identifier feature conversion unit

120 Feature identifier converter

122 Evaluation list output section

20 Recording section 22 Information record storage

202 Content storage

204 Evaluation function storage

206 Index information storage

208 Feature storage

210 Program storage

212 Co-occurrence learning storage

214 Dictionary information storage

216 Advertising information storage

30 Information input section

40 Information output section

50 Communication line

BEST MODE FOR CARRYING OUT THE INVENTION

Next, an example of an information search apparatus when the present invention is applied will be described.

[Constitution]

First, a specific configuration example of the apparatus according to the present invention will be described. The apparatus according to the present invention includes an information processing section 10, a storage section 20, an information input section 30, an information output section 40, and a communication line section 50, as in the apparatus basic configuration example of FIG. This device has a built-in display device such as a TV or display, but it can be held externally.

The communication line unit 50 is configured to perform communication regardless of wired wireless communication with other information processing apparatuses, and to perform mutual communication and control with other information processing apparatuses. For example, information may be searched for, browsed, and provided with each other via devices or communication lines using the present invention.

[0094] The communication line unit 50 has a function of executing acquisition and distribution of arbitrary information, and more specifically, Ethernet (registered trademark), ATM (Asynchronous Transfer Mode), fiber one channel, wireless LAN, It is configured by combining devices such as infrared communication as required, and can use any communication protocol such as IP, TCP, UDP, and IEEE802. [0095] The information input unit 30 includes a keyboard, a pointing device, a moving image capture device, a television broadcast related information receiving circuit, and a device capable of inputting information, such as a microphone input. It has the function of saving to the storage unit according to the instructions and outputting the information to the information output unit based on the processing and instructions of the information processing unit.

[0096] The information input unit 30 is connected to other input devices and input devices such as motion capture devices, cameras, RFID readers, barcode readers, image scanners, switch panels, OCRs, card readers, and sensors described later. The terminals to be connected may be combined as necessary.

[0097] The information output unit 40 is configured by an apparatus capable of outputting information, such as an image display device and speaker output, and the quantized information is stored and reproduced in the storage unit according to instructions from the information processing unit. Or output information by processing or instructions of the information processing unit.

Note that the information output unit 40 may include other output devices such as a printer, an arbitrary driving machine, a modeling device, and a milling machine, and a combination of terminals connected to the output device as necessary. You may print a poster by outputting information based on good search results, or you may print out a resin product.

The information processing unit 10 is configured by an arithmetic circuit based on an electronic circuit such as a CPU, and processes information acquired from the information input unit 30 and the storage unit 20. Then, the processed result is stored in the storage unit 20, reproduced, processed, and output to the information output unit 40 or the storage unit 20, or with other information processing apparatuses via the communication line unit 50. Send and receive for information exchange and receive and distribute information. Further, as shown in FIG. 1, the information processing unit 10 may be configured by program module codes for realizing various processes necessary for searching by a program, and a dedicated electronic circuit for executing them.

[0100] Note that the information processing unit 10 is generally composed of a combination of DSP, reconfigurable processor, FPGA, ASIC, etc., so the storage unit 20 is composed of RAM, ROM, flash memory, hard disk, It is known to be composed of optical discs, removable discs and the like. [0101] Then, the information processing unit 10 evaluates the degree of coincidence between the search condition including the feature quantity and the identifier of the index and the index information, and performs the search, the feature quantity, the search condition, and the search result. The co-occurrence information learning unit 104 for learning the co-occurrence information obtained by the above, the dictionary extraction unit 106 for extracting information for target conversion from the dictionary information storage unit, and the identifier by the recognition process from the extracted feature quantity Index information generation unit 108 that performs determination and indexing, index symbol string synthesis unit 110 that synthesizes index information for content information, control unit 112 that controls each functional unit, and content information power that is similar to MPEG7 Arbitrary symbol information after obtaining the necessary index information, obtaining information in the markup language such as RSS information and XML from the communication line part, or obtaining EPG information based on the broadcast wave received from the information input part Instructions and variables in Meta-symbol extraction unit 114 that extracts attributes, natural information obtained from the outside via the information input unit, and video, images, and voices acquired from the communication line unit and storage unit can be processed by an information processing device Content information extractor 116 that extracts feature values, identifiers obtained by user recognition, identifiers obtained from the outside through storage media or communication, identifiers for which content and other forces are extracted internally, etc. Identifier feature value conversion unit 118 for converting to standard feature values, feature amount conversion unit 120 for converting feature values obtained from content information and user input into identifiers, and output as an evaluation list as a search result And an evaluation list output unit 122 that performs search, detection, and indexing according to combinations according to these needs.

[0102] Regarding content information, music based on audio information, meta information attached to the content, EPG and BML as document and program information based on text information, musical scale as musical score information, general still images and moving images, Visual information that may include polygon data and vector data as 3D information, texture data, motion data (motion data), still images and moving images based on visualization numerical data, and content information for advertising and advertising purposes. And auditory information, text information, and sensor information.The position is chronological, coordinate information in the display, reading position of text, and the recording order and identification number order of the chart. Or co-occurrence information from the vicinity, which may be spatio-temporal coordinates based on positions and coordinates calculated from visual and auditory information Information may be composed. The storage unit 20 includes an information recording / accumulating unit 22 for accumulating / recording each piece of information under the control of the information processing unit 10. Here, the information recording / accumulating unit 22 may be configured using, for example, a semiconductor storage device such as a RAM or a flash memory, or using an external hard disk, an optical disk, or a magnetic disk using an arbitrary interface. The storage unit may be configured with a replaceable storage medium.

Then, as shown in FIG. 1, the storage unit 20 includes a content information storage unit 202 that stores a moving image, a still image, audio, and a document to be searched, and an HMM or Bayes as an evaluation function related to the identifier. An evaluation function storage unit 204 that stores a recognition template of an identification function or an arbitrary distance function, an index information storage unit 206 that stores an identifier or an arbitrary symbol string as an index for searching content information, and a content information capability The feature quantity storage unit 208 that stores the extracted feature quantity information, the program storage unit 210 that stores program module codes and parameters for realizing various processes necessary for searching by the program, and the co-occurrence information learning unit A co-occurrence learning storage unit 212 that stores an HMM and an evaluation function such as a recognized identifier recognition template and an identifier recognition template relearned using the present invention; A dictionary information storage unit 214 that stores dictionary information that also has a conversion table information capability for mutually converting an arbitrary identifier or feature amount and another arbitrary identifier or feature amount, and an information processing capability during content information search, etc. And an advertisement information storage unit 216 that stores advertisement information for advertising in accordance with the instructions of the above.

[0105] In addition, "content information example" for target content information, "example of feature quantity and identifier" for feature quantity and identifier to be used, and mutual conversion of identifier and feature quantity. The dictionary to be used is described in more detail in “Example of dictionary configuration”, and in order to use the information processing device 1 as a search device, content information is input to the device and indexing is performed, or based on user input. A step of constructing a query identifier column (query) used for the search, a step of referring to the index based on the query identifier column (query) and performing a search result narrowing down, and a list of search results based on the search result Are generally required, and the functions required for them are described in detail in Basic index processing examples and Basic search processing examples. Sharing index information The procedure for learning the wake-up state is described in detail in “Example of co-occurrence state learning process”. [0106] In addition, by introducing a server-client model, an arbitrary processing unit or storage unit is divided into a server and a client, connected by communication, and information is exchanged between the server and the client. Indexing, detection, and arbitrary processing associated with detection may be performed, as described in detail in “Procedure examples of information processing devices used in terminals and base stations”.

[0107] Note that, in this embodiment, a force implemented using a part of hardware. It is well known that these hardware can obtain the same effect using software, and is the same as each processing unit. The processing program may be executed by the CPU or DSP used in the information processing unit, or the functions and devices are separated for each arbitrary part, and multiple information processing devices are linked by communication. May be.

[0108] [Operation example]

First, the basic operation (processing procedure) of the indexing means will be outlined according to the operation flow of FIG. First, natural information such as video or audio based on content information, text information input by the user, index information related to content information, extracted meta information, text information extracted, program information received from outside, sensor information, etc. When various information is input from the information input unit 30 or acquired from the communication line unit 50 or the content storage unit 202 (step S0201), natural information or text information based on the input visual information, auditory information, or sensor information. In order to extract the feature amount, feature amount extraction processing (S0202) is executed by the feature amount extraction unit 116.

[0109] Here, the natural information is auditory information, visual information, or sensor information, and is an external information distribution device via the external device connected to the information input unit 30 or the communication line unit 50 as content information or advertisement information. It is also obtained as content information acquired by an exchangeable external storage medium and stored in the content information storage unit 202 or advertisement information stored in the advertisement information storage unit 216.

[0110] The feature amount extraction process (step S0202) is a process of extracting feature amounts from the input natural information. For example, when speech is input, processing such as FFT is performed. A feature value is extracted by quantizing the inside color space. Note that the feature extraction method can take various forms as described below, so it depends on the implementation as described later. good.

[0111] Subsequently, the feature quantity identifier conversion unit 120 gives the extracted feature quantity to a plurality of evaluation functions in order to evaluate a specific identifier from the identifiers in the same field. An identifier generation process is performed by a feature quantity identifier conversion process for selecting an identifier with high similarity (step S0203). The feature quantity identifier conversion process used for the identifier generation process will be described later with reference to FIG.

[0112] In addition, an identifier generation process (step S 0203) or a dictionary information storage unit that directly uses, as an identifier, a character string of meta information attached to content without using an evaluation function or character information that is program information such as BML or EPG. It is also possible to execute an identifier generation process (step S0203) that converts a character string into an ID by using a dictionary function consisting of 216 and the dictionary extraction unit 106 and uses it as an identifier.

[0113] Note that identifiers in the same field are vowels, consonants, and silences in the same field in phoneme identifiers, for example, in the case of phoneme recognition. It can be classified into identifiers such as “e / o”, and about 30 types of phoneme identifiers are generally known in Japanese.

[0114] There are thousands of identifiers in the same field as phonemes, and there are identifiers for each character and character parts for character recognition. There are as many identifiers as there are registered persons, and there are as many identifiers as registered in the dictionary information for musical instruments, environmental sounds, figures and actions.

[0115] As described above, these identifiers are different depending on the purpose in order to recognize a plurality of different information such as phonemes, phoneme characters, images, faces, musical instruments, environmental sounds, figures, and actions. Classification is performed according to the field of recognition by extracting feature quantities.

In this way, based on the identifier converted by the feature quantity identifier conversion unit 120, the index information generation unit 108 performs indexing on the content information in time series to generate an index. Processing is executed (step S0204). Here, the indexing process includes not only the identifiers and feature quantities that can be acquired with audio and video power, but also the index information and meta information power related to the character information and content information input by the user as described above, and the extracted character information and external information. The program information, sensor information, other content information, advertisement information, etc. received from may be recorded in association with each other. Based on the generated index, it is recorded in the database (step S0205a), the MPEG file is changed (step S0205b), and the index information is recorded (step S0205c).

Subsequently, the feature quantity identifier conversion process executed by the feature quantity identifier conversion unit 120 will be described with reference to FIG. First, when the extracted feature amount is input (step S0301), the evaluation function process is executed (step S0302). Here, the evaluation function process is a process for evaluating the likelihood with respect to the input feature quantity using an evaluation function such as a distance function. Then, it is determined whether or not all target evaluation functions have been evaluated for the feature amount (step S0303). If there are still evaluation functions to be evaluated, the evaluation function processing is executed based on the remaining evaluation functions (Step S0303; No → Step S0302).

[0119] When all evaluations are completed using the target evaluation function (step S0303; Yes), the identifier with the highest likelihood is selected from the evaluation results (step S0304). Then, by executing the symbol identifier output step (step S0305) force S for outputting the selected identifier, an optimum identifier can be obtained as an evaluation result from a plurality of evaluation functions.

[0120] In the index based on the identifier information recorded in association with the voice or video identifier recognition process in this way, for example, an appropriate unit time is provided, and the identifier is recorded for each unit time. Index information can be recorded by combining the methods and identifiers to some extent and storing the occurrence time and disappearance time of a certain identifier, as shown in Fig. 4, in relation to the time axis of the content information and the scene name. As shown in Fig. 5, whistle sounds, explosion sounds, and utterance phonemes are indexed according to the environmental sound recognition generated in the scene according to the changes in the image, as shown in Fig. 5. Index information related to character information and content information input by the user as described above, meta information power, extracted character information, external power, and using received program information and sensor information Search for an index for specifying a position in the integrators N'information can be configured.

[0121] More specifically, the feature quantity extracted in the feature quantity extraction process (step S0202) and the identifier for which the feature quantity force was also created in the identifier generation process (step S0203) are acquired for both video and audio, and indexed. The identifier obtained by indexing (step S0204) is shown in FIG. In the example shown in Fig. 5, the phoneme symbol and phoneme recognition feature value are recorded in the recording step (S0205a, S0205b, S0205c) in association with the time axis information of the content information in the row of the phonetic identification type item. For example, the phoneme identifier is related to the phoneme symbol, the feature value for phoneme recognition is related to the time feature information of the content information, and the recording step (S0205a, S0205b, S0205c) “Index co-occurrence information” is recorded as index information of the location neighborhood in the content information based on a plurality of identifiers and feature amounts associated with recognition feature extraction.

[0122] In this case, identifiers recognized for images and emotions, feature quantities used for recognition, character information input by the user as described above, character information extracted from index information and meta information related to content information, and external information By recording the program information and sensor information received from the content information in association with the time axis information of the content information, it can be used as index information for searching, and the above-mentioned phoneme is combined with emotion, visual information, and auditory information. In this way, “index co-occurrence information” can be generated and recorded for use in learning, which will be described later.

[0123] These index information can be realized by describing them in a text string as a text string, and when changing to MPEG (step S0205b), the index symbol synthesis unit 110 also extracts the MPEG file power by the meta symbol extraction unit 114. The index information may be combined with the meta information description area. Note that the index information may be a numeric ID having a one-to-one relationship such as a character ID that is not composed of character string information, or an ASCII code converted from a character string.

[0124] «Example of co-occurrence learning processing procedure»

Next, the processing procedure for indexing by learning the co-occurrence state will be described with reference to FIG. In the procedure of the co-occurrence state learning process, the co-occurrence state of the identifier recorded in association is learned according to FIG. 6, and the co-occurrence information is configured as shown in FIGS. Create an evaluation function based on the index and index the content information. Note that the number of frames to be aggregated to form this co-occurrence matrix is limited in this embodiment, but it may be an arbitrary value. Alternatively, the unit time to be counted may be arbitrarily determined based on 24Hz, 60Hz (16ms), 110Hz (9ms), or a synchronization signal from a TV. First, FIG. 6 is a diagram showing a basic processing procedure of index co-occurrence state learning processing. Accompanying the recognition and feature extraction of auditory, visual, and emotional information for content by extracting the index information by the multiple recognition methods applied to the content information by indexing to the positional neighborhood It is possible to acquire “index co-occurrence information” as index information of positional neighbors configured based on a plurality of identifiers and feature quantities.

More specifically, an index of auditory information by a phoneme identifier consisting of phoneme symbols based on phonemes recorded for each frame by the indexing means is extracted (step S0601). Next, an index of visual information is extracted by extracting the feature value color identifier of the image data of the same frame as the detected phoneme (step S0602). Further, an emotion information index is extracted based on the emotion identifier based on emotion recognition in the same frame (step S06 03). Then, a co-occurrence matrix (FIG. 8) for each frame constituting the co-occurrence information is constructed based on each extracted index information (step S0604). As a result, “index co-occurrence information” is obtained as index information of a positional neighborhood composed of a plurality of identifiers and feature quantities. Note that the width of the extracted frame may be specified arbitrarily. Index co-occurrence information may be configured for the boundary values around 14 Hz, 27 Hz, 55 Hz, and 110 Hz that are continuously felt by humans. The character co-occurrence information may be included in the index co-occurrence information.

[0127] Then, "index co-occurrence information" is formed based on the identifiers and feature quantities of positional neighbors based on the co-occurrence matrix formed by the index information in step S0604, and learning processing based on "index co-occurrence information" (step S0605). ) Is executed. Here, the feature value identifiers used for learning are aggregated (steps S0605a and S0605b) as an example of the neighborhood. For a moving image of 30 frames per second, every 90 frames (3 seconds), a predetermined interval is used. Aggregation may be performed every time, aggregation may be performed until a certain distance deviates from the past average value by statistical test, or the range of information detected by a known detection technique may be constant. Aggregation may be performed for each boundary, or it may be acceptable if the specified teacher information is aggregated over the same range, and an evaluation function is configured according to the end of the aggregation range. (Step S0605c, Step S0605d) are executed. When the learning process is executed, the evaluation function is generated and reconstructed, and the generated and reconstructed evaluation function is shared as learning information. It is stored in the origin learning storage unit 212 (step S0606).

Next, the learning process (step S0605) will be described in detail. First, the co-occurrence information of the identifier is totaled for each frame (step S0605a). The time width for counting the co-occurrence information is calculated every predetermined number of frames · time. For example, the co-occurrence information of the identifier is calculated by adding the co-occurrence information of the identifier every 90 frames (3 seconds). Generate (step S0605b). Subsequently, a covariance matrix is generated from the generated inter-frame co-occurrence information, and eigenvalues' eigenvectors of the co-occurrence matrix are calculated from the generated covariance matrix to generate learning information (step S0605 c). Then, based on the calculated eigenvalue 'eigenvector, a standard template of the evaluation function is generated and a learning result is generated (step S0605d). An evaluation function is constructed by executing these processes.

[0129] It should be noted that the frame width to be aggregated and the time length of one frame can be specified arbitrarily depending on the device configuration, and co-occurrence is performed around 14 Hz, 27 Hz, 55 Hz, and 110 Hz, which are the boundary values that are continuously felt by humans. Information may be configured, and the aggregated interframe information may be used as “index co-occurrence information”.

Then, the standard template (function parameter) of the configured evaluation function is stored in the storage medium so that it can be reused (step S0606). Specifically, the evaluation function and the like generated in step S0605d are stored in the co-occurrence learning storage unit 212. The evaluation function based on the co-occurrence information is performed by performing the indexing procedure by using the evaluation function configured in this way and performing the indexing procedure by using it for the feature identifier conversion in step S0203 of FIG. Can be used for indexing content information.

[0131] The co-occurrence information based on the index information used for this learning will be specifically described with reference to Figs. As an identifier, there are 30 phonemes (5 vowels, 24 consonants, 1 silence), 4 emotions (joy, anger, sorrow, comfort), and Web Color216 colors (Web Color It is composed of 250 elements by 250 elements co-occurrence matrix and covariance matrix obtained by combining identifiers indicating the number of display pixels of each color (also called “color” or “browser common color”).

[0132] Note that this configuration uses the sensor information based on the sensor inputs associated with the content in time series as necessary, so that the terms are included in the co-occurrence matrix according to the type of sensor information. When adding an eye, adding an item to the co-occurrence matrix according to the index information related to the content or character information in the meta information, or setting the standard pattern of the evaluation function consisting of the co-occurrence information as the search condition The character information may be used for the designation of the name.

[0133] Fig. 8 is a diagram showing an example of co-occurrence information. The same element is entered on the horizontal axis and the vertical axis, and the number of appearances related to the image and sound in the frame in the moving image is entered at the intersection of the vertical axis and the horizontal axis. The number of occurrences is a value that indicates how many times an identifier appears in a frame, and is a number that is evaluated by how many arbitrary phonemes, pixels, and emotion identifiers are generated within a short time frame. It is.

For example, the content of the matrix is “0” for the co-occurrence of emotion “joy” and vowel “A”, and “6” for the appearance frequency of red as emotion “joy” and image identifier. Since this information is a value from which the content information power is also extracted, the number of occurrences of the identifier recognized in the frame that is not necessarily constant may be normalized for each type of identifier to obtain a probability value. A probability transition matrix between frames may be constructed based on the probability of occurrence of.

[0135] Fig. 9 is a diagram showing an example of a covariance matrix of video features of emotion features and phoneme features. Here, in this figure, the horizontal axis and the vertical axis are the names of the respective feature amounts, and the number of feature amounts acquired for several frames in a moving image for several seconds is determined from the average over the entire frame. It shows whether there is any scattering. For example, the emotional characteristics indicate the four powers of emotional emotions S, how much the variance is, and for phonemes and images, the distance evaluation results of each distance indicate how much the average power is different. The

In this example, for example, the covariance of the fourth emotion parameter and the first emotion parameter is “0.42”, and the correlation between the first parameter of the video parameter and the first change of the emotion parameter is [0136]. The power of “0.32” is not always constant because the information is the content information power.

[0137] As described above, the present invention is characterized by co-occurrence conditions specified by a person for search, co-occurrence information detected during indexing, and information frequently used by users as search results. Using the occurrence information, a co-occurrence matrix based on identifiers of different properties, a co-occurrence probability based on the co-occurrence matrix and a covariance matrix based on the features of the different properties and an evaluation function for searching It is in the place of making, searching and detecting. Note that the example lines in the figure Since the column assumes a square diagonal matrix, the lower triangular matrix portion in the figure is omitted.

[0138] At this time, in order to construct the evaluation function, the standard pattern is extracted by learning the feature amount based on the input of the natural information whose identifier is specified in advance, and the evaluation function is configured by the extracted standard pattern. Alternatively, the standard pattern may be extracted using an identifier configured by self-organization by multivariate analysis. The obtained standard pattern is stored in the evaluation function storage unit 204 as necessary, and the standard pattern dictionary information is stored in the dictionary information storage unit 214 as association information for mutually converting the identifier and the standard pattern. .

[0139] Note that the standard pattern is used in combination with the evaluation function to identify the identifier, and is configured by a sample feature amount for which the input identifier is not specified and a feature amount attributed to the specific identifier. It consists of the mean and variance of the population, and the evaluation function is used to evaluate the Euclidean distance and Mahalanobis distance, and is sometimes called a standard template, standard parameter, or evaluation function parameter.

[0140] In addition, the standard pattern is a parameter that can be generated by a method such as multivariate analysis using the input feature value and the evaluation function. Based on this, any identifier evaluation function such as HMM, Bayes discriminant function, Mahalanobis distance, Euclidean distance may be used. In addition, it is generally known that the parameters that make up these evaluation functions are configured by mathematical methods such as multivariate analysis, so the extraction method and learning method depend on the implementation.

[0141] At this time, multivariate analysis is performed using evaluation functions, and classification is performed by self-organization. By providing a plurality of evaluation functions, each classified evaluation function is manually given a name and an identifier, The evaluation function can be searched by specifying the evaluation function name from the user by giving the character string included in the content information with high probability of co-occurring with the evaluation function obtained by multivariate analysis as the function name or identifier. You may make it usable for a detection.

[0142] The identifier information recorded in association with each other is a symbol of a phoneme or a phoneme piece, or a name or a name, an identifier or an identifier string given to a population used to construct an identifier evaluation function. Or the representative feature average itself, and not only phonemes and phonemes, but also identifiers, features, and combinations of images, sounds, and emotions that are described separately. The index information and meta-information related to the character information entered by the user and the content, meta-information, extracted character information, external information, received program information and sensor information may be used.

[0143] Then, the identifiers and feature quantities of the content information and the identifiers and feature quantities of the advertisement information are similar to each other by the above-described evaluation function at an arbitrary position of the advertisement information indexed by the same method and the indexed content information. Steps to associate advertisements when they are evaluated may be executed or promoted during indexing, or any advertisement or rating only while paused during content information playback The advertisements associated with the function may be played back, and these evaluation functions can be reconstructed using “Example of identifier reconstruction” and “Identifier learning with search 'detection' indexing” described later. Also good.

[0144] In addition, meta information or EPG information recorded in the content can be used as an identifier to construct and search an evaluation function that evaluates the co-occurrence state by acquiring the identifier used for the index as shown in Fig. 7. In addition, processing for acquiring program information such as EPG and BML (step S0701) may be added to the program information using EPG and BML acquired during broadcasting, and the co-occurrence state may be configured and indexed. .

[0145] Note that the drawing is different from Fig. 6 and supplementary power. EPG as program information acquired as character information in step S0701 is used as an identifier as it is, and other identifiers and features are used from step S0601 to step S0603. The co-occurrence matrix in FIG. 8 and FIG. 9 is constructed using the characteristic quantities of other identifiers having a co-occurrence relationship in the same program information using the program information as an identifier. At this time, the name of the evaluation function may be indexed by associating the character information and the program information by using the character string of the program field based on the character information and the program information.

[0146] As a result, the learning process from step SO703 to step S0705 corresponding to step S0605 is performed based on the acquired co-occurrence information, and an evaluation function for constructing an identifier can be constructed and necessary. If so, the content may be indexed again using the acquired function.

[0147] <Example of basic search processing>

Next, the search process procedure will be described with reference to FIG. First, the user When a search condition such as imaging, speech, character string input or the like is input from (Step S1001), query generation processing is executed based on the input search condition (Step S1002), and a territory is generated. For example, in the case of speech, a phoneme sequence based on phoneme recognition or phoneme recognition for a user's utterance, a query is generated based on the phoneme sequence, and in the case of a character sequence, A query is generated based on the conversion to the phoneme string sequence, and if it is an image capture, a query based on image recognition is generated, so that a search condition is generated by each recognition method.

[0148] At this time, search conditions are configured using identifiers acquired by multiple recognition methods for input character strings, visual rectangles, and auditory information, and search is performed based on co-occurrence relationships between the specified identifiers and character strings. By constructing a co-occurrence matrix, “search condition co-occurrence information” is constructed in the same way as “index co-occurrence information”, and “search condition co-occurrence information” is used for similarity evaluation with “index co-occurrence information” of the present invention. It becomes possible to use it for queries as “starting information”.

[0149] Also, the input character strings and the identifiers obtained from the respective recognition results are converted into or associated with the character strings and identifiers by the dictionary function based on the dictionary information storage unit and the dictionary extraction unit. Instead of co-occurrence of search conditions entered by the user by converting to standard patterns, the information related to the information recognized by the search conditions and the information entered as search conditions Co-occurrence relations of "information recognized from search conditions" and "information entered as search conditions" or "information recognized from search conditions" and "information entered as search conditions" “Search condition co-occurrence information” based on the co-occurrence relationship of “related information” can be configured and used as a query, and can also be used as “index co-occurrence information” used for learning of the present invention.

[0150] When inputting this query, character information extracted by the user, index information related to the content, character information extracted from the meta information, program information received from the outside, and sensor information may be used. In addition, character strings indicating emotion identifiers and image identifiers, symbol strings such as phoneme strings and phoneme fragment strings may be implemented by text input, menu selection, or voice human power. You can search by converting to identifiers, feature quantities, or identifier strings, and specify the location in the content information.

[0151] Of the content information stored in the content storage unit 202, the search target The search is repeatedly performed on the content information, and a search process for evaluating the match between the bow I and the query is executed for all the content information (step S1003). Here, since the search process is executed, the “index co-occurrence information” and the “search condition co-occurrence information” based on the identifier or feature amount of the content information to be searched are compared, and the search result is obtained.

[0152] In this comparison, the match between "index co-occurrence information" and "search condition co-occurrence information" may be evaluated by DP or distance function, and each co-occurrence information is evaluated by evaluation function, It may be possible to compare the similarity, identity, and degree of matching by evaluating the matching of the acquired identifiers and the distance distance. Instead of evaluating all the identifiers and feature quantities, By comparing feature quantities, similarity, identity, and degree of coincidence may be compared.

[0153] Based on the acquired search results, the degree of matching of the search evaluation results is evaluated, and the search results are ranked (step S1004). Further, an evaluation result list display process (step S 1005) is executed to create and display an evaluation result list based on the ranked search evaluation results. At this time, the advertisement information in the storage unit is displayed to the user, the advertisement obtained through the communication line is presented, and the advertisement content associated with the previous indexing is acquired from the storage unit or the communication line unit. May be presented to the user.

[0154] At this time, if content that is not indexed by "index co-occurrence information" during real-time distribution is used, step S for acquiring content in a time-sharing manner as shown in Fig. 13 Step S1302 for confirming completion of content acquisition with 1301 and step S1303 for indexing while extracting features and generating identifiers by acquiring content, and executing `` search condition co-occurrence information '' and `` Step S1304 for comparing the index co-occurrence information and detecting matching points is executed, and step S1305 branches according to the detection. Recording start, channel switching, notification, e-mail delivery, etc. It is also possible to execute an arbitrary process (step S 1306) as will be described later, such as changing the robot operation.

[0155] As a result, in the conventional search, the indexing based on the co-occurrence state described in the section on indexing and the indexing function based on the evaluation function configured based on the co-occurrence state are not performed. In the present invention, a dictionary is created based on an input character string as an input search condition for the “index co-occurrence information” applied to the content by the above-described method. Refer to them and convert them to feature quantities and identifiers related to character strings for search The phoneme sequence generated by the input speech force, phoneme sequence, and other features and identifiers are used for direct search, and the dictionary is based on the phoneme sequence and phoneme sequence generated by the input speech force. Convert to feature quantity and identifier to be used for search, input image · Video 'Sensor image extraction' · Use generated feature quantity and identifier for direct search, input image · Video 'sensor force extraction' generation Based on the converted feature quantity and identifier, refer to the dictionary and convert it to related feature quantity and identifier and use it for the search. By constructing the “occurrence information”, the index by the “index co-occurrence information” for the content information is compared with the search condition by the “search condition co-occurrence information” input by the user, so that the “search condition co-occurrence information” "And" index co-occurrence information " It is possible to search for and find objects that match “”, and to specify the position on the time axis, the position on the display screen, and the position on the reading aloud in the content.

[0156] It should be noted that the matching evaluation method of the search evaluation results is an HMM, a method using a Bayes discriminant function and! /, A probability and a distance, or attribution to a clustered population by multivariate analysis. It is well known that this method is a matching method between DP and CDP and!, A symbol string, and more details can be found in “Examples of methods for evaluating matching between feature quantities and identifier strings”. State.

[0157] Also, in the case where a search is performed using feature amounts, an identifier feature amount conversion unit 118 converts an identifier used in a query generated from an input character string, input speech, or input image into a feature amount. Is converted by the identifier feature amount conversion processing executed by This identifier feature quantity conversion processing will be described with reference to FIG.

First, when an identifier (identifier string) for conversion into a feature quantity is input (step S1101), target symbol extraction processing is executed (step S1102). Here, the target symbol extraction process is a process of selectively extracting a feature quantity associated with an identifier using dictionary information in order to convert the identifier into a feature quantity with respect to an input identifier (identifier string).

At this time, it is determined whether or not an identifier subdivision that divides phonemes into phoneme pieces as necessary is necessary (step S1103). If it is determined that further subdivision is necessary (step S1103; Yes), symbol subdivision processing is executed (step S1104), and the target symbol extraction processing is executed again after further subdivision. . For example, the identifier is a phoneme In some cases, it is possible to obtain feature quantities suitable for the subdivided information by further subdividing into phoneme segments and executing the target symbol extraction process.

[0160] If it is determined that subdivision is not necessary (step S1103; No), based on the selected feature amount! The feature value is output to evaluate the distance between the feature values according to the identifier (step S1105). As described above, since the identifier feature quantity conversion process described above is executed, the input identifier or identifier string is converted into the feature quantity, and the search based on the feature quantity can be performed.

[0161] Further, by performing the index co-occurrence state learning method of Fig. 6 using such search conditions and search results, it is possible to statistically analyze and extract the user's hobbies and preferences. Therefore, as shown in Fig. 12, the user selects the process for learning the co-occurrence state of the search condition (step S 1202), the process for learning the co-occurrence state of the search result (step S1206), and the search result. By adding the process of learning the co-occurrence status of search results (step S1209) to the normal search procedure, it becomes possible to learn co-occurrence information associated with searches based on the user's will and use. An evaluation function for indexing can be configured according to the user.

[0162] In the process of Fig. 12, first, search conditions are input by voice input, character string input, or image input by the users (step S1201). Then, based on the input character string as the search condition information, the phoneme string obtained from the utterance 'the feature amount obtained from the phoneme string string or the image' identifier and the search condition information, the dictionary information storage unit 214 to the dictionary information extraction unit 106 Acquires co-occurrence information of related features and identifiers extracted by (step S12 02). An evaluation function is constructed based on the learned co-occurrence information, and the evaluation function is stored (step S 1203).

[0163] More specifically, when the utterance recognized by the utterance phoneme sequence is "search, bokan, explosion", the search processing is selected based on the command dictionary in the keyword "search", and the `` B / o / k / a / a / a / a / n / n / n / n / n '' is set as the search condition phoneme string, and `` explosion '' is the feature value of explosion sound The co-occurrence state of multiple identifiers and feature quantities can be configured by setting the identifiers of the explosive pronunciation evaluation function configured by collecting the image features and the image features of “images in which the area of the warm color system increases in time series” as search conditions . [0164] If the user's input is "Bokarn" as described above in the configuration of the search condition, the similar "Dokarn" of explosion sound related to the explosion sound is set to "d / o / k / a / a /". The search conditions may be configured so that a search can be performed using related identifiers, feature quantities, or identifier strings that are not related to the associated pseudo-sound power by setting `` a / a / n ''. It may be converted into feature quantities, identifiers, or identifier strings with different recognition methods and added as search conditions.

[0165] Then, "search condition co-occurrence information" similar to the "index co-occurrence information" based on the phoneme sequence, the identifier of the explosive pronunciation evaluation function, and the image feature is constructed, and the above-described "co-occurrence state learning process" is performed. Search conditions can be learned by configuring an evaluation function according to the procedure. Note that the “feature value that the warm color system spreads” can be measured by evaluating the time-series increase in the area occupied by the warm red and yellow screens.

At this time, if the character string or phoneme sequence / phoneme segment sequence input in step S1201 is stored in the dictionary information storage unit 214, it is extracted from the dictionary information storage unit 214 by the dictionary extraction unit 106. It may be used as co-occurrence information for learning after being converted into another identifier or feature amount based on information, or an identifier feature amount conversion unit 118 is used to convert an identifier into a feature amount for use in a search. Also good.

[0167] Subsequently, step S1204 is executed as a search based on the co-occurrence information specified as the search condition described above, and a search result having a high degree of matching with the search condition is acquired. Then, for example, the co-occurrence information based on the feature amount and the identifier used in the index information attached to the target scene obtained as the content information force as a search result with a matching rate exceeding 80% is acquired from the acquired search results. 1206 is executed. And it is preserve | saved in step S1207 as a learning result. At this time, the co-occurrence information to be learned may be given a condition such as being within the top 10 or having a matching rate of 90% or higher.

[0168] Subsequently, when the search result is selected by the user (hereinafter, the search result selected by the user is referred to as "selection search result") (step S1208; Yes), the search result is shared based on the selection search result. The starting information is learned (step S1209). Then, the evaluation function is reconstructed based on the co-occurrence information learned in step S1209 and stored (step S1210). When the search result that the user wants to use is selected again from the search results, SI 211; Yes), the process is executed again from step SI 209.

[0169] The content information searched in this way is classified as one content genre or category, one image as content, a photo collection of images, It is a piece of music, a chorus part of a piece of music, a movie or video work, a scene in a work, or a common image or sound feature of a work in a specific field. It is possible to acquire search results based on the co-occurrence tendency of specific identifiers and feature quantities in content information that may be within the range, so that content scene search by user instructions and Title search is possible.

[0170] Then, it is selected whether or not the user is required to input the search again, that is, whether or not to end the process. Here, when the operation for inputting the search condition is performed again (step S1212; No), the process transitions to step S1201 and executed. Further, when an operation for which a search condition is not input is performed (step S121; Yes), this process is terminated.

[0171] As a result, the identifiers and feature quantities determined by the evaluation functions based on the co-occurrence information of the plurality of feature quantities and the plurality of identifiers that have been configured are recorded in association with the content information, and 'search' is detected. It is possible to search for more complex hobbies, preferences, and interests, using phonemes, phoneme edges, Z or emotions, Z or other identifiers, and co-occurrence states of Z or their features. The convenience of information retrieval can be improved.

[0172] Then, using the evaluation function learned from the co-occurrence information and the co-occurrence information based on the search results, search conditions, and indexes obtained in this way, You can change the billing conditions, evaluate the index of the advertisement information stored in the advertisement information storage unit 216, and present a highly similar advertisement. Use identifiers as advertising conditions, increase advertising fees for frequently used search conditions, change 2D image and 3D image shape data, texture data, and position You may do it.

Note that co-occurrence information is constructed from the phoneme symbol strings, phoneme symbol strings acquired as search results and search conditions, and various identifier strings as in the case of the above-mentioned indexing. You can perform a close evaluation and perform a search for content information with high similarity according to the search conditions. The voice recognition dictionary registered based on the user's utterance phoneme and the speech recognition dictionary registered based on the user's utterance phoneme. Index information combined in the same manner as the above-described indexing may be configured using the co-occurrence information of the color identifier selected by the above.

[0174] In addition, it is recognized based on a symbol string or phoneme or a segment of a phoneme recognized based on a feature obtained from speech, an identifier such as emotion, scale, musical instrument sound, environmental sound, or a feature obtained from Z or video. The identifiers may be constructed by analyzing, classifying, and learning using identifiers such as shapes, colors, characters, actions, etc. and features associated with the identifiers described above and later, using multivariate analysis techniques. New identifiers may be constructed and used in accordance with the implementation of, and details are given in “Examples of identifier reconstruction”.

[0175] Also, in order to convert an input character string or input phoneme string into other identifiers or feature quantities, mutual information conversion may be performed as detailed in "Example of dictionary configuration", The conversion from the identifier to the feature quantity and the conversion from the feature quantity to the identifier can be arbitrarily configured as described in each item described later.

[0176] The identification function information configured in this way can also be exchanged 'distributed' based on 'Example of information sharing procedure between users'. It can be used to improve convenience, and as detailed in “Examples of procedures for information processing devices used in terminals and base stations”, the server client model divides processing into servers and clients, and communicates with devices. By exchanging information between the server and the client, equivalent services and infrastructure, search, indexing, and arbitrary processing associated with detection and detection may be performed.

[0177] If sensor information is used, a temperature sensor is attached to a surveillance camera or the like to detect ambient temperature changes and image feature changes, and when an explosion occurs as a phoneme identifier in the co-occurrence information described above. The feature number is the increase in the number of warm-colored pixels in the screen, and the increase in temperature is added to the co-occurrence matrix as temperature sensor information and recorded, and the co-occurrence information associated with the explosion can be learned. You can also index and search.

[0178] For video input, audio input, and sensor input, multiple channel forces may be input. Features using stereo images and stereo audio using the input deviation from each channel The identifiers and features of different channels can be used to estimate the position by constructing and identifiers, to estimate movement, and to ensure that certain events are related even if there is a difference in time series. This may be detected by evaluating the co-occurrence relationship with a time series width of several seconds to several minutes or more.

[0179] Thus, the first feature of the present invention is the indexing of content information, which will be described later! In addition, various indexing is performed and combined with the index attached by combining with phoneme information, phoneme information, Z or emotion information, Z, auditory information, visual information, character information, program information, sensor information, etc. Learning based on the co-occurrence information and performing a search based on the co-occurrence information, and the second feature is for the identifier and feature quantity used in the present invention as in the example of the search process of the present invention. Thus, a search for speech input, image input, and character string input is performed using a dictionary that assigns phoneme strings and phoneme strings based on each name.

[0180] In the present invention, as information indicating the continuous state of phonemes and phonemes as the phonemes and phonemes, information indicating how these elements change is "continuous phonemes" and " A “phoneme string” or “phoneme string string” that can be considered even if “continuous phoneme strings” are considered refers to an information string in which these phonemes or phoneme strings are arranged as symbols or identifiers. Various identifiers that can be expressed as “columns” can also be considered for matching evaluation as identifier columns.

[0181] For this reason, each identifier recognition method, feature quantity extraction method, identifier string match evaluation method, information classification method, information learning method, communication transmission procedure, type of storage medium, type of communication medium, information processing device Regarding the configuration, the configuration of the terminal and the distribution base station, the shape of the device, the size of the device, the installation location of the device, and the sensors used in the device, devices can be arbitrarily combined as necessary, and programs can be implemented. By using the co-occurrence state in combination with phonemes and phoneme pieces, image-related features, emotion identifiers, and acoustic information, which were handled independently in conventional searches, Features of the information processing device based on the present invention are improved search and detection convenience by learning, indexing, re-learning using search results, constructing and co-converting identifier co-occurrence dictionaries. And the container Indexed by multiple recognition methods and multiple feature extraction methods according to the time axis and information browsing sequence, and emotions and phoneme sequences Using the co-occurrence states of single columns and their identifiers and feature quantities, search and detection, and learning of co-occurrence states using multivariate analysis and search and detection using learning results can be performed.

[0182] In addition, advertisements and promotions implemented based on the present invention may be combined arbitrarily with the conventional invention, depending on the frequency of access to the advertisement, the frequency of use of the content, the quality, size, and time of the advertisement. Fees may be changed, or prizes may be provided through quizzes or questionnaires, and interactive results can be executed by statistically processing advertisement results relating to objects detected using the present invention.

[0183] In addition, by using the search specification function to search and specify content according to the search conditions, information that matches the conditions that are distributed in real time can be used to distribute emails, Change to a channel that matches the conditions, start recording or playback, robot or agent starts utterance, playback the recorded content of another channel retroactively to the detection time, or change device settings It is possible to construct a shortcut including a link to a detection result, or to present to a user by aggregating contents using detected information.

[0184] Further, the present invention performs indexing using other feature quantities and identifiers described later, performs learning of new identifiers and reconstruction of identifiers according to the co-occurrence state of the indexes, Set search conditions using co-occurrence status, learn new identifiers based on search conditions specified by the user, reconstruct existing identifiers, and search acquired according to the search conditions Based on the co-occurrence status in the result, learning of new identifiers and restructuring of existing identifiers, search and detection based on co-occurrence information combining new identifiers and reconstructed existing identifiers, and a lot of co-occurrence information The search and detection are based on identifiers and feature values constructed by variable analysis and learning.

[0185] Further, as shown in the example of the search processing of the present invention, phoneme strings, phoneme string strings, emotion identifiers or their symbol strings based on their names are used for the identifiers and feature quantities used in the present invention. By using the hash value, the input character string as the search condition and detection condition, the identifier associated with the recognition of the input voice and input image, the internal ID associated with the feature quantity, the nominal character string, the phoneme string used for recognition, Conversion dictionary with symbol strings by phoneme strings and Z Or, configure a co-occurrence dictionary and extract identifiers and feature quantities based on input speech, input character strings, and input images given as search conditions and detection conditions, and then convert the identifier conversion dictionary, co-occurrence dictionary, and feature quantity Select necessary targets using an evaluation function based on the covariance matrix of the above, search conditions used for the search and detection described above, input character strings as detection conditions, identifiers associated with recognition of input speech and input images, An image associated with an utterance phoneme sequence and Z or utterance phoneme sequence associated with phonetic recognition or phoneme recognition by performing condition generation processing using co-occurrence information combining identifiers or feature amounts other than character strings using feature amounts Identifier, image feature, environmental sound identifier, environmental sound feature, emotion identifier, emotion feature, program identifier based on the distributed program information, image identifier, acoustic identifier Search and using the program feature based on co-occurrence characteristic of the acoustic feature quantity is detected, indexing, learning co-occurrence information, the reconstruction of the identifier.

[0186] In addition, identifiers and feature quantities can be stored and recorded in association with content by recording them together with time information in a dedicated database, or by creating an index file as a separate file that can be used simultaneously with video and audio information. Save, insert into MPEG file and other video streams, update MPEG file vacant area, comment area, meta information description area, broadcast using program information or text broadcasting using markup language by EPG or BML, etc. Then, the index information according to the present invention may be used by being received by the user and stored in the storage medium by the method described above.

The range and technology to which the present invention can be applied will be described. “Examples of content information” as target content information, “examples of features and identifiers” as feature quantities and identifiers that can be used for co-occurrence information, and conversion of identifiers and feature quantities into phoneme and phoneme symbol strings "Example of dictionary configuration" for converting between identifiers and identifiers, "Example of methods for converting natural information to feature quantities" for constructing dictionaries and converting content information into identifiers, and " `` Example of method for converting feature quantity to identifier string '', `` Example of information indexing method '' for indexing using various recognitions, `` Convert from identifier string to feature quantity '' for performing feature quantity search based on identifier "Example of method to perform", "Example of method to evaluate matching between feature quantities and identifier strings" for evaluating similarity to detect target range in search, "Information" based on the present invention "Example of search method", "Example of optional processing associated with identifier detection" that performs processing according to information detected by the search function of the present invention, "Search" that uses search results and Bow I 'detection

• “Example of identifier learning based on indexing” and “Example of identifier reconstruction” using learning are described as modified examples.

[0188] <Example of content information>

First, the contents and content information to be searched and indexed using the present invention will be explained. The contents are exclusively movies, dramas, photographs, news reports, Ryome, illustrations, paintings, music, It is generally well known to show promotional videos, novels, magazines, games, papers, textbooks, dictionaries, books, comics, catalogs, posters, broadcast program information, etc., but in the present invention public information, maps Information, product information, sales information, advertisement information, reservation status, viewing status, road status, information such as questionnaires, surveillance camera images, satellite photos, blogs, models, dolls, robot cameras' microphone input, etc. may also be included.

[0189] In addition, time-series changes in video, time-series changes in speech, text that expects time-series changes in the reading position of the reader, electronic information in markup language notation in HTML, and search indexes generated from them Interpretation of good reading position even for information, etc., may be interpreted as a time axis, and punctuation, sentences and sentences may be captured as frames.

[0190] In addition, meta information attached to content, EPG and BML as document information and program information as text information, musical scale as musical score information, general still images and moving images, polygon data as 3D information, Visual information, auditory information, and text that may contain vector data, texture data, motion data (motion data), still images and moving images based on visualized numerical data, content information for the purpose of advertising and advertising, etc. It consists of natural information including information and sensor information.

[0191] <Examples of features and identifiers>

Next, identifiers and feature quantities considered as modifications of the present invention will be described. The feature values and identifiers used in the present invention are defined mainly as auditory information, visual information, and sensor information as natural information. The phoneme, phoneme piece, and emotion are associated with the auditory information, visual information, and sensor information. In addition to indexing, search is performed by evaluating the co-occurrence state of such information. [0192] First, for features and identifiers based on auditory information, frequency and volume features that are different from FFT, cepstrum, mel cepstrum, and directional pattern used for speech and sound, and temporal transition of those features As an identifier recognized from feature quantities obtained by a known feature extraction method such as volume and phase of recorded sound at different positions, or differences in frequency components, emotion identifiers indicating phonemes, phonemes, emotions , Person identifiers associated with voice quality, scale identifiers, instrument identifiers for identifying pianos and guitars, explosion sounds and rain sounds, pachinko parlor sounds, wind sounds, wave sounds, mechanical sounds, noises, environmental sound identifiers and sound effects The identifiers can be used, and the features extracted from the speech waveform are classified into populations collected for each name, and distance functions and learning by multivariate analysis are performed based on the classified populations. Constitute the evaluation function, such as HM M function by. Then, based on the nominal phoneme sequence, the nominal phoneme sequence, the character string ID, and the numeric ID associated with the evaluation function, the environmental sound identifier, noise identifier, mechanical sound identifier, and! / An identifier based on can be configured.

[0193] Next, if it is a feature amount or identifier based on visual information, it is recognized using a known moving image feature amount or still image feature that is used as a luminance difference, color difference, or motion vector for an image or video. Identifiers that can be used include urban areas, green spaces, coasts, mountains, deserts, weather, facial expressions, landscape identifiers that indicate how the sun is shaded according to time and season, and object identifiers that indicate objects such as cars, people, faces, flowers, animals, and plants. Image identifiers that indicate image features such as brightness, color, and contour, motion identifiers that indicate movement speed of the object, changes in motion, and changes in the state associated with the behavior, and appearance positions of image system identifiers that correspond to the image range. The display position identifiers shown can be used, and feature values extracted from moving images and still images are classified into populations that are collected for each designation, and based on the classified populations, distance functions by multivariate analysis and HMMs by learning Seki Constitute the evaluation function such as. Then, based on the nominal phoneme sequence, the nominal phoneme sequence, the character string ID, and the numeric ID associated with the evaluation function, the scene identifier, object identifier, and motion identifier based on the feature quantity of the moving image or still image are displayed. / You can configure identifiers based on a single moving image or still image.

[0194] In addition, based on auditory information, visual information, and text information, and emotion identifiers as emotion information by simple recognition, it is possible to use general emotional powers such as emotions based on facial expressions and voice tone. It may be used as an identification by detecting and recognizing words that indicate emotions and mental states described in the book related to psychology.

[0195] These identifiers and feature quantities are the frequency of appearance of colors in one frame as in the previous embodiment, identifiers and feature quantities that span multiple frames, not just phonemes, and identifiers and features that span multiple frames. Identifiers and feature quantities based on quantity transition information, feature quantities and identifiers with coordinate information in the display screen, arithmetic estimation of positions using visual information and auditory information feature quantities and identifiers with coordinate information in the spatial coordinate system, The feature quantity extracted in association with the time axis may be an identifier, the detected feature quantity force, the depth restored by arithmetic space calculation, or the depth as the coordinate information of the 3D image information. Area, mass, volume, speed and numerical information calculated by depth as depth and coordinate information, and calculated area, mass, volume, speed, and numerical information Good, even with the weight and mass of the weight or the mass and the attribute information estimated area and the mass and volume and speed force to, the ivy information.

[0196] For this reason, identifiers may be represented using identifiers that associate character string IDs or numerical IDs with evaluation functions based on features of speech, moving images, or still images, or may be used for speech, video, or still images. The identifiers that can be used, such as arbitrary character strings that are recognized by the user, are combined and used as identifier strings, or the identifiers are shared by using the evaluation values obtained by the recognition of the identifiers using the evaluation function. The HMM and evaluation function can be re-created using a combination of arbitrary evaluation values such as the covariance matrix of the occurrence probability, feature value, HMM output probability, HMM transition probability, distance function distance value, and DP match evaluation value. It can be used for construction, and the identifier string can be configured based on the chronological change of the identifier.

[0197] In addition, an identifier may be given to a population learned by self-organization associated with multivariate analysis to perform search, recognition, detection, and indexing, or these identifiers may be used as search conditions. Alternatively, it may be used as a feature quantity for evaluating an identifier of a population learned by combining feature quantities associated with a plurality of images, videos, and sounds used for multivariate analysis.

[0198] In addition, hash values obtained arithmetically using a feature value average, variance, nominal character string, phoneme sequence or phoneme fragment sequence associated with an arbitrary identifier may be used for indexing. And phoneme continuous times, continuous time, and long classified into several types such as “long, medium, short” "Discernment -longj, Γ discernment-short" using the length information, and the phoneme-front J, r _p honeme -rear using the position information within the range of one phoneme. May be used as symbols or identifiers with location information, or may be used to construct new identifiers by combining those identifiers and symbols as symbol strings or identifier strings. The evaluation function force in the interval where identifiers continue as evaluation results may be used for the length information for classifying the identifiers described above and the weight information associated with the identifier lengths.

[0199] At this time, it is possible to use a feature amount as described in the above-mentioned prior art, or any image recognition technology that can use the feature amount cited in those documents. Character features used as program information such as EPG, BML, RSS, text broadcasting, subtitles, etc. for recognized identifiers are extracted from facial features of TV performers and recognized based on the extracted features. By associating phoneme symbol strings and phoneme string sequences converted from text information or phonetic symbol strings generated by the utterance of the person or the video accompanying sound, an identifier for discriminating the displayed person and the display contents can be provided, Alternatively, the present invention may be modified by providing an identifier.

[0200] In addition, text information and character information using character strings can be combined with any document processing method, and can be realized by combining feature extraction methods for character strings as described in patents and documents related to them. It is also possible to use information evaluation methods that use co-occurrence states, such as “Examples of search and optional processing with multiple identifiers and multiple search conditions” described later. Character spacing, number of characters, character appearance frequency, character co-occurrence frequency, sentence spacing, word number, word appearance frequency, word co-occurrence frequency, symbol interval, number of symbols, symbol appearance frequency A symbol co-occurrence frequency or the like may be used as a feature amount, or an identifier associated with sentence analysis or recognition based on the feature amount may be used.

[0201] In addition, by applying the present invention, environmental sounds are processed as onomatopoeia and evaluated based on recognition of phoneme sequences and phoneme segment sequences, so that environmental sound features, environmental sound identifiers, sound effect identifiers, phoneme identifiers and phoneme segments are evaluated. After constructing a co-occurrence matrix with identifiers, features may be learned to construct new features or new identifiers as onomatopoeia features or onomatopoeia identifiers, or person identifiers or voice quality or changes based on voice quality or changes Model used for recognition using emotion identifiers May be used to improve the recognition rate.

[0202] In addition, an identifier specified in an arbitrary protocol and the name of an article related to the identification may be used in association with each other. For example, in an interface standard such as MIDI, an ID and a musical instrument are used in a method called general MIDI. Can be used to associate instrument numbers and instrument names, and a co-occurrence matrix of their IDs and feature quantities can be constructed, as well as JAN codes, etc. Since the target can be uniquely fixed by manufacturer code, item code, etc., normal bar code, 2D bar code, RFID tag, teletext, closed caption, EPG, subtitle, BML, RSS, etc. may be used.

[0203] In addition, co-occurrence information based on feature quantities and identifiers configured based on the co-occurrence state is any identifier, any feature quantity, or distance information or probability output as a recognition result corresponding to any identifier. The information is based on the fact that the information occurred simultaneously within the specified time range.

For example, when “smile” is recognized as an image, the “laughter” phoneme identifier, “laugh!”, “Emotion identifier”, “laugh! Based on these newly expressed feature quantities, the laughter and state identifiers, ヽぅ identifiers, their evaluation functions, and evaluation HMMs are constructed. Also good.

[0204] In this case, the specified time range is not the power of general time expression, but the number of frames (fields) in the time-series moving image and the degree of deviation from the average of neighboring frames, Find the co-occurrence range based on units that take into account time-series transitions when reading aloud, such as the number of characters, character position, number of words, number of sentences, number of sentences, number of chapters and pages. The character information may include text information and program information.

[0205] In addition, solid information is generated using 2.5D features and 3D features based on the image features obtained from multiple imaging information forces, and the 3D information including the generated 3D information, polygon information, and texture information. Evaluate the degree of coincidence by evaluating the distance to the information and evaluating the matching of the 3D shape in the 3D image and 3D image search, and the coordinate position from the centroid of the pseudo 3D and 3D information, the eigenvalue of the coordinate group, and the eigenvalue It is okay to carry out.

[0206] The scale type is scale information such as "Doremifasolaside". The tempo, rhythm, chord information, etc. associated with the appearance transition state of the time axis of the scale identifier may be included. In addition, it is known that the instrument type can be realized by learning together the acoustic information of the instrument, and according to the known literature, it is known that over 90% is exempted in single-tone recognition.

[0207] The recognition of environmental sound types includes frequency characteristics such as FFT cepstrum, mel cepstrum, directional pattern, and formant extraction, volume characteristics, changes of those characteristics due to time transitions, and the recording sound at different positions. Sound characteristics based on differences in volume, phase, and frequency components, sound source position based on left and right phase differences and volume differences, timbre based on frequency distribution characteristics and pitch transitions, wave sounds, cold sounds, etc., similar to instrument identification It can be recognized by using evaluation functions collectively for each piece of biased information, and can also be used as a machine sound type as an application. More specifically, sound of engine and exhaust, sound of steam locomotive, sound of running on track, sound of wind, sound of animals and insects, sound of birds, sound of waves, sound of trees, horn, scream, cry, cry, laughter Features and identifiers based on information such as natural sounds, mechanical sounds, sounds generated by living things, sounds generated by living things, explosion sounds, etc., and acoustic identifiers include scale identifiers, volume identifiers, tone identifiers, chord identifiers, etc. For example, if it is a sound position identifier or sound source direction identifier, it distinguishes the top, bottom, left, and right of the direction in which the sound is generated, and if it is an echo state identifier, it distinguishes the size of the room based on the speed of the indoor reflected sound. If the instrument type, trumpet or piano sound is discriminated, and if it is a machine sound identifier, machine sound, engine sound, tappet sound, screw sound, exhaust sound, tool sound, furniture sound, flight sound, noise Or nature If it is an identifier, it is a wind sound, wave sound, roaring sound, explosion sound, environmental sound identifier or sound effect identifier, and if it is a speech identifier, it is a language identifier, speech speed identifier, exclamation speech identifier, cheer identifier, hoarseness It is conceivable to combine features specific to these identifiers, which may be identifiers.

[0208] The identifier of the image type starts from a feature amount such as a contour based on luminance differentiation, a hue difference, a color density, or a difference between them, a face type for recognizing a human face, and a human face shape. Recognized facial expression type, person type, walking type, clothes, person type discriminated from physique, landscape type that can be detected from color and shape components of the image, and can distinguish deserts, seas, and cities, and temporal transition of image features Sign language that is extracted, such as jestier, dance, animal behavior Depending on the type of operation based on the machine's operation and the position of those feature values in the display range, for example, the upper right numerical value on the screen has a high co-occurrence rate with the character image information of the time Therefore, it is possible to configure the feature amount and identifier based on the information indicating the position when the time information is detected, the display position type related to the direction of the object in the display range, and the image identifier. For example, if it is a luminance identifier, saturation identifier, hue identifier, contour identifier, motion identifier, image position identifier, speed identifier, moving direction identifier, or a body identifier, an animal identifier, a plant identifier, a machine identifier, a tool identifier For furniture identifiers, person identifiers, material identifiers, sign identifiers, landscape identifiers, and shape identifiers, face identifiers, facial expression identifiers, mouth shape identifiers, clothing identifiers, hairstyle identifiers , Skin identifiers, body identifiers, posture identifiers, waveform shape identifiers, and character identifiers, such as language identifiers, font identifiers, character size identifiers, symbol types, etc. It is conceivable to combine feature quantities.

[0209] These identifiers and feature amounts are processed in the feature amount extraction unit and the feature amount identifier conversion unit in FIGS. 14 and 15, respectively, and feature amount identifier conversion, character string / feature amount identification are performed. All the identifiers and feature quantities are not described in the child transformation, search result generation process, and feature quantity extraction unit, but the natural information power that can be considered to be implemented. The process of extracting and identifying identifiers using the evaluation function from the extracted feature quantities and the process of converting them into feature quantities and identifiers based on the phonetic symbol strings entered are performed for indexing, searching, and learning. Since the identifiers and features used together with phonemes, phonemes, and emotion identifiers can be changed according to the implementation, an index of co-occurrence states associated with image, voice, and emotion recognition results Date, learning, search, search results Usage and various identifiers individual recognition technology to the present invention for the purpose of generating a search condition by the phoneme and phonemic piece corresponding to the designations are not the subject of the invention.

[0210] In addition, the character and symbol types that can be distinguished by giving meanings and sounds to the symbols, the sign types that the meanings are discriminated by graphical symbols, and the elements of the above-mentioned image features, such as corners, curves, and contours Shape types that discriminate between them, graphic symbol types that distinguish shapes and elements whose combination meanings are fixed to some extent, and EPG, text broadcasting, BML, and RSS for discriminating the contents of broadcast programs as program information For broadcast programs and distribution contents such as guides Based on the type of information and the related information, the identifiers and features can be configured. In the case of EPG, the program content is BML. Can be obtained in a row.

[0211] In addition, if it is sensor information, it is possible to add a temperature sensor, a gas sensor, or a motion sensor to the present invention as an identifier accompanying the sensor input, and the input information of those sensor powers may pose a danger to human life. It is possible to construct identifiers by classifying genders, collect co-occurrence information associated with images and sounds related to identifiers, and use them for protection evaluation for human safety by robots and for safety evaluation of the device itself, The heart rate sensor and the brain may be combined with a sensor, a muscular current sensor, and a skin resistance sensor to constitute a medical psychoanalysis apparatus. In combination with inventions related to walking navigation, car navigation, etc., location identifiers can be acquired based on location information such as GPS and linked to perform searches or learn co-occurrence status. Services and devices based on the co-occurrence state of feature quantities and identifiers may be configured using multi-layer Bayes, multi-layer HMMs, multi-layers, etc. for recognition, classification, discrimination, and evaluation.

[0212] Also, for example, a sound that collects only specific noise, a sound that collects only specific instruments, piano drums, dogs and cats, mechanical sounds of cars and factories, cheers, scales, etc. Even if it is an identifier configured as a feature, the feature extraction and recognition processing is similarly performed on the video input from the external device of the apparatus based on the present invention to identify a person or expression based on the face. It is possible to identify items, characters, figures, symbols, signs, etc. based on shapes and colors, or to identify movements based on differences between frames and sound source positions. It may be recorded and used for indexing, and it will be indexed in the future on odor, taste, temperature, humidity, weight, hardness, viscosity, density, size, environment, chemical composition, and physical properties. May be.

[0213] And in the information processing section, there is natural information obtained from the outside via the information input section, and music, documents, and musical score information based on video, images, sounds, and voice information acquired from the communication line section and storage section. There is a feature amount extraction unit that extracts feature amounts from content information that can be processed by music, still images and moving images, polygon data and vector data, numerical data, still images and moving images, t, and other information processing devices. Obtained by feature quantity, search conditions, and search results There is a co-occurrence information learning unit that learns the co-occurrence information, and there is an index information generation unit that performs indexing by determining identifiers from the extracted feature amounts through recognition processing, and is made up of features and index identifiers. There is an index search evaluation unit that evaluates the degree of matching between the condition and index information, and there is an evaluation list output unit that outputs it as an evaluation list as a search result. The feature quantity acquired from the content information and user input There is a feature value identifier conversion unit that converts the identifier into an identifier that is recognized by the user, an identifier obtained from the outside through a storage medium or communication, an identifier from which the content is extracted internally, and the like. There is an identifier feature quantity converter that converts to feature quantities, a dictionary extractor that extracts information for target conversion from the dictionary information storage section, and index information such as content information MPEG7. After acquiring information from markup languages such as RSS information and XML from the communication line section, or acquiring information on EPG, BML, RSS, and text broadcasting based on the received broadcast wave There is a meta-symbol extraction unit that extracts instructions, variables, and attributes in arbitrary symbol information, and search, detection, and indexing may be performed by combining them as necessary.

[0214] For this reason, for example, it is based on different characteristics such as sound that collects only specific noise, sound that collects only specific instrument, mechanical sound such as piano and drum, dog and cat, car and factory, cheer, and scale In the same way, the feature extraction and recognition processing is performed on the video input from the external device of the apparatus based on the present invention, and the person and the facial expression can be identified based on the face. It is possible to identify articles, characters, figures, symbols and signs based on shapes and colors, or identify movements based on differences between frames or changes in sound source position. It may be used for attachment, and in the future, there will be an index on the environment, chemical composition, physical properties such as smell, taste, temperature, humidity, weight, hardness, viscosity, density, size, etc. Also good.

[0215] In addition, identifiers and feature quantities associated with content can be stored and recorded in a dedicated database along with time information, or an index file as a separate file used simultaneously with video and audio information. , MPEG file and other video streams, update empty area, comment area and meta information description area of MPEG file, markup languages like EPG, BML, RSS, teletext Using You can use the method of receiving and saving the data as described above. ,.

[0216] Also, phonemes and Z or phoneme pieces and Z or emotion identifiers and Z or scale symbols and Z or instrument identifiers and Z or environmental sound identifiers and Z or extracted from speech information obtained from search conditions and search results. Video features extracted from moving images and Z or still images and Z or face identifiers and Z or person identifiers and Z or object identifiers and Z or expression identifiers and Z or motion identifiers and Z or display position identifiers, EPG and text broadcasting ,

Character strings extracted from websites related to BML, RSS, and content, and co-occurrence information of character strings can be arbitrarily combined to form an identification function or HMM, or an identifier corresponding to the configured HMM or identification function can be configured. And the co-occurrence information as described in “Examples of identifier learning based on search 'detection' indexing” and “Example of identifier reconstruction” using the distance, matching degree, and HMM output probability as identification results as features. Learning or identifier construction may be performed, or an arbitrary classification evaluation function may be configured by combining the above-mentioned various feature quantities and classifying them by multivariate analysis and giving identifiers.

[0217] At this time, phonemes, phonemes, emotion identifiers, scale identifiers, etc. using protocols such as HTML, XML, RSS, and CGI as described later, markup languages, scripts, programming languages, and noinary codes. Templates for recognizing information such as musical instrument identifiers and environmental sound identifiers, feature extraction algorithms, symbol string matching evaluation algorithms, and symbol recognition algorithms may be obtained and distributed via communication lines. Explain in detail in Example.

[0218] << Example of dictionary structure >>

Next, the dictionary function for mutually converting the identifiers and feature quantities used in the present invention will be described using the dictionary information storage unit 214 of the storage unit 20 and the dictionary extraction unit 106 of the information processing unit 10. These dictionaries can be implemented by general-purpose programs such as information processing methods and storage methods using general algorithms such as hash buffers and map buffers, and databases, and dictionary information used by the dictionary function is stored in a storage medium. It is generally well known that the information group can be related by an index, and can be arbitrarily implemented by a publicly known method, and therefore depends on the implementation.

[0219] As a more specific dictionary configuration, the step of inputting an identifier as described above is input. There is a method based on the step of selecting and outputting another identifier associated with the identifier, and a method based on the step of inputting the identifier and the step of selecting and outputting the identification function associated with the input identifier. There is a method based on the step of inputting and the step of selecting and outputting the identifier associated with the input identifier column, and the step of inputting the identifier column and the identifier column associated with the input identifier column are selected and output. There is a method by step, and there is a method by which an identifier is input and a standard pattern of other identifiers associated with the input identifier or an average value of identifier groups used for the standard pattern is selected and output. Can be implemented by using a method called associative array, and any combination of them Information conversion between identifiers and identifier string or identifier group or standard pattern associated with a child is possible.

[0220] The identifier is information for classifying information that is recognized by the evaluation function, and the identifier string is information in which identifiers of the same system are arranged in time series. It is preferable information that a plurality of arbitrary identifiers are gathered and have a co-occurrence relationship.

[0221] First, these dictionaries are indexed by arbitrary keywords and IDs. More specific examples are composed of symbol identifiers, variables, and feature quantities, as in the control dictionary and the Japanese phoneme international phoneme symbol conversion dictionary. In addition, any combination of the above identifiers and features such as Japanese word phoneme sequence conversion dictionary, motion identifier caller phoneme sequence conversion dictionary, face image identifier name phoneme sequence conversion dictionary, etc. can be considered, If it is a Japanese word phoneme string conversion dictionary, the character string "Japanese" is converted to a phoneme string "n / i / h / ₀ / n / g / o", and the action identifier caller string conversion dictionary If it is a face image identifier name phoneme sequence conversion dictionary, it is converted to an identifier indicating “nodding motion”, r _u / n / a / z / u / k / uj, and V phoneme sequence symbol. Executed according to the identifier indicating “Taro's face” and conversion to the rt / a / r / o / uj t ヽぅ phoneme sequence symbol.

[0222] In this way, the correlation that is one-to-one, one-to-many, or many-to-one is quantitatively recorded and stored. Processing such as identifier conversion is possible, and these dictionaries are composed of reference information groups for conversion. If the dictionary is many-to-one, the dictionary information is based on the co-occurrence information! It may be configured, and eigenvalues and eigenvectors based on co-occurrence information are used. As described in “Example of identifier reconstruction”, a dictionary of evaluation functions and identifiers may be configured, and features and identifiers can be converted using phoneme strings, phoneme strings, numeric IDs, and string IDs A dictionary may be configured, or a phoneme sequence, phoneme sequence, numeric ID, and character string ID may be converted to an evaluation function or an identifier.

[0223] Of course, as an arbitrary dictionary based on the above-mentioned identifiers and feature quantities! /, A combination of Japanese names such as “unazuku” identifiers, for example, using the motion identifier Japanese translation dictionary After converting to, the Japanese phoneme dictionary is referred to as `` u / n / a / z / u / k / u '' and the co-occurrence state of these features and identifiers is observed, and the original `` nodding state '' is called You can configure the identifier. Then, the co-occurrence state of the information of “Taro's face” and “nod state” identifier by face image recognition may be evaluated to constitute “Taro's nod” identifier.

[0224] As described above, a dictionary in which identifiers and language-dependent words are associated with each other may be constructed for each of the above-described feature quantities and identifiers, so that a dictionary combining arbitrary identifiers and feature quantities may be constructed. An identifier that can be used for retrieval by learning the co-occurrence state of features for those words and phoneme strings by associating them with abstract words, adverbs, adjectives, and unknown nouns that can be rebuilt using It may be used as a feature value, or may be used as a hash value by an arithmetic process such as MD5 or CRC based on a phoneme sequence or phoneme segment sequence associated with these identifiers. · Phoneme string sequences and hash values are stored in association with each other to efficiently search for phonemes in the dictionary 'search by identifiers and feature values related to phoneme segments, or between different identifiers, identifiers and feature values, phoneme sequences, phoneme sequence And identifiers and phoneme strings You can configure a dictionary so that you can convert between a phoneme sequence, a feature, a phoneme sequence, a phoneme sequence, a phoneme, a phoneme segment, a phoneme sequence, a phoneme sequence, a phoneme sequence, and a phoneme sequence, DP that evaluates hash values may be used.

[0225] Also, accompanying the moving image, still image, and sound! When the co-occurrence information of the detected identifier and the name of the identifier are expressed in phonemes or phoneme pieces, the moving image, still image, or sound used in the present invention is used. A dictionary can be constructed based on the correlation between arbitrary features and identifiers extracted from video and audio power by using the index associated with the dictionary index, and the identifiers and phonemes of images, sounds and emotions can be combined by combining dictionaries. Convert columns, phoneme strings, character strings, images and sound Dictionary information can be newly constructed with a conversion table that evaluates the co-occurrence state of identifiers and features related to voices and emotions. In this example, in order to simplify the notation, a phoneme string is used. However, a dictionary structure using a phoneme string may be used. Force Because it depends on the combination of extracted features and identifiers, it depends on the implementation.

[0226] Then, based on the phoneme sequence, phoneme segment sequence, and emotion identifier by recognition of natural utterance, an arbitrary word character string is selected and converted to an arbitrary identifier or feature amount associated with the word character string by the conversion dictionary. Search using an identifier evaluation function associated with an arbitrary phoneme sequence, phoneme sequence, or identifier by a conversion dictionary, or directly search for speech using a phoneme sequence, phoneme sequence, or emotion identifier. Or convert a keyword registered in the dictionary to a phoneme sequence or phoneme sequence and use it for speech phoneme recognition, or register a phoneme sequence, phoneme sequence or identifier not registered in the conversion dictionary in the conversion dictionary And a conversion dictionary can be constructed based on the co-occurrence information.

[0227] Also, these dictionaries are not limited to phonemes or phoneme strings, but may be co-occurrence dictionaries configured based on co-occurrence states of arbitrary identifiers and feature quantities described separately. When a user assigns an arbitrary name, conversion from an arbitrary name to co-occurrence information and conversion to a phoneme string or phoneme string sequence based on words in an arbitrary language associated with the co-occurrence information may be used at any time. Users can synthesize speech based on recognized phoneme sequences and phoneme sequence, search based on phoneme sequence and phoneme sequence, and use phonetic characters and words associated with phoneme sequence and phoneme sequence. Or ask the user to make a decision.

[0228] It should be noted that the conversion of identifiers and feature quantities may be performed mutually using "example of method of converting to natural information power feature quantity" and "example of method of converting to feature quantity power identifier string".

[0229] «Example of method for converting natural information into feature value»

Next, with regard to the feature quantity extraction function to be converted into the natural information ability feature quantity necessary for indexing and searching, the feature extraction program stored in the program storage section 210 of the storage section 20 or the characteristics of the information processing section 10 is used. This will be described based on the quantity extraction unit 116. These feature quantity extraction functions can be implemented by a general-purpose program, which is an information processing method using various known general algorithms, and basically depends on the implementation. [0230] For moving images and still images, the amount of features used for character recognition and image recognition, such as luminance distribution and hue extraction, mesh extraction such as Sno << idernet, and displacement pattern of local autocorrelation between frames Motion feature quantities are extracted based on changes in image shape, moving image, and inter-frame differences, and can be combined with autocorrelation coefficient extraction, higher-order autocorrelation extraction, etc., as feature quantities for video and images. For speech features, FFT features, mel cepstrum, mel cepstrum, directional pattern, formant extraction, rhythm extraction, Harmonix extraction, autocorrelation coefficient extraction, higher-order autocorrelation extraction, frequency features and volume features, etc. The change feature can be extracted.

[0231] Also, in speech, multi-order difference features such as frequency components, frequency distribution, volume, sound source direction, differences between them, differences in differences, values of these information average and variance, standard deviation, and their values Or the color distribution, luminance distribution, redistribution, color calculus value, luminance calculus value, saturation calculus value, similarly analyzed RGB value, HSV value, YR—YB— Each frequency component distribution such as Y value and YCM value, multi-order difference features such as color, brightness, frequency difference, difference difference, and the average, variance, and standard deviation values of these information and the exponent part of those values Or a feature amount based on the display position of an object within the image range of the recognized image-related identifier or image-related feature amount, or if it is a moving image, the time axis transition of the feature mentioned in the image Or a three-dimensional image 2. 5D features, 2. Various 3D features restored from 5D features, 3D image coordinate information used in CG, 3D texture information, 3D motion information, 3D color change information, 3D Source light source change information, 3D hardness texture information, arbitrary image recognition and 2.5D image feature extraction, and feature extraction methods and combinations thereof, time information recognized from these feature quantities, weather information, season It is possible to use identifiers such as information, regional information, and cultural information.

[0232] Many methods have been proposed in the past that use this as a feature value or identifier by capturing changes in information associated with time axis, space axis, physical quantity, visual change, auditory change, human subjective axis and observation. It is also possible to use feature quantities as described in the above-mentioned prior art, or to combine various feature quantities cited in these documents. The process of converting information to features is the step of inputting natural information and the features Corresponds to the step of converting to

[0233] «Example of method for converting feature quantity to identifier string >>

Next, we store the program in the storage unit 20 for the feature identifier discriminating function or recognition function that evaluates the similarity between the feature quantity and an arbitrary identifier necessary for indexing and searching based on the probability, Z, distance, and likelihood. The feature quantity identifier conversion program stored in the section 210 and the feature quantity identifier conversion section 120 of the information processing section 10 will be described. These feature quantity extraction functions can be implemented by a general-purpose program, which is an information processing method using various known general algorithms, and basically depends on the implementation.

[0234] A number of methods have been proposed in the past. For example, a feature amount classified into the same identifier is given to the HMM, and the transition probability and output probability of the HMM are learned and used as an evaluation function. The average and variance of the features classified into the same identifier and the covariance matrix are obtained, then the eigenvalues and eigenvectors are found to construct the distance function, and the Bayes discriminant function and Mahalanobis are used to obtain the distance between the center of the identifier information group and the input sample A method using a distance function using a distance function or simply using a Euclidean distance function between an input sample and an average vector of identifier groups has been proposed, and since these procedures depend on the implementation, any method can be used. Can be used.

[0235] Then, it is possible to evaluate the similarity between the feature quantity extracted by the natural information ability input by such an evaluation function and the symbol or identifier as a numerical value. The identifier associated with the nearest function or the highest likelihood, the identifier associated with the HMM, the identifier associated with any distance function indicating the nearest or distance, the most likely to belong The identifier associated with the population is recognized as the evaluation result by the evaluation function. Word recognition, phoneme recognition, phoneme recognition, object recognition, character recognition, face recognition, viseme and facial expression recognition, Emotion recognition, sound, musical instrument, and motion recognition are implemented, and identifier symbol strings can be obtained with time-series changes in these recognitions.

[0236] By such a method, the correct identifier is selected from among a plurality of identifiers for the input feature amount. This means that the evaluation function of the identifier to be selected and the input feature amount to be compared are compared. Similarity due to minimum distance or maximum output probability For the input feature quantity V that is known in advance as an identifier X, the similarity evaluation using the identifier evaluation functions X, Υ, and ζ is performed. If the output identifier evaluation function is X, it can be determined that the recognition of the identifier was successful.

[0237] At this time, as a recognition method using a probabilistic model, an uncorrelated normal probability distribution (diagonal normal probability distribution) that considers only the diagonal component of the covariance matrix or a strict but small number of data! It is difficult to accurately estimate the parameters of the model, and the total correlation normal probability distribution considering all components of the covariance matrix, and the mixed normal probability distribution using a model expressed by the sum of V ヽ several normal distributions ( Using the discrete probability distribution that divides the feature vector space using vector quantization (VQ) and non-correlation and total correlation), the distance from the centroid of each population and the attribution probability are based on the input feature value. First, an evaluation process is performed and an identifier is obtained as a recognition result.

[0238] In addition, a method for evaluating phonemes, phoneme pieces, emotion identifiers, and other arbitrary identifiers is generally performed by using an evaluation function for obtaining a likelihood using various distance functions and probability functions as described above. Then, these evaluations are segmented according to the time axis and display position of the content, and are evaluated sequentially for each segment to give an identifier, or time-series by dividing the time axis into arbitrary unit times and evaluating each frame sequentially. By providing a unique feature identifier, the feature strength for indexing can also be converted to an identifier.

[0239] At this time, for speech, one frame of FFT data, mel cepstrum, mel cepstrum, and direction pattern data may be vectors of arbitrary dimensions, or one frame for image features related to moving images and still images. May be configured with an arbitrary pixel size, and these inter-frame error vectors and inter-pixel error vectors may be given in arbitrary dimensions. May be used. Since the method of obtaining the feature value at this point depends on the implementation, any method may be used.

Of course, besides the Euclidean distance, the output of the Bayes discriminant function that can be used as the distance including the Mahalanobis distance, the probability value based on the inverse of the probability, the natural logarithm, etc., the natural logarithm, etc. Exponent part of value, city block distance, chess board distance, octa In addition to gonal distance, Hetas distance, Minkowski distance, any distance calculation method such as similarity and distance weighted to those distances, distance calculation method using combination of eigenvalues and eigenvectors, eigenvalues and eigenvectors, The distance may be calculated by a combination of eigenvalues, eigenvector norms, maximum eigencomponents, and the like.

[0241] More specifically, in the step of inputting natural information, for example, a force output such as a sensor device with AD conversion such as voice or image is input. Next, in the step of extracting natural information features, the amount of speech depends on the identifier such as FFT, cepstrum, mel cepstrum, direction pattern, etc., and, for images, luminance and saturation delta information and contour information, and delta information by time axis difference The feature quantity is extracted by an optimum method.

[0242] Next, a feature evaluation step is performed based on recognition using Bayes, HMM, and distance functions. The most probable! Near to identifiers and distances! The step of selecting an identifier is performed. Then, by outputting the selected symbols and identifiers as recognition results, phoneme and phoneme symbols, emotion identifiers, image identifiers, face IDs, recognition characters, environmental sound IDs, mechanical sound identifiers, landscape identifiers, scale identifiers Etc. are obtained and used for indexing. These procedures are executed as a step in which an identifier is evaluated, selected and output using a plurality of evaluation functions by an evaluation function processing step and a step of confirming the end of the evaluation function.

[0243] At this time, if the processor can handle the analog value, the analog value may be directly input. If the processor can evaluate the analog value, the evaluation calculation and matching are performed with the analog value as it is. Processing may be performed, or digital values may be converted into analog values for evaluation calculation.

[0244] Also, in the case of an index based on voice-related identifiers, the identifiers and feature quantities of instrument types that measure the distance from the population that collected the sounds of each instrument are used, and engine sounds, exhaust sounds, door sounds, etc. A population that collects sounds for each environmental sound, such as wind sounds, waves, and birds and animals, using mechanical sound type identifiers and features to measure the distance from the population that collected the sounds of each mechanical sound It is possible to use an identifier or feature amount of the environmental sound type that measures the distance to

[0245] In addition, an index based on an image-related identifier is based on the image type, and in order to discriminate the person in the video, the person type based on the face type, clothes, and physique, gesture gesture and facial expression. To discriminate traffic restrictions due to the use of the original action type, landscape type and Z or image position type and Z or character / symbol type by recognition of characters written on the signboard or building surface and signs on the road The identifier may be used for indexing by using the following types of signs and Z, or the shape type of cars, ships, desks, phones, etc. and the figure symbol type of Z or toilets or emergency exits.

Next, the indexing by the apparatus according to the present invention will be described. As an indexing method, it is possible to perform indexing by recognition for each search, but once index information is configured, it can be reused any number of times as long as the content does not change. Perform indexing at any time, such as when it is first registered in the storage, or when it becomes the search target for the first time, or when it is registered for the first time and the frequency of use from the outside of the force device itself decreases. It may appear that content information has been registered so that it can be handled by an external device after indexing the content.

[0247] In addition, this indexing is not a force for indexing the recorded content by performing indexing by recognition at an appropriate unit time (for example, 16 milliseconds) at the time of recording information, but during live broadcasting programs The index information may be distributed in real time while indexing at the same time as broadcasting.

[0248] First, the indexing device according to the present invention executes the audio / video input step (S0201), so that external force also acquires content information. The content acquired here is not limited to video and audio as described above, but may be arbitrary content information such as still images, document information, BML, EPG recognized subtitles, and character strings included in video. .

Next, an indexing execution procedure will be described. Information input unit 30, communication line unit 50, or storage unit using exchangeable storage medium Content information acquired is converted to numerical data as feature values by feature value extraction unit 1 16 Executes feature value extraction step S0202 The

[0250] The feature quantity used in this conversion step S0202 is the same as that described in "Examples of converting natural information into feature quantities", "Examples of feature quantities and identifiers", and "Prior art". A feature extraction method has been proposed for still images, audio, and sentence strength. Feature classification methods such as still image feature extraction unit and moving image feature extraction unit based on visual information, emotion feature extraction unit based on auditory information, phoneme feature extraction unit, phoneme piece feature extraction unit, and program information extraction unit based on character information The feature quantity is extracted by the feature extraction method.

[0251] More specifically, cepstrum for speech waveforms, luminance and hue delta signals for image features, and co-occurrence probabilities for characters and words for text, EPG and BML power were also developed. It may be a phoneme or a symbol string of a phoneme piece! /, And any known feature quantity extraction method of V deviation may be used.

Next, an identifier is assigned and assigned by the feature value identifier conversion unit 120 or step S0203. By recognizing the feature value as described in the above “Example of how to convert to the feature value force identifier string”, the conventional force as in “Example of feature value and identifier” and “Conventional technology” can be used. Step S0204 for indexing by associating an arbitrary feature amount or an identifier recognized using the arbitrary feature amount with the time series of the content information is executed to construct the index information.

[0253] More specifically, for speech waveforms, phonemes and phonemes, emotion identifiers, environment identifiers, and for moving images and still images, shape identifiers, face identifiers, facial expression identifiers, character identifiers, object identifiers, motions If it is an identifier or a sentence, it may be a word identifier or a word co-occurrence state identifier. You can also associate advertisements with similar features and identifiers together with the process of associating with these identifiers!

[0254] Next, the configured index information is recorded in the MPEG information by the index symbol string synthesizing unit 110 as an additional change to the additional carostream or existing MPEG7 information, or the index information is stored in the information recording / accumulating unit 22 as a separate file. By recording or recording index information in a dedicated database configured by the information recording / accumulating unit 22 and the information processing unit, the user can search and use it in some cases.

[0255] Through such an indexing process, a symbol string based on several types of feature quantities and identifiers is generated in association with the content and can be configured as "index co-occurrence information", and metadata using "index co-occurrence information" can be configured. The accompanying content information can be constructed.

[0256] At this time, it can be seen that a plurality of identifiers and feature quantities are evaluated in association with each other according to a diagram describing the symbol conversion unit based on the feature quantities and identifiers in more detail. That is, in the present invention Co-occurrence information refers to emotional changes associated with audio, acoustic changes associated with video changes, emotional changes associated with video changes, subtitles, EPG, BML, RSS, and text broadcast changes associated with video and audio changes. The content is indexed by phonemes and Z or phoneme pieces and Z or emotion identifiers, and similarly indexed by identifiers such as other scales, environmental sounds, recognized character strings, and image identifiers. However, it is constructed based on correlated changes in content, and is characterized by searching, or by learning feature quantities extracted from search conditions and search results, and constructing new identifiers.

[0257] In addition, the content is indexed in the step of learning the co-occurrence state, and the co-occurrence state of various identifiers and feature quantities is learned, and autonomously classified by the quantity IV analysis IV class and the like. Indexing may be performed for each cluster, and the user may give an arbitrary character string, phoneme string / phoneme fragment string for each classified cluster, and use it for the search.

[0258] <Example of a method for converting an identifier string into a feature value >>>

Next, a method for converting identifiers necessary for search and dictionary construction into feature quantities will be described.

[0259] First, a step of inputting a symbol string or an identifier string that needs to be converted in the user or the device is performed. If it is a normal input character string, a target extraction step for extracting a phoneme string or a phoneme fragment string or any identifier by a conversion dictionary based on the input word is executed.

[0260] Next, an identifier segmentation process for converting the obtained identifier from a phoneme to a phoneme piece and from an image to an image element, if necessary, is executed. An image element here is a partial element for an image. If a face image is taken as an example, the face image shows the entire face, and if it is a face image element, it is a component of the face, such as the eyes, nose, and mouth. This is an element to which an identifier is assigned based on the classification when an arbitrary image tendency is separated as a part.

[0261] Next, in order to convert an identifier into a feature value, an identifier average setting step is performed using a sample average value of the corresponding identifier, and a feature value constituted by the average value is output. Since the feature value based on the average value converted according to the identifier is always a value that means the weight of the population, the distance between the identifier centroid and the feature amount is always 0 when given to the identifier evaluation function. It is recognized correctly. [0262] By this conversion, it is possible to evaluate the distance between the feature quantity W converted from the feature quantity Y converted from the arbitrary identifier X and the feature quantity W converted from the arbitrary identifier V force. Since distance evaluation can be realized, it is possible to evaluate the distance between identifiers using the same feature quantity, or to construct a conversion dictionary for identifier quantities.

[0263] In addition, in speech, scale identifiers are used for discriminators related to arbitrary speech such as scales that are not related to speech information associated with language, environmental sounds, noise, laughter, and emotional characteristics that can also provide voice. If there is an environmental sound, if it is an environmental sound, it will be the sound of each wave or wind !! /, if it is an emotion identifier, if it is an emotion identifier, the average value of the feature value of the feature type associated with the emotion will be used. It can be used for conversion to feature values.

[0264] In addition, in the image, road signs, human faces, fingerprint images, landscape images, vehicle types, buildings, etc. that can be connected with the basic figures such as “Maru”, “Bat”, “Triangle”, “Square”. If the identifier of a registered stack related to an arbitrary shape such as a character, or an action identifier related to the moving direction or moving speed is used, it can be converted into a feature value using the average value of the feature value used for recognition. Available.

[0265] Then, an image identifier is selected through a conversion dictionary such as "maru" or "bac", converted into a phoneme or a phoneme fragment sequence associated with the image identifier, and then converted into a speech feature value. “Mull” or “Bat” is uttered 1, a search is performed to find the location, and “Mull” or “Bat” is displayed from the image feature value associated with the image identifier of “Mull” or “Bat”. For example, it is possible to search for identifiers with different purposes such as searching for a place to speak.

[0266] Also, if an arbitrary image is to be searched, the image is subdivided into an image element symbol string or an image fragment symbol string according to the surrounding shape and hue! Even if a single identifier is constructed, an identifier sequence can be constructed by determining the identifier transition probability according to the spatial change of the front, back, left, and right according to the image features, or the time series of these features can be determined. Depending on the change, the feature identifier may be configured after changing the motion identifier to an identifier sequence having an optimal spatial time-series arrangement.

[0267] <Example of a method for evaluating matching between feature quantities or identifier strings> Next, a method for evaluating the match between the feature quantities necessary for the search and the identifiers will be described.

[0268] First, a method using a distance function is well known as a method for evaluating feature quantities. Generally, since feature quantities are composed of vectors, the Euclidean distance between feature quantities is measured. More specifically, the cumulative value is obtained from the square of the difference between each element in the feature vector for the first input vector and the second input vector obtained by the same feature extraction method. Although other various distance functions are described separately, the distance between vectors can be measured by giving two vectors of the same number of dimensions by the same feature extraction method to the distance function.

[0269] In general, when measuring the distance between a feature quantity and an identifier, a standard pattern in which an average vector of feature quantities classified into the same population is used as an evaluation criterion is used. The method of evaluating the distance from the population center of gravity by measuring the distance from the standard pattern for evaluation and the standard pattern for evaluation is generally well known, and any method can be used depending on the implementation. From the mean and variance of the group, it is possible to establish a 3σ boundary, a statistical test boundary, a boundary based on experience values, and so on, to evaluate whether or not it belongs to the population.

[0270] As described above, the distance between the feature quantities can be easily obtained by any known method, but the distance between the feature quantities cannot be easily used for the matching / mismatch evaluation of the identifiers related to the feature quantities. Therefore, the user needs to set an arbitrary threshold value.For example, if the input feature value to be evaluated deviates more than 3σ from the average feature value and standard deviation of the samples classified into the same population, Disagreement, and if it is small, it is possible to determine whether or not the feature value matches the discriminator match caused by the feature value, and the match and similarity between the index co-occurrence information and the search condition co-occurrence information are also evaluated. become able to.

[0271] Next, DP matching is well known as a method for evaluating matching / non-matching between identifier strings. It is possible to select an identifier sequence from the long identifier sequence. More specifically, “a, a, a, a, b, b, b, b” and “a, a, a, a, a, a, b, b” have 100% symbols and their order. “A, a, a, a, b, b, b, b” and “a, a, a, c, c, b, b, b” are estimated to match 75%. For matching evaluation of identifier columns, CDP, Shift—CDP, mp—CDP, RIF—CDP, Self-applicative — Any matching function such as CDP can be used for implementation if necessary!

[0272] According to this DP matching (dynamic programming), it is possible to efficiently calculate the similarity while associating (sorting) elements between two symbol strings. It becomes possible to express the matching rate between the column and the search request symbol string as a percentage.

[0273] At this time, if the identifiers of the frames in the plurality of identifier strings having the frame power match, the evaluation result is configured as “0”, and if they do not match, the evaluation result is configured as “1”. If all the frames match, the cumulative value is “0” and the degree of mismatch is 0%. If all the frames do not match, the number of frames is equal to the cumulative value and the degree of mismatch can be evaluated as 100%.

[0274] In general, the frame length of the sample varies, so the difference in length can be corrected by using the value obtained by dividing the cumulative distance of the result of DP matching by the sum of the number of both frames. I can do it. By sequentially matching and evaluating the sample with a standard template according to any identifier type, the distance as the result value of the matching function is the smallest !, (the cumulative distance is the smallest!) High match rate! The identifier can be output as a recognition result.

[0275] At this time, if the identifiers output for each frame on the time axis are the same continuously, indexing is performed by collecting the consecutive identifiers by detecting that the identifiers have changed between frames in time series. The number of consecutive frames is used for weighting for matching evaluation, and if the difference between the weights of the same identifier is small, it is evaluated that the identifiers match, or the distance from the population centroid of the time-series identifier is the feature amount As a result, a matching score evaluation function is constructed using transitions of multiple identifier distances in time series, and the identifier information of one frame is reduced to one frame every 20 seconds or conversely to one frame every 240 seconds. It may be increased, or it may be matched by means of feature value averaging, variance, nominal character string, phoneme string, phoneme string string, or DP between these hash values. Distance and distance mean output from the distance estimation function consecutive intervals may be utilized in the boundary evaluation of identifiers.

[0276] In addition, when an identifier column is selected,! / Is correct, and is compared with the identifier column X This means that the distance of the identifier string V is minimized or the probability is maximized, and the identifier string X column, identifier The match evaluation is performed for the column Y column and the identifier column z using the matching function. If the identifier column selected as the highest match is the X column, the recognition is judged to be successful.

[0277] Also, whether or not the above-mentioned identifiers match and the power is good! It is also possible to evaluate the distance of different identifiers by using the method of converting to feature quantities and evaluating them with distance, and it is possible to evaluate the distance between identifiers converted into continuous feature quantities by accumulation of distances. The search can be realized by evaluating the distance. If the distance between features is close, it means close match to `` 0 '', and if it is a large number, it means disagreement, and normalization is possible by dividing by the number of consecutive frames, which can be quantified. Become. Of course, if it is within 3σ from the mean and variance of the sample, it is evaluated that they match, or the reciprocal is taken as the method of calculation, or if they match, a coincidence evaluation method that reverses the logical structure of “1” is used. By doing so, the evaluation method can be changed according to the implementation.

[0278] In addition, generally well-known methods include methods such as DP and CDP, search methods specialized for voice, music, and video, and matching evaluation methods. For these methods, various application examples and patent applications have been made, and any method can be selected depending on the implementation.

[0279] The time-series changes of the identifiers are output by the degree of matching evaluation procedure such as DP or CDP, and the degree of matching is displayed on the screen using the obtained evaluation values, ranked, and displayed as a list. It may be good or announced by speech synthesis.

[0280] <Example of information search method >>>

Next, the search by the apparatus based on this invention is demonstrated.

[0281] It is assumed that the search device according to the present invention indexes various contents as described above. This indexing may be real-time distribution information such as a TV broadcast program that is recorded by indexing every appropriate unit time (for example, 16 milliseconds) at the time of recording information. It is possible to record only where there are changes, distribute these index information via EPG, BML, RSS, teletext, etc., or record them in association with DVD files. If it is a text file, every word or sentence, Index information may be configured for each section or each chapter. The indexed information is searched by converting the user input into an identifier so that it matches the identifier used for indexing.

[0282] Next, the search device executes a speech 'character string input step to specify a search condition for the indexed content. The search conditions can be broadly classified into audio, character strings, and moving images and still images. In voice search, phonemes, phonemes, and emotion identifiers are recognized from speech used for user utterances and searches, and direct search is performed using phonemes and phoneme strings, and recognized phonemes and phonemes. Refers to the identifier conversion dictionary using, and includes a method for including other feature quantities and identifiers associated with phoneme strings and phoneme string sequences in the search condition, and an instruction dictionary based on recognized phonemes and phoneme strings However, there is a method of searching using other feature values or identifiers associated with phoneme sequences or phoneme segment sequences excluding detected commands, and considering user emotions based on recognized emotion identifiers. You can do the processing.

[0283] Then, the search by the search character string is performed by referring to the identifier conversion dictionary using the search character string and the method for directly executing the search character string, and other feature quantities associated with the search character string. There are a method of including an identifier in a search condition and a method of referring to an instruction dictionary based on a search character string and performing a search using another feature amount or identifier associated with the search character string excluding the detected instruction. The search character string may be converted into a phoneme sequence or phoneme sequence using an identifier conversion dictionary, and the search may be performed. Based on the recognized emotion identifier, processing that considers the emotion of the user is performed. It ’s okay to go.

[0284] The search by moving image or still image is recognized as a method of recognizing the image identifier or motion identifier used for the search from the video, moving image or still image captured by the user and executing the search directly by the image identifier or motion identifier. By referring to the identifier conversion dictionary using the image identifier or motion identifier thus determined, the image identifier or motion identifier recognized as a method of including other feature quantities or identifiers associated with the image identifier or motion identifier in the search condition is used. Based on the command dictionary, the detected command is excluded and there is a method of searching using other feature quantities or identifiers associated with the image identifier or motion identifier. Convert related identifiers and action identifiers into phoneme strings and phoneme strings using an identifier conversion dictionary. The search may be performed, or processing may be performed in consideration of the emotion of the user based on the recognized emotion identifier.

[0285] The common feature of these search condition construction methods is that information that has not been symbolized 'identified' is converted to other identifiers that are associated with each other via the identifier conversion dictionary after being symbolized 'identifiers. If it is necessary, it can be used for a search using the feature value by converting it to the average feature value of the identifier. Taro-san's face image is presented and a voice search is performed based on the recognized name to find a scene where Taro-san is called by someone, or a voice calling Taro-san is Hanako's voice quality By adding this condition, you can search for scenes where Hanako calls Taro. For the conversion method using a dictionary, refer to the “Dictionary configuration example”, “Identifier feature value conversion”, and “Feature value identifier conversion” items mentioned above. Also, the search conditions acquired here are information entered according to user instructions and use not only video and audio, but also information such as still images, document information, EPG, BML, RSS, and character broadcasting. Thus, feature quantities and identifiers may be configured.

Next, a search execution procedure will be described. First, the information input unit 30, the communication line unit 50, or the storage unit using an exchangeable storage medium. If it is an acquired search identifier string or character string, it is converted into an identifier string that can be used for searching by referring to the dictionary extraction unit. Step S 1001 is entered in which the search condition suitable for the search is input by conversion or conversion to a feature quantity based on the above-mentioned “Example of method for converting to identifier power feature quantity”.

[0287] Then, if it is a search condition based on utterance voice, search sample image, and! /, And natural information, it is possible to extract a feature, or to recognize an identifier using the extracted feature and use it for the search. By configuring possible information in step S1001, identifiers and feature quantities for user-specified search conditions are selected based on the same index as the content information index, and query generation step S1002 that configures the search conditions is executed To do. At this time, various identifiers and various feature quantities that can be used for the search may be combined to be converted into a search condition using only a general character string to be conditioned.

[0288] More specifically, for speech, after converting or recognizing speech information to phoneme sequence or phoneme sequence by utterance or input of speech file, refer to the phoneme sequence / phoneme sequence command conversion dictionary. Then, the utterance part corresponding to the command is extracted from the search condition and deleted, and the remaining phoneme sequence and phoneme segment sequence are used for the search, and if it is a video, the image is specified by specifying the image from the camera or file. After converting or recognizing to identifiers or image features, it can be used as search condition information, and if it is a sentence or word, the rest of the control command word extracted from the sentence or word is converted to a phoneme or image identifier for retrieval. The “search condition co-occurrence information” as a search condition that combines different information such as visual and auditory senses is constructed and given to the search device.

[0289] At this time, if there is a search condition based on a character string, the device is instructed by a character string “search of sea image” and there is a character string “sea” and “image search” t, and a command character string. “Sea”, excluding the command string!画像 Use the image feature value associated with the character string to construct search conditions based on the co-occurrence information of color features and motion features, and to co-occur information of color identifiers and motion identifiers to detect “sea” The search condition is configured by the evaluation function configured by, or if the index is performed by the “sea” evaluation function, the search condition is configured by converting to the “sea” identifier. You can configure "startup information".

[0290] In addition, if the search condition is by voice, it is registered in the command dictionary called "voice search" that corresponds to the command utterance phoneme sequence when the user gives voice instructions such as "voice search, ideal return, explosion sound". The phoneme sequence of the ‘Issual return, explosive pronunciation’ part excluding the utterance phoneme sequence 'phoneme segment sequence is used to detect and search utterances with explosive pronunciation in the content by conventional methods, and the feeling of sadness co-occur You may be able to detect and search for words such as “I don't like me!” Or “Tobi, I ’ll do it,” or if it ’s a continuous drama broadcast every week. When there is a tendency of specific co-occurrence information in the scale change, it may be compared with the theme song, and if the degree of match is high, the highlight scene may be evaluated.

[0291] Then, using the information used for such search conditions, "search condition co-occurrence information" is constructed by using combinations of search conditions that are used at the same time. It can be used as a search condition for evaluating matches and similarities, or such "search condition co-occurrence information" can be collected from multiple users and used as "search condition co-occurrence information". An evaluation function can be constructed using “starting information”.

[0292] Next, the information recording / accumulating unit power index information of the storage unit is read, and the read index information and the previous search condition information are evaluated by DP, distance function, etc. Based on the saved index information, perform a search according to `` Example of how to evaluate matching between feature quantities and identifier columns '' to select content and position within content. The search step (S1003) is executed.

[0293] For each content, a frame part having a high similarity to the search condition and an index part having a high similarity to the search condition are detected for each identifier and feature quantity, and the similarity between a plurality of identifiers and feature quantities is high. Execute the ranking step (S1004) that ranks the search results based on the search result evaluation based on the search result evaluation based on the search result evaluation based on the search condition evaluation! . Note that the similarity may be implemented by combining the above-mentioned similarity evaluation methods such as a distance evaluation method and a probability evaluation method if the degree of coincidence by DP.

[0294] This evaluation is based on an evaluation list without ranking in particular, or an evaluation list in which the maximum value and minimum value are simply determined according to the sum of the evaluation distance and evaluation probability of each identifier, Based on a logical expression such as an OR expression or AND expression, there can be an evaluation list ranked by the values selected by narrowing down or an evaluation list ranked by values calculated according to the logical expression. Note that the evaluation list based on the values calculated according to the logical expression is, for example, “(blue or green) and large amount of motion! ヽ video”, and the condition is expressed by the following function.

A = ((b-B) + (g-G)) X (m-M)

A: In-screen features

b: Blue feature

B: Blue feature average

g: Green feature

G: Green feature average

m: Motion characteristics

M: Motion feature average

[0295] In this way, by expressing the co-occurrence state related to the image feature in the screen by a mathematical expression, it is possible to replace the logical structure that is defined as AND or OR, EXCLUSIVE OR, or NOT with a mathematical expression. AND is multiplication, OR is addition, EXCLUSIVE OR is larger, EXCLUSIVE AND is smaller, knot is 1 The co-occurrence state It is possible to rank and present the search results by performing an arithmetic evaluation.

[0296] Also, by using the co-occurrence state as a co-occurrence matrix or covariance matrix, a distance evaluation function can be constructed, a probability function can be constructed based on the co-occurrence probability, or multiple pieces of co-occurrence information can be combined. It is possible to perform a search that evaluates the similarity based on the co-occurrence information. And if it is distance, the value is small! If it is probability, the value is large! Since the similarity is considered to be high in the case of 順位, ranking according to a plurality of identifiers and feature quantities can be realized as evaluation of the search result.

[0297] Note that the blue feature in this example is the frequency of appearance in the entire screen of pixels within a ± 15 degree hue centered on blue in the screen, and the blue feature average is the content 'all archive The same can be said for green and red, which can be considered as the average of the blue features of the above, and any method can be used because it depends on the implementation. In addition, as an easy-to-understand example of the association used in words and feature co-occurrence dictionaries and word feature conversion dictionaries that are sensibly obtained from user input words and image trends, natural colors are used for each season. Frequency of appearance If you classify images using typical color features associated with sensitivity, such as light green and cherry blossoms in spring, dark green and blue in summer, yellow and orange in autumn, and white and gray in winter! / A combination of features can be considered, as in the case of a simple method.

[0298] In addition, the motion feature is based on the time axis delta of the video, or may be the size of the motion feature vector used in MPEG4 etc. ± 15 from the current frame Features based on image change information occurring at arbitrary time intervals, such as frames, and the content of those features, which may be averages in the archive, are optionally normalized and corrected. The composition of evaluation formulas based on these feature quantities depends on the implementation, so any combination can be used.

[0299] In this case, an arbitrary evaluation function is configured by combining the obtained face ID, motion ID, image ID, and phoneme or phoneme identifier by combining image recognition technology and speech recognition technology instead of color distortion. It is also possible. The distance between identifiers can be evaluated by using the above-mentioned DP, etc., and the distance between features can be evaluated with an arbitrary distance function. Similarity evaluations can be applied to HMMs, distance functions, etc., and are described in detail in the description of identifiers and features described above and their mutual conversion methods. Of course, it is also possible to improve performance by performing efficient classification by combining various evaluation methods such as multi-layer Bayes and -Ural network.

[0300] Next, the search results obtained here are listed in descending order of similarity and presented to the user, and the evaluation result that allows the user to view the similarity value as a ranking index is displayed. Execute the browsing step (S1005), output the search result list to the output unit and display it on the screen, or send it to the user terminal via the communication line unit and present it to the user A user processing continuation confirmation step (S 1006) is performed to evaluate whether the user has requested the search again.

[0301] In this way, a combined search using co-occurrence information based on matching or similarity evaluation between "index co-occurrence information" and "search condition co-occurrence information" of the present invention is performed to obtain a search result. be able to. In this case, co-occurrence probabilities and co-occurrence matrices combined with “index co-occurrence information” and “search condition co-occurrence information” using the co-occurrence information based on each search result and the feature value of the neighborhood obtained as the search result The evaluation function used for the search may be configured as described above.

[0302] Also, based on the co-occurrence state in which the input character string is converted into an arbitrary identifier or feature quantity and the search is executed 1, the learning is performed using "example of identifier reconstruction" or search results, As ancillary information, identifiers may be learned by using co-occurrence information of search results and auxiliary information by associating with EPG, RSS, HTML, XML, BML, teletext, etc. It is possible to realize a service that takes an arbitrary configuration and executes a search by selectively using an arbitrary identifier or feature amount.

[0303] In addition, the character string for search can be transmitted from the broadcast receiving unit, the information line unit connected to the Internet, or the recorded information in the storage unit by any means such as XML, HTML, MPEG7, RSS, teletext, BML, EPG. It is also possible to perform a search by acquiring and converting the feature quantity that serves as a search index into an identifier string based on these character strings. It may be realized as a service that executes a search by selectively using feature quantities. Search string power Search conditions can be generated.

[0304] Note that the search by character string is a character string identifier change related to each feature extraction method. This is implemented by selecting and using identifiers and identifier features associated with an arbitrary character string using a conversion dictionary and identifier feature conversion dictionary. An identifier may be used. For example, perform a content search by converting a performer's name into a phoneme sequence or a phoneme segment sequence, or find the frequency of occurrence of explosion sounds in content classified as an action movie from the word “action movie”. Multiple action movie powers Performing content searches based on the action movie function by determining the average value of the explosion sound appearance frequency and configuring the action movie evaluation function and indexing with the action movie evaluation function Can do.

[0305] Further, there is a step of obtaining a co-occurrence state of an arbitrary feature amount or identifier based on the search result. This co-occurrence state can be configured by using co-occurrence probabilities, co-occurrence matrices, and covariance matrices.For example, by co-occurrence information within the top 10 with a matching rate of 70% or higher under certain conditions The co-occurrence state can be selected and used for learning. If the user views the co-occurrence information configured in this way many times or is used from the outside many times by the information sharing method described later, this co-occurrence information is judged to be highly useful. The By assigning specific identifiers to frequently used co-occurrence information, an evaluation function based on the co-occurrence state can be constructed, and the co-occurrence learning storage unit and the evaluation function storage unit can provide new identifier and feature co-occurrence information and evaluation. Record as a function.

[0306] If it is a function of the co-occurrence state of the above-mentioned "blue feature and green feature", videos with trends such as "forest and blue sky" and "sea and coast" can be obtained as search results. Based on the above, the search for a video with large movements such as “Forests with large trees swinging”, “Forests and blue sky with fast moving clouds” and “Coastal lines where you can see forests with intense waves” used motion features such as MPEG4. This is possible by combining with an identifier.

[0307] If this search result is learned based on information selected by the user, and the "sea" is biased, the feature quantity is also identified using the bias toward the "sea" feature quantity. The function can be reconfigured as in “Example of identifier reconstruction” and reflected in the co-occurrence information learned. In addition, if the image has the horizon in the center, it will be blue in the lower half of the image with wave motion! Since the colors increase, it is possible to construct an evaluation function based on the image features with “sea” and “coast”. it can. [0308] At this time, identification of information groups to be excluded from the search target by selecting the search results based on what was not selected by the user and the deletion result of the user having a negative meaning in the search results A new function can be configured to remove the search results to be excluded from the previous target search results, or the co-occurrence probability is high for another identifier even though the co-occurrence probability is low for a certain target identifier or condition! By entering an identifier and adding or deleting identifiers or features in the search condition, you can remove unnecessary search results and present more efficient search results.

[0309] In addition, the user interface for evaluating the search result may be used so that the user can improve the performance. The content title can be combined with the search by the character string, and the content such as the genre and the director can be used. The search efficiency may be improved by combining with attributes, or an arbitrary name may be given to the co-occurrence state based on the search conditions, identifiers, and features so that it can be used for repeated searches, detection, and instructions. The search conditions and search expressions can be exchanged and distributed via a communication line.

[0310] Also, as an example of using EPG, BML, RSS, and text broadcasting, it is extracted from the genre of the program extracted from the broadcast auxiliary information such as EPG, BML, RSS, text broadcasting, and the video 'sound' in the program The above-mentioned program genre identification function is constructed using the generated feature quantity, the occurrence frequency of generated phoneme strings, and the appearance frequency of the identifier of the environmental sound, or the name of the performer is associated with the face ID based on face recognition. Create a co-occurrence matrix by associating face IDs and performer names that are recognized together in, and configure an evaluation function to detect specific performers, or face IDs with high appearance frequency and appearances of EPG, BML, RSS, and teletext Create an evaluation function that associates the name of the performer with the face image by associating them with the description order in the performer list, and names based on phoneme sequences and phoneme sequences that are spoken by people from EPG, BML, RSS, and teletext Detect and record or skip playback Any markup language information such as HTML, XML, RSS, and BML can be converted into the above identifiers such as phonemes, environmental sound identifiers, and image identifiers using the phoneme-symbol conversion dictionary in the dictionary information storage unit of the storage unit. You can convert and perform arbitrary processing associated with search and detection, or record their usage status and re-learn identifiers using co-occurrence information of frequently used search conditions according to the recorded results .

[0311] Then, the identification function configured as described above, the search result, and the co-occurrence state information in the search result The information can be reused using technology such as P2P software so that other devices can be browsed and acquired via the communication line as shown in “Examples of information sharing procedures between users” or on any site. By publishing using CGI or any Web technology, any user may use it with billing, or sell it in a storage medium.

[0312] At this time, the usage amount is changed depending on the accuracy and detail of information to be used, the speed of processing, the number of uses, the usage time, etc., or the search results obtained by using the present invention are used. You may charge for it, change the amount of money, or encrypt it to protect the value of that information.

[0313] Also, the frequency of reuse is high! Co-occurrence state information, evaluation functions, and evaluation parameters are stored in the storage unit of the device, or acquired externally via the communication line unit as necessary. Meta information generated using the evaluation function and identifier may be presented to other users or sold.

[0314] In addition, since a certain amount of time is required for the search, a combination of features and identifiers frequently used on a daily basis for general advertisements and users, and a combination of identifiers and features associated with search keywords Depending on the user, advertisements that can be determined to have a high degree of similarity to the daily usage content of the user may be presented while searching, creating a list, or entering search conditions.

[0315] «Example of optional processing associated with identifier detection»

Next, the arbitrary process accompanying the detection by the apparatus based on this invention is demonstrated.

[0316] First, the user inputs a detection condition that triggers an arbitrary process in the same manner as the search condition.

The input may be audio, video information, a character string, an identifier obtained by the present invention, or a combination thereof. In accordance with this input, the present invention executes the steps of configuring the co-occurrence state by the combination of the feature quantity and the identifier and setting the detection condition in the same procedure as when searching for the detection condition.

[0317] Next, while acquiring a program acquired from a broadcast wave, a network, or an imaging device based on the configured detection condition, information is obtained while performing indexing using a feature amount and an identifier based on the feature amount. Record in the storage unit in the device. Then, the indexed recording information is compared with the co-occurrence information of the detection condition at the same time as recording, and the degree of coincidence is evaluated. This evaluation is based on the above identifiers such as Bayes, HMM, Mahalanobis distance, Euclidean distance and DP. Any evaluation method that evaluates the distance between each other, the degree of coincidence between identifier strings, and the distance between feature quantities may be used.

[0318] As a result of this evaluation, the probability that the distance from the center of gravity in the specific identifier, identifier string, and feature quantity within 1 σ is based on the detection condition, or the probability that the identifier, identifier string, and feature quantity are specific. The registered arbitrary process is executed on the condition that it is 60% or the matching degree between identifier strings exceeds 60%. This value of 60% is due to the fact that in phoneme recognition, emotion recognition image recognition, if the recognition result is generally 60% or more, practical application can be considered, even if it is changed to an arbitrary rate depending on the user environment. If the recognition rate is continuously less than 20%, you can stop the current process or set a flag to indicate that it is the subject of fast-forward or delete.

[0319] In addition, it detects screams that do not execute processing by detecting only the scenes that are of interest to the user, and also detects features when internal organs and blood are displayed on the screen, and violence such as horror movies. You can use the detection function based on co-occurrence information to avoid scenes unpleasant for the user by fast-forwarding scenes and scenes that are offensive to public order and morals, or by adding processing such as mosaic to the video. .

[0320] In this way, the information acquired by the broadcasting station, the network, and the imaging device is recognized, whether or not the content information is the power intended by the user is detected. By controlling the device, you can record, play, fast forward, search, notify the user to another terminal, notify the screen that you are watching, move the device, send an announcement, deliver an email, etc. , RSS can be generated and bookmarks can be performed.

[0321] Subsequently, although described in more detail later, it will be described as a product application example. First, the user inputs search conditions. The input search condition obtains the cast name associated with the information input by the user by referring to EPG, BML, RSS, and teletext, and executes phoneme and phoneme search by the method described above. As a result, recording is performed retroactively for one hour from the location where the cast name is spoken while constantly recording, or by EPG, BML, RSS, and text broadcasting for each program, every CM time, and screen features. The range to be deleted may be determined for each change, or a boundary may be set in the content for each such change and used as an indicator for the user to instruct. [0322] In this way, by performing detection using the co-occurrence state of various identifiers in the content, the specified range is configured from multiple detection locations for the content, and it is classified and stored as a storage target and a deletion target. As a result, it is possible to save video and audio information retroactively, to skip playback of disgusting scenes that have been learned by co-occurrence information specified by the user, and to return the detected individual power to a few seconds before playback. It becomes possible to do. This example is explained in more detail in later product applications.

[0323] Also, using the co-occurrence detection technology according to the present invention, advertisements of works related to actors and directors obtained from EPG, BML, RSS, text broadcasting, MPEG7, etc., and skip playback It is also possible to carry out advertisements in the middle, or replace the advertisements with new ones or appropriate ones according to the season and time according to the detection conditions of any co-occurrence state!

[0324] << Example of identifier learning based on search 'detection' indexing >>

Next, identifier learning based on search 'detection' indexing will be described.

[0325] By using the above-mentioned device configuration and learning based on the index, search results, and co-occurrence status in the search conditions, any identifier and Z or any feature quantity including “example of identifier reconstruction” etc. Co-occurrence of "co-occurrence information based on search results", "co-occurrence information extracted by indexing" and "co-occurrence information based on user-specified detection conditions and Z or search conditions" that are co-occurrence states Probability evaluation function based on co-occurrence probabilities and eigenvalues · Distance evaluation function based on eigenvalues, classification using HMM, classification based on multivariate analysis and construction of evaluation function New identifiers can be learned by defining strings.

[0326] First, if indexing is performed, the feature quantity and Z or identifier recorded in the index / attribute area of the dedicated index database and index file that are close in time series Based on the co-occurrence state of the collected features and identifiers, the collecting step and the step of constructing the co-occurrence probability, co-occurrence matrix and covariance matrix are executed. The nearest frame is the power that can be arbitrarily specified according to the implementation according to the definition of the user. If fine granularity is required, it can be set as a unit of one frame of video video such as 16 ms. Conversely, 3 seconds (180 frames) It is good even if the time unit is divided. The frame force in which the feature that is statistically distant is detected is also acquired by this step. The co-occurrence information is configured based on the information obtained, and learning is performed by using the HMM or covariance matrix or the distance function is configured. Save to.

[0327] In addition, if the search result, search condition, or detection condition is used, the identifier or feature of the content selected by the user regarding the information presented as the search result or the information specified as the search condition or detection condition. Perform steps to collect quantity co-occurrence information as a sample. Then, the co-occurrence information of the identifier and the feature amount is acquired based on the sample acquired by the collection. There are various combinations of co-occurrence information as described in the contents described separately and “Examples of indexing and searching with multiple identifiers and multiple search conditions, optional processing” described later. Then, based on the co-occurrence state, learning processing by Bayes or HMM is performed, and the learning parameters and distance functions obtained as learning results are stored in the co-occurrence learning storage unit and the evaluation function storage unit of the storage unit. Similarly, for the search conditions and detection conditions, a learning sample can be obtained by collecting the specified conditions of the search conditions and detection conditions as a sample, and an evaluation function can be configured with the learning samples.

[0328] At this time, arbitrary learning algorithms such as neural network, fuzzy, genetic algorithm, chaos, and fractal are combined, and the co-occurrence information between the co-occurrence information is recursively used for co-occurrence information to be reused. Configure the evaluation function, use each element of the co-occurrence matrix used for the search evaluation condition according to the co-occurrence probability and the level of the element value of the co-occurrence matrix, and each of the co-occurrence matrix used for the search condition and the detection condition The element may be used as a valid / invalid flag of the genetic algorithm.

[0329] Also, as a range specification method of co-occurrence matrix, one work or one program, a range where arbitrary identifiers are co-occurring, an image of a specified range by segment based on the appearance of a specific identifier Categorize features and speech features by 'analysis' multivariate analysis and evaluate the appearance time of those features, and create an evaluation function by constructing classified information power co-occurrence matrix and co-occurrence probability covariance matrix Any method can be used to determine the frequency of occurrence of identifiers obtained as an evaluation result, or to construct and evaluate scene features and evaluation functions, such as the appearance histogram for the unit time of these identifiers. A search result extracted based on a search condition using them has a high co-occurrence probability in identifiers and feature quantities other than the search condition (for example, 70% or more) or near-distance (for example, within 3 σ of distance average) as a new target, the method used for learning in the co-occurrence information composition step, or conversely with a low probability of belonging (for example, farther than 3 σ) The method used for learning in the co-occurrence information composition step can be considered.

[0330] These feature quantities used for identifier reconstruction are arbitrarily configured based on values such as the output value of the evaluation function, the output probability of ΗΜΜ, and the similarity between identifier strings. In this embodiment, as described above, the frequency of appearance of colors, the appearance frequency of emotion identifiers, image features such as human actions and gestures, walking, facial expressions, and feature quantities such as phonemes, phonemes, scales, and chord codes. A combination of vector co-occurrence states may be used as a covariance matrix, or a co-occurrence matrix of identifiers may be constructed. By using such a method, “multiple identifiers and multiple search conditions” will be described later. It becomes possible to realize a search such as “Example of Search and Optional Processing Associated with“.

[0331] Then, by arbitrarily labeling the co-occurrence information of the feature quantity and identifier obtained in this way, a character string is given to the evaluation function, stored in the storage unit, and a learning result is obtained. The character strings given to identifiers and features are used as tag names in markup languages such as new XML, or the given character strings themselves are converted into identifier symbol strings such as phonemes and phonemes. It is possible to support human voice input, or configure an evaluation function that is associated with facial expression identifiers, shape identifiers, motion identifiers, etc., so that it can respond to user video input.

[0332] More specifically, in the search condition in which the user repeatedly selects the list presented as the search result and browses the content, the distance evaluation result with the search condition indicates the barycentric force of the co-occurrence information. If the result is within 3 σ or the probability evaluation result is 80% or more, the co-occurrence information of the index in the target range of the selected content is regarded as a co-occurrence matrix or co-occurrence probability, and the index A new evaluation function is constructed based on the identifiers and features used in the above. For example, the evaluation function may be a Bayes discriminant function, Mahalanobis distance function, or a power function. I can do it.

As described above, the features of the present invention are various identifier recognition and feature extraction methods as conventional techniques. Indexing based on co-occurrence information of other sound identifiers and image identifiers based on emotional identifiers and phonemes and phoneme pieces that are not in the frame width and time width specification, range selection method, identifier string matching method, Search using indexing 'detection, processing of recording and playback started by detection, learning of co-occurrence information in indexing, use of search results, learning of co-occurrence information and learning of co-occurrence information This is an identifier conversion dictionary that can specify new identifiers and new features obtained by the above and their identifiers and features as search conditions using phoneme strings and phoneme string sequences.

[0334] «Example of identifier reconstruction»

Next, an identifier reconstruction method based on the present invention will be described.

[0335] To reconstruct the identifier by the apparatus according to the present invention, the output DP match value, the HMM output probability value, the output value of the Bayes discriminant function, and other distance functions for evaluating the feature amount Among the distance values and search results, identifiers and feature quantities associated with the search results used by users and! /, Multiple combinations of these are used as feature quantities to create new Bayesian identification functions and HMM probabilities It can be implemented by constructing an evaluation function, a distance evaluation function, a probability evaluation function, a likelihood evaluation function, etc., and such an identifier reconstruction method can be implemented based on the above-mentioned feature quantity, a multi-layer Bayes or multi-layer network, multi-layer HMM Any learning method such as can be used in combination with the recognition method depending on the implementation.

[0336] At this time, the co-occurrence information obtained by associating the identifier and the feature quantity may be used for constructing the discriminant function, or the co-occurrence probability of the identifier may be combined in the following configuration. Learning with a quantity, learning with a feature quantity covariance matrix, learning with an identifier co-occurrence probability and a feature quantity covariance matrix, learning with the output of a distance function as a feature quantity, and identifier A combination of methods such as learning that uses the output probability of the evaluated HMM as a feature value and learning that uses the transition probability of the HMM that evaluates an identifier as a feature value is given as an HMM learning parameter, or a covariance matrix is formed to configure eigenvalues and eigenvectors. Learning based on an arbitrary identifier or feature quantity by learning the parameters of the evaluation function by constructing the evaluation function and learning the parameters of the evaluation function used for distance evaluation by obtaining the average value Carry out knowledge Reconstruction of the child, reconstruction of an identifier using co-occurrence information of the user identifier and the feature amount associated with the search condition and the detection condition frequently specified, the search results to be used a long time after the user selected It is possible to reconstruct the identifier using the accompanying identifier and the co-occurrence information of the feature amount.

[0337] For example, in the case of emotion identifiers and phonemes, the recognition results of the identifiers of 4 emotions and about 400 phonemes are obtained as emotion identifiers. Next, utter “k / o / r / a” by DP matching to the phoneme string sequence and search for the part that speaks! As a result, by speaking rk / o / r / aj! / Coincidence information can be constructed by acquiring the emotion identifiers that occur around the utterance part, the anger emotion and the phoneme string “k” Learning the co-occurrence state of “/ o / r / a” and the amount of features in the co-occurrence state, the identifier is newly angry, [k / o / r / a], It is possible to construct the identifier "! /, Ru [k / o / r / a]". Note that the information used for learning by reconstruction uses a DP matching rate and a ratio of emotional characteristics and emotion identifiers, and the likelihood and probability based on the evaluation function of the phoneme sequence or phoneme segment sequence and the evaluation function of the emotion identifier, You may use distance. In this case, for example, it is possible to combine associations using feature quantity extraction methods such as video features, image features, moving image features, still image features, scale features, and environmental sound features. Construct facial expression identifiers with emotions, including facial features.

[0338] It should be noted that when the range for re-learning identifiers is the boundary of identifiers or when the average of feature quantities deviates by 3 σ or more, the temporal and spatial changes in feature quantities differ in terms of temporal and spatial information positions If there is a difference of 3σ or more from the average of the temporal and spatial changes, or if there is a significant divergence in other statistical tests, within 3σ from the average including the information surrounding the searched information When there is information, the information range that is the target for re-learning the identifier may be configured based on the specified boundary condition such as when using an arbitrary user-specified time width.

[0339] Examples of identifier associations include association between program information and display position, association between program information and emotion, association between program information and phonemes, association between phonemes, association between program information and landscape images, program information and text Association between program information and environmental sound, association between program information and scale and tempo, chord and chord progression, association between program information and facial expression image, association between program information and object image Association between program information and operation information Display position and emotion, display position and phoneme, phoneme fragment, display position and landscape image, display position and text, display position and environmental sound, display position and scale and tense. P, association of chords and chord progressions, association between display position and facial expression image, association between display position and physical image, association between display position and motion information, association between emotion and phoneme, phoneme fragment, emotion and landscape image Association between emotion and sentence, Association between emotion and environmental sound, Association between emotion and scale and tempo, Association of chord and chord progression, Association of emotion and facial expression image, Association of emotion and object image, Association of emotion and motion information Phonemes, phonemes and landscape images, phonemes, phonemes and texts, phonemes, phonemes and environmental sounds, phonemes, phonemes and scales and tempos, chords and chord progressions, phonemes, Association between phonemes and facial expressions, phonemes, association between phonemes and object images, association between phonemes, phonemes and motion information, association between landscape images and text, landscape images and environmental sounds Association, association between landscape image and scale and tempo, chord and chord progression, association between landscape image and facial expression image, association between landscape image and object image, association between landscape image and motion information, sentence and environmental sound Association, association between text and scale and tempo, chord and chord progression, association between sentence and facial expression image, association between sentence and object image, association between sentence and motion information, environmental sound and scale and tempo, chord and chord progression , Environmental sound and expression image, environmental sound and object image, environmental sound and motion information, scale and tempo, chord and chord progression and facial expression image, scale and tempo, Association of chords and chord progressions with object images, scale and tempo, association of chords and chord progressions with motion information, association between facial expression images and object images, facial expression images and motion information Association, association of object image and motion information, association of image information and acoustic information, and association with any of the above-mentioned identifiers and feature quantities are possible. It is stored in the co-occurrence learning storage unit, distance evaluation by Mahalanobis, probability evaluation by HMM, distance evaluation by Bayes discriminant function, and likelihood evaluation by a combination thereof are performed, and it is a feature extraction unit It can be used for other discriminant functions in the discriminator feature value conversion unit, or for other coincidence evaluation in complex search result generation processing!

According to the evaluation results of such combinations, for example, by collecting speech segments such as screams, explosion sounds, laughter, and exclamation sounds, `` scream discrimination function '', `` explosion discrimination function '', `` laughter discrimination function '', "Exclamation discrimination function" that identifies voices such as By combining discriminant functions, phoneme recognition, video features, and emotion features can be indexed at the same time so that searches can be performed, and `` smiling functions '' and `` crying face functions '' can be created to perform similar searches. It is possible to learn the co-occurrence state based on the identification result of the video, and to construct an evaluation function that can recognize and detect a specific program by recognizing the title title of the first few seconds of the specific program and the program title utterance by phoneme recognition. Alternatively, the presence or absence of an identifier with a high co-occurrence frequency based on the co-occurrence state may be used to specify the gene flag of the genetic algorithm.

[0341] Then, it is possible to analyze image and audio trends associated with the genre of the program based on the identifiers appearing in one program and the frequency and bias of the feature quantity, and learn co-occurrence information based on the analysis results. By configuring the “Horror Movie Identification Function”, “Action Movie Identification Function”, “Comedy Program Identification Function”, and “Uncita Program Identification Function”, it becomes possible to construct new identifiers and identification functions. Thus, unprecedented search and detection such as “an example of search and optional processing with multiple identifiers and multiple search conditions” is realized.

[0342] Next, as a specific method for autonomously adding and reconstructing search conditions in order to increase search efficiency, the input feature quantities and identifiers as search conditions are applied to the content obtained as search results. If it shows a high similarity (for example, 80%) and is specified in the search condition associated with the same content! Other identifiers and features not specified in the conditions are recorded in the co-occurrence information storage unit together with the specified search conditions.

[0343] Next, the accumulation of information related based on such a co-occurrence state exceeds a certain value (for example, it may be 1000 or n times the number of evaluation dimensions). At this point, a co-occurrence matrix based on co-occurrence information can be constructed, the covariance matrix and co-occurrence probabilities can be obtained, and learning can be performed using a distance evaluation function or HMM to reconstruct the evaluation function. At this time, information with a lot of variance and information with low probability can be excluded, and the calculation efficiency can be improved by reducing the number of evaluation dimensions, and if it is a fixed phrase such as command control or a specific word, a command can be used. The phoneme or phoneme segment is used by using the recognized phoneme sequence or phoneme sequence in accordance with the user's affirmative or negative instruction for recognition by the device, not the phoneme sequence or phoneme sequence developed from the character string. The evaluation function template for identification may be updated. [0344] More specifically, in the case of phonemes or phonemes, the explosion sequence is recognized as an environmental sound within a few seconds before and after the phoneme or phoneme segment that is recognized as “waichi”. If more than% is detected, the phoneme or phoneme fragment of “Waichi” is subject to learning as co-occurrence information, and when searching for “explosive pronunciation” by reconstructing the evaluation function, the phoneme string “Waichi” is also included. At the same time, when radial motion features are detected as 80% or more in 1000 searches, motion feature amounts are also used as learning targets for co-occurrence information. ”Phonemes and phoneme pieces and“ radial ”motion characteristics and“ explosive sound ”environmental sound identifiers and co-occurrence states of sound effect identifiers constitute an identification function to search for explosion scenes. Labeled "explosion scene" to execute a search request In order to use the character string in association with the evaluation function or to execute a search request by voice utterance, “b / a / k / u / h / a / ts / u / sh / i / i / n” It may be possible to use a voice by providing an identifier by a phoneme string sequence.

[0345] Also, if an identifier string consisting of phonemes / phoneme segments recognized as emotional identifiers is recognized at the same time as emotion identifier "sadness", the co-occurrence matrix of identifiers is used as co-occurrence information. It is possible to use a method for determining the co-occurrence probability by configuration and a method for determining eigenvalues and eigenvectors using the covariance matrix of feature vectors and constructing a Bayes discriminant function or Mahalanobis distance. Then, by constructing a likelihood evaluation function for the content information to be searched, it is possible to evaluate the presence or absence of a phoneme string called “Waichi” when searching for “sad, scene”. The same “Waichi” t ヽぅ Even if utterance is detected, it is excluded from the search results as a scene different from the user's “sad, scene search” depending on whether or not the emotion of “joy” is recognized Thus, it is possible to provide search results of different quality according to emotions. At this time, the input search condition character string is used as an emotion identification character string such as “joy” or “sadness” by using “(1)” or “(;;)” called emoticons. However, it can be converted into emotion feature quantities and emotion identifiers via a character string identifier conversion dictionary and used for searching.

[0346] The likelihood evaluation function configured as described above stores the parameters and templates of the evaluation function in the co-occurrence learning storage unit and the evaluation function storage unit, and also specifies the specified character string, the specified word, and the utterance of the character string and the word. The relationship between the phoneme string and the phoneme string string based on is registered in the dictionary unit. Also, search conditions that can be used via communication lines to evaluate the use value of search conditions. Evaluation of utility value may be performed using the third-party usage frequency as a learning sample, and “output probability of various identifiers” and Z or “co-occurrence probability of various identifiers” and Z or “transition probability of various identifiers” And Z or “co-occurrence probability of various identifiers” and Z or “various feature quantities” are combined into one set of “feature quantities” based on the covariance matrix! / Evaluate to create eigenvalue eigenvectors, construct evaluation functions, give them to HMM as features and train them. In a genetic algorithm, the divergence state from the average of distances obtained from the average of distances obtained from identifiers and feature quantities with high co-occurrence probabilities that occur during indexing and search results with high degree of use is 3 σ. The identifiers and feature quantities that exceed and the probability of co-occurrence and Ζ or the appearance probability are particularly high in view of the average probability, and identifiers and feature quantities may be used as gene flags.

[0347] In addition, the phoneme and phoneme recognition dictionaries are switched according to emotion recognition, the phoneme and phoneme recognition dictionaries are switched according to changes in the recognized environmental sound, and the recognized landscape image is used. The dictionary may be switched according to the co-occurrence state, such as switching the image recognition dictionary of the display object, or switching the recognition dictionary of phonemes and phonemes according to the recognized image. Information based on the starting relationship can be taken as sensitivity information and used to search for content information.

[0348] Since the examples given here are examples for carrying out the present invention, search, detection, and search results associated with a plurality of identifiers and a plurality of indexing and search conditions other than the above are included. The details will be described later separately as an arbitrary processing example and a product application example.

[0349] <Application Examples of the Present Invention>

Based on the present invention, as an application example for using such a device, “procedure example of information processing device used for terminal and base station” considering server / client environment, information exchange and sharing between users “Example of information sharing procedure between users” and “Example of user interface” using the present invention are described.

[0350] <Example of procedure of information processing apparatus used for terminal and base station >>>

First, a server-client processing system related to base stations and terminals will be described. This device and the terminal are configured as shown in FIG. 20, and are composed of a user terminal, a distribution base station, a device such as a robot controlled by the terminal and the base station, and a remote controller to be controlled. A user who can be used as a form of a terminal or a form of a base station speaks voice to the terminal, and the terminal or base station uses any of the following for recognition processing. Execute the processing procedure.

[0351] In the first method, feature values are extracted from the speech obtained from speech or captured video images, and the feature values are transmitted to the target relay location or base station apparatus, and the feature values are received. The base station apparatus generates a phoneme symbol string, Z or phoneme symbol string, emotion symbol string, and other image identifiers according to the feature quantity. Then, based on the generated symbol string, a matching control means is selected and executed.

[0352] The second method performs feature amount extraction from the speech obtained by speech and the captured video image, and the phoneme symbol string and Z or phoneme symbol string, emotion symbol string, and other images in the terminal An identifier that accompanies recognition, such as an identifier, is generated, and the generated symbol string is transmitted to the target relay location or base station apparatus. Then, the controlled base station apparatus selects and executes a matching control means based on the received symbol string.

[0353] The third method performs feature amount extraction from voice obtained by utterance and captured video image, and based on the feature amount generated in the terminal, phoneme sequence and Z or phoneme symbol sequence, It recognizes emotion symbol strings and other image identifiers, selects control contents based on the recognized symbol strings, and transmits them to the base station apparatus that controls the control method and information relay apparatus.

[0354] Then, the fourth method transmits the voice obtained by utterance using the terminal or the voice waveform or image of the captured video as it is to the base station apparatus that controls it, and the phoneme symbol in the control apparatus. Recognize the string and Z or phoneme symbol string, emotion symbol string, and other image identifiers, select a control means based on the recognized symbol string, and select the relay point or base station device that controls the selected control. It is to execute. Similarly, emotion identifiers can be extracted from voice features and symbols, and so can sound and video features and identifiers such as environmental sounds.

[0355] At this time, the terminal simply transmits only the waveform, transmits the feature amount, transmits the recognized identifier string, and transmits the processing procedure such as the command and message associated with the identifier string. Even if it is possible to implement the client server model by changing the configuration of the distribution base station according to the transmission information, the configuration shown in Fig. 21 is the transmission side, and the configuration shown in Fig. 22 is the reception side. It is also possible to transmit and receive between each other.

[0356] An instruction dictionary for converting into an associated processing procedure based on the input phoneme sequence or phoneme segment sequence is a new control command or media that can be used on either the terminal side or the distribution base station side. Send / receive and distribute information using phonetic symbol strings, image identifiers, and emotion identifiers related to type, format type, and device name, and markup languages such as XML and HTML, RSS, and CGI, which will be described later. It's okay to go.

[0357] Next, a more specific procedure will be described. First, information is exchanged with other terminals and devices in an environment connected to an arbitrary communication line by extracting features and identifiers and configuring evaluation functions.

[0358] Next, a case in which phoneme pieces are used as processing on the terminal side will be described as an example. A user gives a speech waveform to a terminal and a device with an utterance. The terminal-side device analyzes the given voice and converts it into features. Next, the converted features are recognized and converted into identifiers using a recognition technology that is combined with HMM and Bayes.

[0359] In this case, the converted identifier means a phoneme, a phoneme piece, an emotion identifier, and various image identifiers. However, as described elsewhere, if it is a voice, it is a phoneme, an environmental sound, or a scale. Or an identifier based on an image or action. Based on the obtained identifier, the phoneme / phoneme symbol string dictionary is referred to by DP matching to select an arbitrary processing procedure, and the selected processing procedure is transmitted to the target device to execute control. Therefore, it is possible to use the mobile terminal as a remote control by using the present invention, or to control home appliances by robot, and smoothly detect the face, voice humility and facial expression of the other party at the communication destination. It may also be configured to display an emotional index, a display of utterances, an interactive device with a disabled person provided with a braille output unit, etc.

[0360] Depending on the CPU performance of the terminal, the information processed in such a procedure can be transmitted as it is without converting natural information such as video and audio into feature values, or converted to feature values. It is possible to select and send an arbitrary conversion level, such as transmission after conversion, transmission after conversion to an identifier, transmission after selection of control information, and the receiving side can select any state. Based on the information received, it is configured as a receiving device that can be processed and sent to the distribution station or control device based on the acquired information, or searched, recorded, or distributed by mail based on the acquired information. Arbitrary processing such as communication, machine control, and device control may be performed.

[0361] Then, as shown in the state transition diagram of the search process in Fig. 23, the identifier string, character string, and feature amount that are appropriately queryed are transmitted to the distribution-side base station, and information according to the query is obtained. In this case, the control dictionary configuration shown in Fig. 24 is used to enable control items to be selected by communication when controlling by voice even if advertisements and advertisements are displayed during the communication wait time and search wait time. Exchange control dictionaries like in the example.

[0362] This control command dictionary is composed of phonemes, phonemes, emotion identifiers,! /, And any identifiers, features, and device control information as described above. It is possible to make it reusable, and by updating or reconfiguring the dictionary information for a search that associates an arbitrary identifier with a feature quantity, it is recommended to update trendy search keywords. .

[0363] In the control command dictionary, infrared control information to be transmitted to a product that can be controlled by a conventional infrared remote controller is selected as device control information, or a series of operations are batch processed by combining these control information. Alternatively, the feature information may be transmitted to the information processing apparatus for voice versus control without recognizing the identifier according to the CPU performance of the apparatus.

[0364] Even with conventional devices that cannot perform voice control in this way, by combining control with an infrared remote controller, signals from the infrared remote controller can be provided via the voice dictionary or conversion dictionary, and voice control can be performed. If the device is capable, it can recognize and control commands based on features and speech waveforms, execute control dictionary changes to improve performance, and control dictionary version information. After checking, you can check when and how the device status is! /.

[0365] In addition, the server client model is introduced in this way, and the server and the client are divided into arbitrary processing steps, connected by communication, and the server 'client' exchanges arbitrary information. Infrastructure, search and indexing may be implemented.

[0366] In addition, information acquired by client terminals such as DVD recorders, network TVs, STBs, HDD recorders, music recording / playback devices, and video recording / playback devices from core servers at the communication destination can be transmitted via infrared communication, FM, or VHF frequency band communication. 802. l ib, Bluetooth, Z By providing information to mobile terminals and mobile phones via wireless communication such as igBee, WiFi, WiMAX, UWB, and WUSB (Ultra Wide Band), data broadcasting and TV images by EPG, BML, RSS, text broadcasting, Teletext can be used on a mobile terminal or mobile phone, voice control, character string input, instructions for controlling the client terminal by shaking the mobile terminal or mobile phone, or mobile terminal or mobile phone May be used for client terminal operation as a general remote control.

[0367] 《Information sharing procedure example between users》

First, the user selects the search condition formula constructed on his / her device in the environment shown in FIG. 20 and the identifier, feature quantity, and Z or function parameter used in the search condition formula, and the communication line and Z or storage medium. Provide to a third party via At this time, the search condition formula and Z or identifier and Z or feature quantity and Z or function parameter may be disclosed or released to any third party, or P2P software may be used. May be shared. You can also sell search conditions, identifiers, feature quantities, and function parameter combinations based on preferences and values of celebrities, specialized magazines, and professionals via communication lines or by attaching magazines.

[0368] As a result, the search condition formulas and Z or function parameters of the other party were copied to the storage medium and downloaded via the communication line by the procedure shown in Fig. 25, and used for indexing. If the identifier selected by the feature quantity extraction method or the discriminant function has the same configuration, those search condition formulas can be used on the own device. Take measures to prevent viruses from being included in these distribution information.

[0369] If there is a difference in identifiers or feature values between devices, a user who can acquire or convert information such as an evaluation function or search conditions related to a search can be compared with others on other devices. A search condition formula can be acquired by the same method. In this conversion, an identifier co-occurrence matrix is used to convert between identifiers based on co-occurrence information, such as conversion of international phoneme symbols and language-dependent phoneme symbols, which will be described later, or to convert other identifiers to phoneme symbols. It is also possible to perform transformation in information space using evaluation functions such as HMM, Bayes, and membership probability.

[0370] At this time, the dictionary that converts the phoneme sequence and the phoneme segment sequence and the processing procedure is distributed even on the terminal side. New control commands, media types, format types, device names, phonetic symbol strings, image features, emotion identifiers, and! / Tick symbol strings related to the base station can be described later in markup such as XML and HTML. It may be expressed in language, RSS, or CGI. Information configured in this way may be transmitted / received or distributed.

Next, a more specific procedure will be described with reference to FIG. First, terminal A, which is the first user's device, attempts to connect to another terminal C or an information processing device that can communicate with the base station B via the Internet. As a result, if it is possible to connect, information that can be used for search by other devices using RSS or CGI will be confirmed. And if it seems to be distributed, execute the step to get the list

[0372] Next, terminal A executes an evaluation function acquisition step for acquiring detailed information related to a target search execution method using a communication line or infrared rays. As a result, terminal A can acquire the numerical information, identifier symbol string, evaluation expression, and! Necessary for function construction, and information necessary for the search.

[0373] The information necessary for this search, when considering phoneme and phoneme recognition, if it is a Bayesian function, eigenvalues, eigenvectors, average values, prior probabilities t, and numerical information based on the features for each phoneme or phoneme It is an identifier symbol string consisting of phonemes and phonemes of the same notation symbol group as a search index if matching with DP etc., and if it is an HMM, it becomes standard template data for each phoneme and phoneme. Depending on the object to be recognized and the identifier, these pieces of information are appropriately changed to image recognition templates, sound recognition templates, environmental sound templates, motion recognition templates, etc., and their respective identifier strings and evaluation functions.

[0374] Next, if there is no room in the storage capacity of the own device, the identification function, DP, and HMM that are less frequently used are deleted, and a new evaluation function is stored based on the information acquired earlier. The evaluation function switching step is executed so that it can be reused without being registered and registered every time.

[0375] Of course, in some embodiments, the evaluation function is acquired by communication each time, stored in the storage unit, and the stored evaluation function is deleted when the service is terminated or the power is turned off. You can use it, or you can get it from a distributed storage medium.

Further, as shown in FIG. 20, the information exchange target is not the power of the base station or other terminals, but the information processing unit such as the robot and remote controller using the present invention, the information input / output unit, and the storage unit. Any embodiment may be considered as long as the device is included in the configuration related to the invention.

[0377] Example of user interface

Next, use for a user interface will be described.

[0378] A control method is obtained by a method similar to the procedure example of the information processing apparatus used in the terminal and base station described above, and a phoneme string symbol for the command to be input and a dictionary for converting the control command are provided. Therefore, voice operation can be realized by recognizing a person's utterance and making the target command executed. At this time, emotions are analyzed from the voice information, and the detected result is the emotion of “sadness”, and if a phoneme or phoneme associated with the utterance is detected, a comforting context is selected. Alternatively, a processing means such as selecting a context that can be soothed if a feeling of “anger” is detected and a phoneme or phoneme associated with the utterance “Kora” is detected may be implemented.

[0379] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string, or a camera etc. The above-mentioned arbitrary identifiers and features such as phonemes, phoneme pieces, emotion identifiers, and image identifiers can be used. Recognition based on volume may be performed, and processing may be selected and changed according to the combination of identifiers.In addition, recognition results such as emotion identifiers, instrument identifiers, scale identifiers, and environmental sound identifiers may be used. ,.

[0380] In addition, it is possible to improve the accuracy of the extracted information by performing reinforcement learning by letting the user evaluate the user's preference and subjectivity extracted using the search device according to the present invention. . For example, the recognition result of emotions, phoneme strings, and phoneme string sequences associated with the user's utterance at the time of evaluation is associated with a positive semantic match. Reinforcement learning is performed when an identifier such as “joy” or “relief”, which is an emotion associated with a column or positive meaning, is detected, or the recognition result is linked to a negative meaning. A phoneme or phoneme symbolic string such as “no use” or a negative meaning If the emotions associated with the taste are “sadness”, “anger”, or “disappointment”, remove the target of the next reinforcement learning, or create a new feature group of negative meanings to set the negative object. Reinforcement learning for learning may be performed.

[0381] In addition, keywords related to operable processes may be displayed on the screen, a phoneme string or phoneme string list may be selected, or spoken, and presented to the user. It is possible to realize a voice user interface based on phoneme / phoneme recognition with emotion that does not use general voice recognition.

[0382] At this time, the dictionary that converts the phoneme sequence, the phoneme sequence, the emotion identifier, and the processing procedure may be a new control command, media type, or format type that may be on the terminal side or the distribution base station side. Even if symbol strings such as phoneme symbol strings related to device names, image features, and emotion identifiers are transmitted / received and distributed using markup languages such as XML, HTML, and RDF, RSS, and CGI, which will be described later. Convenience can be aimed at by combining well.

[0383] <Threads for co-occurrence information>

A procedure for using a combination of co-occurrence information based on a plurality of identifiers and feature quantities, which is the basis of the present invention, will be described more specifically. First, a search example processing procedure using a plurality of types of identifiers as an outline and an arbitrary processing procedure based on a search using a plurality of types of identifiers are shown, followed by a specific example of a combination associated with each identifier. The combination of these identifiers and feature quantities may be two or three as required, or the co-occurrence probability of these identifiers may be implemented by combining four or more or more than ten. By referring to a co-occurrence dictionary constructed based on the covariance matrix of features and feature quantities, and constructing search conditions according to user instructions, a search unprecedented can be realized.

[0384] The co-occurrence state or co-occurrence information in the present invention is based on natural information including auditory information and visual information, sensor information, and identifiers and features acquired from video and Z or audio. It is based on information configured using quantity, and is a plurality of related information that uses sensor information that is detected as text information to be distributed, and their identifiers and information within an appropriate unit time according to usage It is characterized by the fact that features are generated at the same time, and can be constructed with time transitions that are multiple co-occurrence information powers. Use those stochastic transition matrices "Co-occurrence search conditions" configured using "index co-occurrence information" used for content index information and search conditions entered by users. Used as “information”.

[0385] <Example of search processing procedure using multiple types of identifiers >>

When specifying search conditions and detection conditions to execute search processing using multiple types of identifiers, the boundary of the range in which identifiers and feature quantities are evaluated may be the number of frames divided on the time axis, or arbitrary This may be the case where the divergence state of the feature quantity obtained by this identification method exceeds or is less than the threshold value, or may be an identifier boundary obtained by any detection or identification method.

[0386] Then, the distribution information is indexed using EPG, BML, RSS, text broadcasting, text included in subtitles and video, etc., as well as checking the bias of what identifiers co-occur in any range. You can also create a co-occurrence dictionary by categorizing identifiers and features, using the composition of the performers, titles, directors, names of producers, family relations and actor relations of actors, and human relations.

[0387] Then, the character string or identifier ID associated with the identifier obtained as a search result is converted into another identifier or identifier string by the conversion dictionary, and the part where the identifier or identifier string matches the content information is searched. By doing so, it becomes possible to extract other identifiers and identifier strings associated with the identifier obtained as a search result via the identifier ID and the identifier ID and the character string as an intermediate code system. Search based on origination relationship can be performed.

[0388] More specifically, the name of the performer is obtained from EPG, BML, RSS, teletext, text information such as recognized subtitles and text included in the video, or the phoneme entered or entered by the user. The name of the performer that matches the column is detected, and that name is spoken in the video information! / Detects where to speak and subtitles are displayed!

[0389] As a result, it is determined that the detected location is a scene related to the user's purpose, and the content information is played back, recorded, skipped, or the recording is started by a specific title image feature. You can also use a method such as narrowing the search target using the co-occurrence matrix or co-occurrence probability, or the identifier or the Characteristic It is okay to classify quantities and construct a co-occurrence dictionary.

[0390] Also, EPG, MPEG7, BML, RSS, XML, Web site, recognized subtitles and character strings included in the video, etc., the composition of performers, titles, directors, producers, sports team names and actor casts Using the program information such as family relationships and human relationships above as identifiers, the main character and enemy role co-occur and the main character and lover co-occur! / Sounding scenes! / Sounding searches are given multi-variate analysis based on image features, emotions expressed in scenes, phoneme sequences and phoneme sequences associated with voices generated in scenes, and changes in video features in scenes. A method of indexing, searching, detecting, and learning using the phoneme sequence, phoneme sequence, program information, image feature, or image identifier is also possible.

[0391] 《Example of arbitrary processing procedure based on search by multiple types of identifiers》 >>

For example, an input character string is converted into a symbol string using phonemes or phonemes, or symbol information based on phonemes or phonemes based on a user's utterance speech and identifiers recognized by emotions, environmental sounds, or image features are used. In this manner, a query is constructed and recording of broadcast contents to the information storage device based on the present invention is started.

[0392] At this time, the symbol string is evaluated at the same time as recording, and the match with a pre-registered symbol string is evaluated, and if the match exceeds a certain percentage, the hour before and after that is registered for long-term storage. By deleting information that is not included in the long-term storage target from the information storage unit recorded after a certain period of time, unnecessary information is deleted in a finite storage capacity, and efficient information storage is realized. At this time, a method of narrowing down detection targets using a co-occurrence matrix or a co-occurrence probability according to statistical processing may be used, or a co-occurrence dictionary may be configured by classifying identifiers and feature quantities.

[0393] <Example of search by character string and identifier >>>

For example, the input content information is indexed by emotions and environmental sounds recognized by voice, image features recognized from video, motion identifiers, and object identifiers, and recorded as a database according to the present invention. Next, the speech or character string input by the user is converted into a symbol string using phonemes or phoneme fragments and given to the recorded database as a query, and the search result is detected as the target information. To present.

[0394] At this time, sounds such as "one-one" and "docan", which are generally called onomatopoeia, are also relatively approximate. It can be used as a search index to assist environmental sound identifiers for searching because it is recognized as a phoneme or phoneme segment, or “ni” ”or“ (; The emotion identifiers used for the search from emoticons can be selected by selecting the emotion identifier from the character string by setting the emotion identifier to “joy” or “sorrow”, and the search condition can be configured to perform the search. By detecting this, the search technology of the present invention may be used as an artificial intelligence for chats, agents, and robots to classify identifiers and feature quantities that can be used for dialogue between the device and humans to form a co-occurrence dictionary.

[0395] 《Example of search with emotion and proper nouns>》

For example, a proper noun, which is the conversion of a proper noun into a phoneme or phoneme symbol, is detected, and an emotion feature or emotion identifier near a proper noun and its occurrence is evaluated, or a proper noun is issued By evaluating the probability of the appearance of an emotion identifier, which is the emotional characteristic of the speaker's voice in the vicinity of the utterance time of the proper noun, the user's It is possible to search according to user's preference by evaluating the bias of emotion.

[0396] <Example of search with emotion and image >>

By combining the image features obtained by the face detection algorithm and the emotion identifiers by emotion recognition, the facial expression features in specific emotions are detected, and the facial features are statistically learned to discriminate facial expressions. Can be performed, or the face is converted to a certain direction and size using features based on 3D or 2.5D, and then a part with change or movement is learned as a separate item. It is also possible to give an identifier to separate a part of the face as eyes and mouth and learn to change facial expressions, or to classify bodies, machines, and devices to be used for other searches in the same way. Also good.

[0397] Also, by searching for the co-occurrence state of the phoneme string and emotion identifier associated with the protagonist's face detection and the protagonist's name, a powerful scene that could only be searched by volume in the past was raised. Based on the emotions contained in the voice that calls the name! /, The search force, and the excitement of the phoneme sequence or phoneme segment sequence that accompanies a scene where a character is detected in a large size on the screen and cheers By performing a search based on the co-occurrence state by detecting S and the excitement emotion identifier, it is possible to search for a score scene of a sport or a highlight scene of a movie. [0398] At this time, EPG, BML, RSS, linked to any tag or name in teletext, EPG detects a sports program, BML detects a change in score, and a change in score is displayed Around the time, the highlight position of the sport is detected by moving the playback position to the place where excitement is detected from the emotion feature, and only the image is learned by learning the image features around that time. It may be possible to detect the highlight scene of sports from the information of the blog, analyze the video attached to the blog, organize it in association with the text of the blog, and make it possible to search It is also possible to fast-forward to those detection points, or when the user frequently performs negative operations such as fast-forward by the above learning, the range is changed to a disliked scene, a less interesting, a scene, public order and morals It can automatically skip playback was extracted feature amount is regarded as a scene that may be carried out when was the service and or delivered in change there Tsutamune mail and RSS to score and description.

[0399] <Example of search with images and environmental sounds>

For example, when scene features that vary from frame to frame are extracted as video feature values, the partial motion features are large and their motion directions are not parallel, and red and yellow warm-colored features are displayed on the screen. If a voice feature that is identified as explosive sound is detected and a radial motion is detected, index information is recorded in synchronization with the moving image as an explosion scene. Similarly, if a lot of blue is detected in the screen and a wave sound is detected, the scene is recorded as a seaside scene. If a slowly moving white block is detected in blue and a wind sound is detected, the index information is recorded as an empty scene. . This index information is implemented, and the frequency of the index appearance is calculated for the entire video length, and the similarity of the frequency is evaluated to detect the bias of the expression on the screen and the user browsing By analyzing the situation in the same way, a search based on the browsing status of the user and the frequency of identifiers appearing in the content is realized. In addition, by analyzing the features of the score display screen through image recognition and identifying environmental sounds such as emotional features and cheers that accompany it, it is possible to search for a specific scene using the co-occurrence state. Good.

[0400] 《Example of search with environmental sound and program information》 >>

For example, while recording moving images in which genres that have acquired power, such as BML and EPG broadcast, are classified as actions, video and audio features and identifiers are generated and recorded. The recording Multivariate analysis of information based on the feature values and identifiers obtained, and the frequency of appearance of each identifier in the action movie is obtained and analyzed. As a result, it is possible to construct an arbitrary distance evaluation function or a recognition function such as an HMM using the analyzed feature quantity. For example, an evaluation function for evaluating explosion sounds or sudden changes in screen features can be constructed. Therefore, by learning feature values, it is possible to obtain text information consisting of EPG, BML, RSS, text broadcasting, text information included in subtitles and images recognized from images, and evaluation results using unique evaluation functions. Based on the co-occurrence status of users, arbitrary processing such as recording and playback of content information can be performed and search can be performed by setting evaluation functions and evaluation result thresholds that match the user's hobbies and preferences. It becomes possible to do. At this time, the composition of the performer, the title, the director, the name of the producer, the family relations and the human relations of the actors as casts are used as identifiers, or they are used as phonemes. It is okay to evaluate the match together.

[0401] 《Examples of search with emotion and scale>》

For example, if music for sale is indexed by the above-mentioned various methods according to emotion characteristics and emotion identifiers, scale characteristics and scale identifiers, and phonemes and phoneme symbol strings are registered in the database, and the user likes Specified music power The identifier and feature value power obtained are index information and the index information consisting of music identifiers and feature values registered in the database, and by evaluating the distance and coincidence rate, it is possible to meet the user's hobbies and interests. Searching music information based on!

[0402] << Search example by other combinations >>

With regard to instrument type, it is possible to search for scenes and pages where any instrument is played or displayed from the co-occurrence information of instrument name and acoustic feature, instrument name and image feature, and the piano is out. When you want to search for movies, search for phoneme strings by pronouncing “piano [p / i / a / n / o]”, or based on the phoneme strings. After performing a search by co-occurrence state using the feature evaluation function and the acoustic feature power collected by the piano's sound force using the configured instrument evaluation function, the audio stream or video stream may be searched according to those features, EPG and BML, which can be recorded or skip-played audio or video stream detected by the search instruction. If the piano maker is described in RSS, teletext, etc., you can get the URL etc. and connect to the Web to get information, or give instructions to switch the tone of the instrument in the music you are playing However, a co-occurrence dictionary may be configured by classifying identifiers and feature quantities that are acceptable.

[0403] Regarding mechanical sound types, the above scenes can also be searched using car tappet sounds, engine sounds, and locomotive exhaust sounds, and the names of these sounds are called phoneme strings or phoneme string strings. It can be used for search by converting to, and if the search condition is “engine sound”, the engine sound will be searched, and if it is an engine scene, the engine image features If you search for scenes with volume and engine sound, you can use the following method.

[0404] With regard to environmental sound types, natural sounds such as the sound of wind and waves may be added to the above-mentioned examples, and the sound of animals and insects, the sound of office barbers, sports and other cheers It is also possible to collect biased sounds depending on the environment such as station ticket gates, observe the co-occurrence state of features, and build an evaluation function, and classify the noise type as car noise or factory noise! In addition, movies and dramas can be used for scene search in the same way as for musical instruments, or white noise or pink noise can be used to generate test noise for test equipment in equipment such as amplifiers. It is also possible to convert the name of a sound into a phoneme string or phoneme string string so that it can be used for searching.

[0405] Regarding face types, it is possible to search for images that serve as indices for facial expression identifiers associated with emotions by associating and searching for facial feature quantities and emotion identifiers. You can convert the name to phoneme sequence or phoneme segment sequence and use it for search!

[0406] Regarding the type of person, by searching the phoneme sequence or phoneme segment sequence related to the facial feature quantity and name, search for images that serve as indices for facial expression identifiers associated with emotions, and the clothing, physique, and hairstyle. It is possible to search for images in which the name power of the person to be tracked is also recorded by composing the image feature amount power in the city surveillance system, and the names of those persons, clothes, and physiques can be called phonemes. It may be converted to a string or phoneme string string for use in searching.

[0407] With regard to the expression type, if the expression type is based on the face type and the emotion type described above, the scene search based on the emotional behavior of a person can be performed by associating it with the person type. It is possible to use a phoneme or phoneme symbol string by presenting one word, and convert the expression or emotion name into a phoneme string or phoneme string string for use in a search.

[0408] When the action type is related to the face type and the emotion type as described above, it can be associated with the person type to change the emotional behavior, gesture, action, gesture, and walking of a person. It is possible to search for scenes based on this, and to detect the input video information by signifying the motion identifier and the phoneme or phoneme string sequence. When sign language information is detected and uttered by speech synthesis, the processing and the utterance are converted to a phoneme sequence. It is possible to use CG to reproduce the actions associated with the phoneme sequence and display sign language, and the names of these operations can be converted into phoneme sequences or phoneme segment sequences for use in searches. ,.

[0409] With regard to landscape types, natural images and city images are classified based on co-occurrence information of image features such as color features, existence probability per unit area of straight lines and curves, and based on scene names! /, Phonemes It is possible to convert from a sequence to a feature, or to search by indexing the phoneme sequence or phoneme sequence of the content uttered by looking at the scene. Using location information, by associating landscape types with phoneme strings, it is possible to search for information in any area based on any image characteristics from a large amount of accumulated movies and broadcast images. It is possible to build a travel guide based on the image features of location locations used in famous scenes of the city, and to detect similar landscapes. It may be converted into phoneme strings and used for searching.

[0410] With regard to the display position type, it is possible to evaluate what kind of image is in which position in the screen, specify the range, display it, and let the user call the name. When the device is used as an index for learning the display content, it is conceivable to use a method.After detecting the position of the face using general face detection technology, numbers are displayed at the detected positions. “Who is No. 1”, “Who is No. 2”, asks the user to call his name, learns, and speaks and confirms the phoneme string and phoneme string train V, using the Tatsu method, or “w / a / k / a / r / a / n / a” t, phoneme sequences and phonemes associated with keywords for specific control. When a single row is detected, it is removed from the learning target, or only the feature amount is learned and the association with the name or name is put on hold. You can improve the learning efficiency by using the above-mentioned method! It is also possible to convert it to a phoneme string or phoneme string string so that it can be used for searching.

[0411] With regard to image types, search in association with the phoneme string of names and related acoustic features for distinguishing between the above-mentioned examples such as musical instruments, car types, models, and animal and plant types. Thus, if the piano is the above-mentioned piano, the piano may be displayed and the scene where the music is heard may be searched, the catalog of the piano manufacturer may be obtained via the website, The names of these sounds can be converted into phoneme strings or phoneme string strings so that they can be used for searching, or any product or product name can be used.

[0412] Regarding the character symbol type, the character string identified by the recognition process is converted into a phoneme string or phoneme string string to be searched, and if it is a still image, it is related to the word that was clicked or range specified. It is possible to display and search the voice and video to be used, and convert the characters and font names into phoneme strings and phoneme string strings so that they can be used for the search.

[0413] Regarding the types of signs, they were used for searches using phonemes and phonemes for guides such as car navigation systems, and what was detected while driving a car was announced by speech synthesis using phonemes and phonemes. -It is possible to synthesize the meaning of foreign country signs in use etc., and the names of these signs may be converted into phoneme strings or phoneme string strings for use in searches.

[0414] With regard to shape types, it is possible to detect round objects, square objects, and pointed objects, thereby detecting obstacles that hinder the robot's movement and those that are dangerous to humans, or based on associated image features. If a search is performed using a phoneme sequence or phoneme sequence of abstract keywords, and a search is detected, it can be used anytime. It can be used as a fixed video such as an opening telop in an arbitrary program. It is possible to search by associating phoneme strings or phoneme strings of fixed utterances such as opening utterances, and converting the names of these shapes into phoneme strings or phoneme strings to be used for the search. It is also possible to use a waveform shape type to statistically analyze changes in brain waves and pulse waves extracted from a plurality of locations and provide an identifier for use in searching.

[0415] With regard to graphic symbol types, when searching for graphics and symbols appearing in movie scenes and using them as indicators for subtitles of symbols and signs when distributing to other languages, the usage and abstract label It is possible to detect graphics such as Ryaba, correct icon, and incorrect icon and use it to detect scenes of quiz programs. The names of these figures and symbols may be converted into phoneme strings and phoneme string strings for use in searches.

[0416] For broadcast program types, it is possible to acquire program information such as performers, authors, moderators, and program titles. It can be used as an index, and the names of those program genres and categories may be converted into phoneme strings or phoneme string strings and used for searching.

[0417] In addition, even if it becomes possible to record and reproduce any sensation such as taste, smell, touch, sound, moisture, and texture in the future, those feature values and identifiers are used in this embodiment. It may be added to the index to the other recording media for user convenience.

[0418] As a result, it is possible to detect information based on a variety of co-occurrence information that has been impossible in the past. Recording, search, skip playback, digest playback, email delivery, messages to messenger, and RSS delivery associated with detection are possible. It becomes possible.

[0419] <Application examples as products>

Product examples to be described in the following are implementation requirements and configuration requirements based on the novelty described above, `` Basic search device configuration and technology '', `` Index with multiple identifiers and multiple search conditions '' Using the `` Appendix, Search, and Arbitrary Processing '', you can construct a co-occurrence dictionary of identifiers based on term trends, image trends, acoustic trends, and control dictionaries that are described in examples according to each field, or identifiers By combining search elements and detection conditions using a dictionary that converts phonemes and Z or phoneme fragment strings, identifiers, character strings, identifiers, and feature quantities, the constituent elements and implementation elements of the present invention can be combined. Show examples of products and service solutions that can be realized.

[0420] «Example of broadcast recording and video recording / playback, video search system»

Examples of search with images and environmental sounds, examples of searches with environmental sounds and EPG, BML, RSS, teletext, audio / video search examples with multiple identifiers, and optional processing with identifier detection An application example will be described with reference to FIG.

[0421] First, install a video recording device such as a video camera, and extract audio from multiple microphones. By analyzing and converting to phonemes, it is possible to point the camera in the direction in which a specific keyword was issued, or to start recording according to the keyword. In addition, when singing a rhinoceros, it is possible to select and record specific music by phoneticizing the lyrics and extracting the melody at the same time, or playing back the recorded content. Also, if you detect the excitement of a scene by performing a video search with emotions, or if you detect music with a specific emotional melody! /, You can use this method. A scene having a high similarity to the scene specified by the pointing device or the remote controller may be searched and detected.

[0422] In this way, phonetic symbols and emotional symbols are indexed at the same time as recording, and the recording range and markup languages such as EPG, BML, RSS, teletext, etc., and the recording range associated with services using CGI are described. You may decide the search range, delete unnecessary parts, or skip scenes automatically during playback. For this reason, a specific keyword is converted into a phoneme, recording is performed as a temporary file while confirming the phoneme match, and an emotional characteristic is formed while constructing an index when a target keyword is detected.

[0423] Also, use EPG, BML, RSS, and text broadcasting to classify files and file names, target video and still images, audio, text, and related information power related to their time-series presentation order. For devices that perform playback and recording, configure phoneme strings and phoneme string sequences for information to be directed, distribute phoneme strings and phoneme string sequences by EPG, BM L, RSS, teletext, Even if it is convenient for the user, it can search, record, and play back the recorded content and recording target using the phoneme sequence and phoneme sequence based on the received EPG, BML, and RSS. good.

[0424] Of course, a device that executes these services may be a desktop information processing device or a portable information terminal, and the contents of the present invention are implemented via a communication base station using them. It can be realized by calling a device using the present invention at home in the process of mobile terminal power, or by mailing information recognized by the mobile terminal to a device using the present invention at home. You can send it.

[0425] As a result, the following can be realized using the present invention. For example, when a famous celebrity person named “Nanao (/ a / r / i / n / a / o /)” appears on TV, the channel of the user who acquired that information on that day. I don't know what time to get out However, if the appearance has not been finished, the home device using the present invention is “famous husband (along with [/ a / r / i / n / a / o /]), recording (Rokuga [/ r / o / k / u / g / a /]) ”and the keyword, the device using the present invention at home starts recording all the receivable channels and records from among the keywords. With the exception of the command section, the phoneme is expanded and recorded, and the recorded content is detected by searching the phoneme symbol string.

[0426] Next, in the present embodiment in which the device using the present invention detects the target keyword, the matching degree is set to 60%, content is recorded while setting a save flag boundary every minute, and 60% is recorded. The recorded content information will be deleted after one hour if the location does not exceed one minute. On the contrary, from the point where the keyword matches 60% or more is detected, for example, until one hour before and the boundary of the program by Z, EPG, BML, RSS, text broadcasting, etc.

[0427] As a result, broadcasts with the word “famous husband (along with [/ a / r / i / n / a / o /])” automatically save one hour near the derivation of that word. Thus, it is possible to automatically record without having the power to broadcast on which channel and when to broadcast. The video recorded by the present invention may be ranked according to the number of appearances and the degree of coincidence of the words and displayed as a list.

[0428] Further, at this time, face detection may be performed at the same time, and learning may be repeated in association with the name of the actor and the facial features to learn whether or not a specific person is in the screen. At this time, by instructing the user the power to match which of the facial features the recorded name is output during playback, the learning efficiency is improved and the performance of automatic detection recording is improved. The device itself may perform it independently. In addition, it is possible to learn while evaluating the degree of coincidence between the name of the actor and the person's face feature independently, using EPG, BML, RSS, text broadcasting, etc.

[0429] In addition, when a celebrity (Arina) is an actor, it may be called with a different name in video and audio works. In this case, for example, the program search can be executed by the following procedure. EPG, BML, RSS, text broadcasting, actor names in the list of performers in various programs are kanji.Look up actor names from user utterances using information converted from English words to symbol strings using phonemes and phonemes. , Text entered as usual The actor name is searched and the target actor name is extracted. Next, the cast name associated with the actor name is extracted.

[0430] Next, a symbol string using phonemes and phonemes based on the casting names is constructed while referring to a dictionary using phonemes and phonemes based on the casting names. Then, a search using a symbol string based on phonemes or phonemes is performed on video or audio work information indexed by symbol strings based on phonemes or phonemes. As a result, it is possible to search for the scene associated with the cast name of the target actor, and it is related to EPG, BML, RSS, and teletext, which was not possible with conventional phoneme and phoneme searches. This makes it possible to improve the convenience of searching in video and audio works.

[0431] In addition, the index by explosion pronunciation identifier and the index by phoneme symbol string associated with the fallen word, the prosodic, the time when music etc. was recorded as the identifier is the identifier that laughed at other laughter Video information is an action program, and these are aggregated to create an evaluation function to evaluate and search for the degree of action program, as well as phonemes associated with dark video time and scream in the video information. The index appearance frequency of the symbol string or emotion identifier string is detected more than the average appearance frequency of the index associated with screams in many other video and audio information for the entire video time length. If you create a function that evaluates horror programs and evaluate the degree of horror programs and search for them, you can classify the undulations of emotions and changes in content by using when and how to record conference information. A cable device can be realized.

[0432] In addition, when the environmental sound such as explosion sound, wind sound or wave sound is decomposed in time series by the identifier reconstruction process according to the present invention, the environmental sound segment as the environmental sound is constructed. It is done. Similarly, visemes are decomposed into time series and viewed as viseme segments, video images are viewed as motion elements and motion segments for moving images, and images are also converted to image and image segments for image information. A new index for search may be reconstructed by viewing it as a piece.

[0433] And, when learning features such as screams and explosion sounds, yes, surveillance cameras that start recording in response to the occurrence of information indicating the danger, and screams and explosion sounds after 24 hours of recording It is also possible to configure a monitoring and recording system that deletes data other than one hour before and after and continue recording, and use it for security measures. [0434] As described above, conventionally, only information related to speech utterances is searched by using a symbol string of phonemes or phoneme segments. However, features and identifiers by a plurality of methods as in the present invention are used. By using, it is possible to realize information retrieval according to program contents. Of course, when these techniques are used only for voice and performed on radio recording, the present invention may be implemented by a device whose function is reduced by one of the three methods, and it is used for a surveillance camera or the like to identify a window or door image. It is detected that the image feature evaluation distance of the discriminating function deviates from the average to detect a broken window or door, or that a person does not move for a long time in front of a locked door. Detect crimes by detecting them, detect scene boundaries of moving images and use them in video editing machines, use markup languages, or use phonemes and phonemes from character strings Or by using voice or other identifiers to detect weather by image features and controlling indoor equipment to control ventilation and lighting, or using names, passwords, and face recognition. Billing by personal authentication and utterance It's okay to make a payment.

[0435] At this time, the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is corrected information, whether it is on the terminal side or on the distribution base station side, new program, actor name, program genre, distribution Even if the phoneme symbol string related to the station name, the image feature, the voice feature, the emotion identifier, etc. are sent / received / distributed using markup languages such as XML and HTML, RSS, CGI, etc. Convenience can be aimed at by combining well.

[0436] Of course, a device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.

[0437] 《Example of product quality analysis system with consumer sentiment》

FIG. 27 shows a CRM (Customer Relationship Management) system using the present invention as an application of the above-mentioned search example using proper nouns and emotion identifiers and an example of executing arbitrary processing by audio-video search with multiple identifiers. Explained as an example.

[0438] First, utterances associated with consumer emotions are analyzed and indexed using a plurality of analysis devices and identification devices according to the present invention. By searching for phonemes, phoneme fragments, and emotion identifiers obtained as a result, and determining their frequency, the reputation of the product as viewed by consumer power is derived from the phoneme string indicating the product of a specific model number and the emotions associated with anger and sadness that accompany it. Phonemes that can identify emotional features and products The number of occurrences of symbol strings can be analyzed quantitatively, and the results can be displayed using a markup language such as HTML or XML, described later, or CGI, and manuals for identified products can be displayed. You can display it!

[0439] More specifically, the consumer requests a consultation from the consultation service operator over the phone or in the store. At this time, voice feature values of both operator and consumer are extracted, and emotions, phonemes, and phonemes are recognized from the extracted feature values.

[0440] At this time, phonemes, phonemes, and emotions recognized by the above-described method are stored in the information storage device.

. Next, evaluate the relationship between the voice information in which the phoneme or phoneme associated with the product name appears in the stored information and the voice information in which the emotion identifier of anger or sadness is recognized by the emotion identifier. .

[0441] The relevance evaluation method is based on the fact that anger emotions and sadness emotions occur in the audio information in which a specific product model number is detected, and the consumer evaluation is low. Also good. In this way, by evaluating the phoneme symbol strings recognized in the speech information and the distribution of emotion identifiers, it is possible to quantitatively evaluate the consumer's feelings about the product, and quantitative analysis of the product's reliability is possible. Can be performed.

As a result, the following can be realized using the present invention. For example, if a product with the model number “1X5 (Ichietsu Tsukugo [/ i / ch / i / e / cl / k / u / s / u / g / o /])” is also consumable, The consultant operator repeats the name and performs a search on the device using the present invention. As a result, the searched “1X5 (Ichietsutsugo [/ i / ch / i / e / cl / k / u / s / u / g / o /])” manual is displayed on the operator's screen. And can answer consumer questions. At this time, it is possible to quantitatively record the emotional evaluation of a product by recognizing the consumer's emotion and storing it in association with the information storage device.

[0443] In this case, when the device using the present invention performs a search for a target product name, the criteria for the matching degree of phonemes and phoneme symbol strings is set to 60%, and a list of products exceeding 60% is constructed. By displaying it as a list, the operator may select the manual for the target product.

[0444] And "1X5 (, Chie Tsukusugo [/ i / ch / i / e / cl / k / u / s / u / g / o /])" and the emotions associated with U Features, phoneme symbol strings, and phoneme symbol strings can be recorded and analyzed. . At this time, the product reliability can be quantitatively evaluated by analyzing the emotion appearance time for the same audio information group as the recorded product number.

[0445] At this time, the dictionary for converting the phoneme sequence and the phoneme sequence to the processing procedure is not limited to the terminal side or the distribution base station side, and the phoneme symbol for the new product name or product genre. Convenience by combining symbol strings such as strings, image features, voice features, emotion identifiers, etc., even when sending and receiving and distributing information using markup languages such as XML and HTML, RSS, CGI, etc. Can be planned.

[0446] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. It is also possible to analyze the psychological state of the mobile phone and confirm that it does not cause excessive stress. The content of the present invention may be implemented via the communication base station using these.

[0447] <Example of web browser operation>

First, the user speaks voice to the user's browser. The features of the spoken speech are extracted. In the first method, this feature value is transmitted to the target device, and the device that has received the feature value generates a phoneme symbol string and Z or phoneme symbol string and emotion symbol string according to the feature value. Then, based on the generated symbol string, the matching control means is selected and executed.

[0448] In the second method, a phoneme symbol string, Z or phoneme symbol string, and emotion symbol string are generated in the user's browser, and the generated symbol string is transmitted to the target device. The controlled device selects and executes the matching control means based on the received symbol string.

[0449] The third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on the feature values generated in the user's browser, selects the control content based on the recognized symbol strings, and Sent to the device that controls the control method.

[0450] Then, in the fourth method, the speech waveform is transmitted as it is using the user's browser, and the phoneme symbol string, Z, phoneme symbol string, and emotion symbol string are transmitted in the controlling device. It recognizes, selects a control means based on the recognized symbol string, and the controlled device executes the selected control.

[0451] At this time, if the user's emotions are accompanied by anger, they apologize to the user. Such a message may be presented by voice or character string. Similarly, emotion identifiers can be extracted from voice, and features can be extracted from symbols.

[0452] Then, add a new variable or attribute named pronunciation, for example, to the reference tag that indicates the link, phoneme the pronunciation of the speaker, search the web page, and move to the matching page. Conceivable.

[0453] In this way, by using markup languages such as XML and HTML, and CGI described later, RSS, blogs, catalog sales on the web, etc., the meaning, context and V are not recognized. By searching for phonemes that match the phoneme, voice operations can be easily realized.

[0454] At this time, matching between symbol strings, feature amount extraction, symbol string recognition, and other processes are performed directly in the browser-side information processing terminal by a back-down process such as a service or daemon, without the browser processing directly. Anyway.

[0455] Also, the dictionary that converts the phoneme sequence and phoneme segment sequence to the processing procedure is a phoneme symbol sequence related to correction information, new tags, variables, and attributes, which may be on the terminal side or on the distribution base station side. It is convenient to combine symbol strings such as image features, audio features, and emotion identifiers by sending and receiving and distributing information using markup languages such as XML and HTML, RSS, and CGI. I can plan.

[0456] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented through a communication base station or may be implemented in combination.

[0457] <Example of car navigation system>

As an example of search with emotions and proper nouns and audio / video search with multiple identifiers, combined with information distribution technology such as VICS, based on the position of car navigation and user interaction, phoneme symbols Multivariate analysis with strings, phoneme symbol strings, and emotion identifiers makes it possible to detect that a person has a calm tone at a specific position or a tone that has a high emotional tone. By evaluating together, user's information By analyzing the occurrence of traffic accidents caused by the beginning, and performing a search associated with it, the service can be implemented whenever an announcement is given to the user and the user is alerted in advance. At this time, risk prediction based on the emotional inclination of users in traffic jams is performed by evaluating the variation of the emotional characteristics of the voice characteristics of frequently spoken words and analyzing the emotional stability. Or monitoring the vehicle operation status.

[0458] In addition, the in-car audio detects the phoneme sequence as "accident situation (j / i / k / o / j / o / u / k / y / o / u)" and When an accident vehicle is detected by image recognition, the information is transmitted to the base station, received via VICS, mobile phone! /, Or any other communication means. Alternatively, the information transmitted from each vehicle may be captured by orvis and transmitted to the base station.

[0459] At this time, a dictionary for converting a phoneme sequence or a phoneme segment sequence and a processing procedure is not limited to a new place name, title or address, road phoneme symbol sequence, Convenience is achieved by combining symbolic strings such as image features and emotion identifiers well by sending and receiving and distributing information using markup languages such as VICS, XML, and HTML, and RSS and CGI described later. I can do it.

[0460] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented by performing a search via a communication base station or performing a search independently.

[0461] 《Karaoke selection and music search system example》

As an application example of search with emotion and scale, audio video search by character string and identifier, and example of executing arbitrary processing by audio video search with multiple identifiers, karaoke is a music sales system. Examples will be described.

[0462] According to the present invention, song titles and chorus lyrics are recorded as phoneme strings, phoneme fragment strings, and musical scale strings, and can be used for title search in karaoke by searching for matching points. Furthermore, in addition to the feature structure like karaoke, it is possible to compare the frequency of appearance of scale symbols and to search for high coincidence, things, appearance distribution structure, and appearance position distribution. [0463] More specifically, “sad” emotions from a list extracted from the entire song list based on the name of the performer that matches the phoneme string / phoneme string sequence of “00 band” in “Sad song of band 00” Look for something with a high frequency of appearance in one song of the identifier, and search for something with a “000 band-like sad song” and a high frequency of “Sad U,” ! When searching for a song, the same method can be used. In addition, the user's preference may be learned by learning the co-occurrence information obtained by such a search, or when the user selects and plays back and then repeatedly selects or listens to the music to the end. In such a case, the user may be judged to have affirmed the search result, and may be interpreted as having made a negative judgment if the search is performed once or if the next song is immediately moved to the next song. At this time, the “00 band” of the query may be spoken by voice and used for natural language processing, may be searched by expanding phonemes by inputting a character string, or may be searched as a character string. You may evaluate the similarity of a music feature and an emotion feature.

[0464] It is also possible to compare the appearance frequency, the appearance distribution structure, and the appearance position distribution of emotion identifiers and search for a higher match. It is also possible to search for music with similar lyric composition and music with specific keywords by comparing the frequency of appearance and location distribution of phoneme symbols and phonemes. A service to sell music based on the search results can also be implemented. In addition, note transition information, chords, and chord progression transition information may be used as feature quantities to evaluate the degree of coincidence of music structures, and note transition information, chord and chord progression transition information, etc. The feature quantity may be extracted and an identification function may be configured so that the identifier can be discriminated.

[0465] Also, using the fact that emotion recognition results tend to differ for each music depending on emotion recognition, the tendency of emotion identifiers generated according to music is extracted by statistical processing for each music genre and subjected to multivariate analysis. Search according to the sensitivity parameter of the user based on the present invention by searching for music genre identifiers or by evaluating the similarity of the appearance tendency of emotion identifiers in music, searching for music that is close to the sensitivity trend By presenting it to the user, a service that recommends music according to the user's preference is also possible.

[0466] In this way, the traditionally recognized scale information is used alone! /, And the search and sales method is combined with emotion identifiers, phoneme symbol strings, and phoneme symbol strings, so that the user's hobby is It is possible to search for music pieces with lyrics, melodic trends, emotional trends, and voice quality trends. [0467] At this time, the dictionary for converting the phoneme sequence or phoneme sequence and the processing procedure is not limited to the terminal side or the distribution base station side. Convenience can be achieved by combining symbol strings such as scale symbol strings and emotion identifiers, even if they are sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. I can do it.

[0468] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.

[0469] Note that the nasal song and lyrics search in the prior art is separated into the act of nasal song and the action of lyric utterance, and thus is different from the search based on the co-occurrence information in the present invention.

[0470] 《Example of product search order system》

The present invention is an application of voice operation, and a user speaks voice to an information terminal and Z or a terminal-side browser. The feature amount is extracted from the spoken voice. In the first method, this feature amount is transmitted to the target device, and the distribution device that has received the feature amount generates a phoneme symbol string and Z or a phoneme symbol string and an emotion symbol string according to the feature amount. Generate. Then, based on the generated symbol string, a matching control unit on the distribution apparatus side is selected and executed.

[0471] In the second method, a phoneme symbol string, Z or phoneme fragment symbol string, and emotion symbol string are generated in the information terminal and Z or the terminal-side browser, and the generated symbol string is transmitted to the target distribution device side. Send. Then, the distribution apparatus side selects and executes the matching control and distribution means based on the received symbol string.

[0472] The third method is to recognize phonemes, Z or phoneme symbols, and emotion symbol strings based on feature values generated in the information terminal and Z or the terminal-side browser, and control based on the recognized symbol strings. The contents are selected and transmitted to the distribution apparatus side that controls the control method. The distribution apparatus that has received the control method performs information processing based on the control method and provides information.

[0473] Then, the fourth method uses the information terminal and the Z or terminal side browser to transmit the speech waveform as it is to the device that controls the phoneme symbol string and the Z or phoneme symbol string on the controlling distribution device side. , Recognize emotion symbol string and select control means based on recognized symbol string However, when the distribution apparatus executes the selected control, it is a problem.

[0474] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string. Similarly, emotion identifiers can be extracted from voice, and features can be extracted from symbols, and so can sound and video features and identifiers such as environmental sounds. To do.

[0475] In this case, phoneme symbol strings are embedded in CGI and HTML for the displayed products, and search and evaluation based on those symbols will move to matching pages, product orders and details will be displayed. If it is, it may be the way. These search targets may be performed on books, AV content, digital materials, cosmetics, pharmaceuticals, food, automobiles, and other industrial products and items with any proper nouns.

[0476] Also, a method may be considered in which each proper noun is uttered by multiple speakers and the same phoneme is provided with a recognition template for multiple phonemes and phoneme fragments, thereby improving the search rate of the phoneme string of the page to be used. It is done. In addition, an application system such as an expert system may be constructed by using a part of the processing procedure of such an ordering system.

[0477] At this time, the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is not limited to the phoneme symbol sequence or image feature related to the new product or product genre, which may be on the terminal side or on the distribution base station side. It is also possible to send and receive and distribute information such as symbolic strings such as voice features and emotion identifiers using markup languages such as XML and HTML, RSS and CGI, which will be described later.

[0478] Of course, these services themselves may be content distribution services such as movies, photos, and novels, and even digital material distribution services and product sales services. It may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. You may implement.

[0479] <Example of voice service>

For example, when a book-reading service is executed when a book is sold, it is possible to use any speech or sentence by using phonemes or phonemes, or by evaluating the emotions that are included based on recognition. It is possible to retrieve the position. [0480] In this case, if speech synthesis is used for reading, the speech dictionary and template are changed to the speech of the favorite celebrity by changing the speech synthesis template for the speaker's phoneme. The utterance dictionary or template related to the parameters for speech synthesis in the reading can be changed with the change of emotions, and can be combined for convenience.

[0481] Also, by applying this voice service, templates and parameters for voice synthesis of robots and agents are distributed, and the user's robots and agents are accompanied by emotions with celebrity voices that match the user's hobbies. Services such as speaking and controlling home appliances can also be implemented, and this speech service can be applied to compare conversations between users and utterances provided by the service to realize conversation learning services.

[0482] <Example of remote control enabling voice operation >>

First, the user speaks voice to the remote control. The feature amount is extracted from the spoken voice. Then, in the first method, this feature value is transmitted to the target device, and the device that has received the feature value generates a phoneme symbol string and Z or phoneme symbol string and emotion symbol string according to the feature value. . Then, based on the generated symbol string, a matching control means is selected and executed.

[0483] In the second method, a phoneme symbol string, Z or phoneme symbol string, and emotion symbol string are generated in the remote controller, and the generated symbol string is transmitted to the target device. The controlled device selects and executes a matching control means based on the received symbol string.

[0484] The third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on feature values generated in the remote control, selects the control content based on the recognized symbol strings, and controls the control method. Is sent to the device.

[0485] Then, the fourth method transmits the speech waveform as it is using a remote controller to recognize the phoneme symbol string, Z, phoneme symbol string, and emotion symbol string in the controlling apparatus, and recognizes them. The control means is selected based on the selected symbol string, and the selected device executes the selected control.

[0486] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string. Similarly, emotion identifiers are also spoken Feature extraction and symbol recognition are possible, and the same applies to sound and video features and identifiers such as environmental sounds.

[0487] Such remote control technology may be introduced into a robot to perform home appliance control, or may be incorporated into a car navigation system to perform control. In this case, any new U ヽ control symbol string information is distributed to the operated device using the markup language or CGI described later such as RSS, HTML, XML, etc., and phonemes, phonemes, and speech waveforms are transmitted. The remote phone to be used may receive or transmit the updated phoneme symbol string information of the mobile terminal via infrared or wireless.

[0488] At this time, the dictionary that converts the phoneme sequence and the phoneme segment sequence and the processing procedure is not limited to the terminal side or the distribution base station side. Convenience can be achieved by combining symbolic strings such as voice features and emotion identifiers, even if information is sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. .

[0489] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.

[0490] <Examples for use with mobile devices >>

First, the user speaks voice to the mobile terminal. The features of the spoken speech are extracted. In the first method, this feature amount is transmitted to the target device, and the device that has received the feature amount generates a phoneme symbol string and Z or a phoneme symbol string and an emotion symbol sequence according to the feature amount. . Then, based on the generated symbol string, the matching control means is selected and executed.

[0491] In the second method, a phoneme symbol string, a Z or phoneme fragment symbol string, and an emotion symbol string are generated in a mobile terminal, and the generated symbol string is transmitted to a target device. The controlled device selects and executes a matching control means based on the received symbol string.

[0492] The third method recognizes phonemes and Z or phoneme symbols and emotion symbol strings based on the feature values generated in the mobile terminal, selects the control content based on the recognized symbol strings, and controls the method. To the device that controls [0493] Then, the fourth method is to transmit the speech waveform as it is using a mobile terminal, and recognize the phoneme symbol string, Z or phoneme symbol string, and emotion symbol string in the controlling device. The control means is selected based on the recognized symbol string, and the controlled device executes the selected control.

[0494] At this time, if the user's emotion is accompanied by anger, a message that apologizes to the user may be presented by voice or a character string. Similarly, emotion identifiers can be extracted from voice and features can be extracted from symbols, and so can sound and video features and identifiers such as environmental sounds.

[0495] In addition, the infrared of the mobile device is used to control the DVD deck, TV, air conditioner, and other devices, and the IP address of the device is acquired using infrared or wireless LAN to control the device. If the control information of the target device is acquired and controlled via the Internet for mobile terminals or the indoor LAN, the sound from the mobile terminal or mobile phone can be obtained by acquiring the control list using the present invention. Control can be realized.

[0496] Of course, the mobile device sends its own IP address or e-mail address to the target device, and the target device connects to any port based on the IP address and sends control information, Any method may be used when the device sends the control information to a portable terminal by attaching it to an e-mail, or simply acquires the control information by exchanging infrared rays. In addition, a search service may be implemented by performing phoneme recognition, phoneme recognition, emotion recognition, environmental sound recognition, and scale recognition for input from a microphone of a mobile terminal.

[0497] At this time, the dictionary that converts the phoneme sequence and the phoneme sequence to the processing procedure is not related to the correction information, the new content, the program genre, or the actor name, which may be on the terminal side or on the distribution base station side. A combination of symbolic strings such as phoneme symbol strings, image features, voice features, and emotion identifiers that can be sent and received and delivered using markup languages, such as XML and HTML, described later, RSS, and CGI. Convenience can be aimed at.

[0498] In addition, by processing calls on mobile devices as needed and evaluating emotional undulations and utterance contents, for example, emotions and fatigue such as anger and sadness are frequently observed in the speech ability of terminal users who are talking. When a phone call is presented to a terminal user at a delicious neighborhood store or energetic music, illustrations, or video works as content that will be energetic after the call ends It is also possible to carry out advertisement services based on phonemes being uttered or to execute advertisements.

[0499] Also, multiple low-performance microphones and high-performance microphones are available for high-performance audio recording for recognition, and recognition is performed by increasing the sampling rate for recording. At the same time, the voice rate for voice call transmission may be converted to a lower sampling rate, and voice information for call may be formed to generate compressed voice information for call.

[0500] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a mobile phone information terminal, or a wearable information terminal. The content of the present invention may be implemented via a communication base station.

[0501] << Examples for robots and agents >>

For example, in the case of a robot or computer agent's interface, the detection and recording functions equivalent to those described above are performed using the image recognition function and voice recognition function associated with the associated imaging device, microphone, and recording device. When watching TV together, perform any process in response to a specific entertainer, perform any process in response to a specific keyword, or perform any process in response to a specific emotion. This makes it possible to implement a robot that laughs and crys with the user, and to realize a robot that controls other devices placed in the vicinity according to the user's preference. . In addition, as with other processes, a method may be used in which a feature quantity is extracted or a symbol 匕 is used to make a request to a power core server.

[0502] More specifically, a robot or an agent using the present invention is accompanied by an identifier or feature amount extracted while the content is being browsed by the user, and the facial expression or utterance of the user. By observing the feature quantity and identifier related to the phoneme 'phoneme piece' emotion, it becomes possible to observe the co-occurrence state of the feature quantity and identifier of the user with the feature quantity and identifier of the content. At this time, identifiers and feature quantities related to emotions and phonemes may be acquired from the content playback device using the present invention, or content related to user emotions and phonemes using the indexing function in the device itself. An identifier or feature amount may be extracted.

[0503] In this way, for example, a “comic, program user status evaluation function” is composed of feature values and identifiers collected in “comed program”, and the content is “comic, program feature value”. If the feature and identifier of the user and content are close to the center of gravity of the feature in the “Comedy Program User Situation Evaluation Function” By expressing the emotion as “fun,” it is possible to produce a performance as a pseudo emotion. Of course, other emotions such as other emotions may also be learned in the same manner based on the co-occurrence state of the feature quantity and identifier obtained by the user and the content ability.

[0504] Also, using RFID, JAN code, and barcode attached to an arbitrary object as an identifier, the image of the object from which the identifier is obtained, the sound when it hits, and the sound force when it is operated learn the feature amount, Forces such as the mass and weight of the object Force that can be transported The device itself automatically learns by recording and learning whether it should be avoided in the event of a collision, and how to express the emotion when it is presented to the user. It can be used to autonomously improve the behavior and usage efficiency of the device, recognize a specific person from the video content, and the frequency and intensity of emotional characteristics represented by that person For example, if the person's personality is more than 3 σ away from the average of the emotions expressed, the personality characteristics of the person are identified, and the personality characteristics of the person can be used for dialogue and communication corrections. Guessed It is also possible to extract facial features as emotions change and autonomously learn the types of facial expressions, or to obtain a sequence of symbols obtained by recognizing speech spoken simultaneously with phonemes or phonemes. To analyze phoneme and phoneme tendencies included in emotional utterances, and simultaneously recognize environmental sounds and learn facial expression changes due to user responses to external sounds such as noise and explosion sounds .

[0505] Based on the information learned by such a method, the reaction of the CG, robot, or agent is performed according to the sensitivity or response of the user using the knowledge database for virtual personalities such as CG, robot, or agent. It can be used to change the facial expression of CG, robots, and agents, and broadcast information such as TV can be acquired using external information such as EPG, BML, RSS, and text broadcasting, to the user's preference. You may provide information on such entertainers and time circumstances, and use a method that analyzes them based on the number of times the information recorded by the aforementioned video search and recording means and the playback viewing time! The user's preference may be analyzed, and the robot using the present invention may be connected to infrared communication or wireless LA The control method of the surrounding device is acquired using N, etc., and the device control is improved according to the user's voice, and the convenience of device control is improved, and the identifier and feature amount of the currently displayed information can be obtained. You may win.

[0506] As described above, it is possible to use information based on the co-occurrence state of a plurality of identifiers and feature values in a multi-dimensional manner and use it as a knowledge database for robots and agents. By learning information necessary for communication with video and music, it is possible to realize a more versatile robot and agent interface, as well as various sensor inputs, image information, acoustic information, and voices associated with mouth bot movements. You can evaluate the co-occurrence of information and learn the robot's autonomous behavior, act autonomously according to the learning results, give instructions to people according to the learning results, It can be used as a character or NPC knowledge database as a virtual personality in the company.

[0507] At this time, the phoneme string or the phoneme string string and the dictionary that converts the processing procedure are phoneme symbol strings related to the correction information, new information, and functions of the robot that can be on the terminal side or on the distribution base station side. For convenience, symbol strings such as image features, audio features, and emotion identifiers can be combined and sent and received and distributed using markup languages such as XML and HTML, RSS, and CGI. I can plan.

[0508] Of course, the device that executes these services may be a desktop information processing device, an in-vehicle terminal, a portable information terminal, or a wearable information terminal. The contents of the present invention may be implemented via a communication base station.

[0509] «Examples of medical analyzers»

Next, an example of an analyzer as a medical application will be described. In addition to the phonemes and phonemes extracted from the sound and images of the present invention, characters, facial expressions, and gestures, the subject of observation using a pulse sensor, electroencephalogram sensor, muscle current sensor, skin resistance sensor, weight scale, sphygmomanometer, and thermometer Based on the present invention, it records the emotions associated with the user's utterance and the feature quantities obtained from the sensor power such as brain waves and pulses.

[0510] Next, by using the present invention to observe and analyze the co-occurrence state of each feature quantity and identifier, and learn, the brain wave tendency, blood pressure tendency, body temperature tendency, body weight tendency, pulse tendency associated with a specific emotion , Classify and extract bias by performing multivariate analysis on facial expression tendencies An identifier for the psychological state is assigned based on the tendency learned and classified according to the present invention by the house.

[0511] By executing the relearning used in the present invention by using the index added based on the information classified in this way, based on the device that performs psychological trend analysis and the analyzed trend, By taking a pair with a user and recording emotions and utterances and brain waves or pulse or skin resistance or muscle current, blood pressure trends, body temperature trends, and body weight trends, counseling and Severely ill patients can be configured by using psychoanalysis and various sensors to evaluate the co-occurrence information on components such as body fluids such as urine and blood, body condition, skin, hair, and excreta, and organize diagnostic reference information A device that detects the change of state by observing the moaning voice and behavior can be configured.

[0512] Also, by extracting and analyzing deviations related to human movements in the same way, it is possible to detect abnormalities in the spine and walking, and to record and analyze improvements after trauma treatment such as fractures. It may be configured as a device that quantitatively evaluates the therapeutic effect in medical treatment, or as a knowledge information database used for the virtual personality of information processing devices that attempt counseling or psychoanalysis by interacting with people. Control information used for medical devices such as prosthetic limbs and prosthetic hands may be used for construction and extraction by co-occurrence information related to physical movement using image information and acoustic information, or information related to the health of patients and users Based on the change of V, you can configure a health management system!

[0513] 《International phoneme symbols and language-specific phoneme symbols, and phoneme and phoneme symbol conversion examples》

Next, in such a search that evaluates the co-occurrence state, when searching for content spoken in foreign languages, the phoneme feature tendency differs depending on the language, so it will be necessary to have a technology that complements them. The phoneme symbol conversion method using the co-occurrence state is explained.

[0514] It is also possible to convert between different languages by evaluating the co-occurrence state of Japanese phonemes and English phonemes for the same combination of co-occurrence information. In order to resolve changes and biases in symbols, we have conversion templates as standard templates for conversion considering the probability of attribution for each language against international phoneme notation standards. Phoneme information space by constructing as evaluation function Can be converted to each other, and the problem is solved by improving convenience.

[0515] By learning the co-occurrence state, specifying each identifier or feature as a search condition with phonemes, phoneme pieces or character strings, and performing indexing on content information It is possible to cope with differences in international pronunciation that can be achieved with the power to search, record, distribute, and receive information based on conditions.

[0516] In addition, in these content searches, foreign language phonemes and Japanese phonemes can be converted as countermeasures against changes in the sound environment in consideration of the inclusion of overseas content. Resolve the problem by adjusting search conditions and performing search by converting information that is co-occurring for humans, such as conversion of phoneme sequences and image features, conversion of sound effects and phoneme sequences, and conversion of emotions and character strings. It is intended to be illustrated.

[0517] Of course, because co-occurrence state is used, it is output when facial expressions are in trouble by combining video features and motion features, identifiers related to video and audio, identifiers such as chords and environmental sounds, and emotion identifiers. Learning the co-occurrence state of the voice features, etc., and constructing a new `` problem attitude '' identifier, the co-occurrence with the layer that acquires the voice identifier and image identifier and the layer that processes the co-occurrence state It is also possible to configure and use multi-layer Bayes and multi-layer HMMs by methods such as layers that process time series transitions of states.

[0518] Then, by classifying such various recognition outputs by race and language and giving an evaluation layer by culture, etc., it is also possible to convert identifiers based on recognition results based on different backgrounds. By using co-occurrence information that is probabilistically biased, it is possible to realize the conversion of information that has similar backgrounds but with different backgrounds.

[0519] The feature of identifier conversion is that phoneme conversion is based on language-dependent phoneme notation based on co-occurrence information of international phoneme symbols or phoneme symbols having different language characteristics by recognition of phonemes and phonemes based on different language environments. The implementation of different language phoneme conversion processing using a dictionary or phoneme conversion dictionary and similar identifier conversion processing based on the co-occurrence state for identifiers other than identifiers associated with languages.

[0520] As described in the above “Example of identifier reconstruction using the present invention”, for example, in the case of being expressed in an international phoneme, the output probability of the international phoneme symbol HMM is used as a feature amount and the language-specific sound. Perform HMM learning based on prime symbols. Conversely, HM based on language-specific phonetic symbols If it is M, the output probability of language-specific phoneme symbols is learned based on international phoneme symbols. Similarly, it is possible to use a method for learning phoneme-to-phoneme conversion and phoneme-to-phoneme conversion, respectively, and use distances such as Bayes discriminant function and Mahalanobis distance instead of HMM power. It also shows an application example using Japanese phonemes for the probability of belonging to an international phoneme, which may be a method or a method using likelihood or probability.

[0521] In addition, the phoneme attribution probability in each language for the international phoneme symbol is obtained, a corresponding table is created, the phoneme is identified from the feature quantity using the international phoneme dictionary, and converted into a phoneme symbol string depending on each language, By obtaining the phoneme attribution probabilities between different languages and evaluating the attribution probabilities in descending order, it is converted to the language features of the device that uses the speech features of people who speak other languages or other languages as their native language. Or you may. It should be noted that phoneme, phoneme conversion can be configured based on the co-occurrence of these identifiers as well as conversion between image identifiers as well as between different language phonemes and phoneme conversion, and conversion between image identifiers and phonemes and phoneme string sequences. Use it by using the function.

[0522] In addition, the correspondence with international phoneme symbols can be related to UPA numbers, IPA symbols, UCS code numbers by referring to the International Phonetic Symbol Guidebook by the International Phonetic Society, and these symbols and numbers can be used as identifiers. It may be used for conversion management. Also, when converting phoneme symbols between different languages to international phoneme symbols, use the transition probability between the phoneme probability table and the preceding and following phonemes, or re-learn the output probability and convert between symbols using HMM etc. Alternatively, the Euclidean distance function, the Bayes discriminant function and! /, An evaluation function may be constructed using the co-occurrence information of output probabilities and feature quantities and used as a symbol conversion function.

[0523] In this case, mutual conversion between phonemes and phonemes within the same language using the phoneme-to-phoneme or phoneme-to-phoneme conversion table, or conversion between local language phoneme symbols and international phoneme symbols. It is also possible to convert between local language phoneme symbols and international phoneme symbols and to convert between each phoneme phoneme. For example, when this process is used to convert identifiers into English phoneme symbols via Japanese phoneme symbolic international phoneme symbols or from Japanese phoneme symbols to English phoneme symbols, phonemes that have different language dependences. In addition, it is possible to perform an identifier conversion between phoneme pieces, or to perform an identifier conversion using the above-mentioned phoneme symbol conversion table, and to perform a search using an identifier string that has been converted to an identifier. In consideration of temporal transition, the phone may use bigram or trigram phoneme HMM or phoneme HMM instead of monogram force.

[0524] By this method,

Image → English name → English phoneme indexing → Japanese phoneme conversion → Japanese utterance input → Japanese phoneme search

Japanese speech → Japanese phoneme sequence → Japanese keywords → English translation → English phoneme sequence → English DB phoneme sequence search

Utterances that depend on languages such as these can be searched while being converted to other languages, and a database of “photographs of actors” composed of English-speaking languages is used as a phoneme sequence based on “Japanese pronunciation of actor names” Any meta-co-occurrence search can be performed at any time. Of course, instead of the power of actors, it may be configured as a sales catalog for products such as cars, tools, flowers, and cosmetics, or it may be used for displaying a list for searching.

[0525] Next, a more specific procedure for configuring the conversion function will be described.

[0526] First, according to FIG. 28 and FIG. 29, a search procedure based on simple language-dependent speech and character strings is described. According to this description, it is understood that the input character string or speech waveform is converted into an identifier that is language-dependent and used for a query, and the search is performed using a database indexed by a language-dependent identifier.

[0527] However, since identifiers such as phonemes and phoneme pieces differ depending on the language, identifiers whose notation does not necessarily match are similarly different. In order to be able to search such a variety of identifiers, it is necessary to convert identifier symbol strings. In this identifier conversion, the identifier evaluation function configured in each language environment is recognized for the same utterance, the co-occurrence state of each identifier is observed, and the identifier output as the recognition result or It is possible to change the symbol between identifiers by learning the output probability, likelihood, distance, feature quantity, etc., as shown in the procedure of FIGS.

[0528] In more detail, for example, the utterance information in English and the utterance information in Japanese are converted into feature quantities in the same manner as in the indexing or searching. Next, based on the feature values, Japanese and English phonemes and phoneme recognition are executed. As a result, voice information that depends on each other's language is indexed by identifiers through a recognition process that depends on each other's language. Is done.

[0529] Next, the index by the implemented identifier sequence is observed, and the co-occurrence state of each identifier and the transition of the output probability are observed. As a result, it is possible to extract the co-occurrence state of Japanese phonemes and English phonemes when English is recognized in Japanese for each phoneme or phoneme. Similarly, Japanese phoneme co-occurrence information recognized in English can be constructed. Based on the co-occurrence information obtained in this way, an evaluation function such as an HMM or a Bayes discriminant function is constructed as an English phoneme recognition function when uttered in a Japanese phoneme sequence, and internal constants for the discriminant function are created as files It can be saved and reused on any storage medium.

[0530] Using the discriminant function obtained as a result, phonetic symbol conversion between different languages becomes possible by recognizing English speech information by Japanese phoneme discrimination corresponding to English. Figure 32 shows an example of conversion between international phoneme symbols and Japanese based on the co-occurrence probability for co-occurrence information in such conversion. French, Vietnamese, Spanish, etc. may be combined, government bond phoneme symbols may be used as intermediate phonemes, and evaluation functions that can be converted to each other may be configured.

[0531] Similarly, any phonetic waveform can be indexed simultaneously with phonemes and phonemes, or it can be indexed with Japanese phonemes, English phonemes, and international phoneme symbols at the same time. By indexing with dependent phonemes and phonemes, the co-occurrence state may be observed and a recognition function based on HMM or Bayes may be constructed.

[0532] In addition, by observing the co-occurrence state of phonemes and phoneme pieces, it is possible to construct a multilayer HMM as shown in Fig. 33 or a multilayer Bayes function. Thus, it is possible to perform identifier conversion for different identifier characteristics such as FIG. 34 and FIG. 35, such as phoneme force, phoneme piece, and phoneme piece force and phoneme.

[0533] This method identifies the current phoneme based on the output probability from the phoneme HMM or phoneme HMM, or inputs it to the transformed HMM layer, or outputs the probability index part of multiple Bayes functions. It is also possible to use a multi-layer Bayes method in which the distances are evaluated in parallel and the array of distance information is configured as a feature quantity.

[0534] Specifically, in Fig. 33, the phoneme transformation that inputs the transition state of the Japanese output probability H

To construct MM, after indexing with both phoneme evaluation functions, the source phoneme H The MM output probability is input to the HMM classified by international phoneme symbols and learned. Based on this learning, output probabilities are evaluated and international phoneme symbols are assigned. In this case, the co-occurrence matrix and co-occurrence probabilities are used for learning, and output probability values and features are given as sample vectors for the Bayesian function. It is also possible to obtain a tutor and use it as an evaluation function.

[0535] Also, in Fig. 34, transition is made from "silence" to "A" utterance based on the output probability of the phoneme HMM in the current frame, the output probability of the phoneme HMM in the next frame, and the output probability of the previous frame. In the process, the probability of silence is high, the frame is `` Pau '', the frame with the increased output probability of `` A '' is `` A '', and these symbols are arranged in time series `` Pau-A- A symbol based on the phoneme transition “A” is assigned. At this time, the first frame and the last frame are filled with the same identifier as the self frame because the previous and subsequent frames are missing.

[0536] In this case, for example, when considering the second frame in a simple model, "Pau" is the highest in the past frame of the phoneme, so the left of the phoneme is "Pau", and the average of the output probabilities of the center frame is Since “A” is the highest in the right frame and “A” is the highest in the right frame, the symbol to the right of the phoneme is “A”, and the phoneme is configured as “Pau-AA” based on the time series change. In this way, it is also possible to realize conversion between phonemes and phonemes.

[0537] Also, in Fig. 35, the output probability of the speech unit HMM in the current frame and the output probability of the speech unit HMM in the next frame in the process of transitioning from "silence" to "A" utterance In high phoneme symbols, the ratio of silence to the phoneme symbols is high V, where `` Pau '' is the part, and `` A '' is the high percentage of phoneme symbols, the part is `` A '', and the symbol is assigned And For example, in the second frame, “Pau-A-A” is 60%, “A-A-A” is 20%, and other 20%, and others are omitted from the notation.

[0538] In this case, considering the second frame with a simple model, “Pau” occupies the first third of the phoneme segment, so Pau = (60 ÷ 3)%, and “A” is the phoneme. Because it accounts for two-thirds and second place of the first place, if calculated as A = (60 ÷ 3 X 2) + (20 ÷ 3 X 3)%, Pau = 20%, A = 60% The symbol for the second frame is “A”. In this way, it is also possible to realize conversion between phonemes and phonemes, and the composition of the evaluation formula based on these identifiers can be arbitrarily determined by taking into account the first-order phoneme of the preceding and following frames, for example. Combination It may be used.

[0539] In addition, when using the international phoneme symbol conversion table as shown in Fig. 32, there are several ways to use international phoneme symbols as an intermediate form in the above conversion ^, Fig. 36, Fig. 37 and Fig. 3 8 As shown in Fig. 4, after recognition is performed for each language, it is converted to international phoneme symbols, and a search using international phoneme symbols is performed. Search by phoneme symbol, recognition by speaker language and convert to international phoneme symbol, search and detect after converting to content language, convert input character string to international phoneme symbol You can use any combination, such as the method of searching and converting to phonemes for each language after searching.

[0540] In addition, strings in any language such as Japanese, English, French, Spanish, German, Korean, Chinese, Indian, Islam, Hebrew, Aramaic, Vietnamese, Greek, etc. If the search target is a phonetic string, a phoneme sequence or phoneme segment sequence is constructed based on the pronunciation of the character string, or it is converted into phonetic notation in any language such as Hiragana or Katakana or converted to international phonetic notation. It may be converted into a quantity and the co-occurrence state is confirmed, and the phoneme conversion between languages may be realized, or for the phonemes and phonemes depending on each language, the international phoneme symbol is used as an intermediate form by the above method. You can convert it.

[0541] In this way, co-occurrence states are observed by evaluating information based on the same environment using evaluation functions used for identifiers of different environments such as language states, image states, and acoustic states. It is possible to construct an evaluation function that responds to changes in the environment by probabilistically supplementing the exchangeability. In particular, identifiers can be shared in the conversion of international phoneme symbols and regional phoneme symbols and phoneme and phoneme identifiers. The diversification of the search of the present invention can be realized by learning and using the wake-up state.

[0542] Also, as an example of conversion between identifiers, an example of conversion between phonemes or between phoneme fragment phonemes was given. However, as described in the embodiment of the present invention, as a conversion between environment identifiers and phoneme identifiers, "wave sound ”And“ z / a / p / p / a / a / a / a / n / n / n / n / n ”and conversion of phoneme identifier sequences as image identifiers and names The transition force of co-occurrence information can be converted, and it may be used for evaluation of similar shapes and conversion of identifiers.

[0543] <Other general> Further, the present invention mainly refers to identifiers and emotion identifiers based on phoneme symbols and phoneme symbol symbols, but they are described in “Prior Art”, “Prior Art Issues”, and “Solutions for Issues”. Conventionally, environmental sound identifiers, emotion identifiers, musical instrument identifiers, scale identifiers and image identifiers, person identifiers, motion identifiers, facial expression identifiers, display position identifiers, program identifiers based on various technical documents and references cited in those documents. The convenience of the embodiment of the present invention is improved by executing an indexing method and a search request by a combination of identification symbols applied using a recognition technique or identification technique for some other feature amount or identifier. You may plan.

[0544] In addition, by selecting and selecting representative sample images, audio ranges, and video components to be searched, detected, and learned by GUI clicks, pointing operations, and instruction operations by voice input, etc. You may also conduct stock trading, product trading, auctions, reservations, questionnaires, content viewing, viewing status surveys by communicating the co-occurrence status of content and users, etc. Feature extraction, identifier recognition, search, learning, and detection may be performed at the terminal, base station, or relay station, or distributed processing such as clustering may be performed. Use emotion identifiers to change the context transition coefficient associated with emotions in speech recognition, add options that branch according to emotions recognized by emotion recognition, or use emotion identifiers that are recognized by the user's voice. The selection range or branch range may be given, or the identifier associated with the keyword and the advertisement associated with the keyword may be presented in association with each other.

[0545] In addition, video and other parts to be selected / designated by the user's instructions are used in MPEG4 etc., and the boundary of the selection range is specified using the image outline of the image object used or the coordinate information in the 3D image. You can use the method, you can use the boundaries detected from silent parts such as voice and frequency deviation, indexing by selecting the display object in the image, and selecting it. It is also possible to advertise tourist information using location information such as latitude and longitude regarding the shooting location in the program, and carry out advertisement and promotion according to the recognized identifier and extracted feature quantity Or indexing to run advertisements and promotions.

[0546] Identifiers and feature quantities for search results and content information indexed using the present invention Is added as a markup language tag or attribute and distributed to provide related content according to the user's operation, provide advertisements, or sell products, Content operations, content editing, and content use.

[0547] Annotation processing that supplements or annotates content-related information using search results is also acceptable. If content search is performed using the co-occurrence information used in the present invention, You can configure a bot system that collects and searches information on the network independently.

[0548] In this case, a phoneme segment is a phoneme symbol that is decomposed into a central part, a front part, a rear part, and a plurality of phonemes on the time axis, or between phonemes such as the first phoneme and the second phoneme. The first phoneme force in the transition state between the pieces may be phoneme information with intermediate features based on the position where it changes to the second phoneme, or it may be recognized based on the detected emotion, environmental sound, or person. You can also switch phoneme recognition dictionaries and phoneme templates like this.

[0549] Further, the identifier used in the present invention is an identifier extracted from emotion features including the above-mentioned phonemes and phoneme pieces, an image identifier extracted from image features, or an acoustic feature. Think of it as an information processing device that realizes an unprecedented and convenient service by simultaneously evaluating, searching and detecting information such as musical instrument identifiers, scale identifiers, and environmental sound identifiers. .

[0550] Also, in voice information in which specific emotion identifiers, environmental sound identifiers, scale identifiers, and! / Consistent speech-related identifiers are generated, the bias of the feature amount for recognition of phonemes, phonemes, and various identifiers , And learning the bias for each emotion in the same phoneme, re-learning the features so that it can be performed simultaneously with the recognition of emotions associated with any phoneme and the recognition of phonemes with environmental sounds. The recognition rate may be improved, based on the intra-frame co-occurrence information of the content information, and the inter-frame probability transition matrix is used to search the content information V, or used as an evaluation function for the content information. OK!

[0551] In addition, since quantitative information with such subjectivity is difficult to quantize every time it is recognized, it is necessary to obtain a cumulative error and to obtain probabilistic reproducibility. By using unique identifiers and feature quantities, for example, the number of times of use is high or search registration with many new items is performed. Detection information including EPG, BML, RSS, text broadcasting, image characteristics and identifiers, voice characteristics and identifiers, and various identifiers and various feature quantities, depending on the user's positive responses and actions such as recording By using co-occurrence information in identifiers and feature quantities other than those specified by the user for search, detection, and learning, information that is frequently recorded and played back by the user is produced. It may be collected autonomously or the evaluation of the collected information may be presented by voice or text image to reflect the user's subjectivity.

[0552] In addition, recognition is based on feature strings obtained from symbol strings, emotions, scales, musical instrument sounds, environmental sounds, etc., and Z or video, which are recognized based on feature quantities obtained from speech. Classifiers such as shape, color, character, action, etc., and program information identifiers are categorized by multivariate analysis based on quantification analysis from class I to class IV, and used as a new identifier that is additionally used in the present invention. However, it can be used as an index for the search result by evaluating it in three stages, whether it belongs to 3σ from the mean and variance, or belongs to force 1σ, which belongs to 2σ.

[0553] In addition, the features in these processes may be composed of scalars, vectors, matrices, arbitrary-order tensors, multi-dimensional arrays, complex numbers, quaternions, octal numbers and! /, And multi-numbers. good.

[0554] By such a method, it becomes possible to evaluate co-occurrence states by giving arbitrary time widths to arbitrary information that symbolizes human senses, and to index and search information with video and audio. Therefore, it is possible to detect and detect information that has been difficult to search and detect with quantitative keys, and it is possible to detect human-friendly services, devices that realize such services, information processing systems, and communications. Since base stations and mobile terminals can be realized, portal sites such as the Internet, search sites, sales sites, social networking sites (SNS), expert system sites that share knowledge, auction sites, text broadcasting, information Multivariate analysis system for screening, screening system, authentication site handling credit information and authentication information on network, aggregate server In order to use the present invention when distributing information using RSS (RDF Site Summary) etc. in the graphical 'interface' and tangible 'interface, agent interface, robot, virtual reality, augmented reality, etc. , XML (eXtens¾le Markup Language) and S OA (Service Oriented Architecture), RDF (Resource Description Framework), BML (Broadcast Markup Language), SMIL (Synchronized Multimedia Integration Language), MathML (Mathematical Markup Language), Xpath (XML Path Language), SML (Simple (or (Stupid or Software) Markup Language), MCF (Meta Contents Framework), DDML (Document Definition Markup Language) ^ DSSSL (Document Style Semantics and Specification Language), DSML (Directory Services Markup Language), DTD (Document Type Definition), GML (Geography Markup Language), SMIL (Synchronized Multimedia Integration Language), SGML (Standard Generalized Mark-up Language), RDF (Resource Description Framework), etc. SOAP (Simple Object Access Protocol), UDDI (Universal Description, Discovery, and Integration), WDL (Web Services Description Language), SVG (Scalable Vector Graphics), HTML (HyperText Markup Language), URI (Uniform Res ource Identifier), WAP (The Wireless Application Protocol), XQL (XML Query Language), VML (Vector Markup Language), URL (Uniform Resource Locator), EPG (Electronic Program Guide), DLNA (Digital Linking Network Alliance), BML Various protocols such as (Broadcast Markup Language), information processing language variables such as markup language, schema, attributes, arbitrary tags, attributes, functions, etc. may be used in any combination to implement the service. At this time, correction information and new information may be expressed, written, and implemented using tags, variables, attributes, and instructions that indicate correction or new information. Convenience can be achieved by combining “Example of device”.

In addition, information input from the outside is health management measuring instruments such as pulsometers and blood pressure monitors that are not powered by voice or video, taste sensors, olfactory sensors, human body sensors, heat sensors, humidity sensors, temperature sensors, illuminance Low environmental instruments such as sensors, Raman spectroscopy, ultraviolet, infrared, visible spectrophotometer, laser 'ablation inductively coupled plasma mass spectrometer, qualitative quantitative analysis, fluorescent X-ray elemental analyzer, light scattering laser tomography instrument , Fourier transform infrared spectrophotometer, soft X-ray transmission device, colorimeter, spectrolino, cap detector, thermal analysis operation system, differential thermal 'thermogravimetric simultaneous measurement device, differential Scanning calorimeter, thermal machine, analyzer, thermal dilatometer, decomposition gas analyzer, automatic thermal analysis sample changer, humidity generator, plasma graft polymerizer, ultraviolet graft polymerizer, total organic carbon analyzer, gas chromatography , Liquid chromatography, osmometer, dynamic viscoelasticity analyzer, ionization mass spectrometer, ICP (Inductively Coupled Plasma) emission analyzer, fluorescence spectrometer, biochemical automatic analyzer, automatic blood transfusion analyzer, automatic chemiluminescent enzyme The discriminant function is constructed using the input from various analyzers such as immunoanalyzer, photoelectric photometric emission spectrophotometer, mass spectrometer, etc. as features, recorded in association with video and audio information, indexed and arbitrary These detections can be used for criteria, variables, and attributes for executing the process, and for criteria, variables, and attributes for robots and other behavior indicators. Even if the detection and prediction of the risk that occurs in the human body good,.

[0556] In addition, artificial intelligence and artificial incompetence including information retrieval devices! /, Information processing devices such as robots, personal computers, car navigation systems, backbone servers, and communication base stations, mobile phones, watches, and accessories The present invention may be a mobile terminal such as a terminal, remote control, PDA, IC card, intelligent! ^ FID, or body-embedded terminal. If present, the present invention can be implemented on an apparatus including an arbitrary information processing apparatus or an information distribution apparatus on a line.

[0557] In addition, in order to provide video, audio, and text as support information equipment for the city information support system, information support based on location may be implemented in association with location information by a combination of GPS and geomagnetic location detection system. It is also possible to use a co-occurrence matrix or feature value with an arbitrary identifier V, or a distance function.

[0558] In addition, by using the search device according to the present invention, user preference information is configured and analyzed based on search conditions frequently used by the user, or the user preference information is aggregated and multivariate analysis is performed to create a new preference. It is also possible to use a co-occurrence matrix with an arbitrary identifier or a distance function that uses a feature quantity, even if a category is provided.

[0559] Further, by using the search device according to the present invention, an advertisement by an arbitrary means using a co-occurrence matrix, a co-occurrence probability, a distance function based on a search condition based on a combination of the above-described arbitrary identifiers and feature amounts, Based on preference by evaluating the similarity of preference information with others who may advertise It may be used for compatibility fortune-telling, and may be used for advertisements while waiting for user instructions, such as during learning or presenting search results, or while waiting for users, not only during search. Yes.

[0560] In addition, it is acceptable for a user to perform indexing while uttering while looking at the screen using the search device according to the present invention, and let the user evaluate the preference and subjectivity of the extracted user. You can use reinforcement learning to improve the system of extracted information, or you can display a list of small images and videos such as thumbnails in the search results, and search the search results. You can express the matching rate with the conditions by color depth, brightness, number of icons, graph drawing, or arrange the order.

[0561] Further, phoneme information, emotion information, environmental sound information, scale information, and musical instrument information of information distributed using the identifier as described above are associated, and further, image recognition information, face information, color space information, image Information that associates internal object information and recognition character string information and registers information in the database, searches the database, corrects and changes each content file, and manages the generation of attached files associated with the content file By providing it to the processing device, information registration and information retrieval can be realized easily and with high accuracy. At this time, the registered audio information and video information as the search target are statistically converged to provide an efficient registration of recorded information and a service associated with the browsing of the registered contents. You can also.

[0562] In addition, an evaluation function and an HMM for generating identifiers as described above and analyzing categories of identifiers to form categories are configured, and the evaluation functions and configuration information are distributed between users. By exchanging or exchanging them, associating phoneme and phoneme information, emotion information, environmental sound information, scale information, musical instrument information, etc. based on the associated voice information, and also image recognition information, face information, color Spatial information, object information in the image, motion information, recognition character string information, recognition symbol information, etc. are linked to the information database and search conditions are set from the database, and provided to other information processing devices By doing so, arbitrary information registration and information retrieval can be realized easily and with high accuracy.

[0563] In addition, the above-mentioned operation characteristics are information on the sound source movement of the sound that is not in the image force, reflected wave change information such as echo search, feedback from the motor or pressure sensor, torque It may be information, or robot operation information or contact information may be used.

[0564] In addition, by statistically converging the input audio and video information as targets during registration and search as described above, it is possible to efficiently register recorded information and It is possible to exchange and sell, and to provide more efficient services associated with browsing the registered contents.

[0565] Also, a symbol string or identifier using a phoneme or phoneme as described above is transmitted to another device to change the processing content of the device, or a symbol string based on a phoneme or phoneme is received from another device. You may add processing 'modify control means'. In this case, international phonemes, phonemes, or phonemes or phonemes of any language may be used.

[0566] Also, as a criterion for constructing new identifiers as described above, the general recognition rate is about 60%, so there are existing identifiers that show a match rate exceeding 60%. It is possible to construct a co-occurrence matrix, co-occurrence probability, Bayes, HMM, and other evaluation functions such as probability function, likelihood function, and distance function based on the evaluation. In addition, when the matching rate of multiple identifiers is about 60% on average, new evaluation functions and identifiers may be configured, and DP, CDP, riff CD Pt, It is also possible to combine arbitrary symbol string matching methods, and to improve learning efficiency by combining with neural networks, fuzzy, chaos, fractal, genetic algorithms, etc.

[0567] Also, the information processing apparatus includes, for example, a main storage unit and an auxiliary storage unit! /, An information storage unit that performs information evaluation calculation processing, a communication unit that exchanges information with an external device, It is composed of a device that can register and retrieve information based on an information processing device that has an input unit that receives user instructions and an output unit that presents processing results to the user. Computers, backbone servers and communication base stations can be considered. In addition, it is more preferable to use an apparatus that can analyze information using a program that statistically analyzes the information recorded in the database.

[0568] In addition, the service using the present invention and the billing system are linked to provide added value to the user to realize the information distribution service and the agent service in consideration of the user's psychology and the user's hobbies. Anyway. [0569] In addition, if the user positively captures the results presented by the robot or agent, construct an algorithm to increase the number of times that are affirmed by the enhanced learning algorithm and the evaluation function for the search. Therefore, if a robot or an agent is affirmed by the user! /, A desire to exist and the robot or agent learns autonomously may constitute a learning model.

[0570] In addition, co-occurrence information based on learning results with low usage frequency is automatically deleted on the condition of user evaluation and free space, or saved in an external storage device or communication destination storage device. You can delete items in your own device, or leave an index or identification function that simplifies the conditions and obtain it from the outside using a communication line when necessary!

[0571] In addition, the portable information terminal is, for example, a mobile phone, a PDA (Personal Digital Assistant), a notebook computer, a wearable computer, a wristwatch computer, or an in-vehicle computer such as a car navigation system. The mobile phone, car navigation system, DVD recorder, HDD recorder, video recording / playback device, music, etc. Recording / playback device, STB, modem, FAX, telephone, personal computer, information distribution server, information distribution base station, store information terminal, cash register, POS (Point Of Sales system) terminal, ATM, projector, TV, video, editing machine It may be.

[0572] These information processing devices and portable information terminals are included in any combination necessary for execution by the feature extraction unit, user information input unit, information search unit, information storage unit, and query information transmission / reception unit. The information between these processes must be exchanged and mutually searched via the communication network such as the Internet and Intranet via wireless LAN, infrared communication, mobile phone, normal LAN, wired line, wireless line, etc. If a markup language is used, a markup language transmission / reception unit and a markup language interpretation unit may be added to the information input unit and the information output unit as necessary.

[0573] In addition, advertisement information for advertisements may be acquired via a communication line, advertisements attached to content may be presented, advertisement status is recorded to verify advertisement effectiveness. It is also possible to analyze search co-occurrence information with a high frequency of establishment of advertisements, or present advertisements with co-occurrence information highly similar to the co-occurrence information obtained at the time of indexing. Make them May be provided as a service!

[0574] In addition, arbitrary information in the storage unit may be in the same device, may be acquired from another device via a communication line, or may be a content search service.

[0575] Further, according to the present invention, the search system is not included in the information processing apparatus if the database and the index search evaluation unit are external to the information processing apparatus. It can be realized by enabling communication by any means regardless of wired or wired.

[0576] It should be noted that the present invention is merely an example, and the performance may be improved in combination with a technique described in any patent or document that is not necessarily limited to the description in the text.

Claims

The scope of the claims

[1] Content information acquisition means for acquiring content information;

Search condition input means for inputting search conditions;

Identification means for specifying content information that matches the search condition input by the search condition input means or a position in the content information from the content information acquired by the content information acquisition means;

In an information processing device equipped with

Content information capability Feature amount extraction means for extracting feature amounts;

Identifier generating means for generating an identifier using the feature quantity force evaluation function extracted by the feature quantity extracting means;

Index information storage means for storing the feature quantity and Z or the identifier as index information in association with the content or a position in the content;

Search condition conversion means for converting the search condition input by the search condition input means into a feature value and Z or an identifier,

The specifying unit specifies a content or a position in the content by detecting a match between the index information and the search condition by using the feature amount and the Z or the identifier converted by the search condition conversion unit. An information processing apparatus having a search specifying means.

[2] Content information acquisition means for acquiring content information;

Search condition input means for inputting search conditions;

In an information processing device equipped with

Feature quantity extraction means for extracting a plurality of different feature quantities from the content information module, identifier generation means for generating a plurality of different identifiers using a plurality of different feature quantity force evaluation functions extracted by the feature quantity extraction means,

A plurality of different feature quantities and Z or the identifier are used as the content or the container. Index information storage means for storing as index information in association with the position in the network, and search condition conversion means for converting the search condition input by the search condition input means into a plurality of different feature quantities and Z or identifiers. ,

The specifying means detects a position in the content or content by detecting a match between the index information and the search condition using a plurality of different feature quantities and Z or identifiers converted by the search condition conversion means. An information processing apparatus comprising search specifying means for specifying.

[3] The index information storage means further stores co-occurrence information configured based on the feature amount acquired from the content and Z or the identifier in association with the content or the position in the content,

And further comprising a search condition co-occurrence information configuration means for configuring, as search condition co-occurrence information, the co-occurrence information based on the feature amount and Z or the identifier converted from the search condition by the search condition conversion means,

The search specifying unit specifies a position in the content or content by detecting a match between the search condition co-occurrence information configured by the search condition co-occurrence information configuring unit and the index co-occurrence information. The information processing apparatus according to claim 1, further comprising a co-occurrence search specifying unit.

[4] The content includes text information,

4. The information processing apparatus according to claim 1, wherein the identifier generating unit generates an identifier based on the character information.

[5] It further comprises dictionary information storage means for storing the character information and the identifier in association with each other as dictionary information,

5. The information processing apparatus according to claim 4, wherein the identifier generating unit generates an identifier from character information included in the content using the dictionary information.

[6] The dictionary information storage means further comprises standard pattern dictionary information storage means for storing the identifier and standard pattern in association with each other as standard pattern dictionary information,

The identifier feature quantity converting means for converting the identifier into a feature quantity based on a standard pattern by using the standard pattern dictionary information. 5. The information processing apparatus according to one of the items.

[7] The index information storage means further stores the feature amount and Z or the identifier in association with the content or a position in the content based on the real time of the content information.

7. The information processing according to claim 1, wherein the specifying unit is a unit that detects a match between the index information and the search condition from content distributed in real time. apparatus.

[8] Any one of claims 1 to 7, wherein the co-occurrence information and the advertisement information associated with Z or the index information are presented during the search of the content information and Z or the search result or detection result. The information processing apparatus according to claim 1.

[9] At least one of a plurality of different feature amounts extracted by the feature amount extraction unit is a phoneme information power used in the content force phoneme recognition. A feature amount to be extracted or a phoneme generated from phoneme information. 3. The information processing apparatus according to claim 2, wherein the information processing apparatus is an identifier.

[10] At least one of a plurality of different feature quantities extracted by the feature quantity extraction unit is a phoneme piece information force used in the content force phoneme piece recognition feature quantity to be extracted, or phoneme piece information The information processing apparatus according to claim 2, wherein the information is a generated phoneme identifier.

[11] At least one of a plurality of different feature amounts extracted by the feature amount extraction means is an emotion information power used for emotion recognition from the content. 3. The information processing apparatus according to claim 2, wherein the information processing apparatus is an identifier.

[12] At least one of a plurality of different feature amounts extracted by the feature amount extraction unit is an auditory information power used in recognition based on the content power auditory information. The information processing apparatus according to claim 2, wherein the information processing apparatus is a generated identifier.

[13] At least one of a plurality of different feature amounts extracted by the feature amount extraction means is a visual information force used for recognition based on the content force visual information. The information processing apparatus according to claim 2, wherein the information processing apparatus is an identifier generated or a visual information ability.

[14] The content includes text information,

Among the plurality of different feature quantities extracted by the feature quantity extraction means or the identification quantity generated by the identifier generation means, at least one is an identifier generated from the feature information extracted from the character information power or character information. The information processing apparatus according to claim 2, wherein the information processing apparatus is characterized.

[15] At least one of a plurality of different feature quantities extracted by the feature quantity extraction unit or a plurality of different identifiers generated by the identifier generation unit is a program information capability. The feature quantity or program information extracted is an identifier. The information processing apparatus according to claim 2, wherein:

[16] At least one of a plurality of different feature quantities extracted by the feature quantity extraction unit or a plurality of different identifiers generated by the identifier generation unit is a sensor information force. The feature quantity or sensor information extracted is an identifier. The information processing apparatus according to claim 2, wherein:

17. An evaluation function reconstructing means for reconstructing the evaluation function from co-occurrence information configured based on the feature amount and Z or identifier obtained from the content camera. The information processing apparatus described in 1.

[18] An evaluation function reconstructing unit that reconstructs the evaluation function from the co-occurrence information configured based on the feature quantity and the Z or the identifier converted from the search condition by the search condition conversion unit. The information processing apparatus according to claim 3, wherein the information processing apparatus is characterized.

[19] Search result co-occurrence information constituting means for constituting co-occurrence information based on the result of specifying the content or the position in the content by the co-occurrence search specifying means,

4. The information processing apparatus according to claim 3, further comprising: an evaluation function reconfiguring unit that reconfigures the evaluation function from co-occurrence information configured based on the search result co-occurrence information configuring unit. .

[20] Content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and conforming to the search condition In an information processing apparatus provided with a specifying means for specifying the content from the contents stored in the content storage means!

The phoneme feature amount used for the phoneme recognition extracted by the content force and the phoneme identifier obtained by Z or phoneme recognition, and the emotion feature amount and Z or emotion recognition used for the emotion recognition extracted by the content force Index recording means for associating and recording the emotion identifier obtained by

The information processing apparatus according to claim 1, wherein the specifying unit includes an index specifying unit that specifies the content power that matches the search condition based on the index information recorded by the index recording unit.

[21] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!

The phoneme feature quantity used for recognition of the phoneme segment extracted from the content force, the phoneme identifier obtained by Z or phoneme recognition, the emotion feature amount used for the emotion recognition extracted from the content force, and Index recording means for associating and recording Z or emotion identifiers obtained by emotion recognition as an index,

[22] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!

The phoneme feature amount used for the phoneme recognition extracted by the content force and the phoneme identifier obtained by Z or phoneme recognition, and the emotion feature amount and / or emotion recognition used for the emotion recognition extracted by the content force The emotion identifier obtained by the above, the first feature quantity used for the first recognition extracted from the content, and the Z or first recognition Index recording means for associating and recording as an index the first identifier obtained by

[23] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!

The phoneme feature quantity used for recognition of the phoneme segment extracted from the content force, the phoneme identifier obtained by Z or phoneme recognition, the emotion feature amount used for the emotion recognition extracted from the content force, and The emotion identifier obtained by Z or emotion recognition, the first feature amount used for the first recognition in which the content power is also extracted, and the first identifier obtained by Z or the first recognition It has an index recording means to record as an index

[24] Content acquisition means for acquiring content, search condition input means for inputting a search condition for searching for a predetermined scene from the content, and contents that match the search condition are stored in the content storage means Information processing device with a specific means to identify the content!

The phoneme feature amount used for the phoneme recognition extracted by the content force and the phoneme identifier obtained by the Z or phoneme recognition, the first feature amount used for the first recognition by the content force and And / or index recording means for associating and recording the first identifier obtained by the first recognition as an index,

The specifying means has index specifying means for specifying the content power that matches the search condition based on the index information recorded by the index recording means. An information processing apparatus characterized by the above.

[25] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!

The phoneme feature amount used for recognition of the phoneme extracted from the content force and the phoneme identifier obtained by the Z or phoneme recognition, and the first for use in the first recognition of the content force extracted Index recording means for associating and recording the feature quantity and / or the first identifier obtained by the first recognition as an index;

[26] Content acquisition means for acquiring content, search condition input means for inputting search conditions for searching for a predetermined scene from the content, and contents that match the search conditions are stored in the content storage means Information processing device with a specific means to identify the content!

Emotion feature quantity and Z or emotion identifier obtained by emotion recognition to be used for emotion recognition extracted from the content power and first feature to be used for first recognition extracted from the content camera Index recording means for associating and recording the quantity and / or the first identifier obtained by the first recognition as an index;

[27] The first identifier and Z or the first feature amount are auditory information and Z or visual information and

Z or character information and Z or sensor information and identifier based on Z or warp and information and

27. The information processing device according to claim 22, wherein the information processing device is Z or a feature amount.

[28] On the computer, Content information acquisition function to acquire content information,

Search condition input function for entering search conditions,

From the content information acquired by the content information acquisition function, a content function that matches the search condition input by the search condition input function, or a specific function that specifies a position in the content information,

In an information processing device equipped with

A feature amount extraction function for extracting feature amounts from content information;

An identifier generation function for generating an identifier using the feature amount force evaluation function extracted by the feature amount extraction function;

An index information storage function for storing the feature quantity and Z or the identifier as index information in association with the content or a position in the content;

A search condition conversion function for converting the search condition input by the search condition input function into a feature amount and Z or an identifier,

The specifying function specifies content or a position in the content by detecting a match between the index information and the search condition using the feature amount converted by the search condition conversion function and Z or an identifier. A program that realizes the search specific function.

On the computer,

Content information acquisition function to acquire content information,

Search condition input function for entering search conditions,

In an information processing device equipped with

A feature amount extraction function for extracting a plurality of different feature amounts from content information; an identifier generation function for generating a plurality of different identifiers using a plurality of different feature amount force evaluation functions extracted by the feature amount extraction function;

An index information storage function for storing a plurality of different feature quantities and Z or the identifier as index information in association with the content or the position in the content; A search condition conversion function for converting the search condition input by the search condition input function into a plurality of different feature quantities and z or identifiers, and

The specific function detects a position in the content or content by detecting a match between the index information and the search condition using a plurality of different feature quantities and Z or identifiers converted by the search condition change. A program that realizes the search identification function to identify.