WO2007069512A1

WO2007069512A1 - Information processing device, and program

Info

Publication number: WO2007069512A1
Application number: PCT/JP2006/324348
Authority: WO
Inventors: Masayoshi Ihara
Original assignee: Sharp Kabushiki Kaisha
Priority date: 2005-12-15
Filing date: 2006-12-06
Publication date: 2007-06-21
Also published as: JPWO2007069512A1

Abstract

Provided is an information processing device, which changes and stores the orthography of a markup language used in contents information, thereby to use the changed information in a delivery device and a receive terminal. The information processing device is operated with a voice recognition technique and a phoneme recognition technique by changing phoneme dictionary information described in the markup language thereby to add tags, variables or attributes thereby to store, change and deliver them. Even if a word model, an acoustic model, a grammar model or speech part information is not registered in a recognition dictionary when a word or a letter string contained in the contents information is subjected to a voice recognition, the information processing device is enabled to realize the proper voice recognition by dynamically constituting and using the recognition dictionary information with phonetic symbols composed of phonemes and phonetic elements, which are used for the phonetic symbol recognition, from the contents information.

Description

Specification

Information processing apparatus and program

Technical field

The present invention relates to an information processing apparatus that uses phoneme recognition and Z or phoneme piece recognition for speech recognition. Background art

[0002] Conventionally, technologies relating to speech recognition are known as information processing apparatuses and operation methods that generally use speech. As a method of speech recognition, in general, phonemes and phonemes are recognized by phone models and phoneme recognition using acoustic models, standard parameters, and templates using statistically generated phonemes and phonemes. It is known how to extract phonemes in chronological order and acquire phonemes and phoneme symbol strings!

[0003] Then, using a speech recognition dictionary in which words having phoneme strings and phoneme string powers are recorded, recognized phoneme strings and phoneme string strings, phoneme strings and phoneme string strings registered in the speech recognition dictionary, and The speech recognition and processing associated with recognition are performed by acquiring words associated with phoneme strings and phoneme string strings that have a high degree of coincidence as a result of the evaluation, and by executing device control commands. Appear.

[0004] Here, as a user interface for controlling the device, a device specified by a phoneme recognition dictionary as in Non-Patent Document 1 and a device control method registered in the dictionary in association with the word are used by phoneme recognition processing. There is a method of selecting and implementing it, and as a technology for recognizing phonemes and phoneme pieces, it has been used as an old known technology as shown in Patent Document 1! /

[0005] Also, in speech recognition, spoken words in human dialogue often change abbreviations, exclamation words and coined words such as "Oichi" and "Uon", especially in content information, product names and actors There are many proper nouns such as names that are difficult to register in the dictionary, and not all words could be registered in the dictionary. Therefore, techniques for searching for content using phoneme recognition have been proposed in Patent Document 2, Non-Patent Document 2, Non-Patent Document 3, and the like.

[0006] Here, Patent Document 3 describes that a user's speech can be changed by changing the display representation of a recognizable word when used for speech recognition in HTML, which is one of markup languages. Proposals are made to make the operation easier.

[0007] Further, Patent Document 4 proposes a method for dynamically acquiring recognition dictionary data based on an acoustic model with a minimum vocabulary.

[0008] Further, according to Patent Document 5, when using for speech recognition in HTML, which is one of the markup languages, a range is designated with a specific symbol in order to identify a recognizable word. Therefore, a method to clearly indicate to the user that speech recognition can be performed has been proposed, and convenience is provided by writing a recognizable reading for words that are difficult to pronounce.

Patent Document 1: Japanese Patent Laid-Open No. 62-220998

Patent Document 2: Japanese Patent Laid-Open No. 2005-70312

Patent Document 3: Japanese Patent Laid-Open No. 11-25098

Patent Document 4: Japanese Patent Laid-Open No. 2002-91858

Patent Document 5: Japanese Patent Laid-Open No. 2005-18241

Non-Patent Document 1: “Research and Development on Life Support Interface for Aged Society”, Key Project Project Research Report by Aomori Prefectural Industrial Research Center Vol.5, Apr.1998-Mar.2001 031

Non-Patent Document 2: Masayuki Nakazawa, Takashi Endo, Kiyoshi Furukawa, Jun Toyoura, Takashi Oka (New Information Processing Development Corporation), "Study of speech summaries and topic summaries using phoneme symbol sequences of speech waveform power", Shin Academic Report, SP96-28, pp.61—68, June 1996.

Non-Patent Document 3: Takashi Oka, Hironobu Takahashi, Takuichi Nishimura, Nobuhiro Sekimoto, Hidehide Mori, Masanori Ihara, Hiroaki Yabe, Hiroaki Hashiguchi, Hiroshi Matsumura. Pattern Search Algorithm 'Map-Supporting "CrossMediator" -. Someone Unknown, editor, Artificial Intelligence 'Gikai Study Group, volume 1, pages 1-6. Japanese Society for Artificial Intelligence, 2001.

[0009] In addition, the use of the phoneme symbol string in the markup language uses the structure of Segment and Media Locator as a description in MPEG7 used in a moving picture stream such as MPEG2, and uses Media Locator> in the video content. You can directly specify the segment or frame of the file, point to the content with <Media Locator>, specify the time position within the content with <Media Time>, and combine it with a tag that specifies an appropriate proper noun, When assigning content using the above-mentioned Segment, As one level of metadata, there is a method of using the <Series> description method for attaching the same kind of metadata at fixed intervals. At this time, there is a method of specifying as Audio Series Scalable Series) MPEG7 audio has Spoken Content DS> which describes automatic speech recognition result word (word) and phoneme (phone) ratings.

[0010] In addition, VoiceXML, a standardized method for speech recognition, implements recognition that depends on the grammar according to the context. The method of dynamically constructing dictionary information by assigning attributes to the target range of arbitrary tags using phonetic symbol identifiers such as phonemes and phoneme pieces regardless of context or grammar is proposed. Not proposed.

[0011] According to conventional applications and documents, many phonemes and syllables are confused with each other, but the phoneme in the present invention is an example of the pronunciation of "a power sana" in Japanese. If you use syllables, `` A / ka / sa / tayoyo '' or `` a / ka / sa / ta / na '' will be written as a single voice, and if you use phonemes, `` a / k / a / s / a / t / a / n / a '' or `` a / cl / k / a / s / a / cl / t / a / n / a ''. / a— k / k / k— a / a / a— s / s / s— a I a / a- 1 / t / t— a / a / a— n / n / n— a / a ”or A / a- cl / cl / cl- k / k / k- a / a / a- s / s I sa / a / a— cl / cl / cl-t / t / t— a / a / a An example such as — n / n / n— a / a ”can be a bigram,“ a— a— a / a— cl— cl / cl— cl— cl / cl— cl— k / cl— k— k / k— k— a / a— a— a / a— a— s / s— s — s / saa / ... taa / a— a— n / nn ~ n / n— a— a / a— a — A ” f column force It becomes the f column of S 卜 lygram, and it can be a phoneme segmented based on any position such as the first half, middle board, second half in time series. / cl / is before pronunciation These phonemes and phonemes are both phonemes, notation symbols, phonetic symbols and phonetic symbols that represent any sound by any improvement, and phonemes obtained by decomposing them in time series. You can change notation symbol pieces and phonetic symbol pieces to phonetic symbol pieces!

[0012] Also, the difference between phonetic symbol recognition using phonemes and phonemes and normal speech recognition is explained. Phoneme recognition and phoneme recognition are different from general speech recognition. Vocabulary that interprets meaning and content Do not recognize を features and acoustic models of language models such as words, grammar and parts of speech There is a feature that it does not dynamically configure according to changes, and more specifically, phoneme recognition and phoneme segment recognition do not use a language model related to grammar. Not to be converted to symbols that contain meanings such as, or homonyms or homonyms are not distinguished, parts of speech such as nouns and verbs are not discriminated according to context, or morphological analysis or syntax In this case, phonemes, phonemes, phonetic symbols, phonetic symbols (as phonetic symbols divided in time series), etc. Recognize using a phonetic symbol based on the symbol and their symbol string together as a phonetic symbol recognition!

[0013] As described above, the recognition by phonemes and phonemes is performed by analyzing the utterance sound of a speaker using a static acoustic model for each phonetic symbol, and the phonetic symbol strings in the recognition dictionary and the phonetic symbols in the recognition dictionary. Characteristic that evaluates only sequence matches The recognition process and recognition dictionary structure is simplified, so phonetic symbols such as phonemes and phonemes are used even for unregistered words and exclamation words to evaluate only sound matches. It is possible to recognize an identifier string consisting of phonetic symbols.

[0014] In this case, a dynamic acoustic model that learns according to the utterance characteristics of the speaker and improves the performance may be used as in the past, but it depends on words and grammar as in general speech recognition. Therefore, the process of dynamically switching the acoustic model is not performed in phoneme recognition or phoneme recognition.

[0015] Therefore, by comparing the phoneme sequence or phoneme sequence with the registered dictionary contents, it becomes possible to easily detect the unregistered phoneme sequence or phoneme sequence, and for phonetic symbol recognition using phonemes or phoneme segments. Based on the recognition results, it is possible to conceive efficient speech recognition by limiting the words and implementing speech recognition with general grammar again.

[0016] And, even if there is a word that is not registered in the dictionary, the recognition method using phonemes or phonemes converts the unregistered word in the recognition target sentence to the hiragana character notation, and converts the converted hiragana character string. In accordance with the transition state, the symbol strings converted into phoneme strings and phoneme string strings based on the prosody obtained from wits are temporarily registered in the recognition dictionary, and the user's speech is recognized as phoneme strings and phoneme string strings. After comparing the phoneme sequence and phoneme sequence acquired with the phoneme sequence and phoneme sequence of the recognition dictionary, the degree of coincidence between the symbol sequences is measured and the recognition result is acquired. Then, the conventional voice recognition Speech recognition with dynamic phonemes and phonemes with a higher degree of freedom than knowledge is possible.

[0017] At this time, by re-learning the phoneme or phoneme unit acoustic model according to the user's utterance, the recognition accuracy is improved by the dynamic acoustic model dictionary independent of grammar and words. As described above, re-learning for recognition may be performed by reusing the acoustic information obtained by the user's speech ability as teacher information.

Disclosure of the invention

Problems to be solved by the invention

[0018] In conventional speech recognition technology, abbreviations of spoken words in human dialogues,

Exclamation words (exclamation) such as “Uun” and coined words have many differences depending on the times and the environment, especially in content information, dynamic proper nouns that depend on trends such as product names and actor names are recognized. Since registration in the dictionary is inefficient, it is important to repeatedly distribute recognition dictionaries including acoustic models and grammatical models that exist for a long time as a challenge when putting huge and varied speech recognition into practical use. Because it is relatively difficult due to the size of the quantity, it was registered in the recognition dictionary, and recognition based on vocabulary was virtually impossible.

[0019] In addition, in conventional speech recognition, learning of prosodic models and grammatical models is generally indispensable, and such a processing procedure is a problem of recognition of unregistered words based on the aforementioned coined words, buzzwords, proper names, etc. There is a problem that it is difficult to learn a grammar model based on prosody related to such unregistered words and co-occurrence relationships between words.

[0020] Furthermore, audio information in the conventional markup language cannot be searched or operated except for audio information synchronized with video and audio as content information. In order to provide information including phoneme symbol strings to users, it is necessary to recognize speech information in advance and store the phonemes and word IDs (word identifiers) in association with each other. Elemental arrangement · There was a problem with the provision and operation of phoneme.

[0021] Further, in the technique disclosed in Patent Document 3 described above, a method for recognizing a word that is not registered in the dictionary is not presented. Furthermore, the technique disclosed in Patent Document 4 cannot perform vocabulary-independent speech recognition, and requires measures such as learning prosodic models for unknown words each time. Realize high voice recognition I couldn't do it. Furthermore, the technique disclosed in Patent Document 5 has a problem that the speech recognition method cannot be recognized using phonemes or phonemes that are not different from the conventional speech recognition method.

[0022] Based on such problems, the present invention aims at a word model, an acoustic model, or a grammar in a speech recognition dictionary when speech recognition is performed on a word or a character string included in content information. Even if models and parts-of-speech information are registered, more appropriate speech recognition can be realized by using recognition dictionary information based on phonetic symbols using phonetic symbol recognition consisting of phonemes and phonemes. An object is to provide an information processing apparatus and the like that can be used.

Means for solving the problem

[0023] In order to solve the above-described problem, an information processing apparatus according to a first invention includes a content information acquisition unit that acquires content information including character information and Z or meta information, and the content information acquisition unit. A recognized phonetic symbol string detecting means for detecting a recognized phonetic symbol string consisting of phonetic symbols from the acquired content information; a recognition dictionary information generating means for generating recognition dictionary information using the recognized phonetic symbol string; It is characterized by providing.

[0024] An information processing apparatus according to a second aspect of the present invention provides a content information acquisition unit that acquires content information including character information and Z or meta information, and character information from the content information acquired by the content information acquisition unit. And a development target character string detecting means for detecting a development target character string based on Z or meta information, a phonetic symbol storage means for storing a character string and a phonetic symbol in association with each other, and the phonetic symbol storage A phonetic symbol conversion unit that converts the character string to be expanded into a recognized phonetic symbol string, a recognition dictionary information generation unit that generates recognition dictionary information using the recognition phonetic symbol string, It is characterized by having.

[0025] In addition, in the information processing apparatus according to the second aspect, the third invention adds the phonetic symbol converted by the phonetic symbol conversion means to the content information by adding the content information to the content information. Content information storing means for storing is further provided.

[0026] Further, the fourth invention is the information processing apparatus according to any one of the first to third inventions, the content information stored by the content information storage means and generated based on the content information. Sender that transmits the recognized recognition dictionary information to another information processing terminal Further comprising a step.

[0027] Further, according to a fifth aspect of the present invention, in the information processing apparatus according to any one of the first to fourth aspects, the voice input means for inputting voice and the characteristics of the voice input by the voice input means. A feature amount extracting means for extracting the collected amount, a feature amount phonetic symbol converting means for converting the feature amount extracted by the feature amount extracting means into a phonetic symbol, and the feature amount phonetic symbol converting means. Processing execution means for evaluating a phonetic symbol and a phonetic symbol constituting a recognized phonetic symbol string included in the recognition dictionary information, and executing a predetermined process corresponding to the most similar phonetic symbol; Is further provided.

[0028] Further, according to a sixth aspect of the present invention, in the information processing apparatus of the fifth aspect, the content information includes phoneme information and Z or phoneme piece information, and the processing execution means includes the feature quantity The phonetic symbols converted by the phonetic symbol conversion means and the phonetic symbols that constitute the recognized phonetic symbol string included in the recognition dictionary information are evaluated, and the user responds to the most similar phonetic symbol. On the other hand, information is presented by voice utterance.

[0029] Further, the seventh invention is characterized in that in the information processing apparatus according to any one of the first to sixth inventions, the phonetic symbol is a phoneme or a phoneme piece.

[0030] Further, an eighth invention is characterized in that, in the information processing apparatus according to any one of the first to sixth inventions, the process to be executed is an authentication process accompanying phoneme recognition.

[0031] Further, a program according to a ninth aspect of the invention includes a markup language interpretation step for interpreting information described using a markup language in a computer, and an attribute acquisition step for acquiring an attribute specified by the interpretation. A phonetic symbol extraction step of extracting a phonetic symbol string and Z or phoneme sequence and Z or phoneme fragment sequence associated with the attribute acquired in the attribute acquisition step, and phoneme recognition by the phonetic symbol extraction step. And a dictionary changing step for changing a phoneme string dictionary used in the section.

[0032] Further, a program according to a tenth aspect of the invention includes a computer, a markup language interpretation step for interpreting information described using a markup language, and an attribute acquisition step for acquiring an attribute designated by the interpretation; Based on the phonetic symbol extraction step of extracting the phonetic symbol string and Z or phoneme sequence and Z or phoneme fragment sequence associated with the attribute acquired by the attribute acquisition step, and the attribute acquired by the attribute acquisition step Use An information type evaluation step for evaluating the type of information input by the user and a dictionary changing step for changing the phoneme string dictionary used in the phoneme recognition unit are realized by the information evaluation step.

The invention's effect

[0033] According to the present invention, in order to use an information processing apparatus using phoneme recognition, a phoneme dictionary necessary for recognition of provided content information is included in the content information associated with the content information. By acquiring from the markup language, it is possible to deal with unspecified words related to display content. Therefore, unspecified words such as product sales are likely to occur frequently, and in order to perform processing in a single device or server / client environment, the names of phoneme strings, phoneme strings, and various identifiers are marked in the markup language. It is described as a tag attribute so that the utterance phoneme dictionary can be specified in scene units that span frames of content images, pages of sentences, frames in sentence structures, frames as a single frame of moving images, and multiple frames of moving images. In this way, we try to solve the problem.

[0034] In addition, by expanding the phonemes of the keywords used in these operations, variables and attributes can be changed depending on the distribution file format and markup language such as HTML, XML, RSS, EPG, BML, MPEG7, CSV sent to the user. , By using a method that incorporates an identifier using a phoneme sequence or phoneme segment sequence related to voice operation as a specific tag in association with an arbitrary markup language or script, and indexing using speech, It is possible to distribute and share voice control information for acquiring, browsing, and manipulating voices, and to acquire voice control information on the terminal side, in order to solve the problem.

[0035] In the present invention, when an unspecified word is recognized using a phoneme or a phoneme piece! /, In implementing the conventional technology, a scene of content information for various contents changing in the Internet environment. By using the fact that there are restrictions on the words that appear in it, and providing a dynamic dictionary construction method based on phoneme sequences and phoneme sequences, speech corresponding to unspecified words that do not use prosodic models or grammatical models We are trying to improve convenience by realizing recognition processing.

[0036] In the case of MPEG7, among scene tags of video information, the scene name, actor name, and cast name are classified by attributes, variables, and tags using phoneme symbol strings and phoneme symbol strings. By using the information specified in the markup language except for the recognition part of the audio stream by the range specification, search by any actor name or casting name using the phoneme search technology, and depending on the scene by the markup language information Since a phoneme string can be acquired, a device that can perform arbitrary instructions and searches can be realized and problems can be solved.

[0037] In the case of HTML, a variable or attribute including a phoneme symbol string is provided in the target link or CGI notation, or a range surrounded by a specific tag is converted into a phoneme string, and the tag variable, You can embed it as an attribute or select it! /, Provide a variable and attribute for each table element of the table tag surrounding the product, give the name to each element tag as a variable and attribute with a phoneme string symbol, form tag A phoneme string is given as a variable or attribute, and based on the given phoneme string, information is transmitted or a transition is made to the next page.

[0038] In addition, RSS phoneme strings and phoneme string sequences may be distributed, or IDs may be associated with key words using tags, and IDs may be used as recognition dictionary information associated with phoneme strings / phoneme string sequences. When a keyword is identified by providing a file, it is possible to use any method, and it is accompanied by a proper noun using an image recognition dictionary such as a face or fingerprint and a phonetic symbol string of phonemes or phonemes. Individual recognition using secret words may be performed by associating the recognition dictionary with an acoustic model based on phonemes or phonemes for each speaker.

[0039] In this way, the content of the recognition dictionary based on phonemes and phonemes is acquired as external attributes as markup language attributes, arbitrary tags, and dictionary files, thereby enabling the information processing apparatus to be operated and solving problems. Can be achieved.

[0040] That is, since words that are not clearly included in the content information are not included in the recognition dictionary unless specified, the probability of erroneous recognition is reduced, and voice operations are performed, phoneme strings and phoneme string strings are reduced. The existing mark can be used for device control by the system, or for personal authentication with a high degree of versatility because the authentication conditions accompanying images and sounds can be changed depending on the type of information, and for information exchange between information processing devices. An easy-to-use user interface is realized by adding phoneme and phoneme notation and using dictionary information with phonemes and phonemes attached to or associated with markup language and content. can do.

Brief Description of Drawings FIG. 1 is a block diagram of an information processing apparatus using the present invention.

FIG. 2 is a diagram showing an example of the data structure of recognition dictionary information.

FIG. 3 is a diagram showing an operation flow of a phonetic symbol assignment process.

FIG. 4 is a diagram for explaining the operation of a phonetic symbol assignment process.

FIG. 5 is a diagram for explaining the operation of a phonetic symbol assignment process.

FIG. 6 is a diagram for explaining the operation of a phonetic symbol assignment process.

FIG. 7 is a diagram for explaining the operation of a phonetic symbol assignment process.

FIG. 8 is a diagram showing an operation flow of recognition dictionary update processing.

FIG. 9 is a diagram showing a different data structure of recognition dictionary information.

FIG. 10 is a diagram showing an operation flow of recognition dictionary information update processing.

FIG. 11 is a diagram for explaining the operation of recognition dictionary information update processing.

FIG. 12 is a diagram for explaining the operation of recognition dictionary information update processing.

FIG. 13 is a diagram for explaining the operation of recognition dictionary information update processing.

FIG. 14 is a diagram for explaining the operation of recognition dictionary information update processing.

[FIG. 15] A diagram showing an operation flow when applied to a server client model.

FIG. 16 is a diagram showing an operation flow when applied to a server client model.

FIG. 17 is a diagram for explaining a modification in the present embodiment.

FIG. 18 is a diagram for explaining a modification of the present embodiment.

Explanation of symbols

[0042] 1 Information processing apparatus

10 Control unit

20 Memory

202 Content information

204 Phonetic symbol conversion table

206 Recognition dictionary information

208 Phonetic symbol assignment program

210 Recognition dictionary information update program

212 Voice control program 30 Communications department

40 I / O section

50 Control section

60 Display

BEST MODE FOR CARRYING OUT THE INVENTION

[0043] The present invention provides a device that changes and stores the markup language notation used for content information, stores and uses it, and uses the changed information as it is. It is possible to configure an information processing apparatus that is used for a distribution apparatus that distributes changed information and a receiving terminal that receives and uses the information for recognition and operations and responses. More specifically, as shown in the XML and HTML examples, the information written in the existing markup language is changed, tags are added, variables and attributes are added, saved, changed, and delivered And a method of operating the information processing apparatus by receiving such information.

[0044] <Content information example>

First, the contents and content information to be searched and indexed using the present invention will be described. The contents are exclusively movies, dramas, photos, news reports, movies, illustrations, paintings, music, It is generally well known to show promotional videos, novels, magazines, games, papers, textbooks, dictionaries, books, comics, catalogs, posters, broadcast program information, etc. Map information, product information, sales information, advertising information, reservation status, viewing status, road status information and questionnaires, surveillance camera images, satellite photos, blogs, models, dolls, robots, etc. Information provided by the camera, microphone, sensor input, etc., names of such information, status and situation, abstract concepts, superordinate concepts and subordinates The designation on just in case the symbol column by phonemes or sound fragment may include the information developed.

[0045] In addition, a time-series change of video, a time-series change of speech, a sentence expecting a time-series change of reading position of a reader, electronic information in HTML markup language notation, and a search index generated by them Interpretation of aloud reading position, which may be information, etc., may be interpreted as a time axis, and punctuation marks, sentences, chapters, and sentences may be captured as frames.

[0046] Also, the meta information attached to the content, the document as text information, and the EP as program information G and BML, musical scale as musical scale information, general still and moving images, polygon data, vector data, texture data, motion data (motion data) as 3D information

), Visual information, auditory information, text information, sensor information, etc. .

[0047] Then, there has been proposed a method of recognizing the audio content of a content and adding a phoneme string using a spoken content DS> tag that describes a phoneme (phone) rating such as MPEG7 that has been used conventionally. However, this method occurs in the content and is indexed by a string of symbols based on the recognition of the audio information, so that the user can search for the content title and performer by voice operation. Provide information by phonetic notation and phoneme notation and notation using phonetic symbols and phonetic symbols! /.

[0048] For this reason, unspecified words and unique nouns related to the name and expression of content titles and performers cannot be used for speech recognition. Character information related to the content information is written together by expanding the phoneme, and as a variable or attribute of the tag, the phoneme symbol string, phoneme symbol string, syllable symbol string, and other phonetic symbols and phonetic symbols are included in the MPEG information. By embedding identifiers based on them, phoneme recognition technology can be used for speech recognition.

[0049] In other words, if the portion enclosed by the tags that are subject to speech processing in the markup language is an arbitrary character string, the character string is expanded into a phoneme symbol string or phoneme symbol string, and the phoneme symbol or phoneme is expanded. It can be used for recognition using one-symbols, and it is also evaluated whether the user's utterance power is consistent with the recognized phoneme symbol string or phoneme symbol string, or the spoken phoneme is converted to any phonetic character The phonetic characters may be matched to each other, or the phonetic symbol string based on the user's utterance recognition result is evaluated to be the user's operation target or search target. It may be. In addition, any character or symbol such as an at sign or key bracket described in ideograms may be converted to a phoneme symbol or phoneme symbol by an appropriate phonetic symbol string, and multiple utterances can be estimated. If it is a character string, multiple phoneme strings, phoneme string strings, and syllable symbol strings may be given as in conventional speech recognition.

[0050] Then, the recognized phoneme sequence or phoneme segment sequence is given to the database as a query. DP or HM Search by a symbol string matching method such as M, add phoneme strings or phoneme string sequences to the search results, present the search results as a list so that they can be browsed, and select products based on the phoneme strings included in the search results Select and acquire phoneme strings and phoneme string sequences from the recognition dictionary to perform charging and purchase procedures from the acquired control method. By combining with a phoneme recognition dictionary composed of uttered speech features and recognition dictionaries such as fingerprints, irises, faces, palm prints composed of image features, etc. In addition, the search / browse / sales' authentication / billing procedure can be realized.

[0051] In this way, identifiers and identifiers based on phonetic symbols necessary to evaluate any word that should be obtained as a recognition result according to the position in the information such as the content playback location, page, or display location By switching the column recognition dictionary, it is possible to construct a highly versatile dictionary structure for unspecified words in various usage environments, and based on a phonetic symbol recognition dictionary that uses dynamically configured phonemes and phonemes. Presenting recognized words, performing arbitrary processing, acquiring advertisement URLs, presenting advertisements, and operating devices can be used to deliver content and advertisements to users.て Highly convenient! Realization of information presentation and search conditions are specified by using phoneme strings and phoneme string strings as variables used for CGI processing posts and get in the Internet environment such as the Web. Send Or, it is possible to and go the Web page switching and operation.

[0052] It should be noted that the procedure for expanding Japanese into phonemes is well known, and it has been changed to "Kana notation" using a "split writing" program that converts kanji mixed kana obtained from ideograms into phonograms. There is a method of constructing a symbol string for recognition by performing phoneme symbol conversion and phoneme symbol conversion using phonetic symbols such as `` Roman characters '' associated with `` Kana notation '', and writing in syllable symbols in the same procedure There is also a way to do it.

[0053] In the case of English, it can be converted into a phoneme symbol string using an English phoneme symbol or a phonetic symbol, or can be converted into a phoneme symbol string using an international phoneme symbol. There are phonetic dictionaries in various languages where it is acceptable to use phoneme symbols and phoneme symbols suitable for the language. Phonemes as identifiers that dissociate phonetic symbols in time series, and It is possible to distribute information using a markup language based on any phonetic symbol by expressing the phonetic symbol as an appropriate character code in association with a number.

[0054] At this time, if necessary, the phoneme symbol string may be converted into a phoneme symbol string to improve the convenience of the search, and the environmental sound identifier, scale identifier, image identifier, operation Even if the identifier is an ambient sound scale or scale lattice, an image identifier or motion identifier section is provided in the MPEG stream, and a phoneme sequence or a phoneme sequence is given based on the speech related to the names of these identifiers. good.

Next, a more specific procedure will be described with reference to the drawings.

〔Device configuration〕

First, the apparatus configuration of the information processing apparatus 1 when the present invention is applied will be described with reference to FIG. Here, the information processing apparatus 1 is an apparatus that is realized by each information processing device such as a general-purpose computer, a dedicated terminal, and a portable mobile terminal.

As shown in FIG. 1, the information processing apparatus 1 includes a control unit 10, a storage unit 20, a communication unit 30, an input / output unit 40, an operation unit 50, and a display unit 60. Has been. Here, each functional unit is connected to the control unit 10 via a nose. Note that the operation unit 50 and the display unit 60 may be arbitrarily removable devices.

First, the communication unit 30 is a functional unit for exchanging information with other devices via a LAN (Local Area Network) or a communication network such as the Internet. Here, the communication unit 30 is generally configured by a device that can transmit and Z or receive content information such as Ethernet (registered trademark), modem, wireless LAN, and cable television device.

[0058] Next, the input / output unit 40 is a functional unit for inputting / outputting information from / to another device or from the outside, for example, an input device such as a microphone, a scanner, a capture board, a camera, or sensors, a speaker, etc. And output devices such as printers, modeling devices, and display devices.

[0059] The storage unit 20 is a functional unit that acquires and stores information in the information processing apparatus 1 and stores a program executed by the control unit 10. The storage unit 20 includes a ROM and RAM as semiconductor storage elements, a hard disk and magnetic tape as magnetic storage media, a CD (Compact Disk) and a DVD (Digital Versatile Disk) as optical storage media, and the like. [0060] Specifically, the storage unit 20 stores content information 202, a phonetic symbol conversion table 204, and recognition dictionary information 206, a phonetic symbol addition program 208, and recognition dictionary information. An update program 210 and a voice operation program 212 are stored.

The content information 202 stores content acquired from the outside via the communication unit 30 and content input via the input / output unit 40. The phonetic symbol conversion table 204 is a table that is referred to when content information is converted into phonetic symbols. For example, a character string and a phonetic symbol such as a phoneme are associated with each other. It is a table that is memorized.

[0062] The recognition dictionary information 206 is information that stores a relationship between a word and a phoneme string, a phoneme string string, and the like (hereinafter, these phoneme strings are indicated as phonetic symbols). For example, as shown in Fig. 9, the item `` Title '', the target word `` discount campaign '', and the phoneme string (phonetic symbol) expanded from the target word Γο / t / o / k / u / ky / a ···] is associated and memorized! Speak. The recognition dictionary information 206 is a general language dictionary in which proper nouns such as “product name”, “product nickname”, and “phoneme sequence based on product nickname” may be registered in addition to items. Construct a recognition dictionary that realizes various recognitions by dynamically exchanging words containing exclamation words and misrepresentations that are not registered in the list by phoneme strings or phoneme string sequences!

[0063] The operation unit 50 is a functional unit that receives operation inputs from the user and includes an input device that inputs information associated with operations such as a keyboard, a mouse, a camera, and a remote controller (including wireless). The display unit 60 is a functional unit that outputs information output from the information processing device 1 so that the user can visually recognize the information. The display unit 60 uses a display device that performs display related to operations including a display and a projector. It is composed.

[0064] By calling various programs stored in the storage unit 20, the control unit 10 executes processing for realizing a function corresponding to the program, and each function unit of the information processing apparatus 1 is executed. Do things like control.

The control unit 10 reads out the phonetic symbol addition program 208 from the storage unit 20 and executes it, thereby realizing a phonetic symbol addition process to be described later. Also, the recognition dictionary information update program 210 described later is realized by reading the recognition dictionary information update program 210 from the storage unit 20 and executing it. Also, by reading out and executing the voice operation program 212, sound Voice operation processing is realized.

[0066] Further, the control unit 10 executes a program to acquire a phoneme 'phoneme segment recognition process, tag information, tag identifier, phoneme sequence, phoneme sequence, and user uttered speech phoneme. Phoneme sequence by phoneme recognition · Phoneme sequence associated with phoneme sequence and dictionary registration information • Words can be selected by evaluating similarity of phoneme sequence and voice waveform using input / output unit using microphone May be used for speech recognition, or information may be provided to the user by speech synthesis using a phoneme sequence or phoneme segment sequence acquired by the present invention using a speaker.

Note that the control unit 10 is normally configured using a CPU (Central Processor Unit), DSP, ASIC, or the like, and can be realized by arbitrarily combining them.

[0068] <Operation>

Next, each operation process executed by the information processing apparatus 1 will be described.

[0069] <Phonetic symbol addition processing>

First, the phonetic symbol addition processing will be described with reference to FIG. FIG. 3 is an operation flow for explaining the phonetic symbol-added calo process, which is realized by the control unit 10 reading and executing the phonetic symbol-added calo program 208 in the storage unit 20. .

First, the control unit 10 acquires the force received by the communication unit 30 and the content information 202 stored by being input by the input / output unit 40 (step S301).

Next, a development target character string is detected from the read content information 202 (step S302). Here, the expansion target character string is a character string (information) for identifying a change in the display control method. For example, in the case of a markup language, a tag indicating a link A> or title The tag indicating TITLE>. A target character string that is expanded into phonetic symbols such as phonemes and phoneme segments is detected for the range between the tags.

Next, the expansion target character string is expanded into a phoneme string or a phoneme string string (phonetic symbol) composed of phonetic symbols accompanying the utterance (step S303). Thereby, for example, the title and the name of the link destination are converted into phonetic symbols. When this expanded character string is converted to a phonetic symbol, the character string is acquired by referring to other attributes such as the ALT attribute and ID attribute that are included in the tag, and converted to a phoneme string or phoneme string string. By using phonetic symbol strings for dictionary registration Configure dictionary information, configure image file names, music file names, video file names, document file names, phonemes and phoneme strings, image files, music files, video files and document files A phoneme or phoneme string sequence is constructed using the tag attributes and the text strings sandwiched between tags, the phoneme string or phoneme string sequence between the tags, Using the link information associated as an attribute, based on the name of the file at the link destination and the character information included in the file, the phoneme string or phoneme using the tag or tag attribute or the character string between the tags A method of constructing a phonetic symbol string to be registered in the dictionary by using an arbitrary method such as constructing a single string is conceivable.

Specifically, the phonetic symbols are converted into phonetic symbols using the phonetic symbol conversion table 204. For example, a character string “main” surrounded by title tags is converted to “m / e / i / n /” as a phonetic symbol by referring to the phonetic symbol conversion table 204.

[0074] Further, there is a case where the attribute of the phonetic symbol expansion is already given as the tag of the content information itself without performing such expansion into the phonetic symbol, and the phonetic symbol used for recognition is used. The recognition phonetic symbol string by the symbol string may be configured. For example, after the step S401 for acquiring the content information is performed, the meta information accompanying the markup language information as shown in FIGS. 4 to 7 is obtained. Step S402 for detecting the `` rpronounce attribute from '' is executed, and the phonetic symbol string composed of phonemes and phonemes is extracted as a variable of the detected rpronounce attribute !, and the extracted phonetic symbol string is extracted. And the meta information in which the “rpronounce attribute” is detected and registered as dictionary information. By performing step S403, a table that expresses the utterance sound that can be used by a voice operation program or recognized by the information processing device. sound Dynamic phonograms can be specified by specifying any process, procedure, or operation by specifying the processing content or transition destination page using tags or CGI as meta information associated with the title sequence. Realize recognition using columns.

As a result, the phonetic symbol storage process is executed, and the recognized phonetic symbol string used for phonetic symbol recognition is stored (step S304). The phonetic symbol storage processing is processing for storing the phonetic symbols used for recognition converted in step S303. For example, the phonetic symbols that are already recorded as attributes in each tag are extracted, Processing to add a phonetic symbol (phoneme string or phoneme string string) as a new attribute by expanding the character string between the strings (Step S30) 4a), processing to add tags and attributes that indicate that each tag is a speech recognition target (step S304b), and separate proper nouns to be recognized, convert them into phonetic symbols, and recognize them The process of configuring and updating the recognition dictionary information 206 by configuring the phonetic symbol string (step S304c) is executed. As a result, the content information and the recognition phonetic symbol sequence as the phonetic symbol sequence as the phoneme sequence of the word to be used for recognition and the phoneme sequence sequence are executed.

[0076] Then, the control unit 10 updates and stores the changed content information 202, and recognizes recognition dictionary information for phonetic symbol recognition using phonemes and phoneme segments made up of associated recognition phonetic symbol strings. Save the update (step S305). This makes it possible to use content information that has been changed to recognition of user utterances and distribution via the communication unit.

It should be noted that the processing described above is executed by the distribution device (server) side that distributes the force content information described as being executed by the information processing device 1, so that the information associated with the conversion to the phoneme sequence on the reception side is executed. The processing burden may be reduced. When executed by the distribution device, the distribution device distributes content information accompanied by voice control information in response to a content information call from a user. Therefore, the information processing device 1 (terminal device) can acquire phoneme information classified according to the content page and frame by the information processing device, and can use voice of arbitrary words with less restrictions.

Here, an operation example when the phonetic symbol addition process is executed will be described with reference to the drawings. First, FIG. 4 is a diagram showing the state of the content information 202 acquired by the information processing apparatus 1. By executing step S301, the content information 202 is acquired from the communication unit 30 or the input / output unit 40 and stored in the storage unit 20.

[0079] Then, information related to the tag that is the target of the evaluation by the phonetic symbol (phoneme string 'phoneme string string) is detected (step S302). Note that the information in Fig. 4 is an example of content information using the item section of RSS when the extraction process is executed in step S302. The item section force target character string is extracted and converted. It is shown as Figure 5 or Figure 6.

[0080] When a tag to be expanded is detected from tags included in the acquired content information, a character string in a range specified by the tag is detected. Example For example, in Fig. 4, the “profit campaign” sandwiched between the tags “ku title>” and “ku / title>” that mean the title is detected as the character string to be expanded into the phonetic symbol string. At this time, an arbitrary title character string specified by the distribution side can be acquired by extracting this character string, which can be deleted by unnecessary parentheses.

Then, the acquired character string is confirmed, and converted into a phonetic symbol string according to the pronunciation of the character string using the phonetic symbol conversion table 204. Then, as shown in FIG. 5, for example, a process of adding a new phonetic symbol string and adding a phonetic symbol string as an attribute or variable to the tag described in the original content information 202 (step S304a), As shown in Fig. 6, the processing for newly setting the <prono unce> -ku / pronounce> tags, the words and instructions recognized as phonetic symbols are associated and saved as recognition dictionary information 206, and the recognition dictionary information 206 Is stored in the content information 202 as the acquisition destination of the recognition dictionary information 206 by using a URL tag META> tag and the like (step S304c) etc. is executed, and the phonetic symbol string information used for phonetic symbol recognition is executed in the content information. It becomes possible to add and associate.

[0082] Then, by distributing the content information 202 with the above-mentioned changes directly to other terminals or using it in the device, phonetic symbols (phonemes, phonemes, phonetic symbols, phonetic symbols) It is possible to perform operations based on this.

[0083] Also, for example, in MPEG7, as shown in Fig. 7, a "ku Pronounse DS>" tag is added to describe a phoneme string of content type, or "Rainsound" is generated as an environmental sound behind And the phonetic symbol string “pronounce =" t / o / m / u ”” as the cast name as an attribute for the notation related to the performer that can be placed in the phrase tag. In addition, HTML presents examples of configurations associated with buttons and links, and searches the range between arbitrary tags as a keyword, which is useful for browsing and searching for content, and pronunciation for operations. It may be provided as a symbol string, and the acquired phoneme or phoneme fragment and the resulting phonetic symbol string are stored in the word dictionary of the utterance phoneme or utterance phoneme for speech synthesis utterance in the information processing device 1. It may be used.

Thus, according to the phonetic symbol addition process, the phonetic symbol used for phonetic symbol recognition by adding the phonetic symbol for performing the voice operation based on the acquired content information. Content information including phonetic symbol string information to be incorporated into the dictionary can be configured. [0085] <Recognition dictionary information update process>

Next, the recognition dictionary information update processing in the case where a phonetic symbol has already been added to the content information 202 will be described with reference to FIG. FIG. 8 is a diagram illustrating an operation flow related to the recognition dictionary information update process, which is a process realized by the control unit 110 executing the recognition dictionary information update program 210 in the storage unit 120.

[0086] First, the control unit 10 acquires the content information 202 (step S401). Next, the control unit 10 extracts a phonetic symbol string from the read content information 202 (step S402). In this embodiment, by extracting a tag (a portion between “ku” and “>”) included in the content information 202, a tag including a phonetic symbol string is specified and extracted. It becomes.

[0087] For example, when the control unit 10 extracts the "pronounce attribute" of the title tag "KU TITLE>", the phoneme symbol string "οΛ / οΛ / ιι ···" is used as the phonetic symbol that is the argument. To extract. Then, the extracted phoneme string is stored as a page title and registered in the recognition dictionary information 206 (step S403).

[0088] By switching these dictionaries according to the display contents that change according to the page change, it is possible to avoid erroneous recognition due to words that are not in the display contents, thereby improving the speech recognition rate and improving operability. It is also possible to update the dictionary information to acquire necessary phoneme strings such as URLs of dictionary information associated with content information using arbitrary tags and character strings and to associate them with recognized words and control methods You can do it.

[0089] When the phoneme sequence or phoneme sequence is described in the content information to be distributed, or when the related phoneme dictionary is not associated, the above phoneme sequence 'phoneme sequence is used. Construct phoneme and phoneme symbol strings from the content according to the embedding procedure, and construct dictionary information. The constructed dictionary information is used by detecting whether or not the same word is used. Reuse if possible.

[0090] If the phoneme symbol string does not change when the control recognition dictionary is configured, if it is a control command, a dictionary that associates the ID related to the control command, the command word, and the phoneme sequence as shown in FIG. Use this to identify the command word for control and distribute the content information with the ID used as the command discrimination ID and record it on the storage medium. And storage medium power In the information associated with the acquired content information, the instruction word is identified from the instruction discrimination ID described in the place where the phoneme information or the phoneme information is described, and the phoneme or the phoneme fragment is identified from the identified instruction word. The phoneme sequence or phoneme sequence is constructed and used for recognition by performing the conversion function, or the hash value based on the phoneme sequence or phoneme sequence associated with the control command is used for the instruction discrimination ID. It is also possible to shorten the phoneme string expression during transmission, which tends to be redundant, and improve communication efficiency.

[0091] In addition, regarding the content information 202 obtained via the storage medium or the communication means and stored in the storage unit, if the content information 202 is not converted or added to a phonetic symbol, the content is interpreted by the above-described method. If the information of the content information 202 has already been written, converted, or updated, the content information 202 is converted to the phonetic symbol by the identifier string corresponding to the information processing apparatus 1. There is no need to convert or update.

[0092] In addition, these conversions may be performed by converting on the server side according to the content information distributor or user's situation, or by appropriately converting what is received by the client, or by an external storage medium by itself. The acquired information can be converted so that it can be used by the device itself, or it can be converted using a relay means such as a gateway router.

[0093] <Voice operation processing>

Next, the voice operation process will be described with reference to FIG. First, the control unit 10 acquires content information acquired by the communication unit 30 or the input / output unit 40 and content information 202 stored in the storage unit 20 (step S501).

Next, phonetic symbols composed of phonemes and phoneme pieces are extracted from the acquired content information (step S502). Then, the recognition dictionary information 206 is updated and registered based on the extracted phonetic symbols (step S503).

Next, it waits until there is a voice input from the input / output unit 40 based on the utterance from the user (step S504; No). Here, when a voice input is made by the user (step S504; Yes), the control unit 10 extracts a feature amount of the input user's voice (step S505). Then, phonetic symbols such as phonemes and phonemes are recognized from the extracted feature quantities and converted to phonetic symbols (step S506).

[0096] Then, the phonetic symbols converted in step S506 are registered in the recognition dictionary first. A coincidence evaluation is performed to determine how much the phonetic symbols match! / Spoken (step S507). This coincidence evaluation is stored in the storage unit of this device! The degree of coincidence with the standard model, standard parameter and standard template of sound and speech is evaluated by the evaluation function, and the phonetic symbol as the evaluation result is specified. Then, a phonetic symbol string is specified by obtaining a plurality of phonetic symbols specified based on the coincidence evaluation in time series. Then, the phonetic symbol string having the highest similarity to the identified phonetic symbol sequence is used as the phonetic symbol recognition result, and device operation and search processing are executed according to the information associated with the recognition result ( Step S508).

Here, the process associated with the recognition result is, for example, generation of a character string accompanied by a proper noun realized by recognition of a phonetic symbol string using the present invention, and each operation command, information, or product. For example, information is presented to a series of users specified in connection with the execution of related searches and the recognition of phonetic symbol strings, and the operations are instructed by the users. Specifically, it is based on the sound, text, images, and images of home appliances connected to web browser pages, TV and video operations, robots, navigation devices, computers, audiovisual equipment, cookers, washing machines, and air conditioners. A series of processes and operations such as response, search condition instruction, storage, change, registration and deletion of information presented by the information processing device, specification and browsing of advertisements and program contents associated with recognition results, and personal authentication based on keywords and speech is there. In addition, by using an image recognition dictionary such as face or fingerprint, a recognition dictionary with proper nouns using phonetic symbolic strings based on phonemes or phonemes, and an acoustic model based on phonemes or phonemes for each speaker, Accompanied with authentication, billing and service selection can be performed.

[0098] Specifically, a recognizable word is clearly indicated by uttering a word based on a phoneme string or phoneme string registered in the recognition dictionary for speech synthesis to answer a user's question. Or perform any operation according to the recognition result, present a recognized character string or word string according to the recognition result, or execute an advertisement associated with the phoneme string or phoneme string string Can be combined with conventional speech recognition technology.

[0099] Then, it is determined whether or not to execute the next voice input (step S509). If the voice is input again (step S509; Yes), the process returns to step S504 as a process for waiting for the voice to be input. If no voice is input (step S509; No), it is determined whether or not the next content information can be acquired (step S510). here When acquiring the next content information (step S510; Yes), the processing is repeatedly executed from step S501 in order to acquire a new content. If new content information is not acquired (step S510; No), a series of processing is performed when the processing ends and the user's speech is awaited! /.

[0100] That is, the device using the present invention allows the user to perform voice operation by using identifiers based on phonetic symbols such as information power phonemes and phonemes in the acquired markup language and features for identifying the identifiers. Acquire the markup language information ability where possible, and if necessary, use it for personal authentication, etc. by acquiring and combining arbitrary identifiers related to images and actions such as fingerprints, facial expressions, palm prints, etc. It can also be used for actions of agents and robots.

[0101] Then, selection processing that is conventionally performed by mouse operation is performed based on identifiers and feature values obtained by user's utterances and inputs, and focus is given to any row, column, link, and operation button of the table tag. , Overlapping cursors, issuing events associated with these operations to browsers from operating system managers, controlling other devices using infrared, LAN, telephone lines, etc. By changing the agent's robot-responsive action according to the word, the series of processes associated with the recognition of phonetic symbol strings such as phonemes and phoneme pieces can be performed.

Then, the content information acquired in step S501 detects the “pronounce” attribute information in the tag (step S502) and registers it in the recognition dictionary information 206 (step S503). At this time, by simultaneously registering which tag is associated with the phonetic symbol string for recognition, the display position and display items of the screen configuration information are combined in the browser with the preceding and following tags. Specify the display position when processing each tag, specify the scene position by associating with the tag indicating the scene, title and time series position in the content information in MPEG7 etc., and display the map information The physical location can be specified by associating it with spatial location information by latitude and longitude, location name, regional information, or store information using XML.

[0103] Next, an example of voice operation processing will be described with reference to FIGS. When the user pronounces “first line, drafter”, select the table tag column and use the top I / ch / i / gy / o / u / m / e and k / i / a / n / sh / a using the pronounce attribute of the table tag described in the row, the phoneme string and user's utterance Recognize phonetic matches and determine dictionary power. As a result, the “initiator” column in the phoneme string of the utterance is selected, and the “initiator” in the “first row” is selected by selecting “first row” in the tag for designating the row (step 1). S506).

[0104] If a phoneme string provided as an attribute for each HTML send button in Fig. 11 is detected, transmission is performed according to the specification of the form tag, or the phoneme string of the user's utterance is recognized. If the degree of coincidence with “ts / u / g / i” is high, the web browsing process can be performed by moving to the link destination. Then, when moving from page to page, “Do you want to move?” T Providing the user with the question by voice, character string, image, or video (step S506) t Agent with the response to the user You can also interact with robots, robots, etc., image recognition dictionaries such as faces and fingerprints, recognition dictionaries with proper nouns using phonetic symbols and phonemes, and phonemes and phonemes for each speaker. It is also possible to perform personal authentication with secret words by associating with acoustic models based on.

[0105] Next, the browser image as seen from the user is presented and explained. When Fig. 11 is displayed in an HTML browser, it becomes as shown in Fig. 12. Here, the utterance "Ichigiyome" is attributed. The focus B300 is set in the first line according to the phoneme string with the pronounced attribute of “i / ch / i / g / y / o / u / m” as shown in Fig. 13, and pronou nce according to the pronunciation of “shosai”. After selecting the button B302 whose attribute phoneme string is “sh / o / u / s / a / i” as shown in FIG. 14, click processing is performed and the form is sent (step S506). .

[0106] If a lot of detailed buttons are displayed at this time, which button will lose its power, announcing "How many lines are there from the device?" And presenting to the user "How many draft numbers?" You can perform interactive processing by voice or display by asking questions to obtain target phoneme sequences or phoneme segment sequences that can be easily inferred. Or utterance symbol string or voice XML.

[0107] The browser that receives such an event performs a process set in advance according to those events. For example, in the case of <a> tag in HTML, a specified link destination is accessed and an arbitrary web page is accessed. , Images, videos, music, and product information, "KU I NPUT TYPE =" button ">", "<INPUT TYPE =" submit ">", "<INPUT TYPE =" image "> ”,“ KU BUTTON TYPE ~ 〃> ”t, if it is an operation input tag, transition the HTML processing to the state where the corresponding button or image is pressed, or if it is a“ KU FRAME> ”tag, the name of the frame If you select a frame according to the `` SELECT '' tag, the focus shifts to the select tag with phonetic symbols and phonetic symbol strings generated by the user, and configures a selection candidate for the optional tag If you select any item, or if it is a “KU HR>” or “KU A NAME ="">" tag, the phonetic symbol string by phonemes or phonemes associated with the tag as a variable or attribute Scroll to the target line with the relevant tag using, or if it is a “KU TITLE>” tag, expand the range between the tags into a phonetic symbol string and store it in the bookmark in relation to its own URL Then, you can carry out the processing! (Step S506).

[0108] Of course, these phonetic symbol recognitions may be performed in cooperation with a script or the like, or "KU EMBED SRC"> "," <OBJECT> "," KU APPLET CODE "> An arbitrary extension function is added by a tag such as ``, and it is given as a variable or attribute to those programs, or an identifier string by phonemes, phonemes or phonetic symbols that are instructions or phonetic symbols for operating them externally It can be used for data and feature values, used for information used in conjunction with scripts, or used for script control conditions.

[0109] For example, in the case of RSS or MPEG7 using XML or RDF, in order to select an item section as shown in Fig. 4, addition of variables and attributes as shown in Fig. 5 or as shown in Fig. 6 It is good to have a method to cover changes due to tag addition. Add “ _pronounce ” element to element type based on “Dublin CoreJ” of RDF, scene name, actor, and cast by phoneme or phoneme. Name designation, add "img-type" and "img-position" elements to describe the display position and feature amount of the image, add "moti _on " element to describe the operation in the screen Or add an “e nv-sound” element to describe an environmental sound identifier or feature!

[0110] Fig. 11 shows syllable syllable syllable tags and phoneme tags. These tags are phonemes, image identifiers of input video, and so on. For example, when it is recognized that the user is angry from the voice or facial expression of the user, or when a specific identifier is detected for the image internal force, presentation of a script or content to be processed, movement to the link destination, etc. You can also perform the above process.

[0111] These tags, attributes, and elements are one of character strings in a general interpretation device. Information based on the tags and attributes is provided to the functions, processes, programs, and services that perform the processing that is evaluated by matching and recorded in the information processing device according to the character strings.

[0112] For phoneme and phoneme recognition functions, phoneme sequences and phoneme segment sequences are provided and registered in the recognition target dictionary. If other identifiers and feature quantities are used, the detection results are evaluated. Attributes that describe phonetic symbol strings to change coefficients, output instruction information to peripheral devices, prepare allophone synonyms using a dictionary configuration, and register allophone synonyms in the dictionary It is possible to describe multiple phonetic symbol strings so that they can be distinguished by boundary symbols, or to correct the results obtained by recognition.

[0113] In addition, the character string to be displayed is simply sandwiched between pronunciation tags and specified as the pronunciation target. Character strings in other languages such as kanji or mixed sentences, English, and Chinese are used as phonemes for pronunciation. It can be used for recognition, command control, detection, and search by converting it to a symbol string using a phoneme fragment, or an ASCII code that can be used to express phonetic symbols in alphabetical characters. It may be described in the markup language instead of a numerical value.

[0114] In addition, a tool for generating and producing movies and programs using CG, etc. by controlling video and dialogue related to feature quantities and identifiers, screen features, and display objects associated with words in this way It can also be used for video, programs and their scenes by recognizing the state of utterance while browsing content, or using content evaluation based on voting and browsing frequency by user voice operation. It is acceptable to evaluate a movie or program based on the correlation with the obtained feature value or identifier.

[0115] Kuserver Server Model>

In the above-described mechanism, a search procedure using a markup language may be implemented by a server client model. Specifically, Fig. 15 shows the state transitions in processing in the server client model.

[0116] First, a terminal device serving as a client generates a query. The query generation method may be a general character string input method, a voice input method, or a method of displaying an image and using the feature amount as a query.

[0117] Then, based on the generated query, the distribution device serving as a server searches for an appropriate one, According to the search result, the distribution base station distributes the information of the search result list using the present invention to the terminal device. Then, the terminal device interprets the markup language of the acquired information, converts the character string in the range sandwiched between specific tags into the aforementioned identifiers such as phonemes and phonemes, and converts it into voice input information spoken by the user. Therefore, phonemes and phonemes are acquired, phoneme symbol strings such as phoneme strings and phoneme strings are constructed, and matching processing based on phoneme strings and phoneme strings is performed.

[0118] Then, after performing each processing according to operations with high matching degree, keywords, and identifier strings, a query based on the specified identifier string is formed, and those queries are transmitted to the distribution base station for searching. By implementing this, a search with voice control using a markup language is performed. At this time, the constructed query may be used for searching by a single device.

In FIG. 16, it is possible to insert and set identifier symbol strings such as phonemes and phonemes for speech processing by interpreting the markup language on the terminal side. The present invention can be implemented by inserting a column, manually inserting in advance, constructing and inserting an identifier on a distribution base station side device or a device linked thereto, or using a single information processing device. Add variables and attributes to implement operations and processes, add tags, change the contents of markup language information, and change, add, and delete dictionaries related to identifiers and features. You can go.

[0120] In addition, with regard to the new identifier combination generated by the present invention, search is performed by giving a word based on an arbitrary name, and a symbol string using phonemes or phoneme pieces is given to an arbitrary name to support voice control. It is also possible to use a phoneme or a phoneme symbol string as a keyword for operation, and to associate such phonetic symbol strings with advertisements or add advertisement attributes to make them related. An advertisement may be associated with an utterance symbol string by displaying the URL of the advertisement in the same tag as the utterance symbol attribute.

[0121] Also, identifiers related to images displayed in the browser and phonemes and phonemes of keywords to be controlled are converted to be easily usable by terminals by using symbol strings and IDs that are compressed identifier strings and identifier strings. It is also easy to use the voice on the device without sending the necessary information and interpreting the markup language. It is possible to configure the operating environment with high convenience by taking in via the communication line on the telephone, acquiring it by e-mail, or downloading from another device.

[0122] In addition, a recognition dictionary based on the file name may be configured by describing the file name as a phoneme string, and the phoneme / phoneme fragment is set by setting the file name using the phoneme string / phoneme string string. You may be able to select information in the markup language for recognition, but with recognition, you can search for stock prices by securities number or company name, or search for products by JAN code. You can search by name or region name and implement various services, change the phoneme dictionary according to the location or device, change the phoneme dictionary according to the page, The phoneme dictionary may be changed in accordance with the frame as a frame of a sentence or sentence structure or a frame of a moving picture or a scene unit that spans multiple frames of a moving picture.

[0123] Also, if indexing according to the present invention is performed for an information format having a chunk header such as the RIFF format as shown in Fig. 17, "PRON" t \ is arbitrarily specified as the chunk header. The phoneme sequence or phoneme segment sequence may be described in the case of a normal file, and it may include general metadata such as the file name, production date, and producer, or 2D '3D images. For example, the phoneme's phoneme associated with the name of the display object or person or the name of the part is described. If it is an audio file, the phoneme or phoneme of the appearing voice is described. The lyrics or title of the music file is phoneme or phoneme. The phoneme or phoneme piece may be written in the free description area or used for searching.

[0124] <Modification>

In this example, phonemes are mainly used as phonetic symbols. However, the phoneme part is changed to phoneme, and the phoneme type is a phoneme string of different languages such as international phoneme symbols, English, and Chinese. Or the selection range and branch content in markup language processing may be configured depending on whether a round image is presented to a computer or a triangular image by an identifier based on the recognized image. Search based on the presented photos, or expand the names associated with the photo features into phoneme and phoneme strings and send and receive them in markup languages and dedicated symbol strings for voice operations. You can use a character code other than ASCII code, such as character code, JIS code, I The SO code may be used, or a unique character code system with an arbitrary numerical ID based on phonemes or phonemes may be used.

[0125] The identifier string and identifier used in the present invention are scale type, instrument type, mechanical sound type, environmental sound type, image type, face type, facial expression type, person type, action type, landscape type, display position. Attributes, variable names, and identifiers may be specified based on the names used for identification by one or a combination of identifiers such as classification, character symbol classification, label classification, shape classification, graphic symbol classification, and broadcast program classification. However, the identifier string may be regarded as identifiers continuously described according to the time series transition, or may be used after being converted into a phoneme or phoneme string string based on their names. The search result may be obtained by sending the identifier or identifier string using the GET method or POST method in CGI.

[0126] As described above, voice-related feature names and their identifiers and discriminating functions, and features and identifiers and discriminating functions of still images and moving images are assigned to markup languages using attributes and variables. It is possible to configure possible markup languages, and the user can realize device control by voice by using phonetic symbol strings provided by information processing devices that process such markup languages. It can be applied to public information, map information, product sales, reservation status, viewing status, questionnaires, surveillance camera images, satellite photos, blogs, robots and equipment control. In response to these requests, search and processing results may be returned from Sano to the client using any markup language.

The present invention can also be applied to a processing system by a server client related to a base station and a terminal. This device and terminal are configured as shown in Fig. 18, and they are connected via a communication line to acquire information on other device capabilities and to distribute information to other devices, so that information related to voice operations can be obtained. It can be exchanged to improve user convenience. The shared line used here is not the power of the Internet, but if it is a wide area communication network such as a LAN or telephone line or an indoor communication network, the target devices that can be used regardless of wired wireless are home appliances, remote controllers, Implemented for any device or service that can be a web service, telephone service, EPG distribution, etc. it can.

[0128] Also, it is composed of a user terminal, a distribution base station, a device such as a robot controlled by the terminal and the base station, and a remote controller to be controlled. The remote controller and robot are used as one form of terminal or one form of base station. The user who is allowed to speak speaks the voice to the terminal and performs the following arbitrary processing procedure for the recognition process at the terminal or the base station.

[0129] In the first method, feature amounts are extracted from the speech obtained from speech or captured video images, and the feature amounts are transmitted to the target relay location or base station apparatus, and the feature amounts are received. The base station apparatus generates a phoneme symbol string 'phoneme symbol string and other image identifiers according to the feature quantity. Then, based on the generated symbol string, a matching control means is selected and executed.

[0130] In the second method, feature amounts are extracted from speech obtained by utterances or captured video images, and identification accompanying recognition such as phoneme symbol strings / phoneme symbol strings and other image identifiers in the terminal is performed. A child is generated, and the generated symbol string is transmitted to a target relay location or base station apparatus. Then, the base station apparatus to be controlled selects and executes a matching control means based on the received symbol string.

[0131] In the third method, feature values are extracted from voices obtained by utterances and captured video images, and phoneme strings' phoneme symbol strings and other images are extracted based on the feature values generated in the terminal. It recognizes the identifier, selects the control content based on the recognized symbol string, and transmits it to the base station device that controls the control method and the device that relays information distribution.

[0132] Then, in the fourth method, the voice obtained by utterance using the terminal or the voice waveform or image of the captured video is transmitted to the base station apparatus that controls it as it is, and the phoneme symbol is stored in the controlling apparatus. When a relay point or base station device that controls the selected control unit selects the control means based on the recognized symbol string and recognizes the image symbol symbol string and other image identifiers. It is. The same applies to sound and video features and identifiers such as environmental sounds.

[0133] At this time, the terminal unit simply transmits only the waveform, transmits the feature value, transmits the recognized identifier string, and processing procedures such as a command and a message associated with the identifier string. The configuration of the distribution base station is changed according to the transmission information that is acceptable. It is also possible for the sender and receiver, which can implement the client server model, to send and receive each other, and by assigning features such as images, sounds, and actions related to the identifiers described above to the attributes of the markup language. User side power Provided information power Evaluate the degree of coincidence between the extracted feature quantity and the feature quantity extracted from the distribution information, and perform search and recognition to involve arbitrary control and response to the user Information processing may be realized, image recognition dictionaries such as faces and fingerprints, recognition dictionaries with proper nouns using phonetic symbols and phonemes, and acoustic models based on phonemes and phonemes for each speaker It is also possible to perform personal authentication using secret words by associating with.

[0134] In addition, the command dictionary for converting to an associated processing procedure based on the input phoneme sequence or phoneme sequence is a new control command or media that can be used on the terminal side or the distribution base station side. You can send / receive, distribute, and exchange information such as phonetic symbol strings and image identifiers related to type, format type, and device name using markup languages such as XML and HTML, RSS, and CGI. .

A more specific procedure for distributing and exchanging dictionary information will be described. First, by extracting feature identifiers and configuring evaluation functions, other infrared or wireless LAN, telephone lines, wired LANs, etc. Exchange information with terminals and devices.

[0136] Next, a case in which phoneme pieces are used as processing on the terminal side will be described as an example. A user gives a speech waveform to a terminal and a device with speech. The terminal-side device analyzes the given voice and converts it into features. Next, the converted features are recognized and converted to identifiers using various recognition technologies, such as HMM and Bayes.

[0137] At this time, the converted identifier is information indicating a phoneme, a phoneme piece, and various image identifiers. However, as described elsewhere, if it is a voice, it may be a phoneme, an environmental sound, a scale, or an image. If so, it may be an identifier based on an image or action. Based on the obtained identifier, a dictionary based on phonemes and phoneme symbol strings is referred to by DP matching to select an arbitrary processing procedure, and the selected processing procedure is transmitted to the target device for control. By using the present invention, it is possible to use the mobile terminal as a remote control or to control home appliances with a robot, and to smoothly communicate with the other party at the communication destination For example, a dialogue device with a handicapped person may be configured by providing a display of utterance sound notation and a braille output unit.

[0138] Depending on the CPU performance of the terminal, the information processed in such a procedure is transmitted as the original information without converting natural information such as video and audio into feature values, or converted to feature values. It is possible to select and send an arbitrary conversion level, such as transmission after conversion, transmission after conversion to an identifier, transmission after selection of control information, and the receiving side can select any state. It is configured as a receiving side device that can be processed based on information from, and sent to a distribution station or control device based on the acquired information, or based on the acquired information, search, record, mail delivery, machine Arbitrary processing such as control and device control may be performed.

[0139] Then, for use in search processing, an identifier string, a character string, and a feature amount that are appropriately used as a query are acquired by recognition and transmitted to the distribution-side base station, and information according to the query is obtained. In this case, a control dictionary is constructed so that control items can be selected by communication when control is performed using voice, even if advertisements and advertisements are displayed during the communication wait time and search wait time. Then exchange dictionary information with each other, and the procedure may be performed using P2P technology, or the information may be sold and distributed.

[0140] Furthermore, this control command dictionary can be renewed and reused freely by comprising phonemes, phoneme pieces,! /, And any identifiers, features, and device control information as described above. It is possible to update trendy search keywords by replacing or reconfiguring dictionary information for search that associates arbitrary identifiers with feature quantities. The recognition dictionary information that is changed according to the position and configuration of the content information includes a dictionary for face recognition, a dictionary for fingerprint recognition, a dictionary for character recognition, and a dictionary for figure recognition. Also good.

[0141] In the control command dictionary, infrared control information to be transmitted to products that can be controlled by a conventional infrared remote controller is selected as device control information, or a series of operations are batch processed by combining these control information. Alternatively, the feature information may be transmitted to the information processing apparatus for voice versus control without recognizing the identifier according to the CPU performance of the apparatus.

[0142] In contrast to conventional devices that cannot perform voice control in this way, control using an infrared remote controller is also possible. By combining them, it provides infrared remote control signals via a voice information module conversion dictionary, and if it is a device capable of voice control, it recognizes and controls commands based on feature quantities and voice waveforms. It is possible to change the control dictionary to improve the performance, check when the version information of the control dictionary is confirmed, and what the status of the device is! / can do.

[0143] In addition, by introducing the server-client model in this way, the server and the client are divided into arbitrary processing steps, connected by communication, and the same service is exchanged between the server and the client. And infrastructure, search and indexing.

[0144] Furthermore, in order to perform personal authentication by recognizing faces, fingerprints, and voice features, phoneme recognition dictionaries that use phonetic recognition dictionary information with acoustic models, standard parameters, and standard templates tailored to individual voice characteristics. By using, the recognition dictionary with images and sounds can be changed according to the user, and highly versatile personal authentication can be realized. Therefore, you can charge, lock and unlock keys, select services, grant usage, and use copyrighted works! Various kinds of operations and services using the operations can be realized by using an information terminal that performs recognition using the present invention.

[0145] In addition, a client such as a DVD recorder, a network TV, an STB, an HDD recorder, a music recording / playback device, or a video recording / playback device can be used from a core server at a communication destination using a terminal that performs recognition using the present invention. Information acquired by the terminal is transmitted via wireless communication such as infrared communication, FM, VHF frequency band communication, 802. ib, Bluetooth (registered trademark), ZigBee, WiFi, WiMAX, UWB, WUSB (Ultra Wide Band). By providing information to mobile terminals and mobile phones, it is possible to use EPG, BML, RSS, text broadcasting data broadcasting, TV video, text broadcasting on mobile terminals and mobile phones, voice input, and character string input. The operation of the information terminal, home appliances, information equipment and robots and the control procedure are instructed by swinging the mobile terminal or mobile phone, or the mobile terminal or mobile phone is used as a general remote control for the client terminal. And and go an indication of the power of home appliances and information devices and robot operations and control procedures, even if the ivy remote operation good,.

[0146] Also, based on the HTML FORM tag in the information configured in the markup language and the attribute extracted in association with the input item! If registered in 206, the recognition priority may be changed to a pre-registered phoneme dictionary, or the recognition target may be limited using a pre-registered dictionary.

[0147] Also, based on the attribute variable of the information configured in the markup language, the phoneme string or phoneme string phonetic that may be recognized simultaneously with respect to the phoneme symbol string such as the phoneme string or phoneme string string A plurality of recognition dictionary information 206 may be configured by writing a plurality of symbol strings, and the same recognition dictionary information 206 may be used for input items having the same attribute variable.

[0148] In addition, a plurality of words that may be recognized using a plurality of phoneme strings, phoneme string strings, and phonetic symbol strings may be represented as attribute variables. For example, a classifier such as an arbitrary unit may be represented as a phoneme string or phoneme. When it is spoken as a single string or phonetic symbol string, the recognition dictionary information 206 is switched to a dedicated number dictionary, a dedicated dictionary according to the menu item, or a limited proper noun dictionary such as a place name or station name. You may use the same method.

[0149] Also, used in the step (S506) of converting a speech waveform into an identifier based on phonemes, phonemes, and phonetic symbols according to the character code selected to be used for display based on the markup language. Prepare multiple values for each language such as Bayes discriminant function, standard pattern used for HMM, standard template and standard value obtained as learning result, value by eigenvector, covariance matrix, and Russian if display is Russian It is possible to support multiple languages by switching to a language standard template, or a Chinese language standard template if the display is Chinese, or to acquire information about the user's information processing device, operating system, or browser-specific language environment The standard template used for recognition may be selected from multiple languages.

[0150] Also, according to the user's designation, for example, the difference between the user's native language such as a standard template when a Russian speaker speaks Chinese and the language recognized in the device environment used! Even if the system can be configured so that the standard template can be selected so as to correspond to the utterances and dialects generated by, the user's utterance ability and dialects can be learned to compose the templates.

[0151] In addition, a method may be used in which the recognition dictionary information 206 is switched by converting the content of the cookie or session into a phonetic symbol string using phonemes or phonemes according to the attribute variable. A phonetic symbol sequence such as phoneme sequence or phoneme segment sequence recognized from the voice, a feature value extracted from speech, a method using a script such as AJAX, a method of transmitting status and environment variables as CGI (Common Gateway Interface) parameters, Phoneme strings recognized based on phonetic symbol strings made up of phonemes and phoneme pieces received at the base station side and voice feature values received at the base station side, transmitted to the base station by any variable transmission means such as socket communication by a program And phoneme strings and!, And the phonetic symbol string to distinguish the user's utterance and perform arbitrary processing, or configure search conditions to search content information, advertising information, and regional information. May be.

[0152] Then, the display information such as pictures, characters, icons, CG (Computer Graphics) of the terminal device that changes with the recognition processing of these phonetic symbol strings, or the output sound such as music and warning sound Information or motion control information for devices such as robots, mechanical devices, communication devices, electronic devices, electronic musical instruments, or recognition dictionary information 206 for recognizing voice, still images, moving images, etc. The base station can send information to update them by combining arbitrary information such as processing procedure information such as programs, scripts, and function expressions for feature extraction, or autonomously within the terminal device. You can do the processing.

[0153] In addition, when a phonetic symbol such as a phoneme or a phoneme piece obtained as a recognition result is divided into a plurality of frames and a continuous recognition result is obtained in time series, a plurality of phonemes over a plurality of frames are obtained. To configure the parameters of the Bayes discriminant function using the distance information between the input speech acquired as a recognition result for the phoneme and the phonetic symbol such as the phoneme or phoneme as a feature quantity, or to degenerate in time series HMM parameters are configured using distance information acquired as a recognition result for multiple phonemes and phonemes across multiple frames, and the identifier that is ranked first by the recognition results in multiple frames is used. Dynamic speech recognition may be configured in combination with techniques used in conventional speech recognition by evaluating with DP or the like.

[0154] More specifically, first, tag attribute detection that acquires markup language information through the step of acquiring content information (S401, S501), detects the markup language information power as well as the tag, and detects the tag power as well as the tag attribute. Phonetics associated with attributes detected with the means A phonetic symbol string extraction step (S402, S502) for extracting a symbol string is performed, and a step (S403, S503) of registering it in the recognition dictionary information 206 as a phonetic symbol string used for recognition is performed. These steps (S401 Force et al. S403, S501 Force et al. S503) can be produced by character string evaluation processing and detection processing. When used for operations, searches, and browsing of content information.

[0155] Next, the step of waiting for the voice input of the speaker (S504) is performed, and the step of extracting the feature amount performed by the arithmetic unit according to the start of the voice input (S505) is performed. Based on a phonetic symbol recognition program such as recognition and Z or phoneme recognition, a step (S506) of converting the feature amount into an identifier by recognizing the phonetic symbol is performed. It is generally known that this step (S506) uses a distance evaluation function or a statistical test method, a learning result using multivariate analysis, or an algorithm such as HMM. A phonetic symbol string is formed by a time-series sequence based on the recognized phonetic symbols.

[0156] Next, the configured phonetic symbol string is compared with the recognition dictionary information 206 based on the extracted phonetic symbol string attached to the attribute power of the markup language tag, and the search is performed in the recognition dictionary information 206. Thus, the step (S507) of evaluating the degree of coincidence between the phonetic symbol strings is performed to evaluate whether or not the recognition target is valid. For the comparison to determine whether or not this recognition target is available, any algorithm that can be used for symbol string comparison and evaluation such as DP, HMM, and automaton can be used arbitrarily, and these can be multiplexed and layered. A variety of methods have been devised.

[0157] Based on the identification information such as the character string and ID associated with the phonetic symbol string identified from the recognition dictionary information 206 as a result, the character string is displayed, the arbitrary processing is executed, the information Recognition process using phonetic symbols is realized by exchanging, exchanging events, changing status, and causing the machine to perform arbitrary operations. By performing step (S508), it is possible to realize information processing using speech different from conventional grammar dependence and static registered word dependence.

[0158] At this time, it has a plurality of recognition dictionary information 206 based on the phonetic symbol string and has the tag attribute. Input items to be recognized while switching the recognition dictionary information 206 used in the step (S507) of evaluating the match with the phonetic symbol string based on the type information for discriminating the input items detected by the sex detection means The recognition target when the matching score is evaluated by the symbol string comparison evaluation between the phonetic symbol string registered in the recognition dictionary information 206 selected according to the attribute of the voice and the phonetic symbol recognition result obtained from the speech waveform. The recognition efficiency can be improved by limiting the phonogram sequence included in the recognition dictionary information 206.

[0159] When switching the recognition dictionary information (206) that evaluates speech input according to the items to be input by the information processing device, the recognition dictionary information (206) of the attribute name and the word associated with the attribute If the acquired information is “book”, a classifier such as “book (s / a / ts / u I v / o / ly / u / m)” is used. In addition, the recognition dictionary using the phonetic symbol string associated with the “number” corresponding to the classifier can be selected as the search target of the recognized phonetic symbol string, or the attribute acquired information can be '' Is a table associated with the suffixes `` station (e / k / i _IS / u / t / e / i / sh / ₀ / N) '' and `` nouns used as station names '' The recognition dictionary using phonetic symbol strings is selected as the search target for recognized phonetic symbol strings, and if the information acquired from the attribute is a zip code or telephone number By simply selecting a recognition dictionary that uses a phonetic symbol string of a number as a search target for the recognized phonetic symbol string, the recognition target can be identified using a group of nouns included in a specific framework. By restricting, the user can switch the multiple recognition dictionary information 206 according to the attribute associated with the item to be input to the user, and the recognition dictionary information to be searched for recognized phonetic symbol strings It is possible to improve recognition performance by classifying 206 based on attributes.

[0160] In addition, by outputting voice of words associated with attribute names and attributes in accordance with the order in which voice input is performed on the items to be input by the information processing apparatus and selection of uninputted items, It is also possible to switch the plurality of recognition dictionary information 206 while prompting the items to be input and improve the recognition performance based on the attribute to be classified.

[0161] Then, if there is a continuous voice, the next content is changed according to the step of repeating the conventional processing (S509), the processing step (S508) associated with year recognition, or the status change in the device due to other external operations. To evaluate whether or not to acquire a markup language (S510 ) Is executed, and this processing ends depending on the situation.

[0162] It should be noted that the value of the status change in the device changes depending on other programs and processes when functioning as a part of other multi-thread programs or other programs, so that multi-threaded programs and event-driven programs Assuming that it is used for, etc., it is possible to rewrite the status or generate an event for another process or other program in the same manner as the arbitrary processing of the present invention.

[0163] In addition, a character string displayed by designating a markup language using the method of the present invention, or a character string associated with a displayed image or image feature, or an output voice, music, or voice · Characters associated with acoustic features such as music,! /, And various other character strings are converted into phonetic symbol strings and registered in the dictionary. It is detected by searching for a phonetic symbol string, and is related to the detected information by content-related information, information on the location of arbitrary information such as advertisements, videos, links, etc., by user operation based on music and voice, etc. In addition to providing information, these inputs are not based on voice or text input, but by selecting a character string from a list such as a menu or by button labels in button operations. String may be rope row or using.

[0164] Also, as in the example below

(Example)

Img href = ./flower— lily 丄 .jpg

recog— die— type = 'flower— name recog— die— url = ./flower. prono

name = "nly prono =" l / i / l / i / y>

Phonetic symbol recognition based on information such as URL, URI, IP address, directory path, etc. indicating the location of the phonetic symbol dictionary when reading the phonetic symbols according to the attributes related to the phonetic symbol dictionary included in the tag Dictionaries, phonetic symbol dictionaries,! /, "Rec ₀ g_ dic.uri" t indicating the location and location of the information, and information recognized by the dictionary such as "recog_di type" By using information indicating the type of phoneme, it is possible to distinguish phonetic symbol dictionaries that are frequently reused, etc., so that dictionary information based on phonetic symbol strings and acoustic characteristic templates for recognition can be acquired from markup languages. It may be provided in association with an attribute. [0165] In addition, the dictionary information read in the past is saved to some extent by a method generally called a cache, and when the above-mentioned attribute indicates a specific word range, the priority of the dictionary is increased and the trouble of reading it again The phonetic symbol dictionary that is read for each page as a separate file and related by ID etc. may be built in like a style sheet, or the phonetic symbol that is described in the header block and related by ID A symbol dictionary may be incorporated, or a phonetic symbol dictionary given as an attribute for each tag may be incorporated, or a phonetic symbol dictionary may be included in header information when reading via a file or communication line. It can also be used as a good phonetic symbol string template dictionary.

[0166] In addition, a phonetic symbol string may be embedded in speech waveform information using "acoustic OFDM" that can embed text data in speech, or the embedded phonetic symbol string and related markup language information may be restored. It can be used to search for phonetic symbols in audio data and display related information by subtitles, etc., so it is very common voice data power such as radio and television. Demodulated phonetic symbol strings Can also be used for searching.

[0167] In addition, the database indexed by the phonetic symbol string searched as a search target using the phonetic symbol string acquired by the phonetic symbol recognition is the logic of the phonetic symbol string based on a plurality of keywords. Even if it is a combination based on the gap, even if it is a configuration that can express the logic by a Boolean model, it is possible to compose a query and provide it to the database with those combinations, and obtain search results .

[0168] In this way, unlike the conventional method of stochastically connecting a word and a speech feature group using an HMM or the like, the present invention relates a phonetic symbol and a speech feature by a distance based on a probability based on a Bayes discriminant function or the like. The phonetic symbol string is acquired by the above method, and the acquired phonetic symbol string and the word string are directly related via the markup language, thereby restricting the word to be recognized compared to the conventional general recognition. In addition, dynamic provision of dictionary information to enable efficient recognition can be realized in a markup language, and phonetic symbol strings can be used without directly using words in queries. A database that uses HMM or DP matching may be constructed and searched by using or using phonemes and phoneme symbol strings.

[0169] In addition, image recognition is not an attribute based on phonetic symbols such as utterance phonemes and utterance phoneme changes. It may be used as an attribute.

Claims

The scope of the claims

[1] Content information acquisition means for acquiring content information including character information and Z or meta information;

A recognized phonetic symbol string detecting means for detecting a recognized phonetic symbol string consisting of phonetic symbols from the content information acquired by the content information acquiring means;

An information processing apparatus comprising: recognition dictionary information generation means for generating recognition dictionary information using the recognition phonetic symbol string.

[2] Content information acquisition means for acquiring content information including character information and Z or meta information;

From the content information acquired by the content information acquisition means, character information and

Expansion target character string detection means for detecting the expansion target character string based on Z or meta information; phonetic symbol storage means for storing the character string and the phonetic symbol in association with each other;

A phonetic symbol conversion means for converting the expansion target character string into a recognized phonetic symbol string by referring to the phonetic symbol storage means;

[3] The system further comprises content † blueprint storage means for storing the content † blueprint by adding the phonetic symbol converted by the phonetic symbol conversion means to the content information. Item 3. The information processing device according to item 2.

[4] The apparatus further comprises transmission means for transmitting the content information stored by the content information storage means and the recognition dictionary information generated based on the content information to another information processing terminal. Item 4. The information processing device according to any one of items 1 to 3.

[5] voice input means for inputting voice;

A feature quantity extracting means for extracting the feature quantity of the voice input by the voice input means; and a feature quantity table for converting the feature quantity extracted by the feature quantity extracting means into a phonetic symbol. Phonetic symbol conversion means;

The phonetic symbols converted by the feature value phonetic symbol converting means and the phonetic symbols constituting the recognized phonetic symbol string included in the recognition dictionary information are evaluated, and the phonetic symbols corresponding to the most similar phonetic symbols are evaluated. Processing execution means for executing predetermined processing;

5. The information processing apparatus according to claim 1, further comprising:

[6] The content information includes phoneme information and Z or phoneme piece information, and the processing execution means includes a phonetic symbol converted by the feature value phonetic symbol conversion means, and the recognition dictionary information. 6. The phonetic symbol constituting the recognized phonetic symbol string included in the character string is evaluated, and information is presented to the user by voice utterance corresponding to the most similar phonetic symbol. The information processing apparatus described in 1.

[7] The information processing apparatus according to any one of [1] to [6], wherein the phonetic symbol is a phoneme or a phoneme piece.

8. The information processing apparatus according to claim 5, wherein the executed process is an authentication process associated with phoneme recognition.

[9] On the computer,

A markup language interpretation step for interpreting information described using the markup language, and an attribute acquisition step for acquiring an attribute specified by the interpretation;

A phonetic symbol extraction step for extracting a phonetic symbol string and a Z or phoneme string and a Z or phoneme string string associated with the attribute acquired by the attribute acquisition step;

A dictionary change step of changing a phoneme string dictionary used in the phoneme recognition unit by the phonetic symbol extraction step;

A program characterized by realizing.

[10] On the computer,

A phonetic symbol extraction step for extracting a phonetic symbol string and a Z or phoneme string and a Z or phoneme string string associated with the attribute acquired by the attribute acquisition step; An information type evaluation step for evaluating the type of information input by the user based on the attribute acquired by the attribute acquisition step;

A dictionary changing step of changing a phoneme string dictionary used in the phoneme recognition unit by the information evaluation step;

A program characterized by realizing.