WO2007069512A1 - Information processing device, and program - Google Patents

Information processing device, and program Download PDF

Info

Publication number
WO2007069512A1
WO2007069512A1 PCT/JP2006/324348 JP2006324348W WO2007069512A1 WO 2007069512 A1 WO2007069512 A1 WO 2007069512A1 JP 2006324348 W JP2006324348 W JP 2006324348W WO 2007069512 A1 WO2007069512 A1 WO 2007069512A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
phoneme
string
phonetic symbol
recognition
Prior art date
Application number
PCT/JP2006/324348
Other languages
French (fr)
Japanese (ja)
Inventor
Masayoshi Ihara
Original Assignee
Sharp Kabushiki Kaisha
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sharp Kabushiki Kaisha filed Critical Sharp Kabushiki Kaisha
Priority to JP2007550144A priority Critical patent/JPWO2007069512A1/en
Publication of WO2007069512A1 publication Critical patent/WO2007069512A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks

Definitions

  • the present invention relates to an information processing apparatus that uses phoneme recognition and Z or phoneme piece recognition for speech recognition. Background art
  • a device specified by a phoneme recognition dictionary as in Non-Patent Document 1 and a device control method registered in the dictionary in association with the word are used by phoneme recognition processing.
  • phoneme recognition processing There is a method of selecting and implementing it, and as a technology for recognizing phonemes and phoneme pieces, it has been used as an old known technology as shown in Patent Document 1! /
  • Patent Document 3 describes that a user's speech can be changed by changing the display representation of a recognizable word when used for speech recognition in HTML, which is one of markup languages. Proposals are made to make the operation easier.
  • Patent Document 4 proposes a method for dynamically acquiring recognition dictionary data based on an acoustic model with a minimum vocabulary.
  • Patent Document 5 when using for speech recognition in HTML, which is one of the markup languages, a range is designated with a specific symbol in order to identify a recognizable word. Therefore, a method to clearly indicate to the user that speech recognition can be performed has been proposed, and convenience is provided by writing a recognizable reading for words that are difficult to pronounce.
  • Patent Document 1 Japanese Patent Laid-Open No. 62-220998
  • Patent Document 2 Japanese Patent Laid-Open No. 2005-70312
  • Patent Document 3 Japanese Patent Laid-Open No. 11-25098
  • Patent Document 4 Japanese Patent Laid-Open No. 2002-91858
  • Patent Document 5 Japanese Patent Laid-Open No. 2005-18241
  • Non-Patent Document 1 “Research and Development on Life Support Interface for Aged Society”, Key Project Project Research Report by Aomori Prefectural Industrial Research Center Vol.5, Apr.1998-Mar.2001 031
  • Non-Patent Document 2 Masayuki Nakazawa, Takashi Endo, Kiyoshi Furukawa, Jun Toyoura, Takashi Oka (New Information Processing Development Corporation), "Study of speech summaries and topic summaries using phoneme symbol sequences of speech waveform power", Shin Academic Report, SP96-28, pp.61—68, June 1996.
  • Non-Patent Document 3 Takashi Oka, Hironobu Takahashi, Takuichi Nishimura, Nobuhiro Sekimoto, Hidehide Mori, Masanori Ihara, Hiroaki Yabe, Hiroaki Hashiguchi, Hiroshi Matsumura. Pattern Search Algorithm 'Map-Supporting "CrossMediator" -. Someone Unknown, editor, Artificial Intelligence 'Gikai Study Group, volume 1, pages 1-6. Japanese Society for Artificial Intelligence, 2001.
  • the use of the phoneme symbol string in the markup language uses the structure of Segment and Media Locator as a description in MPEG7 used in a moving picture stream such as MPEG2, and uses Media Locator> in the video content.
  • ⁇ Series> description method for attaching the same kind of metadata at fixed intervals.
  • MPEG7 audio has Spoken Content DS> which describes automatic speech recognition result word (word) and phoneme (phone) ratings.
  • VoiceXML a standardized method for speech recognition, implements recognition that depends on the grammar according to the context.
  • the method of dynamically constructing dictionary information by assigning attributes to the target range of arbitrary tags using phonetic symbol identifiers such as phonemes and phoneme pieces regardless of context or grammar is proposed. Not proposed.
  • Phoneme recognition and phoneme recognition are different from general speech recognition. Vocabulary that interprets meaning and content Do not recognize ⁇ features and acoustic models of language models such as words, grammar and parts of speech There is a feature that it does not dynamically configure according to changes, and more specifically, phoneme recognition and phoneme segment recognition do not use a language model related to grammar.
  • the recognition by phonemes and phonemes is performed by analyzing the utterance sound of a speaker using a static acoustic model for each phonetic symbol, and the phonetic symbol strings in the recognition dictionary and the phonetic symbols in the recognition dictionary. Characteristic that evaluates only sequence matches
  • the recognition process and recognition dictionary structure is simplified, so phonetic symbols such as phonemes and phonemes are used even for unregistered words and exclamation words to evaluate only sound matches. It is possible to recognize an identifier string consisting of phonetic symbols.
  • a dynamic acoustic model that learns according to the utterance characteristics of the speaker and improves the performance may be used as in the past, but it depends on words and grammar as in general speech recognition. Therefore, the process of dynamically switching the acoustic model is not performed in phoneme recognition or phoneme recognition.
  • the recognition method using phonemes or phonemes converts the unregistered word in the recognition target sentence to the hiragana character notation, and converts the converted hiragana character string.
  • the symbol strings converted into phoneme strings and phoneme string strings based on the prosody obtained from wits are temporarily registered in the recognition dictionary, and the user's speech is recognized as phoneme strings and phoneme string strings.
  • the degree of coincidence between the symbol sequences is measured and the recognition result is acquired. Then, the conventional voice recognition Speech recognition with dynamic phonemes and phonemes with a higher degree of freedom than knowledge is possible.
  • re-learning for recognition may be performed by reusing the acoustic information obtained by the user's speech ability as teacher information.
  • Exclamation words such as “Uun” and coined words have many differences depending on the times and the environment, especially in content information, dynamic proper nouns that depend on trends such as product names and actor names are recognized. Since registration in the dictionary is inefficient, it is important to repeatedly distribute recognition dictionaries including acoustic models and grammatical models that exist for a long time as a challenge when putting huge and varied speech recognition into practical use. Because it is relatively difficult due to the size of the quantity, it was registered in the recognition dictionary, and recognition based on vocabulary was virtually impossible.
  • the present invention aims at a word model, an acoustic model, or a grammar in a speech recognition dictionary when speech recognition is performed on a word or a character string included in content information. Even if models and parts-of-speech information are registered, more appropriate speech recognition can be realized by using recognition dictionary information based on phonetic symbols using phonetic symbol recognition consisting of phonemes and phonemes.
  • An object is to provide an information processing apparatus and the like that can be used.
  • an information processing apparatus includes a content information acquisition unit that acquires content information including character information and Z or meta information, and the content information acquisition unit.
  • a recognized phonetic symbol string detecting means for detecting a recognized phonetic symbol string consisting of phonetic symbols from the acquired content information;
  • a recognition dictionary information generating means for generating recognition dictionary information using the recognized phonetic symbol string; It is characterized by providing.
  • An information processing apparatus provides a content information acquisition unit that acquires content information including character information and Z or meta information, and character information from the content information acquired by the content information acquisition unit. And a development target character string detecting means for detecting a development target character string based on Z or meta information, a phonetic symbol storage means for storing a character string and a phonetic symbol in association with each other, and the phonetic symbol storage A phonetic symbol conversion unit that converts the character string to be expanded into a recognized phonetic symbol string, a recognition dictionary information generation unit that generates recognition dictionary information using the recognition phonetic symbol string, It is characterized by having.
  • the third invention adds the phonetic symbol converted by the phonetic symbol conversion means to the content information by adding the content information to the content information.
  • Content information storing means for storing is further provided.
  • the fourth invention is the information processing apparatus according to any one of the first to third inventions, the content information stored by the content information storage means and generated based on the content information. Sender that transmits the recognized recognition dictionary information to another information processing terminal Further comprising a step.
  • the voice input means for inputting voice and the characteristics of the voice input by the voice input means.
  • a feature amount extracting means for extracting the collected amount
  • a feature amount phonetic symbol converting means for converting the feature amount extracted by the feature amount extracting means into a phonetic symbol
  • Processing execution means for evaluating a phonetic symbol and a phonetic symbol constituting a recognized phonetic symbol string included in the recognition dictionary information, and executing a predetermined process corresponding to the most similar phonetic symbol; Is further provided.
  • the content information includes phoneme information and Z or phoneme piece information
  • the processing execution means includes the feature quantity
  • the phonetic symbols converted by the phonetic symbol conversion means and the phonetic symbols that constitute the recognized phonetic symbol string included in the recognition dictionary information are evaluated, and the user responds to the most similar phonetic symbol.
  • information is presented by voice utterance.
  • the seventh invention is characterized in that in the information processing apparatus according to any one of the first to sixth inventions, the phonetic symbol is a phoneme or a phoneme piece.
  • an eighth invention is characterized in that, in the information processing apparatus according to any one of the first to sixth inventions, the process to be executed is an authentication process accompanying phoneme recognition.
  • a program includes a markup language interpretation step for interpreting information described using a markup language in a computer, and an attribute acquisition step for acquiring an attribute specified by the interpretation.
  • a phonetic symbol extraction step of extracting a phonetic symbol string and Z or phoneme sequence and Z or phoneme fragment sequence associated with the attribute acquired in the attribute acquisition step, and phoneme recognition by the phonetic symbol extraction step.
  • a dictionary changing step for changing a phoneme string dictionary used in the section.
  • a program includes a computer, a markup language interpretation step for interpreting information described using a markup language, and an attribute acquisition step for acquiring an attribute designated by the interpretation; Based on the phonetic symbol extraction step of extracting the phonetic symbol string and Z or phoneme sequence and Z or phoneme fragment sequence associated with the attribute acquired by the attribute acquisition step, and the attribute acquired by the attribute acquisition step Use An information type evaluation step for evaluating the type of information input by the user and a dictionary changing step for changing the phoneme string dictionary used in the phoneme recognition unit are realized by the information evaluation step.
  • a phoneme dictionary necessary for recognition of provided content information is included in the content information associated with the content information.
  • the names of phoneme strings, phoneme strings, and various identifiers are marked in the markup language. It is described as a tag attribute so that the utterance phoneme dictionary can be specified in scene units that span frames of content images, pages of sentences, frames in sentence structures, frames as a single frame of moving images, and multiple frames of moving images. In this way, we try to solve the problem.
  • variables and attributes can be changed depending on the distribution file format and markup language such as HTML, XML, RSS, EPG, BML, MPEG7, CSV sent to the user.
  • markup language such as HTML, XML, RSS, EPG, BML, MPEG7, CSV sent to the user.
  • the scene name, actor name, and cast name are classified by attributes, variables, and tags using phoneme symbol strings and phoneme symbol strings.
  • search by any actor name or casting name using the phoneme search technology, and depending on the scene by the markup language information Since a phoneme string can be acquired, a device that can perform arbitrary instructions and searches can be realized and problems can be solved.
  • a variable or attribute including a phoneme symbol string is provided in the target link or CGI notation, or a range surrounded by a specific tag is converted into a phoneme string, and the tag variable, You can embed it as an attribute or select it! /, Provide a variable and attribute for each table element of the table tag surrounding the product, give the name to each element tag as a variable and attribute with a phoneme string symbol, form tag A phoneme string is given as a variable or attribute, and based on the given phoneme string, information is transmitted or a transition is made to the next page.
  • RSS phoneme strings and phoneme string sequences may be distributed, or IDs may be associated with key words using tags, and IDs may be used as recognition dictionary information associated with phoneme strings / phoneme string sequences.
  • IDs may be used as recognition dictionary information associated with phoneme strings / phoneme string sequences.
  • an image recognition dictionary such as a face or fingerprint and a phonetic symbol string of phonemes or phonemes.
  • Individual recognition using secret words may be performed by associating the recognition dictionary with an acoustic model based on phonemes or phonemes for each speaker.
  • the content of the recognition dictionary based on phonemes and phonemes is acquired as external attributes as markup language attributes, arbitrary tags, and dictionary files, thereby enabling the information processing apparatus to be operated and solving problems. Can be achieved.
  • the existing mark can be used for device control by the system, or for personal authentication with a high degree of versatility because the authentication conditions accompanying images and sounds can be changed depending on the type of information, and for information exchange between information processing devices.
  • An easy-to-use user interface is realized by adding phoneme and phoneme notation and using dictionary information with phonemes and phonemes attached to or associated with markup language and content. can do.
  • FIG. 1 is a block diagram of an information processing apparatus using the present invention.
  • FIG. 2 is a diagram showing an example of the data structure of recognition dictionary information.
  • FIG. 3 is a diagram showing an operation flow of a phonetic symbol assignment process.
  • FIG. 4 is a diagram for explaining the operation of a phonetic symbol assignment process.
  • FIG. 5 is a diagram for explaining the operation of a phonetic symbol assignment process.
  • FIG. 6 is a diagram for explaining the operation of a phonetic symbol assignment process.
  • FIG. 7 is a diagram for explaining the operation of a phonetic symbol assignment process.
  • FIG. 8 is a diagram showing an operation flow of recognition dictionary update processing.
  • FIG. 9 is a diagram showing a different data structure of recognition dictionary information.
  • FIG. 10 is a diagram showing an operation flow of recognition dictionary information update processing.
  • FIG. 11 is a diagram for explaining the operation of recognition dictionary information update processing.
  • FIG. 12 is a diagram for explaining the operation of recognition dictionary information update processing.
  • FIG. 13 is a diagram for explaining the operation of recognition dictionary information update processing.
  • FIG. 14 is a diagram for explaining the operation of recognition dictionary information update processing.
  • FIG. 15 A diagram showing an operation flow when applied to a server client model.
  • FIG. 16 is a diagram showing an operation flow when applied to a server client model.
  • FIG. 17 is a diagram for explaining a modification in the present embodiment.
  • FIG. 18 is a diagram for explaining a modification of the present embodiment.
  • the present invention provides a device that changes and stores the markup language notation used for content information, stores and uses it, and uses the changed information as it is. It is possible to configure an information processing apparatus that is used for a distribution apparatus that distributes changed information and a receiving terminal that receives and uses the information for recognition and operations and responses. More specifically, as shown in the XML and HTML examples, the information written in the existing markup language is changed, tags are added, variables and attributes are added, saved, changed, and delivered And a method of operating the information processing apparatus by receiving such information.
  • the contents are exclusively movies, dramas, photos, news reports, movies, illustrations, paintings, music, It is generally well known to show promotional videos, novels, magazines, games, papers, textbooks, dictionaries, books, comics, catalogs, posters, broadcast program information, etc.
  • Information provided by the camera, microphone, sensor input, etc., names of such information, status and situation, abstract concepts, superordinate concepts and subordinates may include the information developed.
  • a time-series change of video a time-series change of speech, a sentence expecting a time-series change of reading position of a reader, electronic information in HTML markup language notation, and a search index generated by them
  • aloud reading position which may be information, etc.
  • punctuation marks, sentences, chapters, and sentences may be captured as frames.
  • the meta information attached to the content, the document as text information, and the EP as program information G and BML, musical scale as musical scale information, general still and moving images, polygon data, vector data, texture data, motion data (motion data) as 3D information
  • Visual information Visual information, auditory information, text information, sensor information, etc. .
  • the character string is expanded into a phoneme symbol string or phoneme symbol string, and the phoneme symbol or phoneme is expanded. It can be used for recognition using one-symbols, and it is also evaluated whether the user's utterance power is consistent with the recognized phoneme symbol string or phoneme symbol string, or the spoken phoneme is converted to any phonetic character
  • the phonetic characters may be matched to each other, or the phonetic symbol string based on the user's utterance recognition result is evaluated to be the user's operation target or search target. It may be.
  • any character or symbol such as an at sign or key bracket described in ideograms may be converted to a phoneme symbol or phoneme symbol by an appropriate phonetic symbol string, and multiple utterances can be estimated. If it is a character string, multiple phoneme strings, phoneme string strings, and syllable symbol strings may be given as in conventional speech recognition.
  • the recognized phoneme sequence or phoneme segment sequence is given to the database as a query.
  • DP or HM Search by a symbol string matching method such as M, add phoneme strings or phoneme string sequences to the search results, present the search results as a list so that they can be browsed, and select products based on the phoneme strings included in the search results Select and acquire phoneme strings and phoneme string sequences from the recognition dictionary to perform charging and purchase procedures from the acquired control method.
  • a phoneme recognition dictionary composed of uttered speech features and recognition dictionaries such as fingerprints, irises, faces, palm prints composed of image features, etc.
  • the search / browse / sales' authentication / billing procedure can be realized.
  • phoneme symbol string In the case of English, it can be converted into a phoneme symbol string using an English phoneme symbol or a phonetic symbol, or can be converted into a phoneme symbol string using an international phoneme symbol.
  • phonetic dictionaries in various languages where it is acceptable to use phoneme symbols and phoneme symbols suitable for the language.
  • Phonemes as identifiers that dissociate phonetic symbols in time series, and It is possible to distribute information using a markup language based on any phonetic symbol by expressing the phonetic symbol as an appropriate character code in association with a number.
  • the phoneme symbol string may be converted into a phoneme symbol string to improve the convenience of the search, and the environmental sound identifier, scale identifier, image identifier, operation Even if the identifier is an ambient sound scale or scale lattice, an image identifier or motion identifier section is provided in the MPEG stream, and a phoneme sequence or a phoneme sequence is given based on the speech related to the names of these identifiers. good.
  • the information processing apparatus 1 is an apparatus that is realized by each information processing device such as a general-purpose computer, a dedicated terminal, and a portable mobile terminal.
  • the information processing apparatus 1 includes a control unit 10, a storage unit 20, a communication unit 30, an input / output unit 40, an operation unit 50, and a display unit 60.
  • each functional unit is connected to the control unit 10 via a nose.
  • the operation unit 50 and the display unit 60 may be arbitrarily removable devices.
  • the communication unit 30 is a functional unit for exchanging information with other devices via a LAN (Local Area Network) or a communication network such as the Internet.
  • the communication unit 30 is generally configured by a device that can transmit and Z or receive content information such as Ethernet (registered trademark), modem, wireless LAN, and cable television device.
  • the input / output unit 40 is a functional unit for inputting / outputting information from / to another device or from the outside, for example, an input device such as a microphone, a scanner, a capture board, a camera, or sensors, a speaker, etc. And output devices such as printers, modeling devices, and display devices.
  • an input device such as a microphone, a scanner, a capture board, a camera, or sensors, a speaker, etc.
  • output devices such as printers, modeling devices, and display devices.
  • the storage unit 20 is a functional unit that acquires and stores information in the information processing apparatus 1 and stores a program executed by the control unit 10.
  • the storage unit 20 includes a ROM and RAM as semiconductor storage elements, a hard disk and magnetic tape as magnetic storage media, a CD (Compact Disk) and a DVD (Digital Versatile Disk) as optical storage media, and the like.
  • the storage unit 20 stores content information 202, a phonetic symbol conversion table 204, and recognition dictionary information 206, a phonetic symbol addition program 208, and recognition dictionary information.
  • An update program 210 and a voice operation program 212 are stored.
  • the content information 202 stores content acquired from the outside via the communication unit 30 and content input via the input / output unit 40.
  • the phonetic symbol conversion table 204 is a table that is referred to when content information is converted into phonetic symbols. For example, a character string and a phonetic symbol such as a phoneme are associated with each other. It is a table that is memorized.
  • the recognition dictionary information 206 is information that stores a relationship between a word and a phoneme string, a phoneme string string, and the like (hereinafter, these phoneme strings are indicated as phonetic symbols). For example, as shown in Fig. 9, the item ⁇ Title '', the target word ⁇ discount campaign '', and the phoneme string (phonetic symbol) expanded from the target word ⁇ / t / o / k / u / ky / a ⁇ ] is associated and memorized! Speak.
  • the recognition dictionary information 206 is a general language dictionary in which proper nouns such as “product name”, “product nickname”, and “phoneme sequence based on product nickname” may be registered in addition to items. Construct a recognition dictionary that realizes various recognitions by dynamically exchanging words containing exclamation words and misrepresentations that are not registered in the list by phoneme strings or phoneme string sequences!
  • the operation unit 50 is a functional unit that receives operation inputs from the user and includes an input device that inputs information associated with operations such as a keyboard, a mouse, a camera, and a remote controller (including wireless).
  • the display unit 60 is a functional unit that outputs information output from the information processing device 1 so that the user can visually recognize the information.
  • the display unit 60 uses a display device that performs display related to operations including a display and a projector. It is composed.
  • control unit 10 executes processing for realizing a function corresponding to the program, and each function unit of the information processing apparatus 1 is executed. Do things like control.
  • the control unit 10 reads out the phonetic symbol addition program 208 from the storage unit 20 and executes it, thereby realizing a phonetic symbol addition process to be described later. Also, the recognition dictionary information update program 210 described later is realized by reading the recognition dictionary information update program 210 from the storage unit 20 and executing it. Also, by reading out and executing the voice operation program 212, sound Voice operation processing is realized.
  • control unit 10 executes a program to acquire a phoneme 'phoneme segment recognition process, tag information, tag identifier, phoneme sequence, phoneme sequence, and user uttered speech phoneme.
  • Phoneme sequence by phoneme recognition ⁇ Phoneme sequence associated with phoneme sequence and dictionary registration information • Words can be selected by evaluating similarity of phoneme sequence and voice waveform using input / output unit using microphone May be used for speech recognition, or information may be provided to the user by speech synthesis using a phoneme sequence or phoneme segment sequence acquired by the present invention using a speaker.
  • control unit 10 is normally configured using a CPU (Central Processor Unit), DSP, ASIC, or the like, and can be realized by arbitrarily combining them.
  • CPU Central Processor Unit
  • DSP Digital Signal processor
  • ASIC Application Specific integrated circuit
  • FIG. 3 is an operation flow for explaining the phonetic symbol-added calo process, which is realized by the control unit 10 reading and executing the phonetic symbol-added calo program 208 in the storage unit 20. .
  • control unit 10 acquires the force received by the communication unit 30 and the content information 202 stored by being input by the input / output unit 40 (step S301).
  • a development target character string is detected from the read content information 202 (step S302).
  • the expansion target character string is a character string (information) for identifying a change in the display control method.
  • a tag indicating a link A> or title The tag indicating TITLE>.
  • a target character string that is expanded into phonetic symbols such as phonemes and phoneme segments is detected for the range between the tags.
  • the expansion target character string is expanded into a phoneme string or a phoneme string string (phonetic symbol) composed of phonetic symbols accompanying the utterance (step S303).
  • the title and the name of the link destination are converted into phonetic symbols.
  • the character string is acquired by referring to other attributes such as the ALT attribute and ID attribute that are included in the tag, and converted to a phoneme string or phoneme string string.
  • phonetic symbol strings for dictionary registration
  • a phoneme or phoneme string sequence is constructed using the tag attributes and the text strings sandwiched between tags, the phoneme string or phoneme string sequence between the tags, Using the link information associated as an attribute, based on the name of the file at the link destination and the character information included in the file, the phoneme string or phoneme using the tag or tag attribute or the character string between the tags
  • a method of constructing a phonetic symbol string to be registered in the dictionary by using an arbitrary method such as constructing a single string is conceivable.
  • the phonetic symbols are converted into phonetic symbols using the phonetic symbol conversion table 204.
  • a character string “main” surrounded by title tags is converted to “m / e / i / n /” as a phonetic symbol by referring to the phonetic symbol conversion table 204.
  • the recognition phonetic symbol string by the symbol string may be configured.
  • the step S401 for acquiring the content information is performed, the meta information accompanying the markup language information as shown in FIGS. 4 to 7 is obtained.
  • Step S402 for detecting the ⁇ rpronounce attribute from '' is executed, and the phonetic symbol string composed of phonemes and phonemes is extracted as a variable of the detected rpronounce attribute !, and the extracted phonetic symbol string is extracted. And the meta information in which the “rpronounce attribute” is detected and registered as dictionary information.
  • step S403 a table that expresses the utterance sound that can be used by a voice operation program or recognized by the information processing device.
  • sound Dynamic phonograms can be specified by specifying any process, procedure, or operation by specifying the processing content or transition destination page using tags or CGI as meta information associated with the title sequence. Realize recognition using columns.
  • the phonetic symbol storage process is executed, and the recognized phonetic symbol string used for phonetic symbol recognition is stored (step S304).
  • the phonetic symbol storage processing is processing for storing the phonetic symbols used for recognition converted in step S303. For example, the phonetic symbols that are already recorded as attributes in each tag are extracted, Processing to add a phonetic symbol (phoneme string or phoneme string string) as a new attribute by expanding the character string between the strings (Step S30) 4a), processing to add tags and attributes that indicate that each tag is a speech recognition target (step S304b), and separate proper nouns to be recognized, convert them into phonetic symbols, and recognize them
  • the process of configuring and updating the recognition dictionary information 206 by configuring the phonetic symbol string (step S304c) is executed. As a result, the content information and the recognition phonetic symbol sequence as the phonetic symbol sequence as the phoneme sequence of the word to be used for recognition and the phoneme sequence sequence are executed.
  • control unit 10 updates and stores the changed content information 202, and recognizes recognition dictionary information for phonetic symbol recognition using phonemes and phoneme segments made up of associated recognition phonetic symbol strings. Save the update (step S305). This makes it possible to use content information that has been changed to recognition of user utterances and distribution via the communication unit.
  • the processing described above is executed by the distribution device (server) side that distributes the force content information described as being executed by the information processing device 1, so that the information associated with the conversion to the phoneme sequence on the reception side is executed.
  • the processing burden may be reduced.
  • the distribution device distributes content information accompanied by voice control information in response to a content information call from a user. Therefore, the information processing device 1 (terminal device) can acquire phoneme information classified according to the content page and frame by the information processing device, and can use voice of arbitrary words with less restrictions.
  • FIG. 4 is a diagram showing the state of the content information 202 acquired by the information processing apparatus 1.
  • the content information 202 is acquired from the communication unit 30 or the input / output unit 40 and stored in the storage unit 20.
  • step S302 information related to the tag that is the target of the evaluation by the phonetic symbol (phoneme string 'phoneme string string) is detected (step S302).
  • the information in Fig. 4 is an example of content information using the item section of RSS when the extraction process is executed in step S302.
  • the item section force target character string is extracted and converted. It is shown as Figure 5 or Figure 6.
  • a tag to be expanded is detected from tags included in the acquired content information, a character string in a range specified by the tag is detected.
  • a character string in a range specified by the tag is detected.
  • the “profit campaign” sandwiched between the tags “ku title>” and “ku / title>” that mean the title is detected as the character string to be expanded into the phonetic symbol string.
  • an arbitrary title character string specified by the distribution side can be acquired by extracting this character string, which can be deleted by unnecessary parentheses.
  • step S304a a process of adding a new phonetic symbol string and adding a phonetic symbol string as an attribute or variable to the tag described in the original content information 202 (step S304a).
  • step S304a a process of adding a new phonetic symbol string and adding a phonetic symbol string as an attribute or variable to the tag described in the original content information 202 (step S304a).
  • step S304c the processing for newly setting the ⁇ prono unce> -ku / pronounce> tags, the words and instructions recognized as phonetic symbols are associated and saved as recognition dictionary information 206, and the recognition dictionary information 206 Is stored in the content information 202 as the acquisition destination of the recognition dictionary information 206 by using a URL tag META> tag and the like (step S304c) etc. is executed, and the phonetic symbol string information used for phonetic symbol recognition is executed in the content information. It becomes possible to add and associate.
  • phonetic symbols phonemes, phonemes, phonetic symbols, phonetic symbols
  • a "ku Pronounse DS>" tag is added to describe a phoneme string of content type, or "Rainsound” is generated as an environmental sound behind
  • the phonetic symbol string “pronounce " t / o / m / u ”” as the cast name as an attribute for the notation related to the performer that can be placed in the phrase tag.
  • HTML presents examples of configurations associated with buttons and links, and searches the range between arbitrary tags as a keyword, which is useful for browsing and searching for content, and pronunciation for operations.
  • the acquired phoneme or phoneme fragment and the resulting phonetic symbol string are stored in the word dictionary of the utterance phoneme or utterance phoneme for speech synthesis utterance in the information processing device 1. It may be used.
  • the phonetic symbol used for phonetic symbol recognition by adding the phonetic symbol for performing the voice operation based on the acquired content information.
  • Content information including phonetic symbol string information to be incorporated into the dictionary can be configured.
  • FIG. 8 is a diagram illustrating an operation flow related to the recognition dictionary information update process, which is a process realized by the control unit 110 executing the recognition dictionary information update program 210 in the storage unit 120.
  • the control unit 10 acquires the content information 202 (step S401).
  • the control unit 10 extracts a phonetic symbol string from the read content information 202 (step S402).
  • a tag a portion between “ku” and “>” included in the content information 202
  • a tag including a phonetic symbol string is specified and extracted. It becomes.
  • the control unit 10 extracts the "pronounce attribute" of the title tag "KU TITLE>"
  • the phoneme symbol string " ⁇ / ⁇ / ⁇ ⁇ ” is used as the phonetic symbol that is the argument.
  • the extracted phoneme string is stored as a page title and registered in the recognition dictionary information 206 (step S403).
  • the phoneme sequence or phoneme sequence is described in the content information to be distributed, or when the related phoneme dictionary is not associated, the above phoneme sequence 'phoneme sequence is used. Construct phoneme and phoneme symbol strings from the content according to the embedding procedure, and construct dictionary information. The constructed dictionary information is used by detecting whether or not the same word is used. Reuse if possible.
  • the phoneme symbol string does not change when the control recognition dictionary is configured, if it is a control command, a dictionary that associates the ID related to the control command, the command word, and the phoneme sequence as shown in FIG. Use this to identify the command word for control and distribute the content information with the ID used as the command discrimination ID and record it on the storage medium. And storage medium power In the information associated with the acquired content information, the instruction word is identified from the instruction discrimination ID described in the place where the phoneme information or the phoneme information is described, and the phoneme or the phoneme fragment is identified from the identified instruction word.
  • the phoneme sequence or phoneme sequence is constructed and used for recognition by performing the conversion function, or the hash value based on the phoneme sequence or phoneme sequence associated with the control command is used for the instruction discrimination ID. It is also possible to shorten the phoneme string expression during transmission, which tends to be redundant, and improve communication efficiency.
  • the content information 202 obtained via the storage medium or the communication means and stored in the storage unit if the content information 202 is not converted or added to a phonetic symbol, the content is interpreted by the above-described method. If the information of the content information 202 has already been written, converted, or updated, the content information 202 is converted to the phonetic symbol by the identifier string corresponding to the information processing apparatus 1. There is no need to convert or update.
  • these conversions may be performed by converting on the server side according to the content information distributor or user's situation, or by appropriately converting what is received by the client, or by an external storage medium by itself.
  • the acquired information can be converted so that it can be used by the device itself, or it can be converted using a relay means such as a gateway router.
  • control unit 10 acquires content information acquired by the communication unit 30 or the input / output unit 40 and content information 202 stored in the storage unit 20 (step S501).
  • phonetic symbols composed of phonemes and phoneme pieces are extracted from the acquired content information (step S502).
  • the recognition dictionary information 206 is updated and registered based on the extracted phonetic symbols (step S503).
  • step S504 it waits until there is a voice input from the input / output unit 40 based on the utterance from the user (step S504; No).
  • the control unit 10 extracts a feature amount of the input user's voice (step S505).
  • phonetic symbols such as phonemes and phonemes are recognized from the extracted feature quantities and converted to phonetic symbols (step S506).
  • step S506 the phonetic symbols converted in step S506 are registered in the recognition dictionary first.
  • a coincidence evaluation is performed to determine how much the phonetic symbols match! / Spoken (step S507).
  • This coincidence evaluation is stored in the storage unit of this device! The degree of coincidence with the standard model, standard parameter and standard template of sound and speech is evaluated by the evaluation function, and the phonetic symbol as the evaluation result is specified.
  • a phonetic symbol string is specified by obtaining a plurality of phonetic symbols specified based on the coincidence evaluation in time series. Then, the phonetic symbol string having the highest similarity to the identified phonetic symbol sequence is used as the phonetic symbol recognition result, and device operation and search processing are executed according to the information associated with the recognition result ( Step S508).
  • the process associated with the recognition result is, for example, generation of a character string accompanied by a proper noun realized by recognition of a phonetic symbol string using the present invention, and each operation command, information, or product.
  • information is presented to a series of users specified in connection with the execution of related searches and the recognition of phonetic symbol strings, and the operations are instructed by the users.
  • it is based on the sound, text, images, and images of home appliances connected to web browser pages, TV and video operations, robots, navigation devices, computers, audiovisual equipment, cookers, washing machines, and air conditioners.
  • a series of processes and operations such as response, search condition instruction, storage, change, registration and deletion of information presented by the information processing device, specification and browsing of advertisements and program contents associated with recognition results, and personal authentication based on keywords and speech is there.
  • an image recognition dictionary such as face or fingerprint
  • a recognition dictionary with proper nouns using phonetic symbolic strings based on phonemes or phonemes, and an acoustic model based on phonemes or phonemes for each speaker Accompanied with authentication, billing and service selection can be performed.
  • a recognizable word is clearly indicated by uttering a word based on a phoneme string or phoneme string registered in the recognition dictionary for speech synthesis to answer a user's question. Or perform any operation according to the recognition result, present a recognized character string or word string according to the recognition result, or execute an advertisement associated with the phoneme string or phoneme string string Can be combined with conventional speech recognition technology.
  • step S509 it is determined whether or not to execute the next voice input. If the voice is input again (step S509; Yes), the process returns to step S504 as a process for waiting for the voice to be input. If no voice is input (step S509; No), it is determined whether or not the next content information can be acquired (step S510). here When acquiring the next content information (step S510; Yes), the processing is repeatedly executed from step S501 in order to acquire a new content. If new content information is not acquired (step S510; No), a series of processing is performed when the processing ends and the user's speech is awaited! /.
  • the device using the present invention allows the user to perform voice operation by using identifiers based on phonetic symbols such as information power phonemes and phonemes in the acquired markup language and features for identifying the identifiers.
  • identifiers based on phonetic symbols such as information power phonemes and phonemes in the acquired markup language and features for identifying the identifiers.
  • Acquire the markup language information ability where possible, and if necessary, use it for personal authentication, etc. by acquiring and combining arbitrary identifiers related to images and actions such as fingerprints, facial expressions, palm prints, etc. It can also be used for actions of agents and robots.
  • selection processing that is conventionally performed by mouse operation is performed based on identifiers and feature values obtained by user's utterances and inputs, and focus is given to any row, column, link, and operation button of the table tag.
  • Overlapping cursors issuing events associated with these operations to browsers from operating system managers, controlling other devices using infrared, LAN, telephone lines, etc.
  • the content information acquired in step S501 detects the “pronounce” attribute information in the tag (step S502) and registers it in the recognition dictionary information 206 (step S503).
  • the display position and display items of the screen configuration information are combined in the browser with the preceding and following tags.
  • Specify the display position when processing each tag specify the scene position by associating with the tag indicating the scene, title and time series position in the content information in MPEG7 etc., and display the map information
  • the physical location can be specified by associating it with spatial location information by latitude and longitude, location name, regional information, or store information using XML.
  • Fig. 11 When Fig. 11 is displayed in an HTML browser, it becomes as shown in Fig. 12.
  • the utterance "Ichigiyome" is attributed.
  • the focus B300 is set in the first line according to the phoneme string with the pronounced attribute of “i / ch / i / g / y / o / u / m” as shown in Fig. 13, and pronou nce according to the pronunciation of “shosai”.
  • click processing is performed and the form is sent (step S506). .
  • buttons are displayed at this time, which button will lose its power, announcing "How many lines are there from the device?" And presenting to the user "How many draft numbers?" You can perform interactive processing by voice or display by asking questions to obtain target phoneme sequences or phoneme segment sequences that can be easily inferred. Or utterance symbol string or voice XML.
  • phonetic symbol recognitions may be performed in cooperation with a script or the like, or "KU EMBED SRC"> "," ⁇ OBJECT> ",” KU APPLET CODE “>
  • An arbitrary extension function is added by a tag such as ⁇ , and it is given as a variable or attribute to those programs, or an identifier string by phonemes, phonemes or phonetic symbols that are instructions or phonetic symbols for operating them externally It can be used for data and feature values, used for information used in conjunction with scripts, or used for script control conditions.
  • Fig. 11 shows syllable syllable syllable tags and phoneme tags. These tags are phonemes, image identifiers of input video, and so on. For example, when it is recognized that the user is angry from the voice or facial expression of the user, or when a specific identifier is detected for the image internal force, presentation of a script or content to be processed, movement to the link destination, etc. You can also perform the above process.
  • tags, attributes, and elements are one of character strings in a general interpretation device. Information based on the tags and attributes is provided to the functions, processes, programs, and services that perform the processing that is evaluated by matching and recorded in the information processing device according to the character strings.
  • phoneme sequences and phoneme segment sequences are provided and registered in the recognition target dictionary. If other identifiers and feature quantities are used, the detection results are evaluated. Attributes that describe phonetic symbol strings to change coefficients, output instruction information to peripheral devices, prepare allophone synonyms using a dictionary configuration, and register allophone synonyms in the dictionary It is possible to describe multiple phonetic symbol strings so that they can be distinguished by boundary symbols, or to correct the results obtained by recognition.
  • the character string to be displayed is simply sandwiched between pronunciation tags and specified as the pronunciation target.
  • Character strings in other languages such as kanji or mixed sentences, English, and Chinese are used as phonemes for pronunciation. It can be used for recognition, command control, detection, and search by converting it to a symbol string using a phoneme fragment, or an ASCII code that can be used to express phonetic symbols in alphabetical characters. It may be described in the markup language instead of a numerical value.
  • a tool for generating and producing movies and programs using CG, etc. by controlling video and dialogue related to feature quantities and identifiers, screen features, and display objects associated with words in this way It can also be used for video, programs and their scenes by recognizing the state of utterance while browsing content, or using content evaluation based on voting and browsing frequency by user voice operation. It is acceptable to evaluate a movie or program based on the correlation with the obtained feature value or identifier.
  • a search procedure using a markup language may be implemented by a server client model.
  • Fig. 15 shows the state transitions in processing in the server client model.
  • a terminal device serving as a client generates a query.
  • the query generation method may be a general character string input method, a voice input method, or a method of displaying an image and using the feature amount as a query.
  • the distribution device serving as a server searches for an appropriate one, According to the search result, the distribution base station distributes the information of the search result list using the present invention to the terminal device. Then, the terminal device interprets the markup language of the acquired information, converts the character string in the range sandwiched between specific tags into the aforementioned identifiers such as phonemes and phonemes, and converts it into voice input information spoken by the user. Therefore, phonemes and phonemes are acquired, phoneme symbol strings such as phoneme strings and phoneme strings are constructed, and matching processing based on phoneme strings and phoneme strings is performed.
  • a query based on the specified identifier string is formed, and those queries are transmitted to the distribution base station for searching.
  • a search with voice control using a markup language is performed.
  • the constructed query may be used for searching by a single device.
  • FIG. 16 it is possible to insert and set identifier symbol strings such as phonemes and phonemes for speech processing by interpreting the markup language on the terminal side.
  • the present invention can be implemented by inserting a column, manually inserting in advance, constructing and inserting an identifier on a distribution base station side device or a device linked thereto, or using a single information processing device. Add variables and attributes to implement operations and processes, add tags, change the contents of markup language information, and change, add, and delete dictionaries related to identifiers and features. You can go.
  • search is performed by giving a word based on an arbitrary name, and a symbol string using phonemes or phoneme pieces is given to an arbitrary name to support voice control. It is also possible to use a phoneme or a phoneme symbol string as a keyword for operation, and to associate such phonetic symbol strings with advertisements or add advertisement attributes to make them related.
  • An advertisement may be associated with an utterance symbol string by displaying the URL of the advertisement in the same tag as the utterance symbol attribute.
  • identifiers related to images displayed in the browser and phonemes and phonemes of keywords to be controlled are converted to be easily usable by terminals by using symbol strings and IDs that are compressed identifier strings and identifier strings. It is also easy to use the voice on the device without sending the necessary information and interpreting the markup language. It is possible to configure the operating environment with high convenience by taking in via the communication line on the telephone, acquiring it by e-mail, or downloading from another device.
  • a recognition dictionary based on the file name may be configured by describing the file name as a phoneme string, and the phoneme / phoneme fragment is set by setting the file name using the phoneme string / phoneme string string.
  • You may be able to select information in the markup language for recognition, but with recognition, you can search for stock prices by securities number or company name, or search for products by JAN code.
  • You can search by name or region name and implement various services, change the phoneme dictionary according to the location or device, change the phoneme dictionary according to the page,
  • the phoneme dictionary may be changed in accordance with the frame as a frame of a sentence or sentence structure or a frame of a moving picture or a scene unit that spans multiple frames of a moving picture.
  • indexing according to the present invention is performed for an information format having a chunk header such as the RIFF format as shown in Fig. 17, "PRON" t ⁇ is arbitrarily specified as the chunk header.
  • the phoneme sequence or phoneme segment sequence may be described in the case of a normal file, and it may include general metadata such as the file name, production date, and producer, or 2D '3D images. For example, the phoneme's phoneme associated with the name of the display object or person or the name of the part is described. If it is an audio file, the phoneme or phoneme of the appearing voice is described. The lyrics or title of the music file is phoneme or phoneme. The phoneme or phoneme piece may be written in the free description area or used for searching.
  • phonemes are mainly used as phonetic symbols.
  • the phoneme part is changed to phoneme, and the phoneme type is a phoneme string of different languages such as international phoneme symbols, English, and Chinese.
  • the selection range and branch content in markup language processing may be configured depending on whether a round image is presented to a computer or a triangular image by an identifier based on the recognized image. Search based on the presented photos, or expand the names associated with the photo features into phoneme and phoneme strings and send and receive them in markup languages and dedicated symbol strings for voice operations.
  • You can use a character code other than ASCII code, such as character code, JIS code, I
  • the SO code may be used, or a unique character code system with an arbitrary numerical ID based on phonemes or phonemes may be used.
  • the identifier string and identifier used in the present invention are scale type, instrument type, mechanical sound type, environmental sound type, image type, face type, facial expression type, person type, action type, landscape type, display position. Attributes, variable names, and identifiers may be specified based on the names used for identification by one or a combination of identifiers such as classification, character symbol classification, label classification, shape classification, graphic symbol classification, and broadcast program classification. However, the identifier string may be regarded as identifiers continuously described according to the time series transition, or may be used after being converted into a phoneme or phoneme string string based on their names.
  • the search result may be obtained by sending the identifier or identifier string using the GET method or POST method in CGI.
  • voice-related feature names and their identifiers and discriminating functions, and features and identifiers and discriminating functions of still images and moving images are assigned to markup languages using attributes and variables. It is possible to configure possible markup languages, and the user can realize device control by voice by using phonetic symbol strings provided by information processing devices that process such markup languages. It can be applied to public information, map information, product sales, reservation status, viewing status, questionnaires, surveillance camera images, satellite photos, blogs, robots and equipment control. In response to these requests, search and processing results may be returned from Sano to the client using any markup language.
  • the present invention can also be applied to a processing system by a server client related to a base station and a terminal.
  • This device and terminal are configured as shown in Fig. 18, and they are connected via a communication line to acquire information on other device capabilities and to distribute information to other devices, so that information related to voice operations can be obtained. It can be exchanged to improve user convenience.
  • the shared line used here is not the power of the Internet, but if it is a wide area communication network such as a LAN or telephone line or an indoor communication network, the target devices that can be used regardless of wired wireless are home appliances, remote controllers, Implemented for any device or service that can be a web service, telephone service, EPG distribution, etc. it can.
  • the remote controller and robot are used as one form of terminal or one form of base station.
  • the user who is allowed to speak speaks the voice to the terminal and performs the following arbitrary processing procedure for the recognition process at the terminal or the base station.
  • feature amounts are extracted from the speech obtained from speech or captured video images, and the feature amounts are transmitted to the target relay location or base station apparatus, and the feature amounts are received.
  • the base station apparatus generates a phoneme symbol string 'phoneme symbol string and other image identifiers according to the feature quantity. Then, based on the generated symbol string, a matching control means is selected and executed.
  • feature amounts are extracted from speech obtained by utterances or captured video images, and identification accompanying recognition such as phoneme symbol strings / phoneme symbol strings and other image identifiers in the terminal is performed.
  • a child is generated, and the generated symbol string is transmitted to a target relay location or base station apparatus. Then, the base station apparatus to be controlled selects and executes a matching control means based on the received symbol string.
  • feature values are extracted from voices obtained by utterances and captured video images, and phoneme strings' phoneme symbol strings and other images are extracted based on the feature values generated in the terminal. It recognizes the identifier, selects the control content based on the recognized symbol string, and transmits it to the base station device that controls the control method and the device that relays information distribution.
  • the voice obtained by utterance using the terminal or the voice waveform or image of the captured video is transmitted to the base station apparatus that controls it as it is, and the phoneme symbol is stored in the controlling apparatus.
  • a relay point or base station device that controls the selected control unit selects the control means based on the recognized symbol string and recognizes the image symbol symbol string and other image identifiers. It is. The same applies to sound and video features and identifiers such as environmental sounds.
  • the terminal unit simply transmits only the waveform, transmits the feature value, transmits the recognized identifier string, and processing procedures such as a command and a message associated with the identifier string.
  • the configuration of the distribution base station is changed according to the transmission information that is acceptable. It is also possible for the sender and receiver, which can implement the client server model, to send and receive each other, and by assigning features such as images, sounds, and actions related to the identifiers described above to the attributes of the markup language.
  • User side power Provided information power Evaluate the degree of coincidence between the extracted feature quantity and the feature quantity extracted from the distribution information, and perform search and recognition to involve arbitrary control and response to the user Information processing may be realized, image recognition dictionaries such as faces and fingerprints, recognition dictionaries with proper nouns using phonetic symbols and phonemes, and acoustic models based on phonemes and phonemes for each speaker It is also possible to perform personal authentication using secret words by associating with.
  • the command dictionary for converting to an associated processing procedure based on the input phoneme sequence or phoneme sequence is a new control command or media that can be used on the terminal side or the distribution base station side. You can send / receive, distribute, and exchange information such as phonetic symbol strings and image identifiers related to type, format type, and device name using markup languages such as XML and HTML, RSS, and CGI. .
  • a user gives a speech waveform to a terminal and a device with speech.
  • the terminal-side device analyzes the given voice and converts it into features.
  • the converted features are recognized and converted to identifiers using various recognition technologies, such as HMM and Bayes.
  • the converted identifier is information indicating a phoneme, a phoneme piece, and various image identifiers. However, as described elsewhere, if it is a voice, it may be a phoneme, an environmental sound, a scale, or an image. If so, it may be an identifier based on an image or action. Based on the obtained identifier, a dictionary based on phonemes and phoneme symbol strings is referred to by DP matching to select an arbitrary processing procedure, and the selected processing procedure is transmitted to the target device for control.
  • a dialogue device with a handicapped person may be configured by providing a display of utterance sound notation and a braille output unit.
  • the information processed in such a procedure is transmitted as the original information without converting natural information such as video and audio into feature values, or converted to feature values. It is possible to select and send an arbitrary conversion level, such as transmission after conversion, transmission after conversion to an identifier, transmission after selection of control information, and the receiving side can select any state. It is configured as a receiving side device that can be processed based on information from, and sent to a distribution station or control device based on the acquired information, or based on the acquired information, search, record, mail delivery, machine Arbitrary processing such as control and device control may be performed.
  • an identifier string, a character string, and a feature amount that are appropriately used as a query are acquired by recognition and transmitted to the distribution-side base station, and information according to the query is obtained.
  • a control dictionary is constructed so that control items can be selected by communication when control is performed using voice, even if advertisements and advertisements are displayed during the communication wait time and search wait time. Then exchange dictionary information with each other, and the procedure may be performed using P2P technology, or the information may be sold and distributed.
  • this control command dictionary can be renewed and reused freely by comprising phonemes, phoneme pieces,! /, And any identifiers, features, and device control information as described above. It is possible to update trendy search keywords by replacing or reconfiguring dictionary information for search that associates arbitrary identifiers with feature quantities.
  • the recognition dictionary information that is changed according to the position and configuration of the content information includes a dictionary for face recognition, a dictionary for fingerprint recognition, a dictionary for character recognition, and a dictionary for figure recognition. Also good.
  • infrared control information to be transmitted to products that can be controlled by a conventional infrared remote controller is selected as device control information, or a series of operations are batch processed by combining these control information.
  • the feature information may be transmitted to the information processing apparatus for voice versus control without recognizing the identifier according to the CPU performance of the apparatus.
  • infrared remote controller In contrast to conventional devices that cannot perform voice control in this way, control using an infrared remote controller is also possible. By combining them, it provides infrared remote control signals via a voice information module conversion dictionary, and if it is a device capable of voice control, it recognizes and controls commands based on feature quantities and voice waveforms. It is possible to change the control dictionary to improve the performance, check when the version information of the control dictionary is confirmed, and what the status of the device is! / can do.
  • the server and the client are divided into arbitrary processing steps, connected by communication, and the same service is exchanged between the server and the client. And infrastructure, search and indexing.
  • phoneme recognition dictionaries that use phonetic recognition dictionary information with acoustic models, standard parameters, and standard templates tailored to individual voice characteristics.
  • the recognition dictionary with images and sounds can be changed according to the user, and highly versatile personal authentication can be realized. Therefore, you can charge, lock and unlock keys, select services, grant usage, and use copyrighted works!
  • Various kinds of operations and services using the operations can be realized by using an information terminal that performs recognition using the present invention.
  • a client such as a DVD recorder, a network TV, an STB, an HDD recorder, a music recording / playback device, or a video recording / playback device can be used from a core server at a communication destination using a terminal that performs recognition using the present invention.
  • Information acquired by the terminal is transmitted via wireless communication such as infrared communication, FM, VHF frequency band communication, 802. ib, Bluetooth (registered trademark), ZigBee, WiFi, WiMAX, UWB, WUSB (Ultra Wide Band).
  • EPG electronic program guide
  • BML electronic program guide
  • RSS text broadcasting data broadcasting
  • TV video text broadcasting on mobile terminals and mobile phones
  • voice input voice input
  • character string input voice input
  • character string input text broadcasting data broadcasting
  • the operation of the information terminal, home appliances, information equipment and robots and the control procedure are instructed by swinging the mobile terminal or mobile phone, or the mobile terminal or mobile phone is used as a general remote control for the client terminal. And and go an indication of the power of home appliances and information devices and robot operations and control procedures, even if the ivy remote operation good,.
  • the recognition priority may be changed to a pre-registered phoneme dictionary, or the recognition target may be limited using a pre-registered dictionary.
  • the phoneme string or phoneme string phonetic that may be recognized simultaneously with respect to the phoneme symbol string such as the phoneme string or phoneme string string
  • a plurality of recognition dictionary information 206 may be configured by writing a plurality of symbol strings, and the same recognition dictionary information 206 may be used for input items having the same attribute variable.
  • a plurality of words that may be recognized using a plurality of phoneme strings, phoneme string strings, and phonetic symbol strings may be represented as attribute variables.
  • a classifier such as an arbitrary unit may be represented as a phoneme string or phoneme.
  • the recognition dictionary information 206 is switched to a dedicated number dictionary, a dedicated dictionary according to the menu item, or a limited proper noun dictionary such as a place name or station name. You may use the same method.
  • step (S506) of converting a speech waveform into an identifier based on phonemes, phonemes, and phonetic symbols according to the character code selected to be used for display based on the markup language Prepare multiple values for each language such as Bayes discriminant function, standard pattern used for HMM, standard template and standard value obtained as learning result, value by eigenvector, covariance matrix, and Russian if display is Russian It is possible to support multiple languages by switching to a language standard template, or a Chinese language standard template if the display is Chinese, or to acquire information about the user's information processing device, operating system, or browser-specific language environment
  • the standard template used for recognition may be selected from multiple languages.
  • the standard template can be selected so as to correspond to the utterances and dialects generated by, the user's utterance ability and dialects can be learned to compose the templates.
  • a method may be used in which the recognition dictionary information 206 is switched by converting the content of the cookie or session into a phonetic symbol string using phonemes or phonemes according to the attribute variable.
  • a phonetic symbol sequence such as phoneme sequence or phoneme segment sequence recognized from the voice, a feature value extracted from speech, a method using a script such as AJAX, a method of transmitting status and environment variables as CGI (Common Gateway Interface) parameters, Phoneme strings recognized based on phonetic symbol strings made up of phonemes and phoneme pieces received at the base station side and voice feature values received at the base station side, transmitted to the base station by any variable transmission means such as socket communication by a program And phoneme strings and!, And the phonetic symbol string to distinguish the user's utterance and perform arbitrary processing, or configure search conditions to search content information, advertising information, and regional information. May be.
  • CGI Common Gateway Interface
  • the display information such as pictures, characters, icons, CG (Computer Graphics) of the terminal device that changes with the recognition processing of these phonetic symbol strings, or the output sound such as music and warning sound Information or motion control information for devices such as robots, mechanical devices, communication devices, electronic devices, electronic musical instruments, or recognition dictionary information 206 for recognizing voice, still images, moving images, etc.
  • the base station can send information to update them by combining arbitrary information such as processing procedure information such as programs, scripts, and function expressions for feature extraction, or autonomously within the terminal device. You can do the processing.
  • a phonetic symbol such as a phoneme or a phoneme piece obtained as a recognition result is divided into a plurality of frames and a continuous recognition result is obtained in time series
  • a plurality of phonemes over a plurality of frames are obtained.
  • To configure the parameters of the Bayes discriminant function using the distance information between the input speech acquired as a recognition result for the phoneme and the phonetic symbol such as the phoneme or phoneme as a feature quantity, or to degenerate in time series HMM parameters are configured using distance information acquired as a recognition result for multiple phonemes and phonemes across multiple frames, and the identifier that is ranked first by the recognition results in multiple frames is used.
  • Dynamic speech recognition may be configured in combination with techniques used in conventional speech recognition by evaluating with DP or the like.
  • tag attribute detection that acquires markup language information through the step of acquiring content information (S401, S501), detects the markup language information power as well as the tag, and detects the tag power as well as the tag attribute.
  • Phonetics associated with attributes detected with the means A phonetic symbol string extraction step (S402, S502) for extracting a symbol string is performed, and a step (S403, S503) of registering it in the recognition dictionary information 206 as a phonetic symbol string used for recognition is performed.
  • These steps (S401 Force et al. S403, S501 Force et al. S503) can be produced by character string evaluation processing and detection processing. When used for operations, searches, and browsing of content information.
  • the step of waiting for the voice input of the speaker (S504) is performed, and the step of extracting the feature amount performed by the arithmetic unit according to the start of the voice input (S505) is performed.
  • a step (S506) of converting the feature amount into an identifier by recognizing the phonetic symbol is performed. It is generally known that this step (S506) uses a distance evaluation function or a statistical test method, a learning result using multivariate analysis, or an algorithm such as HMM.
  • a phonetic symbol string is formed by a time-series sequence based on the recognized phonetic symbols.
  • the configured phonetic symbol string is compared with the recognition dictionary information 206 based on the extracted phonetic symbol string attached to the attribute power of the markup language tag, and the search is performed in the recognition dictionary information 206.
  • the step (S507) of evaluating the degree of coincidence between the phonetic symbol strings is performed to evaluate whether or not the recognition target is valid.
  • any algorithm that can be used for symbol string comparison and evaluation such as DP, HMM, and automaton can be used arbitrarily, and these can be multiplexed and layered. A variety of methods have been devised.
  • step (S508) it is possible to realize information processing using speech different from conventional grammar dependence and static registered word dependence.
  • the recognition dictionary information 206 has a plurality of recognition dictionary information 206 based on the phonetic symbol string and has the tag attribute. Input items to be recognized while switching the recognition dictionary information 206 used in the step (S507) of evaluating the match with the phonetic symbol string based on the type information for discriminating the input items detected by the sex detection means.
  • the recognition efficiency can be improved by limiting the phonogram sequence included in the recognition dictionary information 206.
  • the recognition dictionary information (206) When switching the recognition dictionary information (206) that evaluates speech input according to the items to be input by the information processing device, the recognition dictionary information (206) of the attribute name and the word associated with the attribute If the acquired information is “book”, a classifier such as “book (s / a / ts / u I v / o / ly / u / m)” is used.
  • the recognition dictionary using the phonetic symbol string associated with the “number” corresponding to the classifier can be selected as the search target of the recognized phonetic symbol string, or the attribute acquired information can be '' Is a table associated with the suffixes ⁇ station (e / k / i IS / u / t / e / i / sh / 0 / N) '' and ⁇ nouns used as station names ''
  • the recognition dictionary using phonetic symbol strings is selected as the search target for recognized phonetic symbol strings, and if the information acquired from the attribute is a zip code or telephone number
  • the recognition target can be identified using a group of nouns included in a specific framework.
  • the user can switch the multiple recognition dictionary information 206 according to the attribute associated with the item to be input to the user, and the recognition dictionary information to be searched for recognized phonetic symbol strings It
  • a character string displayed by designating a markup language using the method of the present invention or a character string associated with a displayed image or image feature, or an output voice, music, or voice ⁇ Characters associated with acoustic features such as music,! /, And various other character strings are converted into phonetic symbol strings and registered in the dictionary. It is detected by searching for a phonetic symbol string, and is related to the detected information by content-related information, information on the location of arbitrary information such as advertisements, videos, links, etc., by user operation based on music and voice, etc. In addition to providing information, these inputs are not based on voice or text input, but by selecting a character string from a list such as a menu or by button labels in button operations. String may be rope row or using.
  • Phonetic symbol recognition based on information such as URL, URI, IP address, directory path, etc. indicating the location of the phonetic symbol dictionary when reading the phonetic symbols according to the attributes related to the phonetic symbol dictionary included in the tag Dictionaries, phonetic symbol dictionaries,! /, "Rec 0 g_ dic.uri” t indicating the location and location of the information, and information recognized by the dictionary such as "recog_di type”
  • information indicating the type of phoneme it is possible to distinguish phonetic symbol dictionaries that are frequently reused, etc., so that dictionary information based on phonetic symbol strings and acoustic characteristic templates for recognition can be acquired from markup languages. It may be provided in association with an attribute.
  • the dictionary information read in the past is saved to some extent by a method generally called a cache, and when the above-mentioned attribute indicates a specific word range, the priority of the dictionary is increased and the trouble of reading it again
  • the phonetic symbol dictionary that is read for each page as a separate file and related by ID etc. may be built in like a style sheet, or the phonetic symbol that is described in the header block and related by ID
  • a symbol dictionary may be incorporated, or a phonetic symbol dictionary given as an attribute for each tag may be incorporated, or a phonetic symbol dictionary may be included in header information when reading via a file or communication line. It can also be used as a good phonetic symbol string template dictionary.
  • a phonetic symbol string may be embedded in speech waveform information using "acoustic OFDM" that can embed text data in speech, or the embedded phonetic symbol string and related markup language information may be restored. It can be used to search for phonetic symbols in audio data and display related information by subtitles, etc., so it is very common voice data power such as radio and television. Demodulated phonetic symbol strings Can also be used for searching.
  • the database indexed by the phonetic symbol string searched as a search target using the phonetic symbol string acquired by the phonetic symbol recognition is the logic of the phonetic symbol string based on a plurality of keywords. Even if it is a combination based on the gap, even if it is a configuration that can express the logic by a Boolean model, it is possible to compose a query and provide it to the database with those combinations, and obtain search results .
  • the present invention relates a phonetic symbol and a speech feature by a distance based on a probability based on a Bayes discriminant function or the like.
  • the phonetic symbol string is acquired by the above method, and the acquired phonetic symbol string and the word string are directly related via the markup language, thereby restricting the word to be recognized compared to the conventional general recognition.
  • dynamic provision of dictionary information to enable efficient recognition can be realized in a markup language, and phonetic symbol strings can be used without directly using words in queries.
  • a database that uses HMM or DP matching may be constructed and searched by using or using phonemes and phoneme symbol strings.
  • image recognition is not an attribute based on phonetic symbols such as utterance phonemes and utterance phoneme changes. It may be used as an attribute.

Abstract

Provided is an information processing device, which changes and stores the orthography of a markup language used in contents information, thereby to use the changed information in a delivery device and a receive terminal. The information processing device is operated with a voice recognition technique and a phoneme recognition technique by changing phoneme dictionary information described in the markup language thereby to add tags, variables or attributes thereby to store, change and deliver them. Even if a word model, an acoustic model, a grammar model or speech part information is not registered in a recognition dictionary when a word or a letter string contained in the contents information is subjected to a voice recognition, the information processing device is enabled to realize the proper voice recognition by dynamically constituting and using the recognition dictionary information with phonetic symbols composed of phonemes and phonetic elements, which are used for the phonetic symbol recognition, from the contents information.

Description

明 細 書  Specification
情報処理装置及びプログラム  Information processing apparatus and program
技術分野  Technical field
[0001] 本発明は音声認識にお!ヽて音素認識及び Z又は音素片認識を用いる情報処理 装置等に関する。 背景技術  The present invention relates to an information processing apparatus that uses phoneme recognition and Z or phoneme piece recognition for speech recognition. Background art
[0002] 従来から一般的に音声を用いる情報処理装置や操作方法は音声認識に関する技 術が知られている。音声認識を行う方法としては、一般的に利用者発話に伴う音声か ら統計的に作られた音素や音素片による音響モデルや標準パラメータやテンプレー トを用いた音素認識や音素片認識により音素や音素片を時系列的に抽出し音素や 音素片記号列を獲得する方法が知られて!/ヽる。  [0002] Conventionally, technologies relating to speech recognition are known as information processing apparatuses and operation methods that generally use speech. As a method of speech recognition, in general, phonemes and phonemes are recognized by phone models and phoneme recognition using acoustic models, standard parameters, and templates using statistically generated phonemes and phonemes. It is known how to extract phonemes in chronological order and acquire phonemes and phoneme symbol strings!
[0003] そして、音素列や音素片列力もなる単語が記録された音声認識辞書を用いて、認 識された音素列や音素片列と音声認識辞書に登録された音素列や音素片列との一 致を評価し、評価の結果一致度の高い音素列や音素片列に関連付けられた単語を 取得したり、装置制御の命令を実行したりすることで音声認識や認識に伴う処理を実 現する。  [0003] Then, using a speech recognition dictionary in which words having phoneme strings and phoneme string powers are recorded, recognized phoneme strings and phoneme string strings, phoneme strings and phoneme string strings registered in the speech recognition dictionary, and The speech recognition and processing associated with recognition are performed by acquiring words associated with phoneme strings and phoneme string strings that have a high degree of coincidence as a result of the evaluation, and by executing device control commands. Appear.
[0004] ここで、装置を制御するユーザインタフェースとしては、非特許文献 1のように音素 の認識辞書により特定される単語と単語に関連付けて辞書登録された装置制御方 法を音素の認識処理により選択し実施する方法があり、音素や音素片の認識技術と しては特許文献 1に示されるように古くからの公知技術として用いられて!/、る。  [0004] Here, as a user interface for controlling the device, a device specified by a phoneme recognition dictionary as in Non-Patent Document 1 and a device control method registered in the dictionary in association with the word are used by phoneme recognition processing. There is a method of selecting and implementing it, and as a technology for recognizing phonemes and phoneme pieces, it has been used as an old known technology as shown in Patent Document 1! /
[0005] また、音声の認識において、人の対話における発話単語は省略形や「お一」や「う 一ん」といった感嘆語や造語などの変化が多ぐ特にコンテンツ情報では商品名や役 者名などは辞書登録が困難な固有名詞が多く必ずしも全ての単語を辞書登録でき なかった。そこで、音素認識を用いてコンテンツを検索する技術が特許文献 2ゃ非特 許文献 2や非特許文献 3等にぉ 、て提案されて 、る。  [0005] Also, in speech recognition, spoken words in human dialogue often change abbreviations, exclamation words and coined words such as "Oichi" and "Uon", especially in content information, product names and actors There are many proper nouns such as names that are difficult to register in the dictionary, and not all words could be registered in the dictionary. Therefore, techniques for searching for content using phoneme recognition have been proposed in Patent Document 2, Non-Patent Document 2, Non-Patent Document 3, and the like.
[0006] ここで、特許文献 3には、マークアップ言語の 1つである HTMLにおける音声認識 への利用において、認識可能な単語の表示表現を変えることで、利用者による音声 操作がやりやすくなるような提案がなされて 、る。 [0006] Here, Patent Document 3 describes that a user's speech can be changed by changing the display representation of a recognizable word when used for speech recognition in HTML, which is one of markup languages. Proposals are made to make the operation easier.
[0007] また、特許文献 4には、最低限必要な語彙に伴う音響モデルによる認識辞書データ を動的に獲得する方法が提案されている。  [0007] Further, Patent Document 4 proposes a method for dynamically acquiring recognition dictionary data based on an acoustic model with a minimum vocabulary.
[0008] また、特許文献 5によれば、マークアップ言語の 1つである HTMLにおける音声認 識への利用にお 、て、認識可能な単語を特定するために特定の記号で範囲を指定 し、音声による認識が行えることを利用者に明示する方法が提案され、発音が難解な 単語には認識可能な読み方を記載することで利便性を図っている。 [0008] Further, according to Patent Document 5, when using for speech recognition in HTML, which is one of the markup languages, a range is designated with a specific symbol in order to identify a recognizable word. Therefore, a method to clearly indicate to the user that speech recognition can be performed has been proposed, and convenience is provided by writing a recognizable reading for words that are difficult to pronounce.
特許文献 1:特開昭 62— 220998号公報  Patent Document 1: Japanese Patent Laid-Open No. 62-220998
特許文献 2:特開 2005— 70312号公報  Patent Document 2: Japanese Patent Laid-Open No. 2005-70312
特許文献 3:特開平 11― 25098号公報  Patent Document 3: Japanese Patent Laid-Open No. 11-25098
特許文献 4 :特開 2002— 91858号公報  Patent Document 4: Japanese Patent Laid-Open No. 2002-91858
特許文献 5:特開 2005— 18241号公報  Patent Document 5: Japanese Patent Laid-Open No. 2005-18241
非特許文献 1:「高齢ィ匕社会対応型生活支援インターフェースに関する研究開発」、 青森県工業総合研究センターによるキープロジェタト研究報告書 Vol.5、 Apr.1998 〜Mar.2001 031  Non-Patent Document 1: “Research and Development on Life Support Interface for Aged Society”, Key Project Project Research Report by Aomori Prefectural Industrial Research Center Vol.5, Apr.1998-Mar.2001 031
非特許文献 2 :中沢正幸,遠藤隆,古川清,豊浦潤,岡隆ー (新情報処理開発機構), 「音声波形力 の音素片記号系列を用いた音声要約と話題要約の検討」,信学技報, SP96-28, pp.61— 68, June 1996.  Non-Patent Document 2: Masayuki Nakazawa, Takashi Endo, Kiyoshi Furukawa, Jun Toyoura, Takashi Oka (New Information Processing Development Corporation), "Study of speech summaries and topic summaries using phoneme symbol sequences of speech waveform power", Shin Academic Report, SP96-28, pp.61—68, June 1996.
非特許文献 3 :岡隆ー,高橋裕信,西村拓一,関本信博,森靖英,伊原正典,矢部博 明,橋口博榭,松村博.パターン検索のアルゴリズム 'マップ - "CrossMediator"を支 Oもの -. someone Unknown, editor,人工知會 'ギ会研究会, volume 1, pages 1-6. 人工知能学会, 2001.  Non-Patent Document 3: Takashi Oka, Hironobu Takahashi, Takuichi Nishimura, Nobuhiro Sekimoto, Hidehide Mori, Masanori Ihara, Hiroaki Yabe, Hiroaki Hashiguchi, Hiroshi Matsumura. Pattern Search Algorithm 'Map-Supporting "CrossMediator" -. Someone Unknown, editor, Artificial Intelligence 'Gikai Study Group, volume 1, pages 1-6. Japanese Society for Artificial Intelligence, 2001.
[0009] また、音素記号列のマークアップ言語における利用方法は MPEG2などの動画スト リーム内で用いられる MPEG7における記述として Segmentおよび Media Locatorと いう構成を使用して、く Media Locator〉でビデオコンテンツ内の Segmentもしくは Frame を直接指定したり、く Media Locator〉でコンテンツを指し、く Media Time〉でそのコンテ ンッ内部の時間位置を指定すると共に適当な固有名詞を指定するタグと組合せたり という使い方や、前述の Segmentでコンテンツを割当てたりする際に Visual、 Audioの口 一レベルなメタデータとして、固定の間隔で同種のメタデータを付けるためのく Series〉 という記述方法を用いたりする方法がある。この際、オーディオだとく Scalable Series) として指定する方法がり MPEG7オーディオには、自動音声認識結果である単語 (wo rd)ラテイスと音素 (phone)ラテイスとを記述するく Spoken Content DS〉というものがある [0009] In addition, the use of the phoneme symbol string in the markup language uses the structure of Segment and Media Locator as a description in MPEG7 used in a moving picture stream such as MPEG2, and uses Media Locator> in the video content. You can directly specify the segment or frame of the file, point to the content with <Media Locator>, specify the time position within the content with <Media Time>, and combine it with a tag that specifies an appropriate proper noun, When assigning content using the above-mentioned Segment, As one level of metadata, there is a method of using the <Series> description method for attaching the same kind of metadata at fixed intervals. At this time, there is a method of specifying as Audio Series Scalable Series) MPEG7 audio has Spoken Content DS> which describes automatic speech recognition result word (word) and phoneme (phone) ratings.
[0010] また、 VoiceXMLという音声認識における標準化方式では文脈にあわせて文法に 依存した認識を実施するために、これまで製品間でバラバラだったユーザインタフエ ースユーザインタフェースの記述を統一的な手法を表記する方法が提案されている 力 文脈や文法に依存せず音素や音素片といった表音記号識別子を使って任意の タグの対象範囲に属性を与え、辞書情報を動的に構成する方法は提案されていな い。 [0010] In addition, VoiceXML, a standardized method for speech recognition, implements recognition that depends on the grammar according to the context. The method of dynamically constructing dictionary information by assigning attributes to the target range of arbitrary tags using phonetic symbol identifiers such as phonemes and phoneme pieces regardless of context or grammar is proposed. Not proposed.
[0011] なお、従来の出願や文献によると、音素と音節を混同しているものが多く見受けら れるが、本発明における音素とは日本語で「あ力さたな」という発音を例にする場合、 音節表記した場合であれば「あ/か/さ/た〃よ」もしくは「a/ ka/ sa/ ta/ na」と単音声で 表記され、音素表記した場合は「a/ k/ a/ s/ a/ t/ a/ n/ a」もしくは「a/ cl/ k/ a/ s/ a/ cl/ t/ a/ n/ a」と表記され、音素片表記であれば「a/ a— k/ k/ k— a/ a/ a— s/ s/ s— a I a/ a- 1/ t/ t— a/ a/ a— n/ n/ n— a/ a」もしくは「a/ a- cl/ cl/ cl- k/ k/ k- a/ a/ a- s/ s I s-a/ a/ a— cl/ cl/ cl-t/ t/ t— a/ a/ a— n/ n/ n— a/ a」といった例がバイグラムであれ られ、「a— a— a/ a— cl— cl/ cl— cl— cl/ cl— cl— k/ cl— k— k/ k— k— a/ a— a— a/ a— a— s/ s— s — s/ s-a-a/ · · · t-a-a/ a— a— n/ n-n~n/ n— a— a/ a— a— a」と言った f列力 S卜ライグラムの f列と なり、音素を時系列的に前半部、中盤部、後半部といった任意の位置に基づいて分 解した音素片であっても良ぐ /cl/は発音前の無音部もしくは無声部を指しており、 これらの音素や音素片はともに任意の改善により任意の音を示す音素や表記記号や 表音記号や発音記号及びそれらを時系列的に分解した音素片のような表記記号片 や表音記号片ゃ発音記号片に変更しても良!、。  [0011] According to conventional applications and documents, many phonemes and syllables are confused with each other, but the phoneme in the present invention is an example of the pronunciation of "a power sana" in Japanese. If you use syllables, `` A / ka / sa / tayoyo '' or `` a / ka / sa / ta / na '' will be written as a single voice, and if you use phonemes, `` a / k / a / s / a / t / a / n / a '' or `` a / cl / k / a / s / a / cl / t / a / n / a ''. / a— k / k / k— a / a / a— s / s / s— a I a / a- 1 / t / t— a / a / a— n / n / n— a / a ”or A / a- cl / cl / cl- k / k / k- a / a / a- s / s I sa / a / a— cl / cl / cl-t / t / t— a / a / a An example such as — n / n / n— a / a ”can be a bigram,“ a— a— a / a— cl— cl / cl— cl— cl / cl— cl— k / cl— k— k / k— k— a / a— a— a / a— a— s / s— s — s / saa / ... taa / a— a— n / nn ~ n / n— a— a / a— a — A ” f column force It becomes the f column of S 卜 lygram, and it can be a phoneme segmented based on any position such as the first half, middle board, second half in time series. / cl / is before pronunciation These phonemes and phonemes are both phonemes, notation symbols, phonetic symbols and phonetic symbols that represent any sound by any improvement, and phonemes obtained by decomposing them in time series. You can change notation symbol pieces and phonetic symbol pieces to phonetic symbol pieces!
[0012] また、音素及び音素片を用いる表音記号認識と通常の音声認識の違いを説明する と、音素認識や音素片認識は一般的な音声認識と違!ヽ意味や内容を解釈する語彙 認識を行わな 、と ヽぅ特徴と音響モデルを単語や文法や品詞などの言語モデルの 変化に応じて動的に構成しないという特徴があり、より詳しくは音素認識や音素片認 識は文法に関わる言語モデルを用いな ヽため認識結果として意味を捉えて ヽな ヽこ と、若しくは漢字のような意味を含む記号に変換していないこと、若しくは同音異義語 や同音異表記語を弁別しないこと、文脈に応じて名詞や動詞といった品詞の弁別を 行わな ヽこと、若しくは形態素解析や構文解析を行わな ヽことなどと ヽつた特徴があ り、本件では音素認識や音素片認識をはじめとして、音素や音素片や発音記号や( 時系列的に分割された発音記号としての)発音記号片とそれらの記号列に基づく表 音記号を用いた認識を併せて表音記号認識として表記して!/、る。 [0012] Also, the difference between phonetic symbol recognition using phonemes and phonemes and normal speech recognition is explained. Phoneme recognition and phoneme recognition are different from general speech recognition. Vocabulary that interprets meaning and content Do not recognize を features and acoustic models of language models such as words, grammar and parts of speech There is a feature that it does not dynamically configure according to changes, and more specifically, phoneme recognition and phoneme segment recognition do not use a language model related to grammar. Not to be converted to symbols that contain meanings such as, or homonyms or homonyms are not distinguished, parts of speech such as nouns and verbs are not discriminated according to context, or morphological analysis or syntax In this case, phonemes, phonemes, phonetic symbols, phonetic symbols (as phonetic symbols divided in time series), etc. Recognize using a phonetic symbol based on the symbol and their symbol string together as a phonetic symbol recognition!
[0013] このように、音素及び音素片による認識は表音記号別の静的な音響モデルを用い て発話者の発話音を分析し発話にともなう表音記号列と認識辞書内の表音記号列 の一致のみを評価するという特徴カゝら認識処理や認識辞書の構成が単純になり音の 一致のみを評価するため辞書未登録語や感嘆詞であっても音素や音素片といった 表音記号や発音記号からなる識別子列の認識が可能となる。  [0013] As described above, the recognition by phonemes and phonemes is performed by analyzing the utterance sound of a speaker using a static acoustic model for each phonetic symbol, and the phonetic symbol strings in the recognition dictionary and the phonetic symbols in the recognition dictionary. Characteristic that evaluates only sequence matches The recognition process and recognition dictionary structure is simplified, so phonetic symbols such as phonemes and phonemes are used even for unregistered words and exclamation words to evaluate only sound matches. It is possible to recognize an identifier string consisting of phonetic symbols.
[0014] この際、従来力 あるように話者の発話特性に合わせて学習し性能を改善する動的 な音響モデルを用いても良いが、一般的な音声認識のような単語や文法に依存して 音響モデルを動的に切替えるといった処理を音素認識や音素片認識では行わない という特徴がある。  [0014] In this case, a dynamic acoustic model that learns according to the utterance characteristics of the speaker and improves the performance may be used as in the past, but it depends on words and grammar as in general speech recognition. Therefore, the process of dynamically switching the acoustic model is not performed in phoneme recognition or phoneme recognition.
[0015] このため、音素列や音素片列を登録済み辞書内容と比較することで未登録音素列 や音素片列の検出が容易に可能となり、音素や音素片を用いた表音記号認識によ る認識結果に基づいて単語を限定し再度一般的な文法を加味した音声認識を実施 することで効率的な音声認識を実現するといつた方法も考えられる。  [0015] Therefore, by comparing the phoneme sequence or phoneme sequence with the registered dictionary contents, it becomes possible to easily detect the unregistered phoneme sequence or phoneme sequence, and for phonetic symbol recognition using phonemes or phoneme segments. Based on the recognition results, it is possible to conceive efficient speech recognition by limiting the words and implementing speech recognition with general grammar again.
[0016] そして、音素や音素片による認識方法は辞書に登録されていない単語がある場合 であっても、認識対象文中の未登録単語をひらがな文字表記に変換し、変換された ひらがな文字列の遷移状態に合わせて、機知の情報力 得られる韻律に基づき音 素列や音素片列に変換した記号列を認識辞書に一時的に登録し、利用者の発話を 音素列や音素片列として認識した後に獲得された音素列や音素片列と認識辞書の 音素列や音素片列とを比較することで記号列同士の一致度を測り認識結果を獲得し 、認識結果として利用頻度が下がれば削除するといつた方法により、従来の音声認 識よりも自由度の高い動的な音素や音素片による辞書構成を持つ音声認識が可能 となる。 [0016] And, even if there is a word that is not registered in the dictionary, the recognition method using phonemes or phonemes converts the unregistered word in the recognition target sentence to the hiragana character notation, and converts the converted hiragana character string. In accordance with the transition state, the symbol strings converted into phoneme strings and phoneme string strings based on the prosody obtained from wits are temporarily registered in the recognition dictionary, and the user's speech is recognized as phoneme strings and phoneme string strings. After comparing the phoneme sequence and phoneme sequence acquired with the phoneme sequence and phoneme sequence of the recognition dictionary, the degree of coincidence between the symbol sequences is measured and the recognition result is acquired. Then, the conventional voice recognition Speech recognition with dynamic phonemes and phonemes with a higher degree of freedom than knowledge is possible.
[0017] また、このとき、音素や音素片単位の音響モデルを利用者の発話に合わせて再学 習するといつた方法により、文法や単語に依存しない動的な音響モデル辞書により 認識精度を改善するように利用者の発話力 得られる音響情報を教師情報として再 利用して、認識のための再学習を実施してもよい。  [0017] At this time, by re-learning the phoneme or phoneme unit acoustic model according to the user's utterance, the recognition accuracy is improved by the dynamic acoustic model dictionary independent of grammar and words. As described above, re-learning for recognition may be performed by reusing the acoustic information obtained by the user's speech ability as teacher information.
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0018] 従来の音声認識技術において、人の対話における発話単語の省略形や「お一」や [0018] In conventional speech recognition technology, abbreviations of spoken words in human dialogues,
「う一ん」といった感嘆語 (感嘆詞)や造語などは時代や環境に伴う違いも多く語、特 にコンテンツ情報では商品名や役者名などの流行に依存する動的な固有名詞は認 識辞書への登録は非効率的であったため、膨大で変化に富む音声認識を実用化す る場合の課題として古くから存在するものの音響モデルや文法モデルを含む認識辞 書を繰返し配布することはその情報量の大きさから比較的困難であるため認識辞書 に登録されて 、な 、語彙に依存した認識は事実上不可能であった。 Exclamation words (exclamation) such as “Uun” and coined words have many differences depending on the times and the environment, especially in content information, dynamic proper nouns that depend on trends such as product names and actor names are recognized. Since registration in the dictionary is inefficient, it is important to repeatedly distribute recognition dictionaries including acoustic models and grammatical models that exist for a long time as a challenge when putting huge and varied speech recognition into practical use. Because it is relatively difficult due to the size of the quantity, it was registered in the recognition dictionary, and recognition based on vocabulary was virtually impossible.
[0019] また、従来の音声認識では一般的に韻律モデルや文法モデルの学習が不可欠で あり、そういった処理手順が前述の造語や流行語や固有名詞などに基づく辞書未登 録語の認識の課題となっており、そういった未登録単語に関連する韻律や単語間の 共起関係による文法モデルの学習が困難であるという課題があった。  [0019] In addition, in conventional speech recognition, learning of prosodic models and grammatical models is generally indispensable, and such a processing procedure is a problem of recognition of unregistered words based on the aforementioned coined words, buzzwords, proper names, etc. There is a problem that it is difficult to learn a grammar model based on prosody related to such unregistered words and co-occurrence relationships between words.
[0020] また、従来マークアップ言語による音声情報はコンテッ情報としての映像や音声に 同期した音声情報以外は検索や操作の対象にできな力つた。そして、音素記号列を 含む情報を利用者に提供するためには事前に音声情報を認識し音素と単語 ID (単 語識別子)を関連付けて保存する必要があつたため、不特定単語に対する容易な音 素列 ·音素片の提供方法や操作方法が無 ヽと ヽぅ課題があった。  [0020] Furthermore, audio information in the conventional markup language cannot be searched or operated except for audio information synchronized with video and audio as content information. In order to provide information including phoneme symbol strings to users, it is necessary to recognize speech information in advance and store the phonemes and word IDs (word identifiers) in association with each other. Elemental arrangement · There was a problem with the provision and operation of phoneme.
[0021] また、上述した特許文献 3に開示された技術では、辞書に登録されていない単語を 認識するための方法は提示されていない。さらに、特許文献 4に開示された技術では 、語彙に依存しない音声認識を行うことができず、未知の単語に対してその都度韻 律モデルを学習する等の手段をとる必要があり、自由度の高い音声認識を実現する ことができな力つた。さらに、特許文献 5に開示された技術では、音声認識方法が従 来の音声認識方法と差異が無ぐ音素や音素片を用いた認識が出来ないと言った問 題点があった。 [0021] Further, in the technique disclosed in Patent Document 3 described above, a method for recognizing a word that is not registered in the dictionary is not presented. Furthermore, the technique disclosed in Patent Document 4 cannot perform vocabulary-independent speech recognition, and requires measures such as learning prosodic models for unknown words each time. Realize high voice recognition I couldn't do it. Furthermore, the technique disclosed in Patent Document 5 has a problem that the speech recognition method cannot be recognized using phonemes or phonemes that are not different from the conventional speech recognition method.
[0022] このような課題を踏まえ、本発明が目的とするところは、コンテンツ情報に含まれる 単語や文字列に対して音声認識を行う際に、音声認識辞書に単語モデルや音響モ デルや文法モデルや品詞情報が登録されて!ヽなくても、音素や音素片からなる表音 記号認識を用いた表音記号に基づく認識辞書情報を利用することでより適切な音声 認識を実現することが出来る情報処理装置等を提供することを目的とする。  [0022] Based on such problems, the present invention aims at a word model, an acoustic model, or a grammar in a speech recognition dictionary when speech recognition is performed on a word or a character string included in content information. Even if models and parts-of-speech information are registered, more appropriate speech recognition can be realized by using recognition dictionary information based on phonetic symbols using phonetic symbol recognition consisting of phonemes and phonemes. An object is to provide an information processing apparatus and the like that can be used.
課題を解決するための手段  Means for solving the problem
[0023] 上記の課題を解決するために、第 1の発明の情報処理装置は、文字情報及び Z又 はメタ情報を含むコンテンツ情報を取得するコンテンツ情報取得手段と、前記コンテ ンッ情報取得手段により取得されたコンテンツ情報から、表音記号からなる認識表音 記号列を検出する認識表音記号列検出手段と、前記認識表音記号列を用いて認識 辞書情報を生成する認識辞書情報生成手段と、を備えることを特徴とする。  [0023] In order to solve the above-described problem, an information processing apparatus according to a first invention includes a content information acquisition unit that acquires content information including character information and Z or meta information, and the content information acquisition unit. A recognized phonetic symbol string detecting means for detecting a recognized phonetic symbol string consisting of phonetic symbols from the acquired content information; a recognition dictionary information generating means for generating recognition dictionary information using the recognized phonetic symbol string; It is characterized by providing.
[0024] 第 2の発明の情報処理装置は、文字情報及び Z又はメタ情報を含むコンテンツ情 報を取得するコンテンツ情報取得手段と、前記コンテンツ情報取得手段により取得さ れたコンテンツ情報から、文字情報及び Z又はメタ情報に基づ 、て展開対象文字列 を検出する展開対象文字列検出手段と、文字列と表音記号とを対応づけて記憶する 表音記号記憶手段と、前記表音記号記憶手段を参照することにより、前記展開対象 文字列を認識表音記号列に変換する表音記号変換手段と、前記認識表音記号列を 用いて認識辞書情報を生成する認識辞書情報生成手段と、を備えることを特徴とす る。  [0024] An information processing apparatus according to a second aspect of the present invention provides a content information acquisition unit that acquires content information including character information and Z or meta information, and character information from the content information acquired by the content information acquisition unit. And a development target character string detecting means for detecting a development target character string based on Z or meta information, a phonetic symbol storage means for storing a character string and a phonetic symbol in association with each other, and the phonetic symbol storage A phonetic symbol conversion unit that converts the character string to be expanded into a recognized phonetic symbol string, a recognition dictionary information generation unit that generates recognition dictionary information using the recognition phonetic symbol string, It is characterized by having.
[0025] また、第 3の発明は、第 2の発明の情報処理装置において、前記表音記号変換手 段により変換された表音記号を、前記コンテンツ情報に付加することにより当該コンテ ンッ情報を保存するコンテンツ情報保存手段を更に備えることを特徴とする。  [0025] In addition, in the information processing apparatus according to the second aspect, the third invention adds the phonetic symbol converted by the phonetic symbol conversion means to the content information by adding the content information to the content information. Content information storing means for storing is further provided.
[0026] また、第 4の発明は、第 1から第 3の発明のいずれかに記載の情報処理装置におい て、前記コンテンツ情報保存手段により保存されたコンテンツ情報と、当該コンテンツ 情報に基づいて生成された認識辞書情報とを他の情報処理端末に送信する送信手 段を更に備えることを特徴とする。 [0026] Further, the fourth invention is the information processing apparatus according to any one of the first to third inventions, the content information stored by the content information storage means and generated based on the content information. Sender that transmits the recognized recognition dictionary information to another information processing terminal Further comprising a step.
[0027] また、第 5の発明は、第 1から第 4の発明のいずれかに記載の情報処理装置におい て、音声を入力する音声入力手段と、前記音声入力手段により入力された音声の特 徴量を抽出する特徴量抽出手段と、前記特徴量抽出手段により抽出された特徴量 から、表音記号に変換する特徴量表音記号変換手段と、前記特徴量表音記号変換 手段により変換された表音記号と、前記認識辞書情報に含まれる認識表音記号列を 構成する表音記号とを評価し、もっとも類似する表音記号に対応して所定の処理を 実行する処理実行手段と、を更に備えることを特徴とする。  [0027] Further, according to a fifth aspect of the present invention, in the information processing apparatus according to any one of the first to fourth aspects, the voice input means for inputting voice and the characteristics of the voice input by the voice input means. A feature amount extracting means for extracting the collected amount, a feature amount phonetic symbol converting means for converting the feature amount extracted by the feature amount extracting means into a phonetic symbol, and the feature amount phonetic symbol converting means. Processing execution means for evaluating a phonetic symbol and a phonetic symbol constituting a recognized phonetic symbol string included in the recognition dictionary information, and executing a predetermined process corresponding to the most similar phonetic symbol; Is further provided.
[0028] また、第 6の発明は、第 5の発明の情報処理装置において、前記コンテンツ情報に は、音素情報及び Z又は音素片情報が含まれており、前記処理実行手段は、前記 特徴量表音記号変換手段により変換された表音記号と、前記認識辞書情報に含ま れる認識表音記号列を構成する表音記号とを評価し、もっとも類似する表音記号に 対応して利用者に対し、音声発話による情報の提示を行うことを特徴とする。  [0028] Further, according to a sixth aspect of the present invention, in the information processing apparatus of the fifth aspect, the content information includes phoneme information and Z or phoneme piece information, and the processing execution means includes the feature quantity The phonetic symbols converted by the phonetic symbol conversion means and the phonetic symbols that constitute the recognized phonetic symbol string included in the recognition dictionary information are evaluated, and the user responds to the most similar phonetic symbol. On the other hand, information is presented by voice utterance.
[0029] また、第 7の発明は、第 1から第 6の発明のいずれかの情報処理装置において、前 記表音記号は、音素又は音素片であることを特徴とする。  [0029] Further, the seventh invention is characterized in that in the information processing apparatus according to any one of the first to sixth inventions, the phonetic symbol is a phoneme or a phoneme piece.
[0030] また、第 8の発明は、第 1から第 6の発明のいずれかの情報処理装置において、前 記実行される処理は、音素認識に伴う認証処理であることを特徴とする。  [0030] Further, an eighth invention is characterized in that, in the information processing apparatus according to any one of the first to sixth inventions, the process to be executed is an authentication process accompanying phoneme recognition.
[0031] また、第 9の発明のプログラムは、コンピュータに、マークアップ言語を用いて記述さ れた情報を解釈するマークアップ言語解釈ステップと前記解釈によって指定された 属性を獲得する属性獲得ステップと、属性獲得ステップによって獲得された属性に関 連付けられた表音記号列及び Z又は音素列及び Z又は音素片列を抽出する表音 記号抽出ステップと、前記表音記号抽出ステップによって、音素認識部で用いる音 素列辞書を変更する辞書変更ステップと、を実現させることを特徴とする。  [0031] Further, a program according to a ninth aspect of the invention includes a markup language interpretation step for interpreting information described using a markup language in a computer, and an attribute acquisition step for acquiring an attribute specified by the interpretation. A phonetic symbol extraction step of extracting a phonetic symbol string and Z or phoneme sequence and Z or phoneme fragment sequence associated with the attribute acquired in the attribute acquisition step, and phoneme recognition by the phonetic symbol extraction step. And a dictionary changing step for changing a phoneme string dictionary used in the section.
[0032] また、第 10の発明のプログラムは、コンピュータに、マークアップ言語を用いて記述 された情報を解釈するマークアップ言語解釈ステップと前記解釈によって指定された 属性を獲得する属性獲得ステップと、属性獲得ステップによって獲得された属性に関 連付けられた表音記号列及び Z又は音素列及び Z又は音素片列を抽出する表音 記号抽出ステップと、前記属性獲得ステップによって獲得された属性に基づき利用 者が入力する情報の種別を評価する情報種別評価ステップと、前記情報評価ステツ プによって、音素認識部で用いる音素列辞書を変更する辞書変更ステップと、を実 現させることを特徴とする。 [0032] Further, a program according to a tenth aspect of the invention includes a computer, a markup language interpretation step for interpreting information described using a markup language, and an attribute acquisition step for acquiring an attribute designated by the interpretation; Based on the phonetic symbol extraction step of extracting the phonetic symbol string and Z or phoneme sequence and Z or phoneme fragment sequence associated with the attribute acquired by the attribute acquisition step, and the attribute acquired by the attribute acquisition step Use An information type evaluation step for evaluating the type of information input by the user and a dictionary changing step for changing the phoneme string dictionary used in the phoneme recognition unit are realized by the information evaluation step.
発明の効果  The invention's effect
[0033] 本発明によれば、音素認識を用いた情報処理装置を利用するために、提供される コンテンツ情報の認識に必要な音素辞書をコンテンツ情報に関連付けられた若しく はコンテンッ情報に含まれるマークアップ言語から獲得することで表示内容に関する 不特定単語に対応することができる。したがって、商品販売のような不特定単語が頻 発する可能性の高 、処理を単体の装置やサーバ ·クライアント環境で行うために、音 素列や音素片列や各種識別子の呼称をマークアップ言語のタグ属性として記載し、 コンテンツの画像やページ単位の文章や文章構成におけるフレームや動画像の 1コ マとしてのフレームや動画像の複数フレームにまたがるシーン単位に発話音素辞書 を指定が出来るようにすることで課題の解決を図ろうとするものである。  [0033] According to the present invention, in order to use an information processing apparatus using phoneme recognition, a phoneme dictionary necessary for recognition of provided content information is included in the content information associated with the content information. By acquiring from the markup language, it is possible to deal with unspecified words related to display content. Therefore, unspecified words such as product sales are likely to occur frequently, and in order to perform processing in a single device or server / client environment, the names of phoneme strings, phoneme strings, and various identifiers are marked in the markup language. It is described as a tag attribute so that the utterance phoneme dictionary can be specified in scene units that span frames of content images, pages of sentences, frames in sentence structures, frames as a single frame of moving images, and multiple frames of moving images. In this way, we try to solve the problem.
[0034] また、これらの操作に用いるキーワードを音素展開することで、利用者に送信する H TMLや XML、 RSS、 EPG、 BML、 MPEG7、 CSVといった配布用ファイル形式や マークアップ言語によって変数や属性、特定タグとして音声操作に関する音素列や 音素片列を用いた識別子を任意のマークアップ言語やスクリプトと関連付けて組込 む方法により容易に音声を利用した索引付けや利用者が音声を用いて情報を獲得、 閲覧、操作したりするための音声制御情報を配布共有したり、端末側で音声制御情 報の獲得を行うことを可能とし課題の解決を図ろうとするものである。  [0034] In addition, by expanding the phonemes of the keywords used in these operations, variables and attributes can be changed depending on the distribution file format and markup language such as HTML, XML, RSS, EPG, BML, MPEG7, CSV sent to the user. , By using a method that incorporates an identifier using a phoneme sequence or phoneme segment sequence related to voice operation as a specific tag in association with an arbitrary markup language or script, and indexing using speech, It is possible to distribute and share voice control information for acquiring, browsing, and manipulating voices, and to acquire voice control information on the terminal side, in order to solve the problem.
[0035] そして、本発明は音素や音素片を用いて不特定単語を認識すると!/、う従来の技術 を実施するにあたり、インターネット環境で変化する多様なコンテンツに対してコンテ ンッ情報の一場面中に出現する単語に制約があることを利用し、音素列や音素片列 による動的な辞書構成方法を提供することで、韻律モデルや文法モデルを利用しな い不特定単語に対応した音声認識処理の実現を図り利便性の向上を実現しようとし ている。  [0035] In the present invention, when an unspecified word is recognized using a phoneme or a phoneme piece! /, In implementing the conventional technology, a scene of content information for various contents changing in the Internet environment. By using the fact that there are restrictions on the words that appear in it, and providing a dynamic dictionary construction method based on phoneme sequences and phoneme sequences, speech corresponding to unspecified words that do not use prosodic models or grammatical models We are trying to improve convenience by realizing recognition processing.
[0036] また、 MPEG7であれば映像情報のシーンを表すタグの中でそのシーン名称や、 役者名、配役名を音素記号列や音素片記号列を用いて属性や変数、タグによる範 囲指定により音声ストリームの認識個所以外をマークアップ言語で記載した情報を用 V、て、音素検索技術による任意の役者名や配役名での検索を実施することでマーク アップ言語情報により場面に応じた音素列が獲得できるため任意の指示や検索を行 える装置が実現でき課題の解決が図られる。 [0036] In the case of MPEG7, among scene tags of video information, the scene name, actor name, and cast name are classified by attributes, variables, and tags using phoneme symbol strings and phoneme symbol strings. By using the information specified in the markup language except for the recognition part of the audio stream by the range specification, search by any actor name or casting name using the phoneme search technology, and depending on the scene by the markup language information Since a phoneme string can be acquired, a device that can perform arbitrary instructions and searches can be realized and problems can be solved.
[0037] また、 HTMLであれば、対象とするリンクや CGI表記に音素記号列を含む変数、属 性を設けたり、特定のタグで囲まれた範囲を音素列に変換し、タグの変数、属性とし て埋め込んだり、選択した!/、商品を囲むテーブルタグのテーブル要素ごとに変数、 属性を設け各要素タグに名称を音素列記号で変数、属性として与えたり、フォームタ グゃインプットタグの変数、属性として音素列を与え、与えられた音素列に基づいて、 情報を送信したり次のページへ遷移するといつた方法により課題の解決が図られる。  [0037] In the case of HTML, a variable or attribute including a phoneme symbol string is provided in the target link or CGI notation, or a range surrounded by a specific tag is converted into a phoneme string, and the tag variable, You can embed it as an attribute or select it! /, Provide a variable and attribute for each table element of the table tag surrounding the product, give the name to each element tag as a variable and attribute with a phoneme string symbol, form tag A phoneme string is given as a variable or attribute, and based on the given phoneme string, information is transmitted or a transition is made to the next page.
[0038] また、 RSSによる音素列や音素片列の配信を行っても良いし、タグを用いてキーヮ ードに IDを関連付け、 IDと音素列 ·音素片列を関連付けた認識辞書情報として CSV ファイルを提供することで認識の対象となるキーワードを特定するといつた方法を用 いても良いし、顔や指紋などの画像認識辞書と音素や音素片による表音記号列を用 いた固有名詞を伴う認識辞書と話者ごとの音素や音素片に基づく音響モデルとを関 連付けることで合言葉による個人認証を行っても良い。  [0038] In addition, RSS phoneme strings and phoneme string sequences may be distributed, or IDs may be associated with key words using tags, and IDs may be used as recognition dictionary information associated with phoneme strings / phoneme string sequences. When a keyword is identified by providing a file, it is possible to use any method, and it is accompanied by a proper noun using an image recognition dictionary such as a face or fingerprint and a phonetic symbol string of phonemes or phonemes. Individual recognition using secret words may be performed by associating the recognition dictionary with an acoustic model based on phonemes or phonemes for each speaker.
[0039] このようにして、音素や音素片による認識辞書の内容をマークアップ言語の属性や 任意のタグや辞書ファイルとして外部力 獲得することにより、情報処理装置の操作 を可能とし、課題の解決を図ることができる。  [0039] In this way, the content of the recognition dictionary based on phonemes and phonemes is acquired as external attributes as markup language attributes, arbitrary tags, and dictionary files, thereby enabling the information processing apparatus to be operated and solving problems. Can be achieved.
[0040] すなわち、コンテンツ情報に明確に含まれない単語が指定の無い限り認識辞書に 含まれないため、誤認識の生じる確率が低減されると共に、音声操作をしたり、音素 列や音素片列による装置制御に利用したり、情報に種類によって画像や音声を伴う 認証条件を変えられるため汎用性の高い個人認証をしたり、情報処理装置間の情報 交換に利用したりするために既存のマークアップ言語を拡張し音素や音素片による 表記を追加したり、マークアップ言語やコンテンツに付随もしくは関連付けられた音 素や音素片による辞書情報を用いたりすることによって利便性の高いユーザインタフ エースを実現することができる。  [0040] That is, since words that are not clearly included in the content information are not included in the recognition dictionary unless specified, the probability of erroneous recognition is reduced, and voice operations are performed, phoneme strings and phoneme string strings are reduced. The existing mark can be used for device control by the system, or for personal authentication with a high degree of versatility because the authentication conditions accompanying images and sounds can be changed depending on the type of information, and for information exchange between information processing devices. An easy-to-use user interface is realized by adding phoneme and phoneme notation and using dictionary information with phonemes and phonemes attached to or associated with markup language and content. can do.
図面の簡単な説明 [0041] [図 1]本発明を利用した情報処理装置のブロック図。 Brief Description of Drawings FIG. 1 is a block diagram of an information processing apparatus using the present invention.
[図 2]認識辞書情報のデータ構造の一例を示した図。  FIG. 2 is a diagram showing an example of the data structure of recognition dictionary information.
[図 3]表音記号付与処理の動作フローを示した図。  FIG. 3 is a diagram showing an operation flow of a phonetic symbol assignment process.
[図 4]表音記号付与処理の動作を説明するための図。  FIG. 4 is a diagram for explaining the operation of a phonetic symbol assignment process.
[図 5]表音記号付与処理の動作を説明するための図。  FIG. 5 is a diagram for explaining the operation of a phonetic symbol assignment process.
[図 6]表音記号付与処理の動作を説明するための図。  FIG. 6 is a diagram for explaining the operation of a phonetic symbol assignment process.
[図 7]表音記号付与処理の動作を説明するための図。  FIG. 7 is a diagram for explaining the operation of a phonetic symbol assignment process.
[図 8]認識辞書更新処理の動作フローを示した図。  FIG. 8 is a diagram showing an operation flow of recognition dictionary update processing.
[図 9]認識辞書情報の異なるデータ構造を示した図。  FIG. 9 is a diagram showing a different data structure of recognition dictionary information.
[図 10]認識辞書情報更新処理の動作フローを示した図。  FIG. 10 is a diagram showing an operation flow of recognition dictionary information update processing.
[図 11]認識辞書情報更新処理の動作を説明するための図。  FIG. 11 is a diagram for explaining the operation of recognition dictionary information update processing.
[図 12]認識辞書情報更新処理の動作を説明するための図。  FIG. 12 is a diagram for explaining the operation of recognition dictionary information update processing.
[図 13]認識辞書情報更新処理の動作を説明するための図。  FIG. 13 is a diagram for explaining the operation of recognition dictionary information update processing.
[図 14]認識辞書情報更新処理の動作を説明するための図。  FIG. 14 is a diagram for explaining the operation of recognition dictionary information update processing.
[図 15]サーバ'クライアントモデルに適用した場合における動作フローを示した図。  [FIG. 15] A diagram showing an operation flow when applied to a server client model.
[図 16]サーバ'クライアントモデルに適用した場合における動作フローを示した図。  FIG. 16 is a diagram showing an operation flow when applied to a server client model.
[図 17]本実施形態における変形例を説明するための図。  FIG. 17 is a diagram for explaining a modification in the present embodiment.
[図 18]本実施形態における変形例を説明するための図。  FIG. 18 is a diagram for explaining a modification of the present embodiment.
符号の説明  Explanation of symbols
[0042] 1 情報処理装置 [0042] 1 Information processing apparatus
10 制御部  10 Control unit
20 記憶部  20 Memory
202 コンテンツ情報  202 Content information
204 表音記号変換テーブル  204 Phonetic symbol conversion table
206 認識辞書情報  206 Recognition dictionary information
208 表音記号付与プログラム  208 Phonetic symbol assignment program
210 認識辞書情報更新プログラム  210 Recognition dictionary information update program
212 音声操作プログラム 30 通信部 212 Voice control program 30 Communications department
40 入出力部  40 I / O section
50 操作部  50 Control section
60 表示部  60 Display
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0043] 本発明はコンテンツ情報に利用されるマークアップ言語の表記に対して、変更を加 えて保存したり、保存して利用したり、変更を加えた情報をそのまま利用したりする装 置や変更を加えた情報を配信する配信装置と受信し認識や認識に伴う操作や応答 に利用する受信端末とに用いる情報処理装置を構成することができる。より具体的に は XMLや HTMLによる例にあるように、すでにあるマークアップ言語で記載された 情報を変更しタグを追加したり、変数や属性を追加したりして保存 ·変更 ·配信する方 法とそれらの情報を受信して情報処理装置を操作する方法である。  [0043] The present invention provides a device that changes and stores the markup language notation used for content information, stores and uses it, and uses the changed information as it is. It is possible to configure an information processing apparatus that is used for a distribution apparatus that distributes changed information and a receiving terminal that receives and uses the information for recognition and operations and responses. More specifically, as shown in the XML and HTML examples, the information written in the existing markup language is changed, tags are added, variables and attributes are added, saved, changed, and delivered And a method of operating the information processing apparatus by receiving such information.
[0044] <コンテンツ情報の例 >  [0044] <Content information example>
まず、本発明を用いて実施される検索や索引付けの対象となるコンテンツとコンテ ンッ情報について説明すると、もっぱらコンテンツとは、映画、ドラマ、写真、報道、ァ -メ、イラスト、絵画、音楽、プロモーションビデオ、小説、雑誌、ゲーム、論文、教科 書、辞書、書籍、コミック、カタログ、ポスター、放送番組情報などを示していることが 一般的によく知られているが、本発明では公共情報、地図情報、商品情報、販売情 報、広告情報や予約状況、視聴状況、道路状況といった情報やアンケート、監視カメ ラ映像、衛星写真、ブログ、模型、人形、ロボットであっても良いし、それらの装置に 具備されたカメラ ·マイク ·センサ入力などで得られる情報やそれらの情報や状態や 状況の呼称やそれらの抽象概念や上位概念や下位概念に関する呼称を音素や音 素片による記号列に展開した情報を含んでも良い。  First, the contents and content information to be searched and indexed using the present invention will be described. The contents are exclusively movies, dramas, photos, news reports, movies, illustrations, paintings, music, It is generally well known to show promotional videos, novels, magazines, games, papers, textbooks, dictionaries, books, comics, catalogs, posters, broadcast program information, etc. Map information, product information, sales information, advertising information, reservation status, viewing status, road status information and questionnaires, surveillance camera images, satellite photos, blogs, models, dolls, robots, etc. Information provided by the camera, microphone, sensor input, etc., names of such information, status and situation, abstract concepts, superordinate concepts and subordinates The designation on just in case the symbol column by phonemes or sound fragment may include the information developed.
[0045] また、映像の時系列的変化、音声の時系列変化、読み手の音読位置の時系列的 変化を期待する文章、 HTMLにおけるマークアップ言語表記による電子情報、それ らにより生成された検索指標情報などであっても良ぐ音読位置を時間軸として解釈 して句点や文や章や文章をフレームとして捕らえても良い。  [0045] In addition, a time-series change of video, a time-series change of speech, a sentence expecting a time-series change of reading position of a reader, electronic information in HTML markup language notation, and a search index generated by them Interpretation of aloud reading position, which may be information, etc., may be interpreted as a time axis, and punctuation marks, sentences, chapters, and sentences may be captured as frames.
[0046] また、コンテンツに付属するメタ情報、文字情報による文書や番組情報としての EP Gや BML、譜面情報としての音階、一般的な静止画や動画像、 3次元情報としての ポリゴンデータやベクトルデータやテクスチャデータやモーションデータ(動作データ[0046] Also, the meta information attached to the content, the document as text information, and the EP as program information G and BML, musical scale as musical scale information, general still and moving images, polygon data, vector data, texture data, motion data (motion data) as 3D information
)、可視化数値データによる静止画像や動画像、宣伝や広告を目的としたコンテンツ 情報等を含んでいても良ぐ視覚情報や聴覚情報や文字情報やセンサ情報を含む 自然情報により構成されて 、る。 ), Visual information, auditory information, text information, sensor information, etc. .
[0047] そして、従来から用いられる MPEG7などの音素 (phone)ラテイスを記述するく Spoke n Content DS〉タグを用いてコンテンツの音声内容を認識して音素列を付与するとい う方法が提案されて 、るが、この方法はコンテンッ内で生じて 、る音声情報の認識に 基づいた記号列により索引付されているため、利用者がコンテンツのタイトルや出演 者を音声操作によって検索できるようにするために音素表記や音素片表記と 、つた 発音記号や表音記号を用いた表記による情報の提供を行って!/、るわけではな 、。  [0047] Then, there has been proposed a method of recognizing the audio content of a content and adding a phoneme string using a spoken content DS> tag that describes a phoneme (phone) rating such as MPEG7 that has been used conventionally. However, this method occurs in the content and is indexed by a string of symbols based on the recognition of the audio information, so that the user can search for the content title and performer by voice operation. Provide information by phonetic notation and phoneme notation and notation using phonetic symbols and phonetic symbols! /.
[0048] このため、コンテンツタイトルや出演者などの名称や表現に関わる不特定単語や固 有名詞を必ずしも音声認識に利用できないため、本発明のようなシーンの説明文や タイトルや出演者名称といったコンテンツ情報に力かわる文字情報を音素展開するこ とで併記し、タグの変数や属性として MPEG情報内に音素記号列や音素片記号列 や音節記号列をはじめ任意の発音記号や表音記号に基づく識別子を埋め込むこと で音声認識への音素認識技術の利用を図ることができる。  [0048] For this reason, unspecified words and unique nouns related to the name and expression of content titles and performers cannot be used for speech recognition. Character information related to the content information is written together by expanding the phoneme, and as a variable or attribute of the tag, the phoneme symbol string, phoneme symbol string, syllable symbol string, and other phonetic symbols and phonetic symbols are included in the MPEG information. By embedding identifiers based on them, phoneme recognition technology can be used for speech recognition.
[0049] つまり、マークアップ言語の音声処理の対象となるタグで囲まれた部分が任意の文 字列であれば、その文字列を音素記号列や音素片記号列に展開し音素記号や音素 片記号を用いて認識に利用できるようにすると共に、利用者の発話力も認識された音 素記号列や音素片記号列との一致を評価したり、発話した音素を任意の表音文字に 変換したりして表音文字同士の一致を図っても良いし、利用者の発話認識結果にも とづ ヽた音素記号列との一致を評価して、利用者の操作対象や検索対象であるとし てもよい。また、表意記号で記載されているアットマークや鍵括弧といった文字や記 号であれば適切な表音記号列による音素記号や音素片記号に変換してもよいし、複 数の発話が推測できる文字列であれば、従来の音声認識のように複数の音素列や 音素片列や音節記号列を与えて 、ても良 、。  [0049] In other words, if the portion enclosed by the tags that are subject to speech processing in the markup language is an arbitrary character string, the character string is expanded into a phoneme symbol string or phoneme symbol string, and the phoneme symbol or phoneme is expanded. It can be used for recognition using one-symbols, and it is also evaluated whether the user's utterance power is consistent with the recognized phoneme symbol string or phoneme symbol string, or the spoken phoneme is converted to any phonetic character The phonetic characters may be matched to each other, or the phonetic symbol string based on the user's utterance recognition result is evaluated to be the user's operation target or search target. It may be. In addition, any character or symbol such as an at sign or key bracket described in ideograms may be converted to a phoneme symbol or phoneme symbol by an appropriate phonetic symbol string, and multiple utterances can be estimated. If it is a character string, multiple phoneme strings, phoneme string strings, and syllable symbol strings may be given as in conventional speech recognition.
[0050] そして、認識された音素列や音素片列をクエリとしてデータベースに与え DPや HM M等の記号列マッチング方法により検索し、検索結果に音素列や音素片列を加えて 利用者力 閲覧できるように一覧として検索結果を提示し、検索結果に含まれる音素 列に基づいて商品を選択し、獲得した制御方法から課金や購入手続きを行うための 音素列や音素片列を認識に伴い認識辞書から検出することで販売に伴う一連の処 理を実施したり、ノ スワードを利用者発話音声特徴などにより構成された音素認識辞 書や画像特徴により構成された指紋や虹彩や顔や掌紋などの認識辞書と組合せるこ とで認証を行 、課金したりすることで、物品や権利と 、つた商品やコンテンツ情報の 検索 ·閲覧 ·販売'認証 ·課金手続きを実現することが出来る。 [0050] Then, the recognized phoneme sequence or phoneme segment sequence is given to the database as a query. DP or HM Search by a symbol string matching method such as M, add phoneme strings or phoneme string sequences to the search results, present the search results as a list so that they can be browsed, and select products based on the phoneme strings included in the search results Select and acquire phoneme strings and phoneme string sequences from the recognition dictionary to perform charging and purchase procedures from the acquired control method. By combining with a phoneme recognition dictionary composed of uttered speech features and recognition dictionaries such as fingerprints, irises, faces, palm prints composed of image features, etc. In addition, the search / browse / sales' authentication / billing procedure can be realized.
[0051] このように、コンテンツの再生個所やページや表示個所といった情報内の位置に応 じて認識結果として獲得されるべき任意の単語を評価するために必要な表音記号に 基づく識別子や識別子列の認識辞書を切替えることで多彩な利用環境における不 特定単語に対する汎用性の高い辞書構成を可能とし、動的に構成される音素や音 素片を用いた表音記号による認識辞書に基づいて認識された単語を提示したり、任 意の処理を実施したり、広告の URLを獲得したり、広告を提示したり、装置を操作し たりすることで、利用者にコンテンツや広告の配信にぉ 、て利便性の高!、情報の提 示の実現や Webなどのインターネット環境において CGI処理のポストやゲットに用い る変数に音素列や音素片列を利用することにより検索条件の指定をして送信したり、 Webページ切替えや操作を行ったりすることができる。  [0051] In this way, identifiers and identifiers based on phonetic symbols necessary to evaluate any word that should be obtained as a recognition result according to the position in the information such as the content playback location, page, or display location By switching the column recognition dictionary, it is possible to construct a highly versatile dictionary structure for unspecified words in various usage environments, and based on a phonetic symbol recognition dictionary that uses dynamically configured phonemes and phonemes. Presenting recognized words, performing arbitrary processing, acquiring advertisement URLs, presenting advertisements, and operating devices can be used to deliver content and advertisements to users.て Highly convenient! Realization of information presentation and search conditions are specified by using phoneme strings and phoneme string strings as variables used for CGI processing posts and get in the Internet environment such as the Web. Send Or, it is possible to and go the Web page switching and operation.
[0052] なお、日本語を音素に展開する手順はよく知られており、表意文字で得る漢字かな 混じり表記を表音文字に変換する「分ち書き」プログラムを用いて「カナ表記」にした 後に「カナ表記」にともなう「ローマ字」などの発音記号を用いて音素記号変換や音素 片記号変換を実施し認識に用いる記号列を構成する方法があり、同様の手順で音 節記号による記載を行う方法もある。  [0052] It should be noted that the procedure for expanding Japanese into phonemes is well known, and it has been changed to "Kana notation" using a "split writing" program that converts kanji mixed kana obtained from ideograms into phonograms. There is a method of constructing a symbol string for recognition by performing phoneme symbol conversion and phoneme symbol conversion using phonetic symbols such as `` Roman characters '' associated with `` Kana notation '', and writing in syllable symbols in the same procedure There is also a way to do it.
[0053] そして、英語であれば英語音素記号や発音記号を用いて音素記号列に変換したり 、国際音素記号を用いて音素記号列に変換したりすることが可能であり、任意の言語 やその言語に適した音素記号や音素片記号を用いても良ぐ各種言語において発 音辞典もあることから、言語に応じた発音記号に基づ 、て表音記号による識別子とし ての音素や表音記号を時系列的に分解した識別子としての音素片、そして、それら の表音記号を数字と対応付けて適当な文字コードにして表記したりすることにより任 意の表音記号に基づくマークアップ言語を用いた情報の配信が可能となる。 [0053] In the case of English, it can be converted into a phoneme symbol string using an English phoneme symbol or a phonetic symbol, or can be converted into a phoneme symbol string using an international phoneme symbol. There are phonetic dictionaries in various languages where it is acceptable to use phoneme symbols and phoneme symbols suitable for the language. Phonemes as identifiers that dissociate phonetic symbols in time series, and It is possible to distribute information using a markup language based on any phonetic symbol by expressing the phonetic symbol as an appropriate character code in association with a number.
[0054] この際、必要であれば音素記号列を音素片記号列に変換することで、検索におけ る利便性の向上を図っても良いし、環境音識別子や音階識別子、画像識別子、動作 識別子をそれぞれの環境音ラテイスや音階ラテイスとしたり、 MPEGストリーム中に画 像識別子や動作識別子のセクションを設けたり、それらの識別子の呼称に関する発 音に基づいて音素列や音素片列を与えても良い。  [0054] At this time, if necessary, the phoneme symbol string may be converted into a phoneme symbol string to improve the convenience of the search, and the environmental sound identifier, scale identifier, image identifier, operation Even if the identifier is an ambient sound scale or scale lattice, an image identifier or motion identifier section is provided in the MPEG stream, and a phoneme sequence or a phoneme sequence is given based on the speech related to the names of these identifiers. good.
[0055] 次に、より具体的な手順について図を用いて説明する。  Next, a more specific procedure will be described with reference to the drawings.
〔装置構成〕  〔Device configuration〕
まず、本発明を適用した場合の情報処理装置 1の装置構成について図 1を用いて 説明する。ここで、情報処理装置 1は、通常汎用的なコンピュータや、専用端末、携 帯移動端末等の各情報処理機器で実現される装置である。  First, the apparatus configuration of the information processing apparatus 1 when the present invention is applied will be described with reference to FIG. Here, the information processing apparatus 1 is an apparatus that is realized by each information processing device such as a general-purpose computer, a dedicated terminal, and a portable mobile terminal.
[0056] 図 1に示すように、情報処理装置 1は、制御部 10と、記憶部 20と、通信部 30と、入 出力部 40と、操作部 50と、表示部 60とを備えて構成されている。ここで、各機能部は 、ノ スを介してそれぞれ制御部 10に接続されている。なお、操作部 50や表示部 60 は任意に取外し可能な装置であってもよ 、。  As shown in FIG. 1, the information processing apparatus 1 includes a control unit 10, a storage unit 20, a communication unit 30, an input / output unit 40, an operation unit 50, and a display unit 60. Has been. Here, each functional unit is connected to the control unit 10 via a nose. Note that the operation unit 50 and the display unit 60 may be arbitrarily removable devices.
[0057] まず、通信部 30は、他の装置と LAN (Local Area Network)や、インターネット等の 通信網を介して情報交換を行うための機能部である。ここで、通信部 30は、一般的 にイーサネット(登録商標)や、モデム、無線 LAN、ケーブルテレビ装置といったコン テンッ情報を送信及び Z又は受信できる装置により構成されている。  First, the communication unit 30 is a functional unit for exchanging information with other devices via a LAN (Local Area Network) or a communication network such as the Internet. Here, the communication unit 30 is generally configured by a device that can transmit and Z or receive content information such as Ethernet (registered trademark), modem, wireless LAN, and cable television device.
[0058] つぎに、入出力部 40は、他の装置や外部から情報を入出力するための機能部で あり、例えばマイクやスキャナ、キヤプチャボード、カメラ、センサ類等の入力装置や、 スピーカやプリンタ、造形装置、表示装置等の出力装置から構成されている。  [0058] Next, the input / output unit 40 is a functional unit for inputting / outputting information from / to another device or from the outside, for example, an input device such as a microphone, a scanner, a capture board, a camera, or sensors, a speaker, etc. And output devices such as printers, modeling devices, and display devices.
[0059] 記憶部 20は、情報処理装置 1内における情報を取得して記憶したり、制御部 10に より実行されるプログラムが記憶されたりする機能部である。記憶部 20は、半導体記 憶素子としての ROMや RAM、磁気記憶媒体としてのハードディスクや磁気テープ、 光記憶媒体としての CD (Compact Disk)や DVD (Digital Versatile Disk)等から構成 されている。 [0060] 具体的には、記憶部 20には、コンテンツ情報 202と、表音記号変換テーブル 204と 、認識辞書情報 206とを記憶しており、表音記号付加プログラム 208と、認識辞書情 報更新プログラム 210と、音声操作プログラム 212とを格納している。 [0059] The storage unit 20 is a functional unit that acquires and stores information in the information processing apparatus 1 and stores a program executed by the control unit 10. The storage unit 20 includes a ROM and RAM as semiconductor storage elements, a hard disk and magnetic tape as magnetic storage media, a CD (Compact Disk) and a DVD (Digital Versatile Disk) as optical storage media, and the like. [0060] Specifically, the storage unit 20 stores content information 202, a phonetic symbol conversion table 204, and recognition dictionary information 206, a phonetic symbol addition program 208, and recognition dictionary information. An update program 210 and a voice operation program 212 are stored.
[0061] コンテンツ情報 202には、外部から通信部 30を介して取得されたコンテンツや、入 出力部 40を介して入力されたコンテンツが保存されている。また、表音記号変換テ 一ブル 204は、コンテンツ情報の中から表音記号に変換される際に参照されるテー ブルであり、例えば文字列と音素等の表音記号とが対応づけられて記憶されて 、る テーブルである。  The content information 202 stores content acquired from the outside via the communication unit 30 and content input via the input / output unit 40. The phonetic symbol conversion table 204 is a table that is referred to when content information is converted into phonetic symbols. For example, a character string and a phonetic symbol such as a phoneme are associated with each other. It is a table that is memorized.
[0062] 認識辞書情報 206は、単語と音素列や音素片列等 (以下、これら音素列等を表音 記号として示す)との関係を記憶している情報である。例えば、図 9に示すように、項 目「Title」と、対象単語として「お得キャンペーン」と、音素列 (表音記号)として対象単 語から展開された Γο/t/o/k/u/ky/a· ··]とが対応づけて記憶されて!ヽる。認識辞書情 報 206は、項目等の他にも例えば「商品名」、「商品の愛称」及び「商品の愛称に基 づく音素列」といった固有名詞を登録しても良ぐ一般的な言語辞書に登録されない ような感嘆語や罵倒誤を含む単語を音素列や音素片列によって動的に入替えること で多様な認識を実現する認識辞書を構成して!/ヽる。  [0062] The recognition dictionary information 206 is information that stores a relationship between a word and a phoneme string, a phoneme string string, and the like (hereinafter, these phoneme strings are indicated as phonetic symbols). For example, as shown in Fig. 9, the item `` Title '', the target word `` discount campaign '', and the phoneme string (phonetic symbol) expanded from the target word Γο / t / o / k / u / ky / a ···] is associated and memorized! Speak. The recognition dictionary information 206 is a general language dictionary in which proper nouns such as “product name”, “product nickname”, and “phoneme sequence based on product nickname” may be registered in addition to items. Construct a recognition dictionary that realizes various recognitions by dynamically exchanging words containing exclamation words and misrepresentations that are not registered in the list by phoneme strings or phoneme string sequences!
[0063] 操作部 50は、ユーザ力もの操作入力を受信する機能部であり、キーボードやマウス やカメラやリモコン (ワイヤレス含む)といった操作に伴う情報を入力する入力装置等 で構成されている。また、表示部 60は、情報処理装置 1が出力する情報を利用者ュ 一ザに視認させるために出力する機能部であり、ディスプレイやプロジェクタ等を含 む操作に関わる表示をおこなう表示装置を用いて構成されて 、る。  [0063] The operation unit 50 is a functional unit that receives operation inputs from the user and includes an input device that inputs information associated with operations such as a keyboard, a mouse, a camera, and a remote controller (including wireless). The display unit 60 is a functional unit that outputs information output from the information processing device 1 so that the user can visually recognize the information. The display unit 60 uses a display device that performs display related to operations including a display and a projector. It is composed.
[0064] 制御部 10は、記憶部 20に記憶されている各種プログラムを呼び出すことにより、プ ログラムに対応する機能を実現するための処理を実行したり、情報処理装置 1の各機 能部を制御したりすることを行って 、る。  [0064] By calling various programs stored in the storage unit 20, the control unit 10 executes processing for realizing a function corresponding to the program, and each function unit of the information processing apparatus 1 is executed. Do things like control.
[0065] 制御部 10は、記憶部 20から表音記号付加プログラム 208を読み出して実行するこ とにより、後述する表音記号付加処理を実現する。また、記憶部 20から認識辞書情 報更新プログラム 210を読み出して実行することにより、後述する認識辞書情報更新 処理を実現する。また、音声操作プログラム 212を読み出して実行することにより、音 声操作処理を実現する。 The control unit 10 reads out the phonetic symbol addition program 208 from the storage unit 20 and executes it, thereby realizing a phonetic symbol addition process to be described later. Also, the recognition dictionary information update program 210 described later is realized by reading the recognition dictionary information update program 210 from the storage unit 20 and executing it. Also, by reading out and executing the voice operation program 212, sound Voice operation processing is realized.
[0066] また、制御部 10は、プログラムを実行することにより、音素'音素片認識処理やタグ 情報の獲得やタグ識別子の獲得や音素列 ·音素片列の獲得や利用者発話音声の 音素 ·音素片認識による音素列 ·音素片列と辞書登録情報に関連付けられた音素列 •音素片列の類似度評価により単語の選択を行うことができるとともに、入出力部から マイクを利用して音声波形を獲得し、音声認識に用いたり、スピーカを利用して本発 明により獲得した音素列や音素片列を用いて音声合成により利用者に情報を提供し たりしても良い。  [0066] Further, the control unit 10 executes a program to acquire a phoneme 'phoneme segment recognition process, tag information, tag identifier, phoneme sequence, phoneme sequence, and user uttered speech phoneme. Phoneme sequence by phoneme recognition · Phoneme sequence associated with phoneme sequence and dictionary registration information • Words can be selected by evaluating similarity of phoneme sequence and voice waveform using input / output unit using microphone May be used for speech recognition, or information may be provided to the user by speech synthesis using a phoneme sequence or phoneme segment sequence acquired by the present invention using a speaker.
[0067] なお、制御部 10は、通常 CPU (Central Processor Unit)や DSP、 ASIC等を用い て構成されており、また、これらを任意に組合せて実現することも可能である。  Note that the control unit 10 is normally configured using a CPU (Central Processor Unit), DSP, ASIC, or the like, and can be realized by arbitrarily combining them.
[0068] <動作 >  [0068] <Operation>
続いて、情報処理装置 1が実行する各動作処理について説明する。  Next, each operation process executed by the information processing apparatus 1 will be described.
[0069] <表音記号付加処理 >  [0069] <Phonetic symbol addition processing>
まず、表音記号付加処理について図 3を用いて説明する。図 3は、表音記号付カロ 処理を説明するための動作フローであり、制御部 10が、記憶部 20の表音記号付カロ プログラム 208を読み出して実行すること〖こより実現される処理である。  First, the phonetic symbol addition processing will be described with reference to FIG. FIG. 3 is an operation flow for explaining the phonetic symbol-added calo process, which is realized by the control unit 10 reading and executing the phonetic symbol-added calo program 208 in the storage unit 20. .
[0070] まず、制御部 10は、通信部 30により受信される力、入出力部 40により入力されるこ とにより保存されているコンテンツ情報 202を取得する(ステップ S301)。  First, the control unit 10 acquires the force received by the communication unit 30 and the content information 202 stored by being input by the input / output unit 40 (step S301).
[0071] 次に、読み込まれたコンテンツ情報 202から展開対象文字列を検出する (ステップ S302)。ここで、展開対象文字列とは、表示制御方法の変化を識別するための文字 列(情報)であり、例えばマークアップ言語の場合を一例に取ると、リンクを示すタグく A〉や、タイトルを示すタグく TITLE〉といったものである。このタグに挟まれた範囲を対 象として音素や音素片等の表音記号に展開される展開対象文字列が検出される。  Next, a development target character string is detected from the read content information 202 (step S302). Here, the expansion target character string is a character string (information) for identifying a change in the display control method. For example, in the case of a markup language, a tag indicating a link A> or title The tag indicating TITLE>. A target character string that is expanded into phonetic symbols such as phonemes and phoneme segments is detected for the range between the tags.
[0072] 次に、展開対象文字列を発話に伴う発音記号からなる音素列や音素片列 (表音記 号)に展開する (ステップ S303)。これにより、例えばタイトルやリンク先の呼称が表音 記号に変換される。この展開文字列を表音記号に変換する際には、タグに含まれる 情報である ALT属性や ID属性といった他の属性を参照することで文字列を獲得し、 音素列や音素片列へ変換することで辞書登録するための表音記号列を用いて認識 辞書情報を構成したり、画像ファイル名や音楽ファイル名や映像ファイル名や文書フ アイル名カゝら音素や音素片列を構成したり、画像ファイルや音楽ファイルや映像ファ ィルゃ文書ファイル内に記載されたタグの属性やタグに挟まれた文字列を利用して 音素や音素片列を構成したり、タグに挟まれた文字列力 音素列や音素片列を構成 したり、タグに属性として関連付けられたリンク情報を用いてリンク先にあるファイルの 名称やファイルに含まれている文字情報に基づいてタグやタグの属性やタグに挟ま れた文字列を利用して音素列や音素片列を構成したり、といった任意の方法を用い ることで辞書に登録する表音記号列を構成する方法が考えられる。 Next, the expansion target character string is expanded into a phoneme string or a phoneme string string (phonetic symbol) composed of phonetic symbols accompanying the utterance (step S303). Thereby, for example, the title and the name of the link destination are converted into phonetic symbols. When this expanded character string is converted to a phonetic symbol, the character string is acquired by referring to other attributes such as the ALT attribute and ID attribute that are included in the tag, and converted to a phoneme string or phoneme string string. By using phonetic symbol strings for dictionary registration Configure dictionary information, configure image file names, music file names, video file names, document file names, phonemes and phoneme strings, image files, music files, video files and document files A phoneme or phoneme string sequence is constructed using the tag attributes and the text strings sandwiched between tags, the phoneme string or phoneme string sequence between the tags, Using the link information associated as an attribute, based on the name of the file at the link destination and the character information included in the file, the phoneme string or phoneme using the tag or tag attribute or the character string between the tags A method of constructing a phonetic symbol string to be registered in the dictionary by using an arbitrary method such as constructing a single string is conceivable.
[0073] 具体的には、表音記号変換テーブル 204を用いて表音記号に変換される。例えば 、タイトルタグに囲まれた文字列「メイン」に対して、表音記号変換テーブル 204を参 照し表音記号として「m/e/i/n/」と変換される。  Specifically, the phonetic symbols are converted into phonetic symbols using the phonetic symbol conversion table 204. For example, a character string “main” surrounded by title tags is converted to “m / e / i / n /” as a phonetic symbol by referring to the phonetic symbol conversion table 204.
[0074] また、このような表音記号への展開を行わなくても既にコンテンツ情報自体のタグと して表音記号展開された属性を与えている場合もあり、認識に用いるための表音記 号列による認識表音記号列が構成されていてもよぐ例えばコンテンツ情報を獲得す るステップ S401を実施した後に図 4から図 7に示されるようなマークアップ言語情報か ら付随するメタ情報から rpronounce属性」を検出するステップ S402を実施し、検出さ れた rpronounce属性」の変数として記述されて!、る音素や音素片からなる表音記号 列を抽出し、抽出された表音記号列と rpronounce属性」が検出されたメタ情報とを関 連付けて辞書情報として登録するステップ S403を実施することにより、音声操作プロ グラムを用いたり、情報処理装置が認識可能な発話音を表記する表音記号列に関 連付けられたメタ情報としてのタグや CGIなどを用いる処理内容や遷移先ページを 特定したり、することで任意の処理や手順や操作を指定して、動的な表音記号列を 用いた認識を実現する。  [0074] Further, there is a case where the attribute of the phonetic symbol expansion is already given as the tag of the content information itself without performing such expansion into the phonetic symbol, and the phonetic symbol used for recognition is used. The recognition phonetic symbol string by the symbol string may be configured. For example, after the step S401 for acquiring the content information is performed, the meta information accompanying the markup language information as shown in FIGS. 4 to 7 is obtained. Step S402 for detecting the `` rpronounce attribute from '' is executed, and the phonetic symbol string composed of phonemes and phonemes is extracted as a variable of the detected rpronounce attribute !, and the extracted phonetic symbol string is extracted. And the meta information in which the “rpronounce attribute” is detected and registered as dictionary information. By performing step S403, a table that expresses the utterance sound that can be used by a voice operation program or recognized by the information processing device. sound Dynamic phonograms can be specified by specifying any process, procedure, or operation by specifying the processing content or transition destination page using tags or CGI as meta information associated with the title sequence. Realize recognition using columns.
[0075] この結果、表音記号保存処理が実行され表音記号認識に用いる認識表音記号列 が保存される (ステップ S304)。表音記号保存処理とは、ステップ S303において変 換された認識に用いる表音記号を保存する処理であり、例えば、それぞれのタグに 既に属性として記録されている表音記号を抽出したり、タグに挟まれた文字列カも展 開して新しく属性として表音記号 (音素列や音素片列)を追加する処理 (ステップ S30 4a)や、それぞれのタグに音声の認識対象であることを示すタグや属性をコンテンツ 情報に追加したりする処理 (ステップ S304b)や、認識させたい固有名詞を分離し表 音記号に変換し認識表音記号列を構成することで認識辞書情報 206を構成して更 新する処理 (ステップ S304c)が実行される。これにより、コンテンツ情報と認識に利 用したい単語の音素列や音素片列力 なる表音記号列としての認識表音記号列とを 明確にする処理が実施されることとなる。 As a result, the phonetic symbol storage process is executed, and the recognized phonetic symbol string used for phonetic symbol recognition is stored (step S304). The phonetic symbol storage processing is processing for storing the phonetic symbols used for recognition converted in step S303. For example, the phonetic symbols that are already recorded as attributes in each tag are extracted, Processing to add a phonetic symbol (phoneme string or phoneme string string) as a new attribute by expanding the character string between the strings (Step S30) 4a), processing to add tags and attributes that indicate that each tag is a speech recognition target (step S304b), and separate proper nouns to be recognized, convert them into phonetic symbols, and recognize them The process of configuring and updating the recognition dictionary information 206 by configuring the phonetic symbol string (step S304c) is executed. As a result, the content information and the recognition phonetic symbol sequence as the phonetic symbol sequence as the phoneme sequence of the word to be used for recognition and the phoneme sequence sequence are executed.
[0076] そして、制御部 10は、変更されたコンテンツ情報 202を更新保存したり、関連付け られた認識表音記号列からなる音素や音素片を用いた表音記号認識のための認識 辞書情報を更新保存したりする (ステップ S305)。これにより、利用者の発話の認識 や通信部経由での配信に変更されたコンテンツ情報を利用できるようにする。  [0076] Then, the control unit 10 updates and stores the changed content information 202, and recognizes recognition dictionary information for phonetic symbol recognition using phonemes and phoneme segments made up of associated recognition phonetic symbol strings. Save the update (step S305). This makes it possible to use content information that has been changed to recognition of user utterances and distribution via the communication unit.
[0077] なお、上述した処理は、情報処理装置 1が実行することとして説明した力 コンテン ッ情報を配信する配信装置 (サーバ)側が実行することで受信側の音素列への変換 にともなう情報の処理負担を減らすようにしても良い。配信装置側が実行することによ り、利用者からのコンテンツ情報の呼び出しに応じて配信装置は音声による制御情 報の付随したコンテンツ情報を配信する。したがって、情報処理装置 1 (端末装置)は コンテンツのページやフレーム応じて分類された音素情報が情報処理装置で獲得可 能となり制約の少ない任意単語を音声利用することができるようになる。  It should be noted that the processing described above is executed by the distribution device (server) side that distributes the force content information described as being executed by the information processing device 1, so that the information associated with the conversion to the phoneme sequence on the reception side is executed. The processing burden may be reduced. When executed by the distribution device, the distribution device distributes content information accompanied by voice control information in response to a content information call from a user. Therefore, the information processing device 1 (terminal device) can acquire phoneme information classified according to the content page and frame by the information processing device, and can use voice of arbitrary words with less restrictions.
[0078] ここで、表音記号付加処理を実行した場合の動作例にっ 、て、図を用いて説明す る。まず図 4は、情報処理装置 1が取得したコンテンツ情報 202の様子を示した図で ある。ステップ S301が実行されることにより、通信部 30又は入出力部 40からコンテン ッ情報 202を取得し、記憶部 20に保存する。  Here, an operation example when the phonetic symbol addition process is executed will be described with reference to the drawings. First, FIG. 4 is a diagram showing the state of the content information 202 acquired by the information processing apparatus 1. By executing step S301, the content information 202 is acquired from the communication unit 30 or the input / output unit 40 and stored in the storage unit 20.
[0079] そして、表音記号 (音素列'音素片列)による評価対象として目的となるタグに関連 する情報を検出する (ステップ S302)。なお、図 4の情報はステップ S302で抽出処 理を実行する場合に、 RSSのアイテムセクションを用いたコンテンツ情報例であり、ァ ィテムセクション力 対象文字列を抽出して変換処理を施したものが図 5又は図 6とし て記載されている。  [0079] Then, information related to the tag that is the target of the evaluation by the phonetic symbol (phoneme string 'phoneme string string) is detected (step S302). Note that the information in Fig. 4 is an example of content information using the item section of RSS when the extraction process is executed in step S302. The item section force target character string is extracted and converted. It is shown as Figure 5 or Figure 6.
[0080] そして、取得したコンテンツ情報に含まれるタグの中から、展開対象文字列の対象 となるタグが検出されたら、そのタグで指定されている範囲の文字列を検出する。例 えば図 4ではタイトルを意味するタグ「く title〉」から「く/ title〉」までの間に挟まれた「お 得キャンペーン」を表音記号列への展開対象文字列として検出する。この際、不要な 括弧記号を削除してもよぐこの文字列の抽出によって配信側の指定した任意のタイ トル文字列を取得できる。 [0080] When a tag to be expanded is detected from tags included in the acquired content information, a character string in a range specified by the tag is detected. Example For example, in Fig. 4, the “profit campaign” sandwiched between the tags “ku title>” and “ku / title>” that mean the title is detected as the character string to be expanded into the phonetic symbol string. At this time, an arbitrary title character string specified by the distribution side can be acquired by extracting this character string, which can be deleted by unnecessary parentheses.
[0081] そして、取得できた文字列を確認し、表音記号変換テーブル 204を用いて文字列 の発音に従った表音記号列に変換する。そして、図 5に示すように、もともとのコンテ ンッ情報 202に記載されたタグに属性や変数として例えば新規に pronounce属性を 追カロして表音記号列を追記する処理や (ステップ S304a)、図 6に示すように、く prono unce〉〜く/ pronounce〉タグを新たに設定する処理や、表音記号と認識する単語や命 令を関連付けて認識辞書情報 206として保存すると共に、認識辞書情報 206をコン テンッ情報 202に認識辞書情報 206の獲得先として URLをく META〉タグなどにより 記載し関連付ける処理 (ステップ S304c)等が実行されコンテンツ情報に表音記号認 識に用いる表音記号列情報を追記したり関連付けたりすることが可能となる。  Then, the acquired character string is confirmed, and converted into a phonetic symbol string according to the pronunciation of the character string using the phonetic symbol conversion table 204. Then, as shown in FIG. 5, for example, a process of adding a new phonetic symbol string and adding a phonetic symbol string as an attribute or variable to the tag described in the original content information 202 (step S304a), As shown in Fig. 6, the processing for newly setting the <prono unce> -ku / pronounce> tags, the words and instructions recognized as phonetic symbols are associated and saved as recognition dictionary information 206, and the recognition dictionary information 206 Is stored in the content information 202 as the acquisition destination of the recognition dictionary information 206 by using a URL tag META> tag and the like (step S304c) etc. is executed, and the phonetic symbol string information used for phonetic symbol recognition is executed in the content information. It becomes possible to add and associate.
[0082] そして、前述の変更を行ったコンテンツ情報 202を他の端末に直接配信したり、装 置内で利用したりすることで表音記号 (音素や音素片や発音記号や発音記号片)に 基づ 、た操作を行えることとなる。  [0082] Then, by distributing the content information 202 with the above-mentioned changes directly to other terminals or using it in the device, phonetic symbols (phonemes, phonemes, phonetic symbols, phonetic symbols) It is possible to perform operations based on this.
[0083] また、例えば MPEG7では図 7のように、「く Pronounse DS〉」タグを追加して、コンテ ンッ種別の音素列を記載したり、背後にある環境音として「Rainsound」が生じていると 併記したり、フレーズタグに置ける出演者に関する表記に対して属性として配役名の 音素記号列「pronounce="t/o/m/u"」をカ卩えている。また、 HTMLではボタンやリンク に関連付けた構成の実施例を提示しており、任意のタグで挟まれた範囲をキーヮー ドとして検索し、コンテンツの閲覧検索に役立てたり、操作のための発音を音素記号 列として提供したりしてもょ 、し、獲得された音素や音素片と 、つた表音記号列を情 報処理装置 1における音声合成発話のための発話音素や発話音素片の単語辞書に 用いてもよい。  [0083] Also, for example, in MPEG7, as shown in Fig. 7, a "ku Pronounse DS>" tag is added to describe a phoneme string of content type, or "Rainsound" is generated as an environmental sound behind And the phonetic symbol string “pronounce =" t / o / m / u ”” as the cast name as an attribute for the notation related to the performer that can be placed in the phrase tag. In addition, HTML presents examples of configurations associated with buttons and links, and searches the range between arbitrary tags as a keyword, which is useful for browsing and searching for content, and pronunciation for operations. It may be provided as a symbol string, and the acquired phoneme or phoneme fragment and the resulting phonetic symbol string are stored in the word dictionary of the utterance phoneme or utterance phoneme for speech synthesis utterance in the information processing device 1. It may be used.
[0084] このように、表音記号付加処理によれば、取得されたコンテンツ情報に基づ 、て音 声操作を行うための表音記号を付加することで表音記号認識に用いる表音記号辞 書に組込むための表音記号列情報を含むコンテンツ情報を構成することができる。 [0085] <認識辞書情報更新処理 > Thus, according to the phonetic symbol addition process, the phonetic symbol used for phonetic symbol recognition by adding the phonetic symbol for performing the voice operation based on the acquired content information. Content information including phonetic symbol string information to be incorporated into the dictionary can be configured. [0085] <Recognition dictionary information update process>
次に、コンテンツ情報 202に、すでに表音記号が付加されている場合における認識 辞書情報更新処理について図 8を用いて説明する。図 8は、認識辞書情報更新処理 に係る動作フローを示した図であり、制御部 110が、記憶部 120の認識辞書情報更 新プログラム 210を実行することにより実現される処理である。  Next, the recognition dictionary information update processing in the case where a phonetic symbol has already been added to the content information 202 will be described with reference to FIG. FIG. 8 is a diagram illustrating an operation flow related to the recognition dictionary information update process, which is a process realized by the control unit 110 executing the recognition dictionary information update program 210 in the storage unit 120.
[0086] まず、制御部 10は、コンテンツ情報 202を取得する (ステップ S401)。次に、制御 部 10は、読み出されたコンテンツ情報 202から表音記号列を抽出する (ステップ S40 2)。本実施形態においては、コンテンツ情報 202に含まれるタグ(「く」と「>」に挟ま れた部分)を抽出することにより、表音記号列が含まれているタグを特定し抽出するこ ととなる。  [0086] First, the control unit 10 acquires the content information 202 (step S401). Next, the control unit 10 extracts a phonetic symbol string from the read content information 202 (step S402). In this embodiment, by extracting a tag (a portion between “ku” and “>”) included in the content information 202, a tag including a phonetic symbol string is specified and extracted. It becomes.
[0087] 例えば、制御部 10がタイトルタグ「く TITLE〉」の「pronounce属性」を抽出することに より、その引数である表音記号として、音素記号列「οΛ/οΛ/ιι· ··」を抽出する。そして 、抽出された音素列をページタイトルとして保存するとともに認識辞書情報 206に登 録する(ステップ S403)。  [0087] For example, when the control unit 10 extracts the "pronounce attribute" of the title tag "KU TITLE>", the phoneme symbol string "οΛ / οΛ / ιι ···" is used as the phonetic symbol that is the argument. To extract. Then, the extracted phoneme string is stored as a page title and registered in the recognition dictionary information 206 (step S403).
[0088] なお、これらの辞書をページの切替えに応じて変化する表示内容にしたがって切 替えることで表示内容に無い単語による誤認識を避けることが可能となり音声認識率 の改善を図り操作性を向上させても良いし、コンテンツ情報に任意のタグや文字列を 用いて関連付けられた辞書情報の URLなどカゝら必要な音素列を獲得し認識単語や 制御方法への関連付けを行う辞書情報を更新しても良 、。  [0088] By switching these dictionaries according to the display contents that change according to the page change, it is possible to avoid erroneous recognition due to words that are not in the display contents, thereby improving the speech recognition rate and improving operability. It is also possible to update the dictionary information to acquire necessary phoneme strings such as URLs of dictionary information associated with content information using arbitrary tags and character strings and to associate them with recognized words and control methods You can do it.
[0089] また、配信されるコンテンツ情報に音素列や音素片列が記載されていな力つたり、 関連する音素辞書が関連付けられていな力つたりした場合、前述の音素列'音素片 列を埋め込む手順に従って音素や音素片の記号列をコンテンツ中から構成し、辞書 情報を構築しても良ぐ構成された辞書情報は同一の単語が用いられているカゝ否か を検出することで利用可能であれば再利用しても良 ヽ。  [0089] When the phoneme sequence or phoneme sequence is described in the content information to be distributed, or when the related phoneme dictionary is not associated, the above phoneme sequence 'phoneme sequence is used. Construct phoneme and phoneme symbol strings from the content according to the embedding procedure, and construct dictionary information. The constructed dictionary information is used by detecting whether or not the same word is used. Reuse if possible.
[0090] また、制御用認識辞書を構成する場合に音素記号列が変化しな 、制御命令であ れば図 9の様に制御命令に関連した IDと命令単語と音素列を関連付けた辞書を用 いて制御用の命令単語を特定する IDを命令弁別用 IDとして記載したコンテンツ情 報の配信や記憶媒体への記録を実施したのち、通信部力 受信したコンテンツ情報 や記憶媒体力 獲得したコンテンツ情報に関連付けられた情報において音素情報 や音素片情報を記載する個所に記載された命令弁別用 IDから命令用単語を特定し 、特定された命令単語から音素や音素片への変換機能を行うことで音素列や音素片 列を構成し認識に用いたり、前記制御命令に関連付けられた音素列や音素片列に 基づくハッシュ値を命令弁別用 IDに用いたりすることで、冗長になりやすい送信時の 音素列表現を短縮し通信効率の改善を図っても良い。 [0090] If the phoneme symbol string does not change when the control recognition dictionary is configured, if it is a control command, a dictionary that associates the ID related to the control command, the command word, and the phoneme sequence as shown in FIG. Use this to identify the command word for control and distribute the content information with the ID used as the command discrimination ID and record it on the storage medium. And storage medium power In the information associated with the acquired content information, the instruction word is identified from the instruction discrimination ID described in the place where the phoneme information or the phoneme information is described, and the phoneme or the phoneme fragment is identified from the identified instruction word. The phoneme sequence or phoneme sequence is constructed and used for recognition by performing the conversion function, or the hash value based on the phoneme sequence or phoneme sequence associated with the control command is used for the instruction discrimination ID. It is also possible to shorten the phoneme string expression during transmission, which tends to be redundant, and improve communication efficiency.
[0091] また、記憶媒体や通信手段を経由して獲得され記憶部に保存されたコンテンツ情 報 202に関し、表音記号への変換や追加がなされて無い場合は前述の方法で内容 の解釈を行い当該情報処理装置 1に応じた識別子列による表音記号へ変換し、既に コンテンツ情報 202の内容に対して表音記号の記載や変換や更新が既になされて いる場合はコンテンツ情報 202の内容に対する変換や更新をしなくてもよい。  [0091] In addition, regarding the content information 202 obtained via the storage medium or the communication means and stored in the storage unit, if the content information 202 is not converted or added to a phonetic symbol, the content is interpreted by the above-described method. If the information of the content information 202 has already been written, converted, or updated, the content information 202 is converted to the phonetic symbol by the identifier string corresponding to the information processing apparatus 1. There is no need to convert or update.
[0092] また、これらの変換はコンテンツ情報配給者や利用者の状況によりサーバ側で変換 して配信しても、クライアントで受信したものを適宜変換しても、装置単体で外部の記 憶媒体力 獲得した情報を自装置で利用可能なように変換しても、ゲートウェイゃル ータなどの中継手段を用いて変換しても良 、。  [0092] In addition, these conversions may be performed by converting on the server side according to the content information distributor or user's situation, or by appropriately converting what is received by the client, or by an external storage medium by itself. The acquired information can be converted so that it can be used by the device itself, or it can be converted using a relay means such as a gateway router.
[0093] <音声操作処理 >  [0093] <Voice operation processing>
次に、音声操作処理について、図 10を用いて説明する。まず、制御部 10は、通信 部 30又は入出力部 40に取得されたコンテンツ情報や、記憶部 20に保存されている コンテンツ情報 202を取得する(ステップ S501)。  Next, the voice operation process will be described with reference to FIG. First, the control unit 10 acquires content information acquired by the communication unit 30 or the input / output unit 40 and content information 202 stored in the storage unit 20 (step S501).
[0094] 次に、取得されたコンテンツ情報から、音素や音素片などにより構成された表音記 号を抽出する (ステップ S502)。そして、抽出された表音記号に基づいて認識辞書 情報 206を更新登録する (ステップ S 503)。  Next, phonetic symbols composed of phonemes and phoneme pieces are extracted from the acquired content information (step S502). Then, the recognition dictionary information 206 is updated and registered based on the extracted phonetic symbols (step S503).
[0095] 次に、利用者からの発話に基づく音声入力が入出力部 40からあるまで待機する( ステップ S504 ;No)。ここで、利用者から音声入力がなされると (ステップ S504 ; Yes )、制御部 10は入力された利用者の音声の特徴量を抽出する (ステップ S505)。そし て、抽出された特徴量から音素や音素片といった表音記号を認識し、表音記号に変 換する(ステップ S 506)。  Next, it waits until there is a voice input from the input / output unit 40 based on the utterance from the user (step S504; No). Here, when a voice input is made by the user (step S504; Yes), the control unit 10 extracts a feature amount of the input user's voice (step S505). Then, phonetic symbols such as phonemes and phonemes are recognized from the extracted feature quantities and converted to phonetic symbols (step S506).
[0096] そして、ステップ S506において変換された表音記号と、先に認識辞書に登録され た表音記号とがどの程度一致して!/ヽるかを判定する一致評価を実行する (ステップ S 507)。この一致評価は本装置の記憶部に記憶されて!、る音響や音声の標準モデル や標準パラメータや標準テンプレートとの一致度を評価関数によって評価し評価結 果としての表音記号を特定する。そして、一致評価に基づいて特定された表音記号 を時系列的に複数得ることで表音記号列を特定する。そして、特定された表音記号 列に一番類似度の高い表音記号列を表音記号の認識結果とし、認識結果に関連付 けられた情報に伴い、装置操作や検索処理を実行する (ステップ S508)。 [0096] Then, the phonetic symbols converted in step S506 are registered in the recognition dictionary first. A coincidence evaluation is performed to determine how much the phonetic symbols match! / Spoken (step S507). This coincidence evaluation is stored in the storage unit of this device! The degree of coincidence with the standard model, standard parameter and standard template of sound and speech is evaluated by the evaluation function, and the phonetic symbol as the evaluation result is specified. Then, a phonetic symbol string is specified by obtaining a plurality of phonetic symbols specified based on the coincidence evaluation in time series. Then, the phonetic symbol string having the highest similarity to the identified phonetic symbol sequence is used as the phonetic symbol recognition result, and device operation and search processing are executed according to the information associated with the recognition result ( Step S508).
[0097] ここで、認識結果に伴う処理とは、例えば本発明を用いた表音記号列の認識によつ て実現される固有名詞を伴う文字列の生成や各動作命令や情報もしくは商品に関す る検索の実行や表音記号列の認識に伴い特定された一連の利用者への情報提示 や利用者の指示ずる操作等である。具体的には、ウェブブラウザのページ切替えや テレビやビデオの操作、ロボットやナビゲーシヨン装置やコンピュータや映像音響機 器もしくは調理器もしくは洗濯機もしくはエアコンと ヽつた家電の音声や文字や画像 や映像による応答、検索条件の指示、情報処理装置が提示する情報の保存や変更 や登録や削除、認識結果に伴う広告や番組内容の指定や閲覧、キーワードや発話 特徴による個人認証といった一連の処理や操作である。また、顔や指紋などの画像 認識辞書と音素や音素片による表音記号列を用いた固有名詞を伴う認識辞書と話 者ごとの音素や音素片に基づく音響モデルとを関連付けることで合言葉による個人 認証を行っても良ぐ認証に伴 、課金やサービスの選別を行うことが出来る。  Here, the process associated with the recognition result is, for example, generation of a character string accompanied by a proper noun realized by recognition of a phonetic symbol string using the present invention, and each operation command, information, or product. For example, information is presented to a series of users specified in connection with the execution of related searches and the recognition of phonetic symbol strings, and the operations are instructed by the users. Specifically, it is based on the sound, text, images, and images of home appliances connected to web browser pages, TV and video operations, robots, navigation devices, computers, audiovisual equipment, cookers, washing machines, and air conditioners. A series of processes and operations such as response, search condition instruction, storage, change, registration and deletion of information presented by the information processing device, specification and browsing of advertisements and program contents associated with recognition results, and personal authentication based on keywords and speech is there. In addition, by using an image recognition dictionary such as face or fingerprint, a recognition dictionary with proper nouns using phonetic symbolic strings based on phonemes or phonemes, and an acoustic model based on phonemes or phonemes for each speaker, Accompanied with authentication, billing and service selection can be performed.
[0098] 具体的には、利用者力 の質問に答えるための音声合成に先ほどの認識辞書で 登録されている音素列や音素片列による単語を発話させることで、認識可能な単語 を明示したり、認識結果に応じて任意の操作を実施したり、認識結果に応じて認識さ れた文字列や単語列を提示したり、音素列や音素片列に関連付けられた広告を行つ たりすることが、従来の音声認識技術との組合せにより出来るようになる。  [0098] Specifically, a recognizable word is clearly indicated by uttering a word based on a phoneme string or phoneme string registered in the recognition dictionary for speech synthesis to answer a user's question. Or perform any operation according to the recognition result, present a recognized character string or word string according to the recognition result, or execute an advertisement associated with the phoneme string or phoneme string string Can be combined with conventional speech recognition technology.
[0099] そして、次の音声入力を実施するか否かを判断する (ステップ S509)。ここで、再度 音声入力がされる場合は (ステップ S509; Yes)、音声が入力されることを待機する処 理としてステップ S504に処理が戻る。また、音声の入力がされない場合 (ステップ S5 09 ; No)、次のコンテンツ情報を獲得する力否かを判断する (ステップ S510)。ここで 、次のコンテンツ情報を獲得する場合は (ステップ S510 ; Yes)、新たにコンテンツを 獲得するためにステップ S501から処理を繰り返し実行する。また、新たにコンテンツ 情報を獲得しな 、場合には (ステップ S510 ;No)、処理を終了し利用者の発話を待 つと!/、つた一連の処理を行う。 [0099] Then, it is determined whether or not to execute the next voice input (step S509). If the voice is input again (step S509; Yes), the process returns to step S504 as a process for waiting for the voice to be input. If no voice is input (step S509; No), it is determined whether or not the next content information can be acquired (step S510). here When acquiring the next content information (step S510; Yes), the processing is repeatedly executed from step S501 in order to acquire a new content. If new content information is not acquired (step S510; No), a series of processing is performed when the processing ends and the user's speech is awaited! /.
[0100] すなわち、本発明を利用する装置は獲得されたマークアップ言語による情報力 音 素や音素片といった表音記号による識別子や識別子を特定するための特徴量を用 いて利用者が音声操作を行える箇所をマークアップ言語情報力 獲得するとともに、 必要であれば指紋や表情や掌紋などの画像や動作に関連する任意の識別子を獲 得し組合せることで個人認証などに利用したり、認識によるエージェントやロボットの 対応動作に利用したりすることもできる。  [0100] That is, the device using the present invention allows the user to perform voice operation by using identifiers based on phonetic symbols such as information power phonemes and phonemes in the acquired markup language and features for identifying the identifiers. Acquire the markup language information ability where possible, and if necessary, use it for personal authentication, etc. by acquiring and combining arbitrary identifiers related to images and actions such as fingerprints, facial expressions, palm prints, etc. It can also be used for actions of agents and robots.
[0101] そして、利用者の発話や入力により得られた識別子や特徴量によって従来マウス操 作で行われる選択処理を実施し、テーブルタグの任意の行や列、リンクや操作ボタン にフォーカスを与えたり、カーソルをオーバラップさせたり、これらの操作に伴うィベン トをオペレーティングシステムカゝらブラウザへ発行したり、赤外線や LANや電話回線 等といった通信手段を用いて他の装置を制御したり、認識された単語に応じてエー ジェントゃロボット対応動作を変化させたりすることで、音素や音素片と言った表音記 号列の認識に伴う一連の処理を実施することができる。  [0101] Then, selection processing that is conventionally performed by mouse operation is performed based on identifiers and feature values obtained by user's utterances and inputs, and focus is given to any row, column, link, and operation button of the table tag. , Overlapping cursors, issuing events associated with these operations to browsers from operating system managers, controlling other devices using infrared, LAN, telephone lines, etc. By changing the agent's robot-responsive action according to the word, the series of processes associated with the recognition of phonetic symbol strings such as phonemes and phoneme pieces can be performed.
[0102] そして、ステップ S501により獲得されたコンテンツ情報は、タグ内の「pronounce」属 性情報を検出 (ステップ S502)し、認識辞書情報 206に登録 (ステップ S503)する。 このとき、認識用の表音記号列がどのようなタグと関連付けられているかを同時に登 録することで、画面構成情報の表示位置や表示項目を前後のタグとの組合せによつ てブラウザにおける各タグを処理する際の表示位置を特定したり、 MPEG7などにお けるコンテンツ情報中のシーンやタイトルや時系列位置を示すタグとの関連付けによ つてシーン位置を特定したり、地図情報を表記する XMLなどによって緯度経度によ る空間位置情報や地名や地域の情報や店舗の情報と関連付けたりすることで物理 的な位置を特定したりできるようになる。  Then, the content information acquired in step S501 detects the “pronounce” attribute information in the tag (step S502) and registers it in the recognition dictionary information 206 (step S503). At this time, by simultaneously registering which tag is associated with the phonetic symbol string for recognition, the display position and display items of the screen configuration information are combined in the browser with the preceding and following tags. Specify the display position when processing each tag, specify the scene position by associating with the tag indicating the scene, title and time series position in the content information in MPEG7 etc., and display the map information The physical location can be specified by associating it with spatial location information by latitude and longitude, location name, regional information, or store information using XML.
[0103] 続けて、図 11から図 14を用いて音声操作処理の動作例について説明する。利用 者が「一行目、起案者」と発音した場合テーブルタグのカラム選択にぉ 、て最上位の 行に記載されたテーブルタグの pronounce属性を用いて i/ch/i/gy/o/u/m/eと k/i/a/ n/sh/aと 、つた音素列と利用者の発話音素との一致を認識辞書力も確定する。この 結果、発話の音素列にあった「起案者」列を選択し、行を指定するタグにある「一行目 」を選択することで「一行目」の「起案者」が選択される (ステップ S506)。 [0103] Next, an example of voice operation processing will be described with reference to FIGS. When the user pronounces “first line, drafter”, select the table tag column and use the top I / ch / i / gy / o / u / m / e and k / i / a / n / sh / a using the pronounce attribute of the table tag described in the row, the phoneme string and user's utterance Recognize phonetic matches and determine dictionary power. As a result, the “initiator” column in the phoneme string of the utterance is selected, and the “initiator” in the “first row” is selected by selecting “first row” in the tag for designating the row (step 1). S506).
[0104] また、図 11にある HTMLのそれぞれの送信ボタンに属性として設けられた音素列 が検出された場合はフォームタグの指定に従っての送信を行ったり、利用者発話の 音素列を認識し「ts/u/g/i 」との一致度が高ければリンク先へ移動したりすることで ウエッブブラウジング処理を行うことが出来る。そして、ページ間の移動の際に「移動 しますか?」 t ヽつた質問を利用者に音声や文字列や画像や映像によって提供する (ステップ S506) t 、つた利用者への応対を伴うエージェントやロボットなどの対話処 理を行っても良いし、顔や指紋などの画像認識辞書と音素や音素片による表音記号 列を用いた固有名詞を伴う認識辞書と話者ごとの音素や音素片に基づく音響モデル とを関連付けることで合言葉による個人認証を行っても良い。 [0104] If a phoneme string provided as an attribute for each HTML send button in Fig. 11 is detected, transmission is performed according to the specification of the form tag, or the phoneme string of the user's utterance is recognized. If the degree of coincidence with “ts / u / g / i” is high, the web browsing process can be performed by moving to the link destination. Then, when moving from page to page, “Do you want to move?” T Providing the user with the question by voice, character string, image, or video (step S506) t Agent with the response to the user You can also interact with robots, robots, etc., image recognition dictionaries such as faces and fingerprints, recognition dictionaries with proper nouns using phonetic symbols and phonemes, and phonemes and phonemes for each speaker. It is also possible to perform personal authentication with secret words by associating with acoustic models based on.
[0105] 次に、利用者から見た場合のブラウザ画像を提示して説明すると図 11を HTMLブ ラウザで表示した場合、図 12のようになる、ここで「イチギヨウメ」という発話をすると属 性の「i/ch/i/g/y/o/u/m 」という pronounce属性の音素列にしたがって、図 13のよう に一行目にフォーカス B300が設定され、「ショウサイ」という発音にしたがって pronou nce属性の音素列が「sh/o/u/s/a/i」と記載されているボタン B302に対し図 14のよう に選択された後にクリック処理が実施されフォームが送信される (ステップ S506)。  [0105] Next, the browser image as seen from the user is presented and explained. When Fig. 11 is displayed in an HTML browser, it becomes as shown in Fig. 12. Here, the utterance "Ichigiyome" is attributed. The focus B300 is set in the first line according to the phoneme string with the pronounced attribute of “i / ch / i / g / y / o / u / m” as shown in Fig. 13, and pronou nce according to the pronunciation of “shosai”. After selecting the button B302 whose attribute phoneme string is “sh / o / u / s / a / i” as shown in FIG. 14, click processing is performed and the form is sent (step S506). .
[0106] この際、詳細ボタンがたくさん表示されると、どのボタンかわ力 なくなるので「装置 側から何行目ですか?」とアナウンスしたり「起案番号は幾つですか?」と利用者に 提示した内容力 容易に類推できる対象の音素列や音素片列を取得できるように質 問したりすることで音声や表示によるインタラクティブな処理を実施しても良いし、そ れらの発話内容を単語や発話記号列やボイス XMLで提供してもよい。  [0106] If a lot of detailed buttons are displayed at this time, which button will lose its power, announcing "How many lines are there from the device?" And presenting to the user "How many draft numbers?" You can perform interactive processing by voice or display by asking questions to obtain target phoneme sequences or phoneme segment sequences that can be easily inferred. Or utterance symbol string or voice XML.
[0107] このようなイベントを受け取ったブラウザはそれらのイベントに従って事前に設定さ れている処理を実施し、例えば HTMLにおけるく a〉タグであれば、指定のリンク先へ アクセスし任意のウエッブページや画像や映像や音楽や商品情報を獲得したり、「く I NPUT TYPE="button"〉」や「〈INPUT TYPE="submit"〉」、「〈INPUT TYPE="image"> 」、「く BUTTON TYPE 〜〃〉」 t 、つた操作入力タグであれば該当のボタンや画像が 押下された状態へ HTML処理を遷移させたり、「く FRAME〉」タグであれば、フレーム の名称に従ってフレームの選択を行ったり、「く SELECT〉」タグであれば、利用者の発 話した音素、音素片による発音記号列のあるセレクトタグにフォーカスを移し、ォプシ ヨンタグの中力 選択候補を構成し、任意の項目を選択したり、「く HR〉」や「く A NAME =""〉」タグであれば、そのタグに変数や属性として関連付けられた音素や音素片によ る発音記号列を用いて関連するタグのある目的の行までスクロールしたり、「く TITLE〉 」タグであればタグで挟まれた範囲を発音記号列に展開するとともに自身の URLと関 連付けてブックマークに記憶すると!/、つた処理が実施できる (ステップ S506)。 [0107] The browser that receives such an event performs a process set in advance according to those events. For example, in the case of <a> tag in HTML, a specified link destination is accessed and an arbitrary web page is accessed. , Images, videos, music, and product information, "KU I NPUT TYPE =" button ">", "<INPUT TYPE =" submit ">", "<INPUT TYPE =" image "> ”,“ KU BUTTON TYPE ~ 〃> ”t, if it is an operation input tag, transition the HTML processing to the state where the corresponding button or image is pressed, or if it is a“ KU FRAME> ”tag, the name of the frame If you select a frame according to the `` SELECT '' tag, the focus shifts to the select tag with phonetic symbols and phonetic symbol strings generated by the user, and configures a selection candidate for the optional tag If you select any item, or if it is a “KU HR>” or “KU A NAME ="">" tag, the phonetic symbol string by phonemes or phonemes associated with the tag as a variable or attribute Scroll to the target line with the relevant tag using, or if it is a “KU TITLE>” tag, expand the range between the tags into a phonetic symbol string and store it in the bookmark in relation to its own URL Then, you can carry out the processing! (Step S506).
[0108] もちろん、これらの表音記号による認識はスクリプトなどと連携してこれらの処理を実 施してもよいし、「く EMBED SRC "〉」や「〈OBJECT〉」、「く APPLET CODE "〉」など のタグによって任意の拡張機能を追加し、それらのプログラムへの変数や属性として 与えたり、それらを外部から操作するための命令や表音記号である音素や音素片や 発音記号による識別子列や特徴量に用いたり、スクリプトと連携して利用するための 情報に用いたり、スクリプトの制御条件に用 、たりしても良 、。  [0108] Of course, these phonetic symbol recognitions may be performed in cooperation with a script or the like, or "KU EMBED SRC"> "," <OBJECT> "," KU APPLET CODE "> An arbitrary extension function is added by a tag such as ``, and it is given as a variable or attribute to those programs, or an identifier string by phonemes, phonemes or phonetic symbols that are instructions or phonetic symbols for operating them externally It can be used for data and feature values, used for information used in conjunction with scripts, or used for script control conditions.
[0109] また、例えば XMLや RDFを用いた RSSや MPEG7であれば、図 4のようなアイテ ムセクションを選択対象とするために図 5のような変数や属性の追加や図 6のようなタ グの追カ卩による変更をカ卩える方法をとつても良ぐ RDFの「Dublin CoreJに基づく要 素タイプに「pronounce」要素を追加して、音素や音素片によるシーン名や役者、配役 名の呼称表記を行ったり、「img-type」 「img-position」要素を追加して画像の表示 位置や特徴量を記載したり、「motion」要素を追加して画面内の動作を表記したり、「e nv- sound]要素を追加して環境音識別子や特徴量を記載したりしても良!、。 [0109] For example, in the case of RSS or MPEG7 using XML or RDF, in order to select an item section as shown in Fig. 4, addition of variables and attributes as shown in Fig. 5 or as shown in Fig. 6 It is good to have a method to cover changes due to tag addition. Add “ pronounce ” element to element type based on “Dublin CoreJ” of RDF, scene name, actor, and cast by phoneme or phoneme. Name designation, add "img-type" and "img-position" elements to describe the display position and feature amount of the image, add "moti on " element to describe the operation in the screen Or add an “e nv-sound” element to describe an environmental sound identifier or feature!
[0110] また、図 11では表音文字であるカナ発音のための音節表記用のタグや音素用のタ グを示しているが、これらのタグは音素片、入力映像による画像識別子などであって もよぐ例えば、利用者の音声や表情から怒っていることが認識された場合や画像内 力も特定の識別子が検出された場合に処理されるスクリプトや内容の提示、リンク先 への移動と 、つた処理を実施してもよ 、。  [0110] Fig. 11 shows syllable syllable syllable tags and phoneme tags. These tags are phonemes, image identifiers of input video, and so on. For example, when it is recognized that the user is angry from the voice or facial expression of the user, or when a specific identifier is detected for the image internal force, presentation of a script or content to be processed, movement to the link destination, etc. You can also perform the above process.
[0111] そして、これらのタグや属性、要素は一般的に解釈する装置において文字列の一 致により評価され、それらの文字列にあわせて情報処理装置内に収録された処理を 実施する関数やプロセスやプログラムやサービスにそのタグや属性に基づく情報を 提供する。 [0111] These tags, attributes, and elements are one of character strings in a general interpretation device. Information based on the tags and attributes is provided to the functions, processes, programs, and services that perform the processing that is evaluated by matching and recorded in the information processing device according to the character strings.
[0112] なお、音素や音素片の認識関数に対しては音素列や音素片列を提供して認識対 象辞書に登録したり、他の識別子や特徴量であれば、検出結果としての評価係数を 変更したり、周辺機器への指示情報の出力を行ったり、異音同義語を辞書構成によ り用意したり、異音同義語を辞書登録するために表音記号列を記載する属性に複数 の表音記号列を境界記号で弁別できるようにして記載したり、認識によって得られた 結果を補正する処理を実施したりしてもょ 、。  [0112] For phoneme and phoneme recognition functions, phoneme sequences and phoneme segment sequences are provided and registered in the recognition target dictionary. If other identifiers and feature quantities are used, the detection results are evaluated. Attributes that describe phonetic symbol strings to change coefficients, output instruction information to peripheral devices, prepare allophone synonyms using a dictionary configuration, and register allophone synonyms in the dictionary It is possible to describe multiple phonetic symbol strings so that they can be distinguished by boundary symbols, or to correct the results obtained by recognition.
[0113] また、単純に表示する文字列を発音タグで挟んで発音対象として指定して漢字か な混じり文や英文、中文といった他言語の文字列を発音のための表音記号を用いて 音素や音素片による記号列に変換して認識や命令制御、検出、検索に利用してもよ いし、表音記号をアルファベットにより表記するば力りではなぐアスキーコードゃェビ セディックコードのような数値に置き換えてマークアップ言語内に記載しても良い。  [0113] In addition, the character string to be displayed is simply sandwiched between pronunciation tags and specified as the pronunciation target. Character strings in other languages such as kanji or mixed sentences, English, and Chinese are used as phonemes for pronunciation. It can be used for recognition, command control, detection, and search by converting it to a symbol string using a phoneme fragment, or an ASCII code that can be used to express phonetic symbols in alphabetical characters. It may be described in the markup language instead of a numerical value.
[0114] また、このような方法で単語に関連付けられた特徴量や識別子に関連する映像や 台詞、画面特徴や表示物体を制御することで CGなどにより映画や番組の生成や製 作を行うツールに利用しても良いし、コンテンツ閲覧中の発話状況を認識したり、利 用者の音声操作による投票や閲覧回数などによる内容評価を用いたりすることで映 画や番組とそれらのシーンなど力も得られた特徴量や識別子との相関性に基づいて 映画や番組を評価しても良 ヽ。  [0114] In addition, a tool for generating and producing movies and programs using CG, etc. by controlling video and dialogue related to feature quantities and identifiers, screen features, and display objects associated with words in this way It can also be used for video, programs and their scenes by recognizing the state of utterance while browsing content, or using content evaluation based on voting and browsing frequency by user voice operation. It is acceptable to evaluate a movie or program based on the correlation with the obtained feature value or identifier.
[0115] くサーバ'クライアントモデル >  [0115] Kuserver Server Model>
なお、上述した仕組みは、マークアップ言語を用いた検索手順をサーバ'クライアン トモデルにより実装しても良い。具体的に図 15にサーバ'クライアントモデルにおける 処理における状態遷移を示す。  In the above-described mechanism, a search procedure using a markup language may be implemented by a server client model. Specifically, Fig. 15 shows the state transitions in processing in the server client model.
[0116] まず、クライアントとなる端末装置はクエリを生成する。クエリの生成方法は、一般的 な文字列入力による方法であったり、音声入力による方法であったり、画像を示して その特徴量をクエリとする方法であってもよ 、。  [0116] First, a terminal device serving as a client generates a query. The query generation method may be a general character string input method, a voice input method, or a method of displaying an image and using the feature amount as a query.
[0117] そして、生成されたクエリに基づいてサーバとなる配信装置は適切なものを検索し、 検索結果にしたがって配信基地局は端末装置に本発明を用いた検索結果一覧の情 報を配信する。そして、端末装置は取得した情報のマークアップ言語を解釈し特定 のタグに挟まれている範囲の文字列を音素や音素片などの前述の識別子に変換し、 利用者が発話する音声入力情報にしたがって音素や音素片を獲得して音素列や音 素片列といった表音記号列を構成し音素列や音素片列に基づくマッチング処理を実 施する。 [0117] Then, based on the generated query, the distribution device serving as a server searches for an appropriate one, According to the search result, the distribution base station distributes the information of the search result list using the present invention to the terminal device. Then, the terminal device interprets the markup language of the acquired information, converts the character string in the range sandwiched between specific tags into the aforementioned identifiers such as phonemes and phonemes, and converts it into voice input information spoken by the user. Therefore, phonemes and phonemes are acquired, phoneme symbol strings such as phoneme strings and phoneme strings are constructed, and matching processing based on phoneme strings and phoneme strings is performed.
[0118] そして、一致度の高い操作やキーワード、識別子列にしたがってそれぞれの処理を 行ったのちに、指定された識別子列によるクエリを構成するとともに、それらのクエリを 配信基地局に送信し検索を実施することで、マークアップ言語を用いた音声制御を 伴う検索を実施する。この際、構成されたクエリは装置単体での検索に利用しても良 い。  [0118] Then, after performing each processing according to operations with high matching degree, keywords, and identifier strings, a query based on the specified identifier string is formed, and those queries are transmitted to the distribution base station for searching. By implementing this, a search with voice control using a markup language is performed. At this time, the constructed query may be used for searching by a single device.
[0119] なお、図 16では端末側でのマークアップ言語解釈により音声処理のための音素や 音素片などの識別子記号列の挿入や設定を実施して ヽるが、配信側サーバで識別 子記号列を挿入したり、事前に手入力により挿入したり、配信基地局側装置やそれら と連携する装置上で識別子を構成し挿入したり、単体の情報処理装置を用いたりす ることで本発明を用いた操作や処理を実現するための変数や属性の追加やタグの追 カロ、マークアップ言語情報の内容の変更を行ったり、識別子や特徴量に関連した辞 書の変更や追加、削除を行っても良い。  In FIG. 16, it is possible to insert and set identifier symbol strings such as phonemes and phonemes for speech processing by interpreting the markup language on the terminal side. The present invention can be implemented by inserting a column, manually inserting in advance, constructing and inserting an identifier on a distribution base station side device or a device linked thereto, or using a single information processing device. Add variables and attributes to implement operations and processes, add tags, change the contents of markup language information, and change, add, and delete dictionaries related to identifiers and features. You can go.
[0120] また、本発明により生成された新しい識別子の組合せに関し、任意の名称に基づく 単語を与えて検索を実施たり、任意の名称に音素や音素片による記号列を与えて音 声制御に対応したり、操作のためのキーワードとして音素や音素片による記号列を与 え操作できるようにしたりしても良いし、このような発音記号列を広告に関連付けたり、 アドバタイズ属性を追加して関連する広告の URLを発話記号属性と同一タグ内に表 記することで広告と発話記号列とを関連付けたりしても良い。  [0120] In addition, with regard to the new identifier combination generated by the present invention, search is performed by giving a word based on an arbitrary name, and a symbol string using phonemes or phoneme pieces is given to an arbitrary name to support voice control. It is also possible to use a phoneme or a phoneme symbol string as a keyword for operation, and to associate such phonetic symbol strings with advertisements or add advertisement attributes to make them related. An advertisement may be associated with an utterance symbol string by displaying the URL of the advertisement in the same tag as the utterance symbol attribute.
[0121] また、ブラウザ内で表示された画像に関する識別子や制御対象となるキーワードの 音素や音素片に関して、識別子列や識別子列を圧縮した記号列や IDを用いること で端末に利用しやすく変換して必要な情報を送り、マークアップ言語を解釈せずに 装置での音声利用を行うことも容易に考えられ、それらの辞書情報をリモコンや携帯 電話において通信回線経由で取り込んだり、メールで獲得したり、他の装置からダウ ンロードすることで利便性の高 、操作環境を構成しても良 、。 [0121] Also, identifiers related to images displayed in the browser and phonemes and phonemes of keywords to be controlled are converted to be easily usable by terminals by using symbol strings and IDs that are compressed identifier strings and identifier strings. It is also easy to use the voice on the device without sending the necessary information and interpreting the markup language. It is possible to configure the operating environment with high convenience by taking in via the communication line on the telephone, acquiring it by e-mail, or downloading from another device.
[0122] また、ファイル名を音素列で記載することでファイル名に基づ ヽた認識辞書を構成 しても良 、し、音素列 ·音素片列によりファイル名を設定して音素 ·音素片認識でマー クアップ言語内の情報を選択できるようにしてもょ 、し、認識に伴 、證券番号や企業 名による株価検索や JANコードによる商品検索ば力りではなく商品名や出演者名や 会社名や地域名による検索を行い、多様なサービスを実施しても良いし、位置や装 置に応じて音素辞書を変更したりページに応じて音素辞書を変更したりコンテンツの 画像やページ単位の文章や文章構成におけるフレームや動画像の 1コマとしてのフ レームや動画像の複数フレームにまたがるシーン単位に応じて音素辞書を変更した りしても良い。  [0122] In addition, a recognition dictionary based on the file name may be configured by describing the file name as a phoneme string, and the phoneme / phoneme fragment is set by setting the file name using the phoneme string / phoneme string string. You may be able to select information in the markup language for recognition, but with recognition, you can search for stock prices by securities number or company name, or search for products by JAN code. You can search by name or region name and implement various services, change the phoneme dictionary according to the location or device, change the phoneme dictionary according to the page, The phoneme dictionary may be changed in accordance with the frame as a frame of a sentence or sentence structure or a frame of a moving picture or a scene unit that spans multiple frames of a moving picture.
[0123] また、図 17のように RIFF形式のようなチャンクヘッダを持つ情報形式に対して本発 明による索引付けを行うのであれば、チャンクヘッダとして「PRON」 t\、つたタグを任 意に設けて音素列や音素片列を表記しても良ぐその内容は通常ファイルであれば ファイル名や製作日時、製作者といった一般的なメタデータを記載したり、 2D ' 3Dの 画像であれば表示物体や人物の呼称や部品の呼称に伴う音素'音素片を記載したり 、音声ファイルであれば出現音声の音素,音素片を記載したり、音楽ファイルの歌詞 やタイトルを音素,音素片で記載しても良いし、自由記載エリアに音素,音素片を記 載しても良いし、検索に利用したりしても良い。  [0123] Also, if indexing according to the present invention is performed for an information format having a chunk header such as the RIFF format as shown in Fig. 17, "PRON" t \ is arbitrarily specified as the chunk header. The phoneme sequence or phoneme segment sequence may be described in the case of a normal file, and it may include general metadata such as the file name, production date, and producer, or 2D '3D images. For example, the phoneme's phoneme associated with the name of the display object or person or the name of the part is described. If it is an audio file, the phoneme or phoneme of the appearing voice is described. The lyrics or title of the music file is phoneme or phoneme. The phoneme or phoneme piece may be written in the free description area or used for searching.
[0124] <変形例 >  [0124] <Modification>
なお、本実施例では表音記号として音素を中心に例を記載しているが、音素部分 を音素片に変更したり、音素の種類を国際音素記号や英語、中国語といった異なる 言語の音素列に変更したり、認識された画像に基づいた識別子によって丸い画像を コンピュータに提示したのか三角の画像を提示したのかによってマークアップ言語の 処理における選択範囲や分岐内容を構成したりしても良いし、提示された写真に基 づいて検索を実施したり、写真の特徴量に関連付けられた呼称を音素列や音素片 列に展開しマークアップ言語や専用記号列で送受信することで音声による操作を行 つても良いし、アスキーコード以外の文字記号ィ匕方法であるュ-コードや JISコード、 I SOコードを用いても良いし、音素や音素片に基づいた任意の数値 IDを与えた独自 文字コード体系を用いても良い。 In this example, phonemes are mainly used as phonetic symbols. However, the phoneme part is changed to phoneme, and the phoneme type is a phoneme string of different languages such as international phoneme symbols, English, and Chinese. Or the selection range and branch content in markup language processing may be configured depending on whether a round image is presented to a computer or a triangular image by an identifier based on the recognized image. Search based on the presented photos, or expand the names associated with the photo features into phoneme and phoneme strings and send and receive them in markup languages and dedicated symbol strings for voice operations. You can use a character code other than ASCII code, such as character code, JIS code, I The SO code may be used, or a unique character code system with an arbitrary numerical ID based on phonemes or phonemes may be used.
[0125] また、本発明に用いられる識別子列や識別子は音階種別、楽器種別、機械音種別 、環境音種別、画像種別、顔種別、表情種別、人物種別、動作種別、風景種別、表 示位置種別、文字記号種別、標識種別、形状種別、図形記号種別、放送番組種別 といった識別子を一つあるいは複数の組合せによりそれら識別に用いる呼称に基づ いて属性や変数名、識別子を指定してもよいし、識別子列は識別子が時系列遷移に 応じて連続的に記載されたものとして捕らえても良いし、それらの呼称に基づいて音 素や音素片列に変換して利用しても良 、し、 CGIにおける GETメソッドや POSTメソ ッドを用いてそれらの識別子や識別子列を送信して検索結果を得ても良い。  [0125] The identifier string and identifier used in the present invention are scale type, instrument type, mechanical sound type, environmental sound type, image type, face type, facial expression type, person type, action type, landscape type, display position. Attributes, variable names, and identifiers may be specified based on the names used for identification by one or a combination of identifiers such as classification, character symbol classification, label classification, shape classification, graphic symbol classification, and broadcast program classification. However, the identifier string may be regarded as identifiers continuously described according to the time series transition, or may be used after being converted into a phoneme or phoneme string string based on their names. The search result may be obtained by sending the identifier or identifier string using the GET method or POST method in CGI.
[0126] このように、音声に関する特徴量の呼称とそれらの識別子と識別関数、静止画像や 動画像に関する特徴の呼称と識別子と識別関数によってマークアップ言語に属性と 変数を与えることで音声により操作可能なマークアップ言語を構成できるとともに、こ のようなマークアップ言語を処理する情報処理装置が提供する表音記号列によって 利用者が音声による装置制御を実現できるためコンテンツの検索ば力りではなく公 共情報、地図情報、商品販売、予約状況、視聴状況、アンケート、監視カメラ映像、 衛星写真、ブログ、ロボットや機器の制御などに応用することが出来る。なお、これら の要求に対して任意のマークアップ言語を用いて検索や処理結果をサーノからクラ イアントに返信してもよい。  [0126] As described above, voice-related feature names and their identifiers and discriminating functions, and features and identifiers and discriminating functions of still images and moving images are assigned to markup languages using attributes and variables. It is possible to configure possible markup languages, and the user can realize device control by voice by using phonetic symbol strings provided by information processing devices that process such markup languages. It can be applied to public information, map information, product sales, reservation status, viewing status, questionnaires, surveillance camera images, satellite photos, blogs, robots and equipment control. In response to these requests, search and processing results may be returned from Sano to the client using any markup language.
[0127] <端末及び基地局に用いる情報処理装置の手順例 >  <Example of procedure of information processing apparatus used for terminal and base station>
また、本発明は基地局と端末に関わるサーバ'クライアントによる処理システムにつ いても適用可能である。本装置と端末は図 18のように構成され、通信回線を経由し て接続し、他の装置力 情報を取得したり、他の装置に情報を配信したりすることで、 音声操作に関する情報を交換可能とし、利用者の利便性を改善する。なお、ここで 用いる共有回線はインターネットば力りではなく LANや電話回線などの広域通信網 や屋内通信網であれば有線無線を問わずに用いても良ぐ対象となる装置は家電や リモコン、ロボット、携帯電話、通信基地局であっても良ぐウェブサービス、電話サー ビス、 EPG配信などのサービスであっても良ぐ任意の装置やサービスに対して実施 できる。 The present invention can also be applied to a processing system by a server client related to a base station and a terminal. This device and terminal are configured as shown in Fig. 18, and they are connected via a communication line to acquire information on other device capabilities and to distribute information to other devices, so that information related to voice operations can be obtained. It can be exchanged to improve user convenience. The shared line used here is not the power of the Internet, but if it is a wide area communication network such as a LAN or telephone line or an indoor communication network, the target devices that can be used regardless of wired wireless are home appliances, remote controllers, Implemented for any device or service that can be a web service, telephone service, EPG distribution, etc. it can.
[0128] また、利用者端末と配信基地局と端末や基地局に制御されるロボットなどの装置や 制御するリモコンにより構成され、リモコンやロボットは端末の一形態や基地局の一形 態として利用されても良ぐ利用者は端末に対して音声を発話し、端末若しくは基地 局で認識処理のために以下にあるような任意の処理手順を実施する。  [0128] Also, it is composed of a user terminal, a distribution base station, a device such as a robot controlled by the terminal and the base station, and a remote controller to be controlled. The remote controller and robot are used as one form of terminal or one form of base station. The user who is allowed to speak speaks the voice to the terminal and performs the following arbitrary processing procedure for the recognition process at the terminal or the base station.
[0129] 第 1の方法では、発話により得られた音声や撮像された映像カゝら特徴量抽出を実 施し、特徴量を対象となる中継個所や基地局装置に送信し、特徴量を受信した基地 局装置はその特徴量に応じて音素記号列'音素片記号列やその他画像識別子を生 成する。そして、生成された記号列に基づいて、一致する制御手段を選択し実施す る。  [0129] In the first method, feature amounts are extracted from the speech obtained from speech or captured video images, and the feature amounts are transmitted to the target relay location or base station apparatus, and the feature amounts are received. The base station apparatus generates a phoneme symbol string 'phoneme symbol string and other image identifiers according to the feature quantity. Then, based on the generated symbol string, a matching control means is selected and executed.
[0130] 第 2の方法は、発話により得られた音声や撮像された映像カゝら特徴量抽出を実施し 、端末内で音素記号列 ·音素片記号列、その他画像識別子といった認識に伴う識別 子を生成し、生成された記号列を対象となる中継個所や基地局装置に送信する。そ して、制御される基地局装置は受信した記号列に基づき一致する制御手段を選択し 実施する。  [0130] In the second method, feature amounts are extracted from speech obtained by utterances or captured video images, and identification accompanying recognition such as phoneme symbol strings / phoneme symbol strings and other image identifiers in the terminal is performed. A child is generated, and the generated symbol string is transmitted to a target relay location or base station apparatus. Then, the base station apparatus to be controlled selects and executes a matching control means based on the received symbol string.
[0131] 第 3の方法は、発話により得られた音声や撮像された映像カゝら特徴量抽出を実施し 、端末内で生成された特徴量に基づき音素列'音素片記号列、その他画像識別子を 認識し、認識された記号列に基づき制御内容を選択し、制御方法を制御する基地局 装置や情報配信を中継する装置に対し送信する。  [0131] In the third method, feature values are extracted from voices obtained by utterances and captured video images, and phoneme strings' phoneme symbol strings and other images are extracted based on the feature values generated in the terminal. It recognizes the identifier, selects the control content based on the recognized symbol string, and transmits it to the base station device that controls the control method and the device that relays information distribution.
[0132] そして、第 4の方法は端末を用いて発話により得られた音声や撮像された映像の音 声波形や画像をそのまま制御する基地局装置に送信し、制御する装置内で音素記 号列 ·音素片記号列、その他画像識別子を認識し、認識された記号列に基づいて制 御手段を選択し、選択された制御を制御される中継個所や基地局装置が実施すると V、うものである。同様に環境音など音や映像の特徴や識別子にっ ヽても同様である  [0132] Then, in the fourth method, the voice obtained by utterance using the terminal or the voice waveform or image of the captured video is transmitted to the base station apparatus that controls it as it is, and the phoneme symbol is stored in the controlling apparatus. When a relay point or base station device that controls the selected control unit selects the control means based on the recognized symbol string and recognizes the image symbol symbol string and other image identifiers. It is. The same applies to sound and video features and identifiers such as environmental sounds.
[0133] この際、端末カゝら単純に波形のみを送信したり、特徴量を送信したり、認識された 識別子列を送信したり、識別子列に関連付けられた命令やメッセージなどの処理手 順を送信しても良ぐそれらの送信情報にあわせて配信基地局の構成を変更してクラ イアントサーバモデルを実施しても良ぐ送信側と受信側が相互に送受信することも 可能であり、前述される識別子に関連する画像や音声や動作などの特徴量をマーク アップ言語の属性に与えて、利用者側力 提供される情報力 抽出される特徴量と 配信情報から抽出される特徴量との一致度を評価し、検索や認識を行うことで任意 の制御や利用者への応対を伴う情報処理を実現しても良いし、顔や指紋などの画像 認識辞書と音素や音素片による表音記号列を用いた固有名詞を伴う認識辞書と話 者ごとの音素や音素片に基づく音響モデルとを関連付けることで合言葉による個人 認証を行っても良い。 [0133] At this time, the terminal unit simply transmits only the waveform, transmits the feature value, transmits the recognized identifier string, and processing procedures such as a command and a message associated with the identifier string. The configuration of the distribution base station is changed according to the transmission information that is acceptable. It is also possible for the sender and receiver, which can implement the client server model, to send and receive each other, and by assigning features such as images, sounds, and actions related to the identifiers described above to the attributes of the markup language. User side power Provided information power Evaluate the degree of coincidence between the extracted feature quantity and the feature quantity extracted from the distribution information, and perform search and recognition to involve arbitrary control and response to the user Information processing may be realized, image recognition dictionaries such as faces and fingerprints, recognition dictionaries with proper nouns using phonetic symbols and phonemes, and acoustic models based on phonemes and phonemes for each speaker It is also possible to perform personal authentication using secret words by associating with.
[0134] また、入力された音素列や音素片列に基づいて関連付けられた処理手順へ変換 する命令辞書は、端末側にあっても配信基地局側にあってもよぐ新しい制御命令や メディア種別、フォーマット種別、装置名に関する音素記号列や画像識別子といった 記号列を、 XMLや HTMLのような後述されるマークアップ言語や RSS、 CGIを用い て情報の送受信や配信や交換を行っても良い。  [0134] In addition, the command dictionary for converting to an associated processing procedure based on the input phoneme sequence or phoneme sequence is a new control command or media that can be used on the terminal side or the distribution base station side. You can send / receive, distribute, and exchange information such as phonetic symbol strings and image identifiers related to type, format type, and device name using markup languages such as XML and HTML, RSS, and CGI. .
[0135] より具体的な辞書情報の配信や交換の手順にっ 、て説明する。まず、特徴量ゃ識 別子を抽出したり、評価関数を構成したりすることで、任意の赤外線や無線 LAN、電 話回線や有線 LAN等を問わず通信回線に接続された環境で他の端末や装置類と の情報交換を行う。  A more specific procedure for distributing and exchanging dictionary information will be described. First, by extracting feature identifiers and configuring evaluation functions, other infrared or wireless LAN, telephone lines, wired LANs, etc. Exchange information with terminals and devices.
[0136] 次に、端末側の処理として音素片を用いた場合を例に説明すると、利用者は発話 を伴って音声波形を端末と装置に与える。端末側装置は与えられた音声を分析し特 徴量に変換する。次に変換された特徴量を HMMやベイズと ヽつた各種認識技術に より認識し識別子に変換する。  [0136] Next, a case in which phoneme pieces are used as processing on the terminal side will be described as an example. A user gives a speech waveform to a terminal and a device with speech. The terminal-side device analyzes the given voice and converts it into features. Next, the converted features are recognized and converted to identifiers using various recognition technologies, such as HMM and Bayes.
[0137] この際、変換された識別子は音素や音素片、各種画像識別子を示す情報となるが 、他にも別記されるように音声であれば音素や環境音や音階であったり、画像であれ ば画像や動作に基づいた識別子であったりしてもよい。そして、得られた識別子に基 づいて音素、音素片記号列による辞書を DPマッチングにより参照して任意の処理手 順を選択し、選択された処理手順を対象となる装置に送信し制御を実施することで、 本発明を利用して携帯端末をリモコンとして用いたり、ロボットによる家電制御を実施 したりすることが可能であり、通信先にいる相手との円滑なコミュニケーションを実施 するための発話音表記の表示や点字出力部を設けて障害者との対話装置なども構 成しても良い。 [0137] At this time, the converted identifier is information indicating a phoneme, a phoneme piece, and various image identifiers. However, as described elsewhere, if it is a voice, it may be a phoneme, an environmental sound, a scale, or an image. If so, it may be an identifier based on an image or action. Based on the obtained identifier, a dictionary based on phonemes and phoneme symbol strings is referred to by DP matching to select an arbitrary processing procedure, and the selected processing procedure is transmitted to the target device for control. By using the present invention, it is possible to use the mobile terminal as a remote control or to control home appliances with a robot, and to smoothly communicate with the other party at the communication destination For example, a dialogue device with a handicapped person may be configured by providing a display of utterance sound notation and a braille output unit.
[0138] このような手順で処理された情報は端末側の CPU性能によって、動画や音声とい つた自然情報から特徴量への変換をせずに元の情報のまま送信したり、特徴量への 変換で留めて送信したり、識別子への変換で留めて送信したり、制御情報の選択ま で行ってから送信したり、任意の変換水準を選択することができ、受信側は任意の状 態から情報に基づ!、て処理可能な受信側装置として構成され、獲得した情報に基づ き配信局や制御装置に送信したり、獲得した情報に基づいて検索や記録、メール配 信、機械制御、装置制御といった任意の処理を実施しても良い。  [0138] Depending on the CPU performance of the terminal, the information processed in such a procedure is transmitted as the original information without converting natural information such as video and audio into feature values, or converted to feature values. It is possible to select and send an arbitrary conversion level, such as transmission after conversion, transmission after conversion to an identifier, transmission after selection of control information, and the receiving side can select any state. It is configured as a receiving side device that can be processed based on information from, and sent to a distribution station or control device based on the acquired information, or based on the acquired information, search, record, mail delivery, machine Arbitrary processing such as control and device control may be performed.
[0139] そして、検索処理に用いるために、適宜クエリとなる識別子列や文字列、特徴量を 認識により獲得し、配信側基地局に送信し、そのクエリに従った情報を入手する。こ の際、通信の待ち時間や検索の待ち時間に宣伝や広告を表示しても良ぐ音声によ る制御を行う際は通信により制御項目の選択が出来るようにするために制御辞書を 構成し相互に辞書情報の交換'獲得を行い、その手順は P2P技術を利用して行って も良いし、それらの情報を販売、配布しても良い。  [0139] Then, for use in search processing, an identifier string, a character string, and a feature amount that are appropriately used as a query are acquired by recognition and transmitted to the distribution-side base station, and information according to the query is obtained. In this case, a control dictionary is constructed so that control items can be selected by communication when control is performed using voice, even if advertisements and advertisements are displayed during the communication wait time and search wait time. Then exchange dictionary information with each other, and the procedure may be performed using P2P technology, or the information may be sold and distributed.
[0140] また、この制御命令辞書は音素や音素片と!/、つた前述されるような任意の識別子や 特徴量と装置制御情報で構成することにより自由に内容を更新して再利用できるよう にすることが可能であり、任意の識別子と特徴量を関連付けた検索のための辞書情 報を入れ替えたり再構成したりすることで、流行の検索キーワードを更新出来るように してもょ 、し、これらのコンテンツ情報の位置や構成に応じて変更される認識辞書情 報は顔認識するための辞書や指紋認識するための辞書や文字認識するための辞書 や図形認識するための辞書であってもよい。  [0140] Furthermore, this control command dictionary can be renewed and reused freely by comprising phonemes, phoneme pieces,! /, And any identifiers, features, and device control information as described above. It is possible to update trendy search keywords by replacing or reconfiguring dictionary information for search that associates arbitrary identifiers with feature quantities. The recognition dictionary information that is changed according to the position and configuration of the content information includes a dictionary for face recognition, a dictionary for fingerprint recognition, a dictionary for character recognition, and a dictionary for figure recognition. Also good.
[0141] なお、制御命令辞書は従来の赤外線リモコンで制御できる製品に送信するための 赤外線制御情報が装置制御情報として選択されたり、それらの制御情報の組合せに より一連の作業をバッチ処理のように連続的に実施したり、装置の CPU性能に応じ て識別子を認識せずに特徴量情報のみを音声対制御応情報処理装置に送信する ようにしてもよい。  [0141] In the control command dictionary, infrared control information to be transmitted to products that can be controlled by a conventional infrared remote controller is selected as device control information, or a series of operations are batch processed by combining these control information. Alternatively, the feature information may be transmitted to the information processing apparatus for voice versus control without recognizing the identifier according to the CPU performance of the apparatus.
[0142] このような方法で音声制御が出来ない従来装置に対しても赤外線リモコンによる制 御を組合せることで音声情報カゝら変換辞書経由で赤外線リモコンの信号を提供した り、音声制御の可能な装置であれば、特徴量や音声波形に基づいて命令を認識し 制御したりすることが出来るとともに、性能改善に伴う制御用辞書の変更を実施する ことや、制御用辞書のバージョン情報と確認するといつたことや、装置の状態がどのよ うになって!/、るかを確認することができる。 [0142] In contrast to conventional devices that cannot perform voice control in this way, control using an infrared remote controller is also possible. By combining them, it provides infrared remote control signals via a voice information module conversion dictionary, and if it is a device capable of voice control, it recognizes and controls commands based on feature quantities and voice waveforms. It is possible to change the control dictionary to improve the performance, check when the version information of the control dictionary is confirmed, and what the status of the device is! / can do.
[0143] また、このような方法でサーバ ·クライアントモデルを導入し、任意の処理ステップで サーバとクライアントに分割して通信で結びサーバ'クライアント間で任意の情報を交 換することにより同等のサービスやインフラ、検索、索引付けを実現することができる。  [0143] In addition, by introducing the server-client model in this way, the server and the client are divided into arbitrary processing steps, connected by communication, and the same service is exchanged between the server and the client. And infrastructure, search and indexing.
[0144] また、顔や指紋や音声特徴の認識による個人認証を併せて行うために、音素認識 辞書情報に個人の音声特性に合わせた音響モデルや標準パラメータや標準テンプ レートを用いた音素認識辞書を利用することで、画像や音声を伴う認識辞書を利用 者に応じて変更可能し、汎用性の高い個人認証を実現すること可能となる。したがつ て、課金を行ったり、鍵の施錠や開錠を行ったり、サービスを選択したり、利用の許諾 を行ったり、著作物の利用を行ったりと!ヽつた各種操作や操作を用いるサービスが本 発明を用いて認識を行う情報端末を利用して実現できる。  [0144] Furthermore, in order to perform personal authentication by recognizing faces, fingerprints, and voice features, phoneme recognition dictionaries that use phonetic recognition dictionary information with acoustic models, standard parameters, and standard templates tailored to individual voice characteristics. By using, the recognition dictionary with images and sounds can be changed according to the user, and highly versatile personal authentication can be realized. Therefore, you can charge, lock and unlock keys, select services, grant usage, and use copyrighted works! Various kinds of operations and services using the operations can be realized by using an information terminal that performs recognition using the present invention.
[0145] また、本発明を用いて認識を行う端末を利用して、通信先にある基幹サーバから D VDレコーダやネットワーク TV、 STB、 HDDレコーダ、音楽録再装置、映像録再装 置といったクライアント端末によって獲得された情報を赤外線通信や FMや VHF周 波数帯域通信、 802. l ib,ブルートゥース(登録商標)、 ZigBee、 WiFi、 WiMAX、 UWB、 WUSB (Ultra Wide Band)などの無線通信を経由して携帯端末や携帯電話 に情報を提供することで EPGや BML、 RSS、文字放送によるデータ放送やテレビ映 像、文字放送を携帯端末や携帯電話で利用できるようにしたり、音声入力や文字列 入力、携帯端末や携帯電話を振り動かす操作により情報端末や家電や情報機器や ロボットの操作や制御手順の指示を行ったり、携帯端末や携帯電話を一般的なリモコ ンとしてクライアント端末力 家電や情報機器やロボットの操作や制御手順の指示を 行ったりと 、つた遠隔操作を行っても良 、。  [0145] In addition, a client such as a DVD recorder, a network TV, an STB, an HDD recorder, a music recording / playback device, or a video recording / playback device can be used from a core server at a communication destination using a terminal that performs recognition using the present invention. Information acquired by the terminal is transmitted via wireless communication such as infrared communication, FM, VHF frequency band communication, 802. ib, Bluetooth (registered trademark), ZigBee, WiFi, WiMAX, UWB, WUSB (Ultra Wide Band). By providing information to mobile terminals and mobile phones, it is possible to use EPG, BML, RSS, text broadcasting data broadcasting, TV video, text broadcasting on mobile terminals and mobile phones, voice input, and character string input. The operation of the information terminal, home appliances, information equipment and robots and the control procedure are instructed by swinging the mobile terminal or mobile phone, or the mobile terminal or mobile phone is used as a general remote control for the client terminal. And and go an indication of the power of home appliances and information devices and robot operations and control procedures, even if the ivy remote operation good,.
[0146] また、マークアップ言語により構成された情報における HTMLの FORMタグといつ た入力項目に関連付けられて抽出された属性に基づ!、た音素辞書が認識辞書情報 206に事前に登録されている場合、認識の優先順位を事前に登録された音素辞書 に変更しても良 、し事前に登録された辞書を用 、て認識対象を限定しても良 、。 [0146] Also, based on the HTML FORM tag in the information configured in the markup language and the attribute extracted in association with the input item! If registered in 206, the recognition priority may be changed to a pre-registered phoneme dictionary, or the recognition target may be limited using a pre-registered dictionary.
[0147] また、マークアップ言語により構成された情報の属性変数に基づ 、た音素列や音 素片列といった表音記号列に関し、同時に認識される可能性のある音素列や音素片 列表音記号列を複数併記することで認識辞書情報 206を複数で構成し、同一の属 性変数を持つ入力項目に関して同一の認識辞書情報 206を利用するように構成し ても良い。 [0147] Also, based on the attribute variable of the information configured in the markup language, the phoneme string or phoneme string phonetic that may be recognized simultaneously with respect to the phoneme symbol string such as the phoneme string or phoneme string string A plurality of recognition dictionary information 206 may be configured by writing a plurality of symbol strings, and the same recognition dictionary information 206 may be used for input items having the same attribute variable.
[0148] また、属性変数に音素列や音素片列や表音記号列を複数用いて認識される可能 性のある単語を複数表記しても良く例えば任意の単位といった助数詞が音素列や音 素片列や表音記号列として表記されて!ヽる場合、認識辞書情報 206を数詞専用に 切替えたり、メニュー項目に応じた専用辞書に切替えたり、地名や駅名といった限定 的固有名詞辞書に切替えたりする 、つた方法を用いても良い。  [0148] In addition, a plurality of words that may be recognized using a plurality of phoneme strings, phoneme string strings, and phonetic symbol strings may be represented as attribute variables. For example, a classifier such as an arbitrary unit may be represented as a phoneme string or phoneme. When it is spoken as a single string or phonetic symbol string, the recognition dictionary information 206 is switched to a dedicated number dictionary, a dedicated dictionary according to the menu item, or a limited proper noun dictionary such as a place name or station name. You may use the same method.
[0149] また、マークアップ言語に基づく表示に用いるべく選択された文字コードに応じて、 音声波形から音素や音素片と 、つた表音記号による識別子へ変換を行うステップ (S 506)に用いられるベイズ識別関数や HMMに用いる標準パターンや標準テンプレ ートといった学習結果として得られる値や固有値 ·固有べタトルによる値や共分散行 列による値を言語ごとに複数用意し、表示がロシア語ならロシア語標準テンプレート、 表示が中国語なら中国語語標準テンプレート、と切替えることにより多言語に対応し ても良いし、利用者の情報処理装置もしくはオペレーティングシステムもしくはブラウ ザ固有の言語環境に関する情報を取得することにより認識に用いる標準テンプレート を多言語から選択しても良い。  [0149] Also, used in the step (S506) of converting a speech waveform into an identifier based on phonemes, phonemes, and phonetic symbols according to the character code selected to be used for display based on the markup language. Prepare multiple values for each language such as Bayes discriminant function, standard pattern used for HMM, standard template and standard value obtained as learning result, value by eigenvector, covariance matrix, and Russian if display is Russian It is possible to support multiple languages by switching to a language standard template, or a Chinese language standard template if the display is Chinese, or to acquire information about the user's information processing device, operating system, or browser-specific language environment The standard template used for recognition may be selected from multiple languages.
[0150] また、利用者の指定により、例えばロシア語話者が中国語を発話した場合の標準テ ンプレートといった利用者の母国語'母語と利用する装置環境で認識される言語との 違!、で生じる訛りや方言に対応するように標準テンプレートを選択出来るようにシステ ムを構成しても良ぐ利用者の発話力 訛りや方言などを学習しテンプレートを構成 でさるようにしてちょい。 [0150] Also, according to the user's designation, for example, the difference between the user's native language such as a standard template when a Russian speaker speaks Chinese and the language recognized in the device environment used! Even if the system can be configured so that the standard template can be selected so as to correspond to the utterances and dialects generated by, the user's utterance ability and dialects can be learned to compose the templates.
[0151] また、属性変数に応じてクッキーやセッションの内容を音素や音素片による表音記 号列に変換して認識辞書情報 206を切替えるといった方法を用いても良いし、音声 から認識された音素列や音素片列といった表音記号列や音声から抽出された特徴 量を AJAXといったスクリプトを用いる手法や CGI (Common Gateway Interface)のパ ラメータとしてステータスや環境変数を伝達する手法やプログラムによるソケット通信 といった任意の変数伝達手段によって基地局に送信し、基地局側で受信した音素や 音素片からなる表音記号列や基地局側で受信した音声特徴量に基づき認識された 音素列や音素片列と!、つた表音記号列を用いて利用者の発話を弁別し任意の処理 を行ったり、検索条件を構成してコンテンツ情報や広告情報や地域情報の検索処理 を行ったりしてもよい。 [0151] In addition, a method may be used in which the recognition dictionary information 206 is switched by converting the content of the cookie or session into a phonetic symbol string using phonemes or phonemes according to the attribute variable. A phonetic symbol sequence such as phoneme sequence or phoneme segment sequence recognized from the voice, a feature value extracted from speech, a method using a script such as AJAX, a method of transmitting status and environment variables as CGI (Common Gateway Interface) parameters, Phoneme strings recognized based on phonetic symbol strings made up of phonemes and phoneme pieces received at the base station side and voice feature values received at the base station side, transmitted to the base station by any variable transmission means such as socket communication by a program And phoneme strings and!, And the phonetic symbol string to distinguish the user's utterance and perform arbitrary processing, or configure search conditions to search content information, advertising information, and regional information. May be.
[0152] そして、それらの表音記号列の認識処理に伴って変化する端末装置の絵、文字、 アイコン、 CG (Computer Graphics)をはじめとした表示情報もしくは音楽、警告音を はじめとした出力音情報もしくはロボット、機械装置、通信装置、電子機器、電子楽器 をはじめとした装置類の動作制御情報もしくは音声や静止画像や動画像などを認識 するための認識辞書情報 206もしくは映像や音声や画像力 特徴抽出するためのプ ログラム、スクリプト、関数式などの処理手順情報といった任意の情報を組合せてそ れらを更新するための情報送信を基地局力 行ったり、端末装置内で自律的に任意 の処理を実施したりしてもょ 、。  [0152] Then, the display information such as pictures, characters, icons, CG (Computer Graphics) of the terminal device that changes with the recognition processing of these phonetic symbol strings, or the output sound such as music and warning sound Information or motion control information for devices such as robots, mechanical devices, communication devices, electronic devices, electronic musical instruments, or recognition dictionary information 206 for recognizing voice, still images, moving images, etc. The base station can send information to update them by combining arbitrary information such as processing procedure information such as programs, scripts, and function expressions for feature extraction, or autonomously within the terminal device. You can do the processing.
[0153] また、認識結果として取得された音素や音素片と言った表音記号に関して複数の フレームに分割され時系列的に連続した認識結果を得る場合において、複数フレー ムにまたがる複数の音素や音素片に対する認識結果として獲得された入力音声と音 素や音素片といった表音記号との距離情報などを特徴量として利用しベイズ識別関 数のパラメータを構成したり、時系列的に縮退させるために複数のフレームにまたが る複数の音素や音素片に対する認識結果として獲得された距離情報を用いて HM Mのパラメータを構成したり、複数フレームにおける認識結果によって第一位と評価 された識別子を DP等で評価したりすることで従来の音声認識に用いられた技術と組 合せて動的な音声認識を構成したりしても良い。  [0153] In addition, when a phonetic symbol such as a phoneme or a phoneme piece obtained as a recognition result is divided into a plurality of frames and a continuous recognition result is obtained in time series, a plurality of phonemes over a plurality of frames are obtained. To configure the parameters of the Bayes discriminant function using the distance information between the input speech acquired as a recognition result for the phoneme and the phonetic symbol such as the phoneme or phoneme as a feature quantity, or to degenerate in time series HMM parameters are configured using distance information acquired as a recognition result for multiple phonemes and phonemes across multiple frames, and the identifier that is ranked first by the recognition results in multiple frames is used. Dynamic speech recognition may be configured in combination with techniques used in conventional speech recognition by evaluating with DP or the like.
[0154] より具体的には、まず、コンテンツ情報を取得するステップ(S401、 S501)によって マークアップ言語情報を獲得し、マークアップ言語情報力もタグを検出しタグ力もタグ 属性を検出するタグ属性検出手段とあわせて検出された属性に関連付けられる表音 記号列を抽出する表音記号列抽出ステップ (S402、 S502)を実施し認識に用いる 表音記号列として認識辞書情報 206へ登録するステップ(S403、 S503)を実施する 。これらのステップ(S401力ら S403、 S501力ら S503)は文字列の評価処理や検出 処理により製作できるが従来の音声認識システムや音素認識による検索や音素列認 識による WEBブラウザやインターネット環境で行われる操作や検索やコンテンツ情 報の閲覧にぉ 、て用いられては ヽな 、。 [0154] More specifically, first, tag attribute detection that acquires markup language information through the step of acquiring content information (S401, S501), detects the markup language information power as well as the tag, and detects the tag power as well as the tag attribute. Phonetics associated with attributes detected with the means A phonetic symbol string extraction step (S402, S502) for extracting a symbol string is performed, and a step (S403, S503) of registering it in the recognition dictionary information 206 as a phonetic symbol string used for recognition is performed. These steps (S401 Force et al. S403, S501 Force et al. S503) can be produced by character string evaluation processing and detection processing. When used for operations, searches, and browsing of content information.
[0155] 次に、話者の音声入力を待つステップ(S504)を実施し、音声入力の開始に従つ て演算部で実施される特徴量の抽出を行うステップ (S505)を実施し、音素認識及 び Z又は音素片認識をはじめとする表音記号認識プログラムに基づいて表音記号 の認識による特徴量カゝら識別子への変換を行うステップ (S506)が実施される。この ステップ (S506)は距離評価関数や統計的検定手法を用いたり、多変量解析を利用 した学習結果を用いたり、 HMMのようなアルゴリズムを用いたりすることが一般的に 知られている。そして、認識された表音記号に基づいた時系列的な連続により表音 記号列が構成される。 [0155] Next, the step of waiting for the voice input of the speaker (S504) is performed, and the step of extracting the feature amount performed by the arithmetic unit according to the start of the voice input (S505) is performed. Based on a phonetic symbol recognition program such as recognition and Z or phoneme recognition, a step (S506) of converting the feature amount into an identifier by recognizing the phonetic symbol is performed. It is generally known that this step (S506) uses a distance evaluation function or a statistical test method, a learning result using multivariate analysis, or an algorithm such as HMM. A phonetic symbol string is formed by a time-series sequence based on the recognized phonetic symbols.
[0156] 次に、構成された前記表音記号列とマークアップ言語のタグに付随した属性力 抽 出された表音記号列による認識辞書情報 206とを比較し認識辞書情報 206内を検 索することにより、表音記号列同士の一致度合を評価するステップ (S507)を実施し 、認識対象として妥当であるカゝ否かを評価する。この認識対象であるカゝ否かを判断す るための比較は DPや HMMやオートマトンといった記号列比較評価に利用可能な アルゴリズムを任意に用いてもょ 、し、それらを多重化して階層化処理による認識を 実現してもよく従来力 多様な方法が発明 '考案されている。  [0156] Next, the configured phonetic symbol string is compared with the recognition dictionary information 206 based on the extracted phonetic symbol string attached to the attribute power of the markup language tag, and the search is performed in the recognition dictionary information 206. Thus, the step (S507) of evaluating the degree of coincidence between the phonetic symbol strings is performed to evaluate whether or not the recognition target is valid. For the comparison to determine whether or not this recognition target is available, any algorithm that can be used for symbol string comparison and evaluation such as DP, HMM, and automaton can be used arbitrarily, and these can be multiplexed and layered. A variety of methods have been devised.
[0157] この結果として認識辞書情報 206から特定された表音記号列に関連付けられる文 字列や IDといった識別情報に基づいて、文字列を表示したり、任意の処理を実行に 移したり、情報を交換したり、イベントを発生させたり、ステータスを変化させたり、任 意の動作を機械装置に行わせたりすることで、表音記号を用いた認識処理が実現さ れ任意の処理を実行するステップ(S508)が実施されることにより、従来の文法依存 や静的な登録単語依存とは異なる音声を用いた情報処理が実現可能となる。  [0157] Based on the identification information such as the character string and ID associated with the phonetic symbol string identified from the recognition dictionary information 206 as a result, the character string is displayed, the arbitrary processing is executed, the information Recognition process using phonetic symbols is realized by exchanging, exchanging events, changing status, and causing the machine to perform arbitrary operations. By performing step (S508), it is possible to realize information processing using speech different from conventional grammar dependence and static registered word dependence.
[0158] この際、前記表音記号列による認識辞書情報 206を複数持つとともに前記タグ属 性検出手段により検出された入力項目を弁別するための種類情報に基づいて表音 記号列との一致を評価するステップ(S507)で用いる認識辞書情報 206を切替なが ら認識対象となる入力項目の属性に応じて選択される認識辞書情報 206に登録され た表音記号列と音声波形から獲得された表音記号認識結果との記号列比較評価に よって一致度評価を行うときの認識対象となる認識辞書情報 206に含まれる表音記 号列を限定し認識効率を改善することができる。 [0158] At this time, it has a plurality of recognition dictionary information 206 based on the phonetic symbol string and has the tag attribute. Input items to be recognized while switching the recognition dictionary information 206 used in the step (S507) of evaluating the match with the phonetic symbol string based on the type information for discriminating the input items detected by the sex detection means The recognition target when the matching score is evaluated by the symbol string comparison evaluation between the phonetic symbol string registered in the recognition dictionary information 206 selected according to the attribute of the voice and the phonetic symbol recognition result obtained from the speech waveform. The recognition efficiency can be improved by limiting the phonogram sequence included in the recognition dictionary information 206.
[0159] そして、情報処理装置が入力すべき項目に応じて音声入力を評価する認識辞書情 報 (206)を切替える場合、属性の名称や属性に関連付けられた単語の認識辞書情 報(206)を適切に選択するために、属性力 獲得された情報が「書籍」であれば単 位の「冊 (s/a/ts/u I v/o/ly/u/m)」といった助数詞を用いるとともに助数詞に応じた「 数詞 (number)」に関連付けられる表音記号列を用いた認識辞書を認識された表音 記号列の検索対象に選択するようにしたり、属性力 獲得された情報が「駅名」であ れば接尾語としての「駅 (e/k/i I S/u/t/e/i/sh/0/N)」と「駅名として使用される名詞 群」に関連付けられた表音記号列を用いた認識辞書を認識された表音記号列の検 索対象に選択されるようにしたり、属性から獲得された情報が郵便番号や電話番号 であれば単純に数詞の表音記号列を用いた認識辞書を認識された表音記号列の検 索対象に選択されるようにしたりすることで、特定の枠組みに含まれる名詞群を用い て認識対象を制限することにより利用者に対して入力対象となる項目に関連付けら れた属性に従った複数の認識辞書情報 206の切替を実施し、認識された表音記号 列の検索対象になる認識辞書情報 206を属性に基づいて分類することにより認識性 能の改善を図ることも出来る。 [0159] When switching the recognition dictionary information (206) that evaluates speech input according to the items to be input by the information processing device, the recognition dictionary information (206) of the attribute name and the word associated with the attribute If the acquired information is “book”, a classifier such as “book (s / a / ts / u I v / o / ly / u / m)” is used. In addition, the recognition dictionary using the phonetic symbol string associated with the “number” corresponding to the classifier can be selected as the search target of the recognized phonetic symbol string, or the attribute acquired information can be '' Is a table associated with the suffixes `` station (e / k / i IS / u / t / e / i / sh / 0 / N) '' and `` nouns used as station names '' The recognition dictionary using phonetic symbol strings is selected as the search target for recognized phonetic symbol strings, and if the information acquired from the attribute is a zip code or telephone number By simply selecting a recognition dictionary that uses a phonetic symbol string of a number as a search target for the recognized phonetic symbol string, the recognition target can be identified using a group of nouns included in a specific framework. By restricting, the user can switch the multiple recognition dictionary information 206 according to the attribute associated with the item to be input to the user, and the recognition dictionary information to be searched for recognized phonetic symbol strings It is possible to improve recognition performance by classifying 206 based on attributes.
[0160] また、情報処理装置が入力すべき項目を音声入力の行われる順序や未入力の項 目選別に従って属性の名称や属性に関連付けられた単語の音声出力をすることで、 利用者に対して入力対象となる項目を促しながら複数の認識辞書情報 206に対して 切替を実施し、分類する属性に基づいて認識性能の改善を行っても良い。  [0160] In addition, by outputting voice of words associated with attribute names and attributes in accordance with the order in which voice input is performed on the items to be input by the information processing apparatus and selection of uninputted items, It is also possible to switch the plurality of recognition dictionary information 206 while prompting the items to be input and improve the recognition performance based on the attribute to be classified.
[0161] そして、継続する音声があれば従来の処理を繰返すステップ(S509)や年季認識 に伴う処理ステップ (S508)や他の外部操作に伴う装置内のステータスの変化に応 じて次のコンテンツやマークアップ言語を取得するか否かを評価するステップ(S510 )が実施され、状況に応じて本処理は終了する。 [0161] Then, if there is a continuous voice, the next content is changed according to the step of repeating the conventional processing (S509), the processing step (S508) associated with year recognition, or the status change in the device due to other external operations. To evaluate whether or not to acquire a markup language (S510 ) Is executed, and this processing ends depending on the situation.
[0162] なお、装置内のステータスの変化はマルチスレッドプログラムや、他のプログラムの 一部として機能する際に他のプログラムやプロセスによって値が変化し、マルチスレツ ド型のプログラムやイベントドリブン型のプログラム等に用いると仮定すればわ力りや すぐ同様に本発明の任意処理で他のプロセスや他のプログラムのためにステータス を書き換えたり、イベントを発生させたりすることも考えられる。  [0162] It should be noted that the value of the status change in the device changes depending on other programs and processes when functioning as a part of other multi-thread programs or other programs, so that multi-threaded programs and event-driven programs Assuming that it is used for, etc., it is possible to rewrite the status or generate an event for another process or other program in the same manner as the arbitrary processing of the present invention.
[0163] また、本発明の方法を利用してマークアップ言語の指定によって表示される文字列 もしくは表示される画像や画像の特徴に関連付けられた文字列もしくは出力される音 声 ·音楽等や音声 ·音楽等の音響特徴に関連付けられた文字列と!/、つた各種文字列 を表音記号列に変換し辞書登録することで表示される任意の情報を利用者の入力 音声や文字列によって表音記号列の検索により検出し、検出された情報に関連付け られたコンテンツ関連情報や広告や映像やリンクのような任意情報所在個所に関す る情報や音楽や音声に基づいて利用者の操作などにより情報の提供を実施できると ともに、これらの入力は音声やテキスト入力ば力りではなくメニューなどのリストから文 字列を選択したり、ボタン操作におけるボタンのラベルによる文字列を用いたりして行 なわれても良い。 [0163] In addition, a character string displayed by designating a markup language using the method of the present invention, or a character string associated with a displayed image or image feature, or an output voice, music, or voice · Characters associated with acoustic features such as music,! /, And various other character strings are converted into phonetic symbol strings and registered in the dictionary. It is detected by searching for a phonetic symbol string, and is related to the detected information by content-related information, information on the location of arbitrary information such as advertisements, videos, links, etc., by user operation based on music and voice, etc. In addition to providing information, these inputs are not based on voice or text input, but by selecting a character string from a list such as a menu or by button labels in button operations. String may be rope row or using.
[0164] また、下記例のように [0164] Also, as in the example below
(例)  (Example)
く img href= ./flower— lily丄 .jpg  Img href = ./flower— lily 丄 .jpg
recog— die— type= 'flower— name recog— die— url= ./flower. prono  recog— die— type = 'flower— name recog— die— url = ./flower. prono
name="nly prono="l/i/l/i/y >  name = "nly prono =" l / i / l / i / y>
タグに含まれる表音記号辞書に関わる属性に応じて表音記号を読込む際に表音 記号辞書の所在を示す URLや URI、 IPアドレス、ディレクトリパスなどの情報によつ て表音記号認識辞書や表音記号列辞書と!/、つた情報の位置や場所を示す "rec0g_ dic.uri" t 、つた属性による情報を用いたり、 "recog_di type"のように辞書で認識す る対象の種類を示す情報を用いたりすることによって再利用頻度の高い表音記号辞 書などを区別することで表音記号列による辞書情報や認識用音響特性テンプレート 辞書情報をマークアップ言語から獲得される属性に関連付けて提供してもよい。 [0165] また、過去に読込んだ辞書情報を一般的にキャッシュと呼ばれる方法である程度保 存しておき、前述の属性が特定の単語範囲を示す場合に辞書の優先順位を上げて 再度読み込む手間を省 、てもよ 、し、スタイルシートのように別ファイルとしてページ ごとに読込んで IDなどにより関連付けた表音記号辞書を組込んでもよいし、ヘッダー ブロックに記載して IDで関連付けた表音記号辞書を組込んでも良 、し、タグごとに属 性として与えた表音記号辞書を組込んでも良いし、ファイルや通信回線経由の読込 み時におけるヘッダ情報に表音記号辞書を含ませても良ぐ表音記号列テンプレー ト辞書として利用することも出来る。 Phonetic symbol recognition based on information such as URL, URI, IP address, directory path, etc. indicating the location of the phonetic symbol dictionary when reading the phonetic symbols according to the attributes related to the phonetic symbol dictionary included in the tag Dictionaries, phonetic symbol dictionaries,! /, "Rec 0 g_ dic.uri" t indicating the location and location of the information, and information recognized by the dictionary such as "recog_di type" By using information indicating the type of phoneme, it is possible to distinguish phonetic symbol dictionaries that are frequently reused, etc., so that dictionary information based on phonetic symbol strings and acoustic characteristic templates for recognition can be acquired from markup languages. It may be provided in association with an attribute. [0165] In addition, the dictionary information read in the past is saved to some extent by a method generally called a cache, and when the above-mentioned attribute indicates a specific word range, the priority of the dictionary is increased and the trouble of reading it again The phonetic symbol dictionary that is read for each page as a separate file and related by ID etc. may be built in like a style sheet, or the phonetic symbol that is described in the header block and related by ID A symbol dictionary may be incorporated, or a phonetic symbol dictionary given as an attribute for each tag may be incorporated, or a phonetic symbol dictionary may be included in header information when reading via a file or communication line. It can also be used as a good phonetic symbol string template dictionary.
[0166] また、音声にテキストデータを埋め込める「音響 OFDM」を用いて音声波形情報に 表音記号列を埋め込んでも良いし、埋め込まれた表音記号列や関連するマークアツ プ言語情報を復元して音声データ内の表音記号検索や関連する情報を字幕などに より表示したりするために利用してもよいため、ラジオやテレビなどの極めて一般的な 音声データ力 復調された表音記号列を検索に利用することも出来る。  [0166] In addition, a phonetic symbol string may be embedded in speech waveform information using "acoustic OFDM" that can embed text data in speech, or the embedded phonetic symbol string and related markup language information may be restored. It can be used to search for phonetic symbols in audio data and display related information by subtitles, etc., so it is very common voice data power such as radio and television. Demodulated phonetic symbol strings Can also be used for searching.
[0167] また、表音記号認識によって獲得された表音記号列を用いて、検索対象として検索 される表音記号列によって索引付されたデータベースは、複数のキーワードに基づく 表音記号列の論理的な間に基づく組合せであってもよぐブーリアンモデルにより論 理性を表記できる構成であっても良ぐそれらの組合せによってクエリを構成しデータ ベースに提供し検索結果を獲得することが可能である。  [0167] In addition, the database indexed by the phonetic symbol string searched as a search target using the phonetic symbol string acquired by the phonetic symbol recognition is the logic of the phonetic symbol string based on a plurality of keywords. Even if it is a combination based on the gap, even if it is a configuration that can express the logic by a Boolean model, it is possible to compose a query and provide it to the database with those combinations, and obtain search results .
[0168] このようにして、従来の単語と音声特徴群を HMMなどにより確率的に結びつける 方法とはことなり、本発明は表音記号と音声特徴をベイズ識別関数などによる確率に 基づく距離によって関連付けることにより表音記号列を獲得し、獲得された表音記号 列と単語文字列とをマークアップ言語を介して直接関連付ける方法によって、従来の 一般的な認識に比べ認識対象となる単語に制約を加えて効率的な認識を実現する ことを可能とするための辞書情報の動的な提供をマークアップ言語で実現することが 可能となるとともに、クエリに直接単語を利用せずに表音記号列を用いたり音素や音 素片記号列を用いたりすることで HMMや DPのマッチングを用いるデータベースを 構成し検索を行っても良い。  [0168] In this way, unlike the conventional method of stochastically connecting a word and a speech feature group using an HMM or the like, the present invention relates a phonetic symbol and a speech feature by a distance based on a probability based on a Bayes discriminant function or the like. The phonetic symbol string is acquired by the above method, and the acquired phonetic symbol string and the word string are directly related via the markup language, thereby restricting the word to be recognized compared to the conventional general recognition. In addition, dynamic provision of dictionary information to enable efficient recognition can be realized in a markup language, and phonetic symbol strings can be used without directly using words in queries. A database that uses HMM or DP matching may be constructed and searched by using or using phonemes and phoneme symbol strings.
[0169] また、発話音素や発話音素変といった表音記号による属性ば力りではなく画像識 属性として用いても良い。 [0169] In addition, image recognition is not an attribute based on phonetic symbols such as utterance phonemes and utterance phoneme changes. It may be used as an attribute.

Claims

請求の範囲 The scope of the claims
[1] 文字情報及び Z又はメタ情報を含むコンテンツ情報を取得するコンテンツ情報取 得手段と、  [1] Content information acquisition means for acquiring content information including character information and Z or meta information;
前記コンテンツ情報取得手段により取得されたコンテンツ情報から、表音記号から なる認識表音記号列を検出する認識表音記号列検出手段と、  A recognized phonetic symbol string detecting means for detecting a recognized phonetic symbol string consisting of phonetic symbols from the content information acquired by the content information acquiring means;
前記認識表音記号列を用いて認識辞書情報を生成する認識辞書情報生成手段と を備えることを特徴とする情報処理装置。  An information processing apparatus comprising: recognition dictionary information generation means for generating recognition dictionary information using the recognition phonetic symbol string.
[2] 文字情報及び Z又はメタ情報を含むコンテンツ情報を取得するコンテンツ情報取 得手段と、  [2] Content information acquisition means for acquiring content information including character information and Z or meta information;
前記コンテンツ情報取得手段により取得されたコンテンツ情報から、文字情報及び From the content information acquired by the content information acquisition means, character information and
Z又はメタ情報に基づいて展開対象文字列を検出する展開対象文字列検出手段と 、文字列と表音記号とを対応づけて記憶する表音記号記憶手段と、 Expansion target character string detection means for detecting the expansion target character string based on Z or meta information; phonetic symbol storage means for storing the character string and the phonetic symbol in association with each other;
前記表音記号記憶手段を参照することにより、前記展開対象文字列を認識表音記 号列に変換する表音記号変換手段と、  A phonetic symbol conversion means for converting the expansion target character string into a recognized phonetic symbol string by referring to the phonetic symbol storage means;
前記認識表音記号列を用いて認識辞書情報を生成する認識辞書情報生成手段と を備えることを特徴とする情報処理装置。  An information processing apparatus comprising: recognition dictionary information generation means for generating recognition dictionary information using the recognition phonetic symbol string.
[3] 前記表音記号変換手段により変換された表音記号を、前記コンテンツ情報に付カロ することにより当該コンテンツ†青報を保存するコンテンツ†青報保存手段を更に備える ことを特徴とする請求項 2に記載の情報処理装置。  [3] The system further comprises content † blueprint storage means for storing the content † blueprint by adding the phonetic symbol converted by the phonetic symbol conversion means to the content information. Item 3. The information processing device according to item 2.
[4] 前記コンテンツ情報保存手段により保存されたコンテンツ情報と、当該コンテンツ情 報に基づいて生成された認識辞書情報とを他の情報処理端末に送信する送信手段 を更に備えることを特徴とする請求項 1から 3のいずれか一項に記載の情報処理装 置。  [4] The apparatus further comprises transmission means for transmitting the content information stored by the content information storage means and the recognition dictionary information generated based on the content information to another information processing terminal. Item 4. The information processing device according to any one of items 1 to 3.
[5] 音声を入力する音声入力手段と、  [5] voice input means for inputting voice;
前記音声入力手段により入力された音声の特徴量を抽出する特徴量抽出手段と、 前記特徴量抽出手段により抽出された特徴量から、表音記号に変換する特徴量表 音記号変換手段と、 A feature quantity extracting means for extracting the feature quantity of the voice input by the voice input means; and a feature quantity table for converting the feature quantity extracted by the feature quantity extracting means into a phonetic symbol. Phonetic symbol conversion means;
前記特徴量表音記号変換手段により変換された表音記号と、前記認識辞書情報 に含まれる認識表音記号列を構成する表音記号とを評価し、もっとも類似する表音 記号に対応して所定の処理を実行する処理実行手段と、  The phonetic symbols converted by the feature value phonetic symbol converting means and the phonetic symbols constituting the recognized phonetic symbol string included in the recognition dictionary information are evaluated, and the phonetic symbols corresponding to the most similar phonetic symbols are evaluated. Processing execution means for executing predetermined processing;
を更に備えることを特徴とする請求項 1から 4のいずれか一項に記載の情報処理装 置。  5. The information processing apparatus according to claim 1, further comprising:
[6] 前記コンテンツ情報には、音素情報及び Z又は音素片情報が含まれており、 前記処理実行手段は、前記特徴量表音記号変換手段により変換された表音記号 と、前記認識辞書情報に含まれる認識表音記号列を構成する表音記号とを評価し、 もっとも類似する表音記号に対応して利用者に対し、音声発話による情報の提示を 行うことを特徴とする請求項 5に記載の情報処理装置。  [6] The content information includes phoneme information and Z or phoneme piece information, and the processing execution means includes a phonetic symbol converted by the feature value phonetic symbol conversion means, and the recognition dictionary information. 6. The phonetic symbol constituting the recognized phonetic symbol string included in the character string is evaluated, and information is presented to the user by voice utterance corresponding to the most similar phonetic symbol. The information processing apparatus described in 1.
[7] 前記表音記号は、音素又は音素片であることを特徴とする請求項 1から 6のいずれ か一項に記載の情報処理装置。 [7] The information processing apparatus according to any one of [1] to [6], wherein the phonetic symbol is a phoneme or a phoneme piece.
[8] 前記実行される処理は、音素認識に伴う認証処理であることを特徴とする請求項 5 から 7の 、ずれか一項に記載の情報処理装置。 8. The information processing apparatus according to claim 5, wherein the executed process is an authentication process associated with phoneme recognition.
[9] コンピュータに、 [9] On the computer,
マークアップ言語を用いて記述された情報を解釈するマークアップ言語解釈ステツ プと前記解釈によって指定された属性を獲得する属性獲得ステップと、  A markup language interpretation step for interpreting information described using the markup language, and an attribute acquisition step for acquiring an attribute specified by the interpretation;
属性獲得ステップによって獲得された属性に関連付けられた表音記号列及び Z又 は音素列及び Z又は音素片列を抽出する表音記号抽出ステップと、  A phonetic symbol extraction step for extracting a phonetic symbol string and a Z or phoneme string and a Z or phoneme string string associated with the attribute acquired by the attribute acquisition step;
前記表音記号抽出ステップによって、音素認識部で用いる音素列辞書を変更する 辞書変更ステップと、  A dictionary change step of changing a phoneme string dictionary used in the phoneme recognition unit by the phonetic symbol extraction step;
を実現させることを特徴とするプログラム。  A program characterized by realizing.
[10] コンピュータに、 [10] On the computer,
マークアップ言語を用いて記述された情報を解釈するマークアップ言語解釈ステツ プと前記解釈によって指定された属性を獲得する属性獲得ステップと、  A markup language interpretation step for interpreting information described using the markup language, and an attribute acquisition step for acquiring an attribute specified by the interpretation;
属性獲得ステップによって獲得された属性に関連付けられた表音記号列及び Z又 は音素列及び Z又は音素片列を抽出する表音記号抽出ステップと、 前記属性獲得ステップによって獲得された属性に基づき利用者が入力する情報の 種別を評価する情報種別評価ステップと、 A phonetic symbol extraction step for extracting a phonetic symbol string and a Z or phoneme string and a Z or phoneme string string associated with the attribute acquired by the attribute acquisition step; An information type evaluation step for evaluating the type of information input by the user based on the attribute acquired by the attribute acquisition step;
前記情報評価ステップによって、音素認識部で用いる音素列辞書を変更する辞書 変更ステップと、  A dictionary changing step of changing a phoneme string dictionary used in the phoneme recognition unit by the information evaluation step;
を実現させることを特徴とするプログラム。  A program characterized by realizing.
PCT/JP2006/324348 2005-12-15 2006-12-06 Information processing device, and program WO2007069512A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2007550144A JPWO2007069512A1 (en) 2005-12-15 2006-12-06 Information processing apparatus and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2005361670 2005-12-15
JP2005-361670 2005-12-15

Publications (1)

Publication Number Publication Date
WO2007069512A1 true WO2007069512A1 (en) 2007-06-21

Family

ID=38162820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/324348 WO2007069512A1 (en) 2005-12-15 2006-12-06 Information processing device, and program

Country Status (2)

Country Link
JP (1) JPWO2007069512A1 (en)
WO (1) WO2007069512A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009223172A (en) * 2008-03-18 2009-10-01 Advanced Telecommunication Research Institute International Article estimation system
JP2009244432A (en) * 2008-03-29 2009-10-22 Kddi Corp Voice recognition device, method and program for portable terminal
WO2016088241A1 (en) * 2014-12-05 2016-06-09 三菱電機株式会社 Speech processing system and speech processing method
JP2018156060A (en) * 2017-03-17 2018-10-04 株式会社リコー Information processing device, program, and information processing method
WO2019073350A1 (en) * 2017-10-10 2019-04-18 International Business Machines Corporation Abstraction and portability to intent recognition
CN111489735A (en) * 2020-04-22 2020-08-04 北京声智科技有限公司 Speech recognition model training method and device
CN111639219A (en) * 2020-05-12 2020-09-08 广东小天才科技有限公司 Method for acquiring spoken language evaluation sticker, terminal device and storage medium
CN112201238A (en) * 2020-09-25 2021-01-08 平安科技(深圳)有限公司 Method and device for processing voice data in intelligent question answering and related equipment
CN112236816A (en) * 2018-09-20 2021-01-15 海信视像科技股份有限公司 Information processing device, information processing system, and imaging device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10222342A (en) * 1997-02-06 1998-08-21 Nippon Telegr & Teleph Corp <Ntt> Hypertext speech control method and device therefor
JP2001034151A (en) * 1999-07-23 2001-02-09 Matsushita Electric Ind Co Ltd Language learning teaching material preparing device and language learning system
JP2003202890A (en) * 2001-12-28 2003-07-18 Canon Inc Speech recognition device, and method and program thereof
JP2005258198A (en) * 2004-03-12 2005-09-22 Internatl Business Mach Corp <Ibm> Setting device, program, recording medium, and setting method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10222342A (en) * 1997-02-06 1998-08-21 Nippon Telegr & Teleph Corp <Ntt> Hypertext speech control method and device therefor
JP2001034151A (en) * 1999-07-23 2001-02-09 Matsushita Electric Ind Co Ltd Language learning teaching material preparing device and language learning system
JP2003202890A (en) * 2001-12-28 2003-07-18 Canon Inc Speech recognition device, and method and program thereof
JP2005258198A (en) * 2004-03-12 2005-09-22 Internatl Business Mach Corp <Ibm> Setting device, program, recording medium, and setting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KATSUURA ET AL.: "Onsei Keyword ni yoru WWW no Browsing", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 40, no. 2, 15 February 1999 (1999-02-15), pages 443 - 452, XP003014472 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009223172A (en) * 2008-03-18 2009-10-01 Advanced Telecommunication Research Institute International Article estimation system
JP2009244432A (en) * 2008-03-29 2009-10-22 Kddi Corp Voice recognition device, method and program for portable terminal
WO2016088241A1 (en) * 2014-12-05 2016-06-09 三菱電機株式会社 Speech processing system and speech processing method
JP2018156060A (en) * 2017-03-17 2018-10-04 株式会社リコー Information processing device, program, and information processing method
JP7035526B2 (en) 2017-03-17 2022-03-15 株式会社リコー Information processing equipment, programs and information processing methods
US11138506B2 (en) 2017-10-10 2021-10-05 International Business Machines Corporation Abstraction and portability to intent recognition
GB2581705A (en) * 2017-10-10 2020-08-26 Ibm Abstraction and portablity to intent recognition
CN111194401B (en) * 2017-10-10 2021-09-28 国际商业机器公司 Abstraction and portability of intent recognition
CN111194401A (en) * 2017-10-10 2020-05-22 国际商业机器公司 Abstraction and portability of intent recognition
WO2019073350A1 (en) * 2017-10-10 2019-04-18 International Business Machines Corporation Abstraction and portability to intent recognition
CN112236816A (en) * 2018-09-20 2021-01-15 海信视像科技股份有限公司 Information processing device, information processing system, and imaging device
CN112236816B (en) * 2018-09-20 2023-04-28 海信视像科技股份有限公司 Information processing apparatus, information processing system, and image apparatus
CN111489735A (en) * 2020-04-22 2020-08-04 北京声智科技有限公司 Speech recognition model training method and device
CN111489735B (en) * 2020-04-22 2023-05-16 北京声智科技有限公司 Voice recognition model training method and device
CN111639219A (en) * 2020-05-12 2020-09-08 广东小天才科技有限公司 Method for acquiring spoken language evaluation sticker, terminal device and storage medium
CN112201238A (en) * 2020-09-25 2021-01-08 平安科技(深圳)有限公司 Method and device for processing voice data in intelligent question answering and related equipment

Also Published As

Publication number Publication date
JPWO2007069512A1 (en) 2009-05-21

Similar Documents

Publication Publication Date Title
CN109086408B (en) Text generation method and device, electronic equipment and computer readable medium
WO2007043679A1 (en) Information processing device, and program
WO2007069512A1 (en) Information processing device, and program
KR102018295B1 (en) Apparatus, method and computer-readable medium for searching and providing sectional video
JP4689670B2 (en) Interactive manuals, systems and methods for vehicles and other complex devices
CN101309327B (en) Sound chat system, information processing device, speech recognition and key words detection
Freitas et al. Speech technologies for blind and low vision persons
CN105224581B (en) The method and apparatus of picture are presented when playing music
US20090083029A1 (en) Retrieving apparatus, retrieving method, and computer program product
US20160328206A1 (en) Speech retrieval device, speech retrieval method, and display device
US20200058288A1 (en) Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
JP6122792B2 (en) Robot control apparatus, robot control method, and robot control program
CN103348338A (en) File format, server, view device for digital comic, digital comic generation device
WO2022184055A1 (en) Speech playing method and apparatus for article, and device, storage medium and program product
JP2008234431A (en) Comment accumulation device, comment creation browsing device, comment browsing system, and program
CN109920409B (en) Sound retrieval method, device, system and storage medium
CN112449253A (en) Interactive video generation
KR101410601B1 (en) Spoken dialogue system using humor utterance and method thereof
CN114946193A (en) Customized video production service providing system using cloud-based voice integration
CN112182196A (en) Service equipment applied to multi-turn conversation and multi-turn conversation method
CN116682411A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
US20210337274A1 (en) Artificial intelligence apparatus and method for providing visual information
US20200013409A1 (en) Speaker retrieval device, speaker retrieval method, and computer program product
CN209625781U (en) Bilingual switching device for child-parent education
Beskow et al. A model for multimodal dialogue system output applied to an animated talking head

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2007550144

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06834103

Country of ref document: EP

Kind code of ref document: A1