WO2016194740A1 - Speech recognition device, speech recognition system, terminal used in said speech recognition system, and method for generating speaker identification model - Google Patents
Speech recognition device, speech recognition system, terminal used in said speech recognition system, and method for generating speaker identification model Download PDFInfo
- Publication number
- WO2016194740A1 WO2016194740A1 PCT/JP2016/065500 JP2016065500W WO2016194740A1 WO 2016194740 A1 WO2016194740 A1 WO 2016194740A1 JP 2016065500 W JP2016065500 W JP 2016065500W WO 2016194740 A1 WO2016194740 A1 WO 2016194740A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- terminal
- unit
- user
- speech recognition
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 22
- 230000004044 response Effects 0.000 claims abstract description 93
- 230000008569 process Effects 0.000 claims description 15
- 238000004891 communication Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 11
- 244000205754 Colocasia esculenta Species 0.000 description 9
- 235000006481 Colocasia esculenta Nutrition 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 3
- 241000282575 Gorilla Species 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000000877 morphologic effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 241000282836 Camelus dromedarius Species 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
Definitions
- This disclosure relates to speech recognition, and more specifically to technology for identifying a speaker.
- Patent Document 1 discloses a technique for “increasing accuracy for speaker identification in a speaker identification device that identifies a speaker from an audio signal”. (See [Summary]).
- Patent Document 2 discloses a “learning method for creating a phoneme model for speaker recognition with high speaker recognition accuracy” (see [Summary]).
- JP 2010-217319 A Japanese Patent Laid-Open No. 7-261781
- the conventional technology requires the user to speak as preprocessing.
- speaker identification in voice communication in order to perform a more natural dialogue, it is necessary to acquire learning data without making the user feel that the user's speech is used for learning. Therefore, it is necessary to acquire voice data necessary for building a speaker identification model without imposing a load on the user in a state where the speaker identification model is not built.
- the present disclosure has been made to solve the above-described problems, and an object in one aspect is to provide a speech recognition apparatus that can acquire speech data necessary for building a speaker identification model. It is to be.
- An object in another aspect is to provide a speech recognition system that can acquire speech data necessary to build a speaker identification model.
- the purpose in another aspect is to provide a terminal used in the speech recognition system.
- Still another object of the present invention is to provide a method for generating a speaker identification model necessary for constructing a speaker identification model.
- a speech recognition apparatus includes a speech input unit for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, and a speech for performing speech recognition processing A recognition unit, a voice output unit for outputting a voice, and a control unit for controlling the voice recognition device based on a result of the voice recognition process.
- the control unit generates a speaker identification model for identifying the speaker by associating the information for identifying the speaker with the utterance not including the information for identifying the speaker.
- the user can collect voice data necessary for learning only by performing a normal voice dialogue without being aware of preprocessing for learning.
- voice data for example, voiceprint information
- voice conversation contents For example, in a dialogue such as “Shiritori” or a quick phrase such as a game, a plurality of user utterances with respect to a game partner (for example, a terminal, a home appliance, etc.) is expected.
- a game partner for example, a terminal, a home appliance, etc.
- the device as the game opponent can make a series of user utterances as learning data by asking the user name in advance and playing the game.
- an unknown user name one hour before can be determined by asking the user name after a certain unknown user speaks.
- morphological analysis is used as an example of speech recognition.
- proper nouns and those that are not are separated.
- a speech recognition system may have a dictionary of names as a database. Speech recognition is performed by matching a dictionary with an extracted proper noun in morphological analysis.
- a speech recognition apparatus includes a microphone for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, and a processor for performing speech recognition processing And a speaker for outputting voice and a processor for controlling the voice recognition device based on the result of the voice recognition processing.
- the processor generates a speaker identification model for identifying a speaker by associating information identifying the speaker with an utterance that does not include information identifying the speaker.
- the speaker identification model may include, for example, a speaker identification ID (Identification), the name of the speaker (user of the speech recognition apparatus), voiceprint information extracted from the speaker's utterance, and the like.
- the information for identifying a speaker is, for example, a name, a nickname, a resident number, an identification number given by a government agency, or other information that can be included in an utterance.
- the speaker outputs an inquiry asking for information for identifying a speaker.
- Associating information that identifies a speaker with an utterance that does not contain information that identifies the speaker means that information that identifies the speaker and utterance that does not contain information that identifies the speaker uttered after the inquiry. Including associating.
- the speaker outputs an inquiry asking for information for identifying the speaker after the utterance not including the information for identifying the speaker.
- Associating information that identifies a speaker with an utterance that does not include information that identifies the speaker is the process of associating the utterance before the inquiry with the information identifying the speaker included in the utterance that responds to the inquiry. Including associating.
- the processor is configured to determine the content of the next utterance output from the speaker based on the content of the response to the utterance output from the speaker.
- the speech recognition apparatus holds a plurality of inquiries in advance.
- the difficulty level of each inquiry is hierarchically different.
- the processor issues a query with a low difficulty level ( Talk about the problem of shiritori).
- the processor utters a query with a high degree of difficulty (a problem of shiritori) as the next query.
- the speech recognition apparatus further includes a memory for storing the generated speaker identification model.
- the processor is configured to update the generated speaker identification model based on the response to the query.
- a speech recognition system includes a terminal and a device capable of communicating with the terminal.
- the terminal is electrically connected to a microphone for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, a speaker for outputting sound, and the microphone and the speaker.
- a communication interface for communicating with the device.
- the apparatus includes a communication interface for communicating with a terminal, a processor for performing voice recognition processing, and a processor for controlling the apparatus based on the result of the voice recognition processing.
- the processor generates a speaker identification model for identifying a speaker by associating information identifying the speaker with an utterance that does not include information identifying the speaker.
- FIG. 1 is a diagram illustrating an exchange between the user 1 and the terminal 2 when a shiritori game is performed.
- User 1 issues message 10 to terminal 2.
- the terminal 2 recognizes the message 10, the terminal 2 issues a message 11 as a response.
- Terminal 2 issues a message 12 to terminal 2.
- the terminal 2 recognizes the message 12
- the terminal 2 issues a message 13 synthesized using the name included in the message 12 and a predefined message.
- the terminal 2 When a predetermined time has elapsed, the terminal 2 issues a message 14. When the user 1 recognizes the message 14, the user 1 considers a word following the message 14 as a response within a predetermined time. User 1 issues message 15 to terminal 2. When the terminal 2 recognizes the message 15, the terminal 2 refers to a language dictionary prepared in advance and considers words following the message 15. The terminal 2 issues a message 16 as a word for the message 15 within a predetermined time. In this way, the user 1 and the terminal 2 continue the shiritori game.
- the interruption continues.
- the user 1 issues a message 17 to the terminal 2.
- the terminal 2 recognizes the message 17, it issues a message 18.
- the user 1 may not be able to return the next word. In this case, the user 1 keeps silence or issues a message 19 indicating that he / she does not understand.
- the terminal 2 determines that there is no response from the user 1 within a predetermined fixed waiting time, or when the terminal 2 recognizes the message 19, the terminal 2 issues a message 20 defined in advance for the content.
- the terminal 2 recognizes that the user 1 is “Taro” through the exchange of messages with the user 1, and associates “Taro” as user information with each data.
- FIG. 2 is a diagram illustrating an outline of the configuration of the speech recognition system according to the first embodiment of the present disclosure.
- one terminal 200 functions as a voice recognition system.
- the terminal 200 includes a control unit 30, a voice input unit 31, a voice output unit 32, a speaker identification unit 33, a speaker identification learning unit 34, a user management unit 35, a voice recognition unit 36, and a dialog analysis.
- a generation unit 37 is provided.
- the terminal 200 may be a terminal having a voice input / output function and a voice recognition function, for example.
- the terminal may include, for example, a smartphone, a television, a cleaning robot that can operate standalone, and other devices.
- the control unit 30 controls the operation of the terminal 200.
- the voice input unit 31 receives a voice input and outputs a signal to the control unit 30.
- the audio output unit 32 converts the signal output from the control unit 30 into audio and outputs the audio to the outside of the terminal 200.
- the audio output unit 32 includes, for example, a speaker, a terminal, and the like.
- the speaker identification unit 33 identifies a speaker who has made an utterance to the terminal 200 based on a signal sent from the control unit 30. In another aspect, the speaker identification unit 33 identifies a speaker based on the signal and data stored in the terminal 200.
- the data can include, for example, voiceprint information registered in advance as a user of the terminal 200.
- the speaker identification learning unit 34 creates data (user profile) for each speaker using the information (user ID or the like) of the speaker identified by the speaker identifying unit 33.
- the user management unit 35 stores user information of the terminal 200.
- the user information may include a user profile and the like.
- the voice recognition unit 36 performs voice recognition processing using the voice signal sent from the control unit 30. For example, the voice recognition unit 36 extracts characters included in the utterance.
- the dialog analysis / generation unit 37 analyzes a message to the terminal 200 based on the recognition result by the voice recognition unit 36. Furthermore, the dialog analysis / generation unit 37 generates a response according to the utterance according to the analysis result. In another aspect, the dialog analysis / generation unit 37 generates an utterance for encouraging the user of the terminal 200 based on the setting in the terminal 200.
- the setting may include, for example, that the terminal 200 has detected the presence of a user in the vicinity of itself, that a preset time has arrived.
- FIG. 3 is a diagram illustrating an outline of the configuration of the speech recognition system according to the second embodiment of the present disclosure.
- the voice recognition system includes a terminal 300 and a server 350.
- the terminal 300 includes an audio input unit 31 and an audio output unit 32.
- Terminal 300 is controlled by a processor (not shown).
- the server 350 includes a control unit 30, a speaker identification unit 33, a speaker identification learning unit 34, a user management unit 35, a voice recognition unit 36, and a dialog analysis / generation unit 37.
- the terminal 300 is realized as a terminal having a voice input / output function and a communication function, for example.
- Such a terminal may include, for example, a mobile phone or other information communication terminal, a cleaning robot or the like having a voice recognition function and a communication function, and the like.
- the terminal 300 When the terminal 300 accepts the user's utterance, the terminal 300 transmits an audio signal corresponding to the utterance to the server 350 via a communication interface (not shown). Upon receiving the voice signal, the server 350 executes processing such as speaker identification processing, voice recognition processing, dialog analysis, response generation, and the like. Each process is the same as the process realized by the configuration shown in FIG. 2, and thus detailed description will not be repeated.
- the server 350 transmits the generated response to the terminal 300 via a communication interface (not shown).
- the audio output unit 32 outputs audio corresponding to the response.
- FIG. 4 is a diagram illustrating an outline of the configuration of the speech recognition system according to the third embodiment of the present disclosure.
- the voice recognition system includes a terminal 300, a server 400, and a speaker identification server 410.
- the server 400 includes a control unit 30, a user management unit 35, a voice recognition unit 36, and a dialog analysis / generation unit 37.
- the speaker identification server 410 includes a speaker identification unit 33 and a speaker identification learning unit 34.
- the server 400 and the speaker identification server 410 are realized by a computer device having a known configuration.
- the computer includes, as main components, a CPU (Central Processing Unit) that executes a program, a keyboard and other input devices, a RAM (Random Access Memory), a hard disk, an optical disk drive, a monitor, a communication IF ( Interface).
- a CPU Central Processing Unit
- RAM Random Access Memory
- hard disk a hard disk
- optical disk drive an optical disk drive
- monitor a communication IF ( Interface).
- Processing in the computer is realized by software executed by each hardware and CPU.
- the software is stored in advance on a hard disk.
- the software is stored in a CD-ROM or other computer-readable non-volatile data recording medium and distributed as a program product.
- the software may be provided as a program product that can be downloaded by an information provider connected to the Internet or other networks.
- the hardware configuration of the computer is general. Therefore, the description of the hardware configuration of server 400 and speaker identification server 410 will not be repeated. It can be said that the essential part for realizing the technical idea according to the present embodiment is a program stored in the computer.
- the server 400 When the server 400 receives the voice signal transmitted from the terminal 300, the server 400 transmits the voice signal to the speaker identification server 410 via the communication interface.
- the speaker identification server 410 recognizes the speaker and generates data for registering the speaker.
- the speaker identification server 410 transmits the generated data to the server 400.
- FIG. 5 is a diagram illustrating an outline of a configuration of a speech recognition system according to the fourth embodiment of the present disclosure.
- the voice recognition system includes a terminal 300, a server 500, a speaker identification server 410, and a voice recognition server 520.
- the server 500 includes a control unit 30, a user management unit 35, and a dialog analysis / generation unit 37.
- the voice recognition server 520 includes a voice recognition unit 36.
- the server 500 When the server 500 receives the voice signal from the terminal 300, the server 500 transmits the voice signal to the speaker identification server 410 and the voice recognition server 520.
- the voice recognition server 520 executes voice recognition processing using the voice signal, and transmits the recognition result to the server 500.
- FIG. 6 is a block diagram illustrating a configuration of functions that implement the speech recognition system according to the present disclosure.
- the voice recognition system includes a terminal module 600, a main module 610, a speaker identification module 620, and a voice recognition module 630.
- the terminal module 600 includes an audio input unit 31 and an audio output unit 32.
- the terminal module 600 is in the vicinity of the user, accepts an utterance, and transmits voice data and a terminal ID to the main module 610.
- the terminal module 600 receives the synthesized voice data sent from the main module 610 and outputs voice based on the synthesized voice data from the voice output unit 32.
- the control unit 30 transmits the voice data and the speaker model list to the speaker identification module 620.
- the speaker identification module 620 identifies a speaker
- the speaker identification module 620 transmits a speaker identification result (for example, a message ID, a flag indicating that the speaker has been identified, etc.) to the main module 610.
- the control unit 30 transmits the terminal ID or voice data to the user management unit 35.
- the user management unit 35 stores the terminal ID or voice data.
- the control unit 30 reads the speaker model list from the user management unit 35. For example, the control unit 30 exchanges text data with the dialog analysis / generation unit 37.
- the control unit 30 transmits the voice data to the voice recognition module 630.
- the speech recognition module 630 executes speech recognition processing using speech data
- the speech recognition module 630 sends the result to the control unit 30 as text.
- FIG. 7 is a diagram conceptually showing one mode of storing data held in the voice recognition system.
- the voice recognition system includes a terminal management table, a home management table, and a user management table.
- the terminal management table includes a terminal ID and a belonging user ID.
- the terminal ID identifies a terminal registered in the voice recognition system.
- the terminal ID is uniquely assigned by a manager of the voice recognition system (for example, a manager of a computer including the control unit 30).
- the terminal ID is configured by an arbitrary character string (for example, alphanumeric characters or symbols) desired by the user of the terminal.
- the control unit 30 checks whether or not the ID input by the user is already used so that the terminal ID is not duplicated. If the used terminal ID is input, Notify the terminal to that effect.
- the affiliated user ID identifies a user registered as a user of the terminal.
- the number of users of the terminal is not particularly limited.
- the home management table includes a home ID and a terminal ID of a terminal belonging to the home.
- the home ID identifies the home as a group of users who use the service of the voice recognition system.
- the unit of the user group is not limited to the home. Any user may be used as long as a plurality of users can be associated with one group.
- the home ID is associated with each terminal ID of one or more terminals. The number of terminals associated with the home is not particularly limited.
- the user management table includes a user ID, a user name, speaker model data, and a voice data list.
- the user ID identifies the user who uses the terminal.
- the user name identifies the user to whom the user ID is assigned.
- the speaker model data is data for identifying the user.
- the speaker model data may include voiceprint information, for example.
- the audio data list includes audio data for identifying the user.
- the voice data may include an utterance from the user to the terminal, a user response to the terminal utterance, an utterance by the user of a character string displayed on the terminal, and the like.
- FIG. 8 is a diagram illustrating a state in which the speaker model 80 is generated by the dialogue between the user 1 and the terminal 2. The description of the same state as that in FIG. 1 will not be repeated.
- the terminal 2 In the dialog between the user 1 and the terminal 2, when the user 1 is not registered, the terminal 2 first listens to the user name and then utters the user until a certain period thereafter (for example, the end of the game). To register voice data in the database.
- the audio data can include voiceprint information.
- the speaker identification learning unit learns the speaker identification using all the speech data from the target speech DB (Database) as learning data.
- ID is assigned to each terminal.
- the speaker identification learning unit 34 reads the terminal ID and user name stored in the user management unit 35 and generates a speaker model 80.
- the speaker model 80 includes the user name and the terminal ID. Therefore, thereafter, when the user name is specified by the terminal 2 interacting with the user 1, the speaker model 80 associated with the user can be used.
- FIGS. 9 to FIG. 11 are flowcharts showing sequences when the user is the starting point of the utterance.
- step 910 an utterance for starting a sequence for speaker identification learning by the user is performed.
- the user issues a message 911 “Let's take a bite”.
- the voice input unit 31 transmits a voice signal corresponding to the message 911 to the control unit 30.
- step 915 when the control unit 30 detects that the voice signal has been received, the control unit 30 transmits a voice recognition request to the voice recognition unit 36.
- step 920 when detecting that the voice signal has been received, the control unit 30 transmits a speaker model list acquisition request to the user management unit 35.
- the speaker model list acquisition request requests access to the speaker model list associated with the user who gave the utterance.
- step 925 the control unit 30 transmits a speaker model list response to the control unit 30 in response to the speaker model list acquisition request.
- the speaker model list response includes the acquisition result of the speaker model list associated with the user.
- step 930 the control unit 30 transmits a speaker identification request to the speaker identification unit 33.
- the speaker identification unit 33 detects the reception of the speaker identification request, the speaker identification unit 33 refers to the data stored in the user management unit 35 and tries to identify the user (speaker) who made the utterance in step 910.
- step 935 the voice recognition unit 36 transmits a voice recognition response to the control unit 30 in response to the voice recognition request in step 915.
- the voice recognition response includes whether or not the voice recognition is successful.
- the speaker identification unit 33 transmits a speaker identification failure response to the speaker identification unit 33. That is, since the user is not registered in the voice recognition system, the speaker identifying unit 33 cannot identify the user (speaker) who gave the speech. Therefore, a speaker identification failure response notifying that speaker identification has failed is sent from the speaker identification unit 33 to the control unit 30.
- the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37 in response to the reception of the speaker identification failure response.
- the dialogue analysis / generation request may include a voice identification result and a speaker identification result.
- the dialog analysis / generation unit 37 Upon receiving the dialog analysis / generation request, the dialog analysis / generation unit 37 generates a message for acquiring the name of the user who gave the utterance.
- the dialogue analysis / generation unit 37 uses a template prepared in advance in the speech recognition system and the term “shiritori” included in the message 911 to start the message 946 (let's start shiritori. Create.)
- step 950 the dialog analysis / generation unit 37 transmits the generated message 946 to the control unit 30.
- the control unit 30 When detecting the reception of the message, the control unit 30 generates a voice response including the terminal ID of the terminal that has given the utterance and the message.
- step 955 the control unit 30 transmits the audio response to the audio output unit 32.
- the audio output unit 32 receives the audio response signal
- the audio output unit 32 outputs audio based on the signal.
- the user recognizes the voice
- the user speaks the voice.
- the utterance is received by the voice input unit 31.
- step 960 the voice input unit 31 transmits the content of the received message 961 (name registration utterance) to the control unit 30.
- the message 961 includes an answer (name) to the message 946, for example, “Taro would!”.
- the control unit 30 detects reception of the message 961, the control unit 30 generates a voice recognition request.
- step 965 the control unit 30 transmits a voice recognition request to the voice recognition unit 36.
- the voice recognition unit 36 detects the reception of the voice recognition request, the voice recognition unit 36 executes voice recognition processing of the message 961.
- step 970 the control unit 30 transmits a speaker model list acquisition request to the user management unit 35.
- the user management unit 35 detects the reception of the speaker model request, the user management unit 35 tries to acquire the speaker model list.
- the user management unit 35 generates a result of the acquisition attempt as a speaker model list response.
- step 975 the user management unit 35 transmits a speaker model list response to the control unit 30.
- step 980 in response to receiving the speaker model list response, the control unit 30 transmits a speaker identification request to the speaker identification unit 33.
- the speaker identification unit 33 detects reception of a speaker identification request, the speaker identification unit 33 starts speaker identification and generates an identification result.
- the voice recognition unit 36 transmits a voice recognition response to the control unit 30 as a response to the speaker identification request.
- the voice recognition response may include information indicating that the content of the message 961 has been recognized.
- step 1015 the speaker identification unit 33 transmits a speaker identification failure response to the control unit 30. That is, the speaker is not registered in the speech recognition system. Therefore, the speaker identification unit 33 generates a response indicating that the attempt to identify the speaker has failed.
- step 1020 the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37.
- the dialog analysis / generation unit 37 In response to receiving the dialog analysis / generation request, the dialog analysis / generation unit 37 generates a message 1031 for dialog.
- the message 1031 is generated as a message including the content of the utterance and information for identifying the speaker, such as “Taro-san. Let's start.
- step 1030 the dialog analysis / generation unit 37 transmits the message 1031 to the control unit 30.
- the control unit 30 When detecting the reception of the message 1031, the control unit 30 generates an audio response including the message 1031 and the terminal ID in order to respond to the utterance to the terminal.
- step 1035 the control unit 30 transmits the voice response to the terminal.
- the audio output unit 32 of the terminal When receiving the audio response signal, the audio output unit 32 of the terminal outputs audio based on the signal.
- the user recognizes the voice, the user speaks to the terminal in consideration of the next response.
- the voice input unit 31 receives the utterance, for example, “gorilla”.
- step 1040 the voice input unit 31 transmits the received message 1041 to the control unit 30.
- the control unit 30 detects the reception of the message 1041, the control unit 30 generates a voice recognition request.
- step 1045 the control unit 30 transmits a voice recognition request to the voice recognition unit 36.
- the voice recognition unit 36 receives the request, the voice recognition unit 36 starts a voice recognition process.
- step 1050 the control unit 30 transmits a speaker voice storage / list acquisition request to the user management unit 35.
- the user management unit 35 saves the identification ID of the speaker (taro) and the name of the speaker (taro) by associating each other. Further, the user management unit 35 generates a response indicating that the speaker voice has been successfully stored.
- step 1055 the user management unit 35 transmits a speaker voice storage / list acquisition response to the control unit 30 as the response.
- step 1060 the control unit 30 transmits a speaker identification model learning request to the speaker identification learning unit 34.
- the speaker identification learning unit 34 detects the reception of the request, the speaker identification learning unit 34 generates a model by associating a voice with the user who gave the utterance as a speaker identification model, and updates it appropriately.
- step 1065 the voice recognition unit 36 transmits the processing result based on the voice recognition request to the control unit 30 as a voice recognition response.
- step 1070 the speaker identification learning unit 34 transmits a speaker identification learning response to the control unit 30 in response to the speaker identification model learning request.
- step 1075 the control unit 30 generates a dialog analysis / generation request, and transmits the generated request to the dialog analysis / generation unit 37. For example, if the control unit 30 determines that there is not enough data for the speaker's learning and the learning has failed, the control unit 30 generates the request.
- the dialog analysis / generation unit 37 detects the reception of the request, the dialog analysis / generation unit 37 generates a message 1081 (for example, “gorilla... Then,“ camel ”) for further learning.
- step 1080 the dialog analysis / generation unit 37 transmits the generated message 1081 to the control unit 30.
- the control unit 30 receives the message 1081, the control unit 30 generates a voice response including the terminal ID and the message 1081.
- step 1085 the control unit 30 transmits the generated voice response to the terminal.
- the voice output unit 32 outputs voice based on the voice response.
- the voice input unit 31 accepts the user's utterance and generates a voice response corresponding to the utterance.
- voice input unit 31 transmits message 1111 (for example, “diamond”) to control unit 30.
- message 1111 for example, “diamond”
- the control unit 30 When detecting the reception of the message 1111, the control unit 30 generates a speech recognition request and a speaker speech storage / list acquisition request.
- step 1115 the control unit 30 transmits a voice recognition request to the voice recognition unit 36.
- the voice recognition unit 36 detects reception of the request, the voice recognition unit 36 starts voice recognition processing of the message 1111.
- step 1120 the control unit 30 transmits a message 1111 and a speaker voice storage / list acquisition request to the user management unit 35.
- the user management unit 35 stores the content (voice data) of the message 1111 in association with the identification ID of the user (speaker).
- the control unit 30 transmits a speaker identification model learning request to the speaker identification learning unit 34.
- the speaker identification learning unit 34 detects reception of the request, the speaker identification learning unit 34 learns a speaker identification model. More specifically, the speaker identification learning unit 34 stores the user identification ID and voice information (for example, voiceprint information) included in the message 1111 in association with each other.
- voice information for example, voiceprint information
- step 1135 in response to the completion of the voice recognition process, the voice recognition unit 36 generates a voice recognition response notifying the result of the voice recognition process, and transmits the response to the control unit 30.
- the speaker identification learning unit 34 transmits the generated response to the control unit 30.
- the control unit 30 receives the response from the voice recognition unit 36 and the response from the speaker identification learning unit 34, the control unit 30 determines whether data sufficient for learning is prepared and learning is completed. For example, when more than a predetermined number of audio data is associated with the user identification ID, the control unit 30 determines that sufficient data for learning is available and learning is completed.
- the control unit 30 generates a dialog analysis / generation request based on the content of the response received from the speech recognition unit 36 and the response received from the speaker identification learning unit 34. For example, based on the result of each response, the control unit 30 generates the request when it is determined that the speech recognition is successful, and that sufficient data for learning is available and learning is completed. Data sufficient for learning is, for example, information that is defined as the amount of information extracted from speech data within a predetermined time (such as the number of voiceprint information having a fixed data size) required for learning. The one that exceeds the amount.
- step 1145 the control unit 30 transmits the generated request to the dialog analysis / generation unit 37.
- the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1151 for the message 1111.
- step 1150 the dialog analysis / generation unit 37 transmits the generated message 1151 to the control unit 30.
- the control unit 30 When detecting the reception of the message 1151, the control unit 30 generates a voice response including the terminal ID and the message 1151.
- step 1155 the control unit 30 generates a voice response to the terminal.
- the terminal receives the voice response, the terminal outputs the voice from the voice output unit 32.
- FIG. 12 is a diagram illustrating an exchange sequence between the user 1 and the terminal 2 when the user is known to the voice recognition system.
- the same operations as those described above are denoted by the same numbers. Therefore, description of the same operation will not be repeated.
- the speaker model is updated accordingly. Therefore, speaker identification based on the latest user's voice data is always possible.
- User 1 issues message 10 to terminal 2.
- the terminal 2 executes voice recognition processing and speaker identification processing. If the terminal 2 determines that the speaker of the message 10 has been identified based on the result of the speaker identification process, the terminal 2 issues a message 1210 according to the determination result.
- Message 1210 includes a response to message 10 and an inquiry to identify the speaker of message 10.
- the terminal 2 performs voice recognition processing and speaker identification processing on the message 1220.
- the terminal 2 determines from the content of the message 1220 that an answer to the inquiry has been obtained, the terminal 2 transmits data including the terminal ID of the terminal 2 and the user name (taro) to the user management unit 35.
- the user management unit 35 accumulates the data. Further, the terminal 2 issues a message 1230 for the message 1220.
- the terminal 2 transmits data including the terminal ID and the user name to the user management unit 35.
- the user management unit 35 stores each data.
- the speaker identification learning unit 34 refers to the terminal ID and the user name from the user management unit 35, reads data associated with the user from the accumulated data, and creates a speaker model 80.
- 13 and 14 are sequence charts showing the flow of processing performed when the user is known. The same steps as those described above are denoted by the same step numbers. Therefore, the description of the same process will not be repeated.
- the speaker identification unit 33 transmits a speaker identification response to the control unit 30 in order to notify that the speaker identification has been successful.
- the control unit 30 detects the reception of the response and the response from the voice recognition unit 36, the control unit 30 generates a dialog analysis / generation request.
- the request includes a voice identification result and a speaker identification result.
- step 1345 the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37.
- the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1351 for responding to the message 911.
- the message 1351 includes a response to the message 911 and an inquiry for confirming the speaker of the message 911.
- step 1350 the dialog analysis / generation unit 37 transmits the generated message 1351 to the control unit 30.
- the control unit 30 transmits a voice response including the message 1351 and the terminal ID to the terminal, the voice output unit 32 of the terminal utters voice.
- the user recognizes the voice and determines that the voice is correct, for example, the user issues a message 1361 “Yes” (name registration utterance).
- step 1360 when the voice input unit 31 receives an input of the message 1361, the voice input unit 31 transmits a voice signal corresponding to the input to the control unit 30. Thereafter, the control unit 30 transmits a voice recognition request to the voice recognition unit 36 (step 965).
- speaker identification unit 33 transmits a response to the speaker identification request (step 980) to speaker identification unit 33 as a speaker recognition response. If the user is known to the speech recognition system, the speaker recognition response indicates that the speaker has been identified. When detecting the reception of the response, the control unit 30 generates a dialog analysis / generation request.
- step 1420 the control unit 30 transmits the generated dialog analysis / generation request to the dialog analysis / generation unit 37.
- the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1431.
- the message 1431 is a question ⁇ Taro-san? Included in the message 1351 based on the result of the exchange so far. ) Is included based on the correctness (after all!).
- step 1430 the dialog analysis / generation unit 37 transmits a message 1431 to the control unit 30.
- the control unit 30 detects reception of the message 1431, the control unit 30 generates an audio response including the terminal ID and the message 1431.
- control unit 30 transmits a voice response to the terminal.
- the voice output unit 32 outputs the message 1431 by voice based on the voice response.
- step 1040 the processing after step 1040 is performed in the same manner as described above.
- the voice data is stored, and the learning data (for example, voiceprint information) is constantly updated with new voice data of the target user.
- the terminal does not make an utterance for confirming the user's name.
- FIG. 15 is a diagram illustrating a case in which talking to the user 1 from the terminal 2 triggers a conversation.
- the terminal 2 When the terminal 2 detects the presence of the user 1, the terminal 2 talks to the user 1.
- the presence of the user 1 is detected based on, for example, an output from an infrared sensor, a human sensor, or the like.
- the terminal 2 issues a message 1510, for example.
- User 1 recognizes message 1510.
- User 1 issues message 1520 in response to message 1510.
- the terminal 2 executes voice recognition processing and speaker identification processing.
- the terminal 2 switches the utterance to the user 1 based on the result of each process. For example, if it is determined that the speaker is not known, the terminal 2 generates a message 1530 and outputs the message 1530 by voice.
- User 1 issues message 1540 to terminal 2 in response to message 1530.
- the terminal 2 performs voice recognition processing and speaker identification processing on the message 1540. Further, the terminal 2 associates the speaker “Taro” recognized as the user name of the terminal 2 with the terminal ID, and sends the messages 1520 and 1540 of the user 1 received so far to the user management unit 35 as voice data of the speaker. accumulate.
- the terminal 2 generates a message 1550 as a response to the message 1540 and outputs the message 1550 by voice.
- the user management unit 35 stores voice data associated with the user “Taro” and identification information (for example, voiceprint information) acquired from the voice data.
- identification information for example, voiceprint information
- 16 and 17 are sequence charts showing a part of processing performed in the voice recognition system.
- step 1610 when detecting that a predetermined condition is satisfied, the control unit 30 transmits a dialog generation request to the dialog analysis / generation unit 37.
- the condition is, for example, that the presence of the user is detected within the range of the voice recognition system, that a predetermined time has arrived.
- the dialog generation request includes, for example, a request to generate a message 1510 for speaking to the detected user.
- the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1510 based on a template prepared in advance.
- step 1615 the dialog analysis / generation unit 37 transmits the message 1510 generated in response to the request to the control unit 30.
- the control unit 30 detects reception of the message 1510, the control unit 30 transmits a voice utterance request including the message 1510 and the terminal ID to the terminal.
- the voice input unit 31 of the terminal When receiving the request, the voice input unit 31 of the terminal outputs the message 1510 by voice.
- the user When the user recognizes the message 1510, the user issues a message 1520 as a response to the message 1510.
- step 1625 the voice input unit 31 transmits the message 1520 as a voice signal to the control unit 30. Thereafter, from step 915 to step 1345, processing similar to that described above is executed.
- step 1350 the dialog analysis / generation unit 37 transmits a message 1530 to the control unit 30.
- the user issues message 1540.
- the message 1540 is sent from the control unit 30 to the voice recognition unit 36, and voice recognition processing is executed (step 1045).
- step 1740 the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37.
- the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1550 corresponding to the request.
- step 1742 the dialog analysis / generation unit 37 transmits a message 1550 to the control unit 30.
- the control unit 30 detects reception of the message 1550, the control unit 30 generates an audio response including the terminal ID and the message 1550.
- step 1750 the process of step 1750 is executed. More specifically, in step 1751, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37.
- the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1560 for responding to the request.
- step 1752 the dialog analysis / generation unit 37 transmits a message 1560 to the control unit 30.
- the control unit 30 detects reception of the message 1560, the control unit 30 generates an audio response including the terminal ID and the message 1560.
- step 1760 the control unit 30 transmits the voice response to the terminal.
- the voice output unit 32 outputs the message 1560 by voice.
- Voice recognition and voice authentication are performed in parallel. Therefore, recognition of the user's utterance content and authentication of the user are performed simultaneously.
- a topic of interest of each user is estimated based on the log of the dialog content, and a dialog based on the estimated topic is generated.
- the voice interactive system to which the technical idea is applied specifies a user (voice authentication) without using information from a device such as a camera or a wireless tag, and the user's Acquisition of speech contents (voice recognition) becomes possible.
- the daily conversation of the user is stored in the voice dialogue system and analyzed as necessary. Based on the analysis result, the voice dialogue system acquires a topic (sports, entertainment news, etc.) that each user is interested in from another information providing device, and provides the user with a topic according to the user who is interacting. Can do.
- a topic sports, entertainment news, etc.
- the dialogue between the voice dialogue system and the user is performed for a long time and periodically, so that the expression of the utterance from the voice dialogue system (wording, tone, etc.) changes according to the dialogue contents. Can do.
- the user can be familiar with the voice interaction system (or a voice input / output terminal such as a robot included in the voice interaction system).
- the voice recognition system As described above, according to the voice recognition system according to the present embodiment, the user performs normal voice conversation without being aware of the preprocessing for learning, thereby obtaining voice data necessary for learning. Can be given to. Therefore, the function provided by the system can be easily used.
- user authentication is performed without the user's awareness, and a topic corresponding to the user is output. Therefore, the user can be familiar with the services and functions provided by the voice recognition system.
- control unit 31 voice input unit, 32 voice output unit, 33 speaker identification unit, 34 speaker identification learning unit, 35 user management unit, 36 voice recognition unit, 37 generation unit, 80 speaker model, 350, 400, 500 server, 410 speaker identification server, 520 speech recognition server, 600 terminal module, 610 main module, 620 speaker identification module, 630 speech recognition module.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Telephonic Communication Services (AREA)
Abstract
Provided is a system capable of identifying speakers without a user being conscious of learning.
A speech recognition system is provided with a terminal (300) and a server (350). The terminal (300) is provided with a speech input unit (31) for receiving speech input and a speech output unit (32) for outputting speech. The server (350) is provided with: a control unit (30) for controlling the operations thereof; a speaker identification unit (33) for identifying speakers on the basis of signals and data stored in the terminal (300); a speaker identification learning unit (34) for creating data (user profile) for each user using information (user ID and the like) for speakers identified by the speaker identification unit (33); a user management unit (35) for storing user information in the terminal (300); a speech recognition unit (36) for performing speech recognition processing; and a conversation analysis/generation unit (37) for analyzing messages for the terminal (300) on the basis of the results of speech recognition and generating responses according to those messages in accordance with the results of the analysis.
Description
本開示は音声認識に関し、より特定的には、話者を識別する技術に関する。
This disclosure relates to speech recognition, and more specifically to technology for identifying a speaker.
音声認識において話者を識別する技術が知られている。たとえば、特開2010-217319号公報(特許文献1)は、「音声信号から話者の特定を行なう話者特定装置において、話者特定のための精度向上を図る」ための技術を開示している([要約]参照)。特開平7-261781号公報(特許文献2)は、「話者認識精度が高い話者認識のための音素モデルを作成する学習方法」を開示している([要約]参照)。
A technology for identifying a speaker in speech recognition is known. For example, Japanese Patent Laying-Open No. 2010-217319 (Patent Document 1) discloses a technique for “increasing accuracy for speaker identification in a speaker identification device that identifies a speaker from an audio signal”. (See [Summary]). Japanese Patent Laid-Open No. 7-261781 (Patent Document 2) discloses a “learning method for creating a phoneme model for speaker recognition with high speaker recognition accuracy” (see [Summary]).
従来の音声に基づく話者識別では、話者を識別するためのモデルは予め与えられているものとし、より短いユーザ発話で効率よいモデルを構築することが目標とされている。そのため、短いながらも1分~2分程度の発話を予めユーザに要求し、得られた音声データから話者識別のモデルの確立を行っている。
In conventional speaker identification based on speech, a model for identifying a speaker is given in advance, and the goal is to build an efficient model with shorter user utterances. For this reason, the user is requested in advance to speak for about 1 to 2 minutes although it is short, and a speaker identification model is established from the obtained speech data.
従来の技術は、ユーザに前処理としての発話を要求するものである。しかしながら、音声コミュニケーションにおける話者識別では、より自然な対話を行なうために、ユーザの発話が学習のために用いられていることをユーザに感じさせることなく学習データを取得する必要がある。そのため、話者識別のモデルが構築されていない状態でユーザに負荷を強いることなく話者識別のモデルを構築するために必要な音声データを取得する必要がある。
The conventional technology requires the user to speak as preprocessing. However, in speaker identification in voice communication, in order to perform a more natural dialogue, it is necessary to acquire learning data without making the user feel that the user's speech is used for learning. Therefore, it is necessary to acquire voice data necessary for building a speaker identification model without imposing a load on the user in a state where the speaker identification model is not built.
本開示は、上述のような問題点を解決するためになされたものであって、ある局面における目的は、話者識別のモデルを構築するために必要な音声データを取得できる音声認識装置を提供することである。
The present disclosure has been made to solve the above-described problems, and an object in one aspect is to provide a speech recognition apparatus that can acquire speech data necessary for building a speaker identification model. It is to be.
他の局面における目的は、話者識別のモデルを構築するために必要な音声データを取得できる音声認識システムを提供することである。
An object in another aspect is to provide a speech recognition system that can acquire speech data necessary to build a speaker identification model.
他の局面における目的は、当該音声認識システムで使用される端末を提供することである。
The purpose in another aspect is to provide a terminal used in the speech recognition system.
さらに他の局面における目的は、話者識別のモデルを構築するために必要な話者識別モデルを生成するための方法を提供することである。
Still another object of the present invention is to provide a method for generating a speaker identification model necessary for constructing a speaker identification model.
一実施の形態に従う音声認識装置は、話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とをそれぞれ受け付けるための音声入力部と、音声認識処理を行なうための音声認識部と、音声を出力するための音声出力部と、音声認識処理の結果に基づいて音声認識装置を制御するための制御部とを備える。制御部は、話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する。
A speech recognition apparatus according to an embodiment includes a speech input unit for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, and a speech for performing speech recognition processing A recognition unit, a voice output unit for outputting a voice, and a control unit for controlling the voice recognition device based on a result of the voice recognition process. The control unit generates a speaker identification model for identifying the speaker by associating the information for identifying the speaker with the utterance not including the information for identifying the speaker.
ある局面において、ユーザは、学習のための前処理を意識せずに、通常の音声対話を行なうことのみで、学習に必要な音声データが収集され得る。
In a certain aspect, the user can collect voice data necessary for learning only by performing a normal voice dialogue without being aware of preprocessing for learning.
この発明の上記および他の目的、特徴、局面および利点は、添付の図面と関連して理解されるこの発明に関する次の詳細な説明から明らかとなるであろう。
The above and other objects, features, aspects and advantages of the present invention will become apparent from the following detailed description of the present invention which is to be understood in connection with the accompanying drawings.
以下、図面を参照しつつ、本発明の実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがって、それらについての詳細な説明は繰り返さない。
Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.
<技術思想>
本開示によれば、ユーザが不定の場合に音声対話内容でユーザ名を問いかけてユーザを分類することにより話者識別のためのモデル構築に必要な音声データ(たとえば声紋情報)が収集される。たとえば、「しりとり」や早口言葉等のゲームのような対話では、ゲーム相手(たとえば、端末、家電機器等)に対する複数回のユーザ発話が期待される。このような場合に、ゲーム相手となる装置は、予めユーザ名を問いかけてゲームをすることにより一連のユーザ発話を学習データとすることができる。または、ある未知のユーザが発話した後にユーザ名を問いかけることで一時刻前の未知のユーザ名を確定することができる。 <Technology>
According to the present disclosure, voice data (for example, voiceprint information) necessary for building a model for speaker identification is collected by asking a user name by voice conversation contents and classifying the user when the user is indefinite. For example, in a dialogue such as “Shiritori” or a quick phrase such as a game, a plurality of user utterances with respect to a game partner (for example, a terminal, a home appliance, etc.) is expected. In such a case, the device as the game opponent can make a series of user utterances as learning data by asking the user name in advance and playing the game. Alternatively, an unknown user name one hour before can be determined by asking the user name after a certain unknown user speaks.
本開示によれば、ユーザが不定の場合に音声対話内容でユーザ名を問いかけてユーザを分類することにより話者識別のためのモデル構築に必要な音声データ(たとえば声紋情報)が収集される。たとえば、「しりとり」や早口言葉等のゲームのような対話では、ゲーム相手(たとえば、端末、家電機器等)に対する複数回のユーザ発話が期待される。このような場合に、ゲーム相手となる装置は、予めユーザ名を問いかけてゲームをすることにより一連のユーザ発話を学習データとすることができる。または、ある未知のユーザが発話した後にユーザ名を問いかけることで一時刻前の未知のユーザ名を確定することができる。 <Technology>
According to the present disclosure, voice data (for example, voiceprint information) necessary for building a model for speaker identification is collected by asking a user name by voice conversation contents and classifying the user when the user is indefinite. For example, in a dialogue such as “Shiritori” or a quick phrase such as a game, a plurality of user utterances with respect to a game partner (for example, a terminal, a home appliance, etc.) is expected. In such a case, the device as the game opponent can make a series of user utterances as learning data by asking the user name in advance and playing the game. Alternatively, an unknown user name one hour before can be determined by asking the user name after a certain unknown user speaks.
本実施の形態では、音声認識の一例として、たとえば形態素解析が用いられる。この解析手法によれば、固有名詞とそうでないものが切り分けられる。たとえば、音声認識システムは、名前の辞書をデータベースとして有し得る。音声認識は、形態素解析において辞書と抽出された固有名詞とをマッチングすることにより行なわれる。
In the present embodiment, for example, morphological analysis is used as an example of speech recognition. According to this analysis technique, proper nouns and those that are not are separated. For example, a speech recognition system may have a dictionary of names as a database. Speech recognition is performed by matching a dictionary with an extracted proper noun in morphological analysis.
<構成の概要>
(構成1)ある局面に従う音声認識装置は、話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とをそれぞれ受け付けるためのマイクと、音声認識処理を行なうためのプロセッサと、音声を出力するためのスピーカと、音声認識処理の結果に基づいて音声認識装置を制御するためのプロセッサとを備える。プロセッサは、話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する。話者識別モデルは、たとえば、話者の識別ID(Identification)と、話者(音声認識装置のユーザ)の名前と、当該話者の発話から抽出された声紋情報等を含み得る。 <Outline of configuration>
(Configuration 1) A speech recognition apparatus according to an aspect includes a microphone for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, and a processor for performing speech recognition processing And a speaker for outputting voice and a processor for controlling the voice recognition device based on the result of the voice recognition processing. The processor generates a speaker identification model for identifying a speaker by associating information identifying the speaker with an utterance that does not include information identifying the speaker. The speaker identification model may include, for example, a speaker identification ID (Identification), the name of the speaker (user of the speech recognition apparatus), voiceprint information extracted from the speaker's utterance, and the like.
(構成1)ある局面に従う音声認識装置は、話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とをそれぞれ受け付けるためのマイクと、音声認識処理を行なうためのプロセッサと、音声を出力するためのスピーカと、音声認識処理の結果に基づいて音声認識装置を制御するためのプロセッサとを備える。プロセッサは、話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する。話者識別モデルは、たとえば、話者の識別ID(Identification)と、話者(音声認識装置のユーザ)の名前と、当該話者の発話から抽出された声紋情報等を含み得る。 <Outline of configuration>
(Configuration 1) A speech recognition apparatus according to an aspect includes a microphone for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, and a processor for performing speech recognition processing And a speaker for outputting voice and a processor for controlling the voice recognition device based on the result of the voice recognition processing. The processor generates a speaker identification model for identifying a speaker by associating information identifying the speaker with an utterance that does not include information identifying the speaker. The speaker identification model may include, for example, a speaker identification ID (Identification), the name of the speaker (user of the speech recognition apparatus), voiceprint information extracted from the speaker's utterance, and the like.
本実施の形態において、話者を識別する情報としては、たとえば、名前、あだ名、住民番号、政府機関から与えられた識別番号その他の情報であって、発話に含めることが可能な語句をいう。
In this embodiment, the information for identifying a speaker is, for example, a name, a nickname, a resident number, an identification number given by a government agency, or other information that can be included in an utterance.
(構成2)好ましくは、スピーカは、話者を識別する情報を尋ねる問い合わせを出力する。話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることは、話者を識別する情報と、問い合わせの後に発せられた話者を識別する情報を含まない発話とを関連付けることを含む。
(Configuration 2) Preferably, the speaker outputs an inquiry asking for information for identifying a speaker. Associating information that identifies a speaker with an utterance that does not contain information that identifies the speaker means that information that identifies the speaker and utterance that does not contain information that identifies the speaker uttered after the inquiry. Including associating.
(構成3)好ましくは、スピーカは、話者を識別する情報を含まない発話の後に、話者を識別する情報を尋ねる問い合わせを出力する。話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることは、問い合わせの前に発せられた発話と、問い合わせに応答する発話に含まれる話者を識別する情報とを関連付けることを含む。
(Configuration 3) Preferably, the speaker outputs an inquiry asking for information for identifying the speaker after the utterance not including the information for identifying the speaker. Associating information that identifies a speaker with an utterance that does not include information that identifies the speaker is the process of associating the utterance before the inquiry with the information identifying the speaker included in the utterance that responds to the inquiry. Including associating.
(構成4)プロセッサは、スピーカから出力される発話に対する応答の内容に基づいて、スピーカから次に出力する発話の内容を決定するように構成されている。たとえば、音声認識装置は、複数の問い合わせを予め保持している。各問い合わせの難易度は、階層的に異なる。ある局面において、難易度が中位である問い合わせの発話に対して、予め定められた一定時間内に応答が返ってこない場合、あるいは、応答が正しくない場合、プロセッサは、難易度が低い問い合わせ(しりとりの問題)を発話する。別の局面において、予め定められた一定時間内に早期に応答が返ってきた場合、プロセッサは、難易度が高い問い合わせ(しりとりの問題)を次の問い合わせとして発話する。
(Configuration 4) The processor is configured to determine the content of the next utterance output from the speaker based on the content of the response to the utterance output from the speaker. For example, the speech recognition apparatus holds a plurality of inquiries in advance. The difficulty level of each inquiry is hierarchically different. In a certain situation, if a response is not returned within a predetermined time for an utterance of a query with a medium difficulty level, or if the response is not correct, the processor issues a query with a low difficulty level ( Talk about the problem of shiritori). In another aspect, when a response is returned early within a predetermined time, the processor utters a query with a high degree of difficulty (a problem of shiritori) as the next query.
(構成5)当該音声認識装置は、生成された話者識別モデルを格納するためのメモリをさらに備える。プロセッサは、問い合わせに対する応答に基づいて、生成された話者識別モデルを更新するように構成されている。
(Configuration 5) The speech recognition apparatus further includes a memory for storing the generated speaker identification model. The processor is configured to update the generated speaker identification model based on the response to the query.
(構成6)別の局面に従うと、音声認識システムが提供される。音声認識システムは、端末と、当該端末と通信可能な装置とを備える。端末は、話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とを受け付けるためのマイクと、音声を出力するためのスピーカと、マイクおよびスピーカに電気的に接続されて、当該装置と通信するための通信インターフェイスとを備える。装置は、端末と通信するための通信インターフェイスと、音声認識処理を行なうためのプロセッサと、音声認識処理の結果に基づいて装置を制御するためのプロセッサとを備える。プロセッサは、話者を識別する情報と、話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する。
(Configuration 6) According to another aspect, a speech recognition system is provided. The voice recognition system includes a terminal and a device capable of communicating with the terminal. The terminal is electrically connected to a microphone for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker, a speaker for outputting sound, and the microphone and the speaker. And a communication interface for communicating with the device. The apparatus includes a communication interface for communicating with a terminal, a processor for performing voice recognition processing, and a processor for controlling the apparatus based on the result of the voice recognition processing. The processor generates a speaker identification model for identifying a speaker by associating information identifying the speaker with an utterance that does not include information identifying the speaker.
<技術思想の背景>
図1を参照して、本実施の形態に係る技術思想の背景について説明する。図1は、しりとりゲームが行なわれる場合におけるユーザ1と端末2とのやり取りを表わす図である。ユーザ1は、端末2に対して、メッセージ10を発する。端末2は、メッセージ10を認識すると、応答として、メッセージ11を発する。 <Background of technical thought>
The background of the technical idea according to the present embodiment will be described with reference to FIG. FIG. 1 is a diagram illustrating an exchange between theuser 1 and the terminal 2 when a shiritori game is performed. User 1 issues message 10 to terminal 2. When the terminal 2 recognizes the message 10, the terminal 2 issues a message 11 as a response.
図1を参照して、本実施の形態に係る技術思想の背景について説明する。図1は、しりとりゲームが行なわれる場合におけるユーザ1と端末2とのやり取りを表わす図である。ユーザ1は、端末2に対して、メッセージ10を発する。端末2は、メッセージ10を認識すると、応答として、メッセージ11を発する。 <Background of technical thought>
The background of the technical idea according to the present embodiment will be described with reference to FIG. FIG. 1 is a diagram illustrating an exchange between the
ユーザ1は、端末2に対して、メッセージ12を発する。端末2は、メッセージ12を認識すると、メッセージ12に含まれる名前と予め規定されたメッセージとを用いて合成されたメッセージ13を発する。
User 1 issues a message 12 to terminal 2. When the terminal 2 recognizes the message 12, the terminal 2 issues a message 13 synthesized using the name included in the message 12 and a predefined message.
予め定められた時間が経過すると、端末2は、メッセージ14を発する。ユーザ1は、メッセージ14を認識すると、予め規定された時間内に、応答として、メッセージ14に続く言葉を考える。ユーザ1が、端末2に対して、メッセージ15を発する。端末2は、メッセージ15を認識すると、予め準備された国語辞書を参照して、メッセージ15に続く言葉を考える。端末2は、予め規定された時間内に、メッセージ15に対する言葉としてメッセージ16を発する。このようにして、ユーザ1と端末2とは、しりとりゲームを続ける。
When a predetermined time has elapsed, the terminal 2 issues a message 14. When the user 1 recognizes the message 14, the user 1 considers a word following the message 14 as a response within a predetermined time. User 1 issues message 15 to terminal 2. When the terminal 2 recognizes the message 15, the terminal 2 refers to a language dictionary prepared in advance and considers words following the message 15. The terminal 2 issues a message 16 as a word for the message 15 within a predetermined time. In this way, the user 1 and the terminal 2 continue the shiritori game.
端末2の発話に対して、ユーザ1が予め規定された時間内に次の言葉を返せる場合は、同様にしりとりが続く。たとえば、ユーザ1が端末2に対してメッセージ17を発する。端末2は、メッセージ17を認識すると、メッセージ18を発する。
In the case where the user 1 can return the next word within a predetermined time in response to the utterance of the terminal 2, the interruption continues. For example, the user 1 issues a message 17 to the terminal 2. When the terminal 2 recognizes the message 17, it issues a message 18.
一方、ユーザ1が次の言葉を返せない場合がある。この場合、ユーザ1は沈黙を続けるか、分からない旨のメッセージ19を発することになる。端末2は、予め定められた一定の待ち時間内にユーザ1からの応答がないと判断した場合、あるいは、メッセージ19を認識した場合には、その内容について予め規定されていたメッセージ20を発する。
On the other hand, the user 1 may not be able to return the next word. In this case, the user 1 keeps silence or issues a message 19 indicating that he / she does not understand. When the terminal 2 determines that there is no response from the user 1 within a predetermined fixed waiting time, or when the terminal 2 recognizes the message 19, the terminal 2 issues a message 20 defined in advance for the content.
このような場合、端末2は、ユーザ1との間のメッセージのやり取りを通じて、ユーザ1が「たろう」であることを認識し、ユーザ情報として「たろう」を各データに関連付ける。
In such a case, the terminal 2 recognizes that the user 1 is “Taro” through the exchange of messages with the user 1, and associates “Taro” as user information with each data.
図2~図5を参照して本開示に係る音声認識システムの構成について説明する。
[端末]
図2は、本開示に係る第1の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムでは、ひとつの端末200が音声認識システムとして機能する。 The configuration of the speech recognition system according to the present disclosure will be described with reference to FIGS.
[Terminal]
FIG. 2 is a diagram illustrating an outline of the configuration of the speech recognition system according to the first embodiment of the present disclosure. In the voice recognition system, oneterminal 200 functions as a voice recognition system.
[端末]
図2は、本開示に係る第1の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムでは、ひとつの端末200が音声認識システムとして機能する。 The configuration of the speech recognition system according to the present disclosure will be described with reference to FIGS.
[Terminal]
FIG. 2 is a diagram illustrating an outline of the configuration of the speech recognition system according to the first embodiment of the present disclosure. In the voice recognition system, one
端末200は、制御部30と、音声入力部31と、音声出力部32と、話者識別部33と、話者識別学習部34と、ユーザ管理部35と、音声認識部36と、対話分析・生成部37とを備える。端末200は、たとえば、音声入出力機能と音声認識機能とを備える端末であればよい。当該端末は、たとえば、スマートフォン、テレビ、スタンドアロンで作動し得るお掃除ロボットその他の機器を含み得る。
The terminal 200 includes a control unit 30, a voice input unit 31, a voice output unit 32, a speaker identification unit 33, a speaker identification learning unit 34, a user management unit 35, a voice recognition unit 36, and a dialog analysis. A generation unit 37 is provided. The terminal 200 may be a terminal having a voice input / output function and a voice recognition function, for example. The terminal may include, for example, a smartphone, a television, a cleaning robot that can operate standalone, and other devices.
制御部30は、端末200の動作を制御する。音声入力部31は、音声の入力を受け付けて信号を制御部30に出力する。音声出力部32は、制御部30から出力された信号を音声に変換して、端末200の外部に音声を出力する。音声出力部32は、たとえばスピーカ、端子等を含む。話者識別部33は、制御部30から送られる信号に基づいて、端末200に対する発話を行なった話者を識別する。別の局面において、話者識別部33は、当該信号と端末200に保存されているデータとに基づいて話者を識別する。当該データは、たとえば、端末200のユーザとして予め登録された声紋情報等を含み得る。
The control unit 30 controls the operation of the terminal 200. The voice input unit 31 receives a voice input and outputs a signal to the control unit 30. The audio output unit 32 converts the signal output from the control unit 30 into audio and outputs the audio to the outside of the terminal 200. The audio output unit 32 includes, for example, a speaker, a terminal, and the like. The speaker identification unit 33 identifies a speaker who has made an utterance to the terminal 200 based on a signal sent from the control unit 30. In another aspect, the speaker identification unit 33 identifies a speaker based on the signal and data stored in the terminal 200. The data can include, for example, voiceprint information registered in advance as a user of the terminal 200.
話者識別学習部34は、話者識別部33により識別された話者の情報(ユーザID等)を用いて、話者毎のデータ(ユーザプロファイル)を作成する。ユーザ管理部35は、端末200のユーザ情報を保存する。ユーザ情報は、ユーザプロファイル等を含み得る。音声認識部36は、制御部30から送られる音声信号を用いて音声認識処理を実行する。たとえば、音声認識部36は、発話に含まれている文字を抽出する。
The speaker identification learning unit 34 creates data (user profile) for each speaker using the information (user ID or the like) of the speaker identified by the speaker identifying unit 33. The user management unit 35 stores user information of the terminal 200. The user information may include a user profile and the like. The voice recognition unit 36 performs voice recognition processing using the voice signal sent from the control unit 30. For example, the voice recognition unit 36 extracts characters included in the utterance.
対話分析・生成部37は、音声認識部36による認識の結果に基づいて端末200に対するメッセージを分析する。さらに、対話分析・生成部37は、分析の結果に応じて、当該発話に応じた応答を生成する。別の局面において、対話分析・生成部37は、端末200における設定に基づいて、端末200のユーザに対する働きかけのための発話を生成する。当該設定は、たとえば、端末200が、自己の近傍にユーザの存在を検知したこと、予め設定された時刻が到来したこと等を含み得る。
The dialog analysis / generation unit 37 analyzes a message to the terminal 200 based on the recognition result by the voice recognition unit 36. Furthermore, the dialog analysis / generation unit 37 generates a response according to the utterance according to the analysis result. In another aspect, the dialog analysis / generation unit 37 generates an utterance for encouraging the user of the terminal 200 based on the setting in the terminal 200. The setting may include, for example, that the terminal 200 has detected the presence of a user in the vicinity of itself, that a preset time has arrived.
[端末+サーバ]
図3は、本開示に係る第2の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末300と、サーバ350とを備える。端末300は、音声入力部31と、音声出力部32とを備える。端末300は、プロセッサ(図示しない)によって制御される。サーバ350は、制御部30と、話者識別部33と、話者識別学習部34と、ユーザ管理部35と、音声認識部36と、対話分析・生成部37とを備える。端末300は、たとえば、音声入出力機能と通信機能とを備える端末として実現される。そのような端末は、たとえば、携帯電話その他の情報通信端末、音声認識機能と通信機能とを備えるお掃除ロボットその他の機器等を含み得る。 [Terminal + server]
FIG. 3 is a diagram illustrating an outline of the configuration of the speech recognition system according to the second embodiment of the present disclosure. The voice recognition system includes a terminal 300 and aserver 350. The terminal 300 includes an audio input unit 31 and an audio output unit 32. Terminal 300 is controlled by a processor (not shown). The server 350 includes a control unit 30, a speaker identification unit 33, a speaker identification learning unit 34, a user management unit 35, a voice recognition unit 36, and a dialog analysis / generation unit 37. The terminal 300 is realized as a terminal having a voice input / output function and a communication function, for example. Such a terminal may include, for example, a mobile phone or other information communication terminal, a cleaning robot or the like having a voice recognition function and a communication function, and the like.
図3は、本開示に係る第2の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末300と、サーバ350とを備える。端末300は、音声入力部31と、音声出力部32とを備える。端末300は、プロセッサ(図示しない)によって制御される。サーバ350は、制御部30と、話者識別部33と、話者識別学習部34と、ユーザ管理部35と、音声認識部36と、対話分析・生成部37とを備える。端末300は、たとえば、音声入出力機能と通信機能とを備える端末として実現される。そのような端末は、たとえば、携帯電話その他の情報通信端末、音声認識機能と通信機能とを備えるお掃除ロボットその他の機器等を含み得る。 [Terminal + server]
FIG. 3 is a diagram illustrating an outline of the configuration of the speech recognition system according to the second embodiment of the present disclosure. The voice recognition system includes a terminal 300 and a
端末300は、ユーザの発話を受け付けると、その発話に応じた音声信号を、通信インターフェイス(図示しない)を介してサーバ350に送信する。サーバ350は、その音声信号を受信すると、話者識別処理、音声認識処理、対話分析、応答生成等の処理を実行する。各処理は、図2に示される構成によって実現される処理と同様なので、詳細な説明は繰り返さない。
When the terminal 300 accepts the user's utterance, the terminal 300 transmits an audio signal corresponding to the utterance to the server 350 via a communication interface (not shown). Upon receiving the voice signal, the server 350 executes processing such as speaker identification processing, voice recognition processing, dialog analysis, response generation, and the like. Each process is the same as the process realized by the configuration shown in FIG. 2, and thus detailed description will not be repeated.
サーバ350は、生成された応答を通信インターフェイス(図示しない)を介して端末300に送信する。端末300がその応答を受信すると、音声出力部32は、その応答に応じた音声を出力する。
The server 350 transmits the generated response to the terminal 300 via a communication interface (not shown). When the terminal 300 receives the response, the audio output unit 32 outputs audio corresponding to the response.
[端末+サーバ+話者識別サーバ]
図4は、本開示に係る第3の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末300と、サーバ400と、話者識別サーバ410とを備える。サーバ400は、制御部30と、ユーザ管理部35と、音声認識部36と、対話分析・生成部37とを備える。話者識別サーバ410は、話者識別部33と、話者識別学習部34とを備える。 [Terminal + server + speaker identification server]
FIG. 4 is a diagram illustrating an outline of the configuration of the speech recognition system according to the third embodiment of the present disclosure. The voice recognition system includes a terminal 300, aserver 400, and a speaker identification server 410. The server 400 includes a control unit 30, a user management unit 35, a voice recognition unit 36, and a dialog analysis / generation unit 37. The speaker identification server 410 includes a speaker identification unit 33 and a speaker identification learning unit 34.
図4は、本開示に係る第3の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末300と、サーバ400と、話者識別サーバ410とを備える。サーバ400は、制御部30と、ユーザ管理部35と、音声認識部36と、対話分析・生成部37とを備える。話者識別サーバ410は、話者識別部33と、話者識別学習部34とを備える。 [Terminal + server + speaker identification server]
FIG. 4 is a diagram illustrating an outline of the configuration of the speech recognition system according to the third embodiment of the present disclosure. The voice recognition system includes a terminal 300, a
サーバ400と話者識別サーバ410とは、公知の構成を有するコンピュータ装置によって実現される。当該コンピュータは、主たる構成要素として、プログラムを実行するCPU(Central Processing Unit)と、キーボードその他の入力装置と、RAM(Random Access Memory)と、ハードディスクと、光ディスク駆動装置と、モニタと、通信IF(Interface)とを備える。
The server 400 and the speaker identification server 410 are realized by a computer device having a known configuration. The computer includes, as main components, a CPU (Central Processing Unit) that executes a program, a keyboard and other input devices, a RAM (Random Access Memory), a hard disk, an optical disk drive, a monitor, a communication IF ( Interface).
コンピュータにおける処理は、各ハードウェアおよびCPUにより実行されるソフトウェアによって実現される。ある局面において、当該ソフトウェアは、ハードディスクに予め格納されている。別の局面において、当該ソフトウェアは、CD-ROMその他のコンピュータ読み取り可能な不揮発性のデータ記録媒体に格納されてプログラム製品として流通している。さらに別の局面において、当該ソフトウェアは、インターネットその他のネットワークに接続されている情報提供事業者によってダウンロード可能なプログラム製品として提供される場合もある。
Processing in the computer is realized by software executed by each hardware and CPU. In one aspect, the software is stored in advance on a hard disk. In another aspect, the software is stored in a CD-ROM or other computer-readable non-volatile data recording medium and distributed as a program product. In yet another aspect, the software may be provided as a program product that can be downloaded by an information provider connected to the Internet or other networks.
コンピュータのハードウェア構成は、一般的なものである。したがって、サーバ400と話者識別サーバ410のハードウェア構成の説明は繰り返さない。本実施の形態に係る技術思想を実現する本質的な部分は、当該コンピュータに格納されたプログラムであるともいえる。
The hardware configuration of the computer is general. Therefore, the description of the hardware configuration of server 400 and speaker identification server 410 will not be repeated. It can be said that the essential part for realizing the technical idea according to the present embodiment is a program stored in the computer.
サーバ400は、端末300から送られた音声信号を受信すると、通信インターフェイスを介して、その音声信号を話者識別サーバ410に送信する。
When the server 400 receives the voice signal transmitted from the terminal 300, the server 400 transmits the voice signal to the speaker identification server 410 via the communication interface.
話者識別サーバ410は、話者を認識し、また、話者を登録するためのデータを生成する。話者識別サーバ410は、生成したデータをサーバ400に送信する。
The speaker identification server 410 recognizes the speaker and generates data for registering the speaker. The speaker identification server 410 transmits the generated data to the server 400.
[端末+サーバ+話者識別サーバ+音声認識サーバ]
図5は、本開示に係る第4の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末300と、サーバ500と、話者識別サーバ410と、音声認識サーバ520とを備える。サーバ500は、制御部30と、ユーザ管理部35と、対話分析・生成部37とを備える。音声認識サーバ520は、音声認識部36を備える。 [Terminal + server + speaker identification server + voice recognition server]
FIG. 5 is a diagram illustrating an outline of a configuration of a speech recognition system according to the fourth embodiment of the present disclosure. The voice recognition system includes a terminal 300, aserver 500, a speaker identification server 410, and a voice recognition server 520. The server 500 includes a control unit 30, a user management unit 35, and a dialog analysis / generation unit 37. The voice recognition server 520 includes a voice recognition unit 36.
図5は、本開示に係る第4の実施例に従う音声認識システムの構成の概要を表す図である。当該音声認識システムは、端末300と、サーバ500と、話者識別サーバ410と、音声認識サーバ520とを備える。サーバ500は、制御部30と、ユーザ管理部35と、対話分析・生成部37とを備える。音声認識サーバ520は、音声認識部36を備える。 [Terminal + server + speaker identification server + voice recognition server]
FIG. 5 is a diagram illustrating an outline of a configuration of a speech recognition system according to the fourth embodiment of the present disclosure. The voice recognition system includes a terminal 300, a
サーバ500は、端末300から音声信号を受信すると、その音声信号を話者識別サーバ410および音声認識サーバ520に送信する。音声認識サーバ520は、当該音声信号を用いて音声認識処理を実行し、認識の結果をサーバ500に送信する。
When the server 500 receives the voice signal from the terminal 300, the server 500 transmits the voice signal to the speaker identification server 410 and the voice recognition server 520. The voice recognition server 520 executes voice recognition processing using the voice signal, and transmits the recognition result to the server 500.
その他の動作は、前述の他の実施例に従う音声認識システムの構成における動作と同様である。したがって、他の動作の説明は繰り返さない。
Other operations are the same as those in the configuration of the speech recognition system according to the other embodiments described above. Therefore, description of other operations will not be repeated.
[機能構成]
図6は、本開示に係る音声認識システムを実現する機能の構成を表すブロック図である。音声認識システムは、端末モジュール600と、メインモジュール610と、話者識別モジュール620と、音声認識モジュール630とを備える。 [Function configuration]
FIG. 6 is a block diagram illustrating a configuration of functions that implement the speech recognition system according to the present disclosure. The voice recognition system includes aterminal module 600, a main module 610, a speaker identification module 620, and a voice recognition module 630.
図6は、本開示に係る音声認識システムを実現する機能の構成を表すブロック図である。音声認識システムは、端末モジュール600と、メインモジュール610と、話者識別モジュール620と、音声認識モジュール630とを備える。 [Function configuration]
FIG. 6 is a block diagram illustrating a configuration of functions that implement the speech recognition system according to the present disclosure. The voice recognition system includes a
端末モジュール600は、音声入力部31と音声出力部32とを備える。端末モジュール600は、ユーザの近傍にあって発話を受け付けて、音声データと端末IDとをメインモジュール610に送信する。別の局面において、端末モジュール600は、メインモジュール610から送られた合成音声データを受信し、合成音声データに基づく音声を音声出力部32から出力する。
The terminal module 600 includes an audio input unit 31 and an audio output unit 32. The terminal module 600 is in the vicinity of the user, accepts an utterance, and transmits voice data and a terminal ID to the main module 610. In another aspect, the terminal module 600 receives the synthesized voice data sent from the main module 610 and outputs voice based on the synthesized voice data from the voice output unit 32.
メインモジュール610において、制御部30は、音声データと話者モデルリストとを話者識別モジュール620に送信する。話者識別モジュール620は、話者を識別すると、話者識別結果(たとえば、メッセージのID、話者が識別できたことを表すフラグ等)をメインモジュール610に送信する。
In the main module 610, the control unit 30 transmits the voice data and the speaker model list to the speaker identification module 620. When the speaker identification module 620 identifies a speaker, the speaker identification module 620 transmits a speaker identification result (for example, a message ID, a flag indicating that the speaker has been identified, etc.) to the main module 610.
制御部30は、端末IDまたは音声データをユーザ管理部35に送信する。ユーザ管理部35は、端末IDまたは音声データを保存する。
The control unit 30 transmits the terminal ID or voice data to the user management unit 35. The user management unit 35 stores the terminal ID or voice data.
制御部30は、ユーザ管理部35から話者モデルリストを読み出す。
制御部30は、対話分析・生成部37との間で、たとえば、テキストデータのやり取りを行なう。 Thecontrol unit 30 reads the speaker model list from the user management unit 35.
For example, thecontrol unit 30 exchanges text data with the dialog analysis / generation unit 37.
制御部30は、対話分析・生成部37との間で、たとえば、テキストデータのやり取りを行なう。 The
For example, the
制御部30は、音声データを音声認識モジュール630に送信する。音声認識モジュール630は、音声データを用いて音声認識処理を実行すると、その結果をテキストとして制御部30に送る。
The control unit 30 transmits the voice data to the voice recognition module 630. When the speech recognition module 630 executes speech recognition processing using speech data, the speech recognition module 630 sends the result to the control unit 30 as text.
図6に示される機能は、図2~図5に示される構成のいずれかによって実現される。
[データ構造]
図7を参照して、本実施の形態に係る音声認識システムのデータ構造について説明する。図7は、音声認識システムにおいて保持されるデータの格納の一態様を概念的に表す図である。ある局面において、音声認識システムは、端末管理テーブルと、家庭管理テーブルと、ユーザ管理テーブルとを含む。 The function shown in FIG. 6 is realized by any of the configurations shown in FIGS.
[data structure]
With reference to FIG. 7, the data structure of the speech recognition system according to the present embodiment will be described. FIG. 7 is a diagram conceptually showing one mode of storing data held in the voice recognition system. In one aspect, the voice recognition system includes a terminal management table, a home management table, and a user management table.
[データ構造]
図7を参照して、本実施の形態に係る音声認識システムのデータ構造について説明する。図7は、音声認識システムにおいて保持されるデータの格納の一態様を概念的に表す図である。ある局面において、音声認識システムは、端末管理テーブルと、家庭管理テーブルと、ユーザ管理テーブルとを含む。 The function shown in FIG. 6 is realized by any of the configurations shown in FIGS.
[data structure]
With reference to FIG. 7, the data structure of the speech recognition system according to the present embodiment will be described. FIG. 7 is a diagram conceptually showing one mode of storing data held in the voice recognition system. In one aspect, the voice recognition system includes a terminal management table, a home management table, and a user management table.
(端末管理テーブル)
端末管理テーブルは、端末IDと、所属ユーザIDとを含む。端末IDは、音声認識システムにおいて登録された端末を識別する。ある局面において、端末IDは、音声認識システムの管理者(たとえば、制御部30を含むコンピュータの管理者)によって一意に付与される。別の局面において、端末IDは、当該端末のユーザが希望する任意の文字列(たとえば、英数字、記号など)によって構成される。この場合、端末IDの重複が生じないように、たとえば、制御部30は、ユーザによって入力されたIDが既に使用されているか否かをチェックし、使用済みの端末IDが入力された場合は、その旨を端末に通知する。所属ユーザIDは、当該端末の使用者として登録されたユーザを識別する。端末の使用者の数は特に限られない。 (Terminal management table)
The terminal management table includes a terminal ID and a belonging user ID. The terminal ID identifies a terminal registered in the voice recognition system. In one aspect, the terminal ID is uniquely assigned by a manager of the voice recognition system (for example, a manager of a computer including the control unit 30). In another aspect, the terminal ID is configured by an arbitrary character string (for example, alphanumeric characters or symbols) desired by the user of the terminal. In this case, for example, thecontrol unit 30 checks whether or not the ID input by the user is already used so that the terminal ID is not duplicated. If the used terminal ID is input, Notify the terminal to that effect. The affiliated user ID identifies a user registered as a user of the terminal. The number of users of the terminal is not particularly limited.
端末管理テーブルは、端末IDと、所属ユーザIDとを含む。端末IDは、音声認識システムにおいて登録された端末を識別する。ある局面において、端末IDは、音声認識システムの管理者(たとえば、制御部30を含むコンピュータの管理者)によって一意に付与される。別の局面において、端末IDは、当該端末のユーザが希望する任意の文字列(たとえば、英数字、記号など)によって構成される。この場合、端末IDの重複が生じないように、たとえば、制御部30は、ユーザによって入力されたIDが既に使用されているか否かをチェックし、使用済みの端末IDが入力された場合は、その旨を端末に通知する。所属ユーザIDは、当該端末の使用者として登録されたユーザを識別する。端末の使用者の数は特に限られない。 (Terminal management table)
The terminal management table includes a terminal ID and a belonging user ID. The terminal ID identifies a terminal registered in the voice recognition system. In one aspect, the terminal ID is uniquely assigned by a manager of the voice recognition system (for example, a manager of a computer including the control unit 30). In another aspect, the terminal ID is configured by an arbitrary character string (for example, alphanumeric characters or symbols) desired by the user of the terminal. In this case, for example, the
(家庭管理テーブル)
家庭管理テーブルは、家庭IDと、当該家庭に所属する端末の端末IDとを含む。家庭IDは、音声認識システムのサービスを利用するユーザのグループとして家庭を識別する。ユーザのグループの単位は家庭に限られない。複数のユーザがひとつのグループに関連付けられるものであればよい。家庭IDには、1つ以上の端末の各端末IDが関連付けられている。家庭に関連付けられる端末の数は特に限られない。 (Home management table)
The home management table includes a home ID and a terminal ID of a terminal belonging to the home. The home ID identifies the home as a group of users who use the service of the voice recognition system. The unit of the user group is not limited to the home. Any user may be used as long as a plurality of users can be associated with one group. The home ID is associated with each terminal ID of one or more terminals. The number of terminals associated with the home is not particularly limited.
家庭管理テーブルは、家庭IDと、当該家庭に所属する端末の端末IDとを含む。家庭IDは、音声認識システムのサービスを利用するユーザのグループとして家庭を識別する。ユーザのグループの単位は家庭に限られない。複数のユーザがひとつのグループに関連付けられるものであればよい。家庭IDには、1つ以上の端末の各端末IDが関連付けられている。家庭に関連付けられる端末の数は特に限られない。 (Home management table)
The home management table includes a home ID and a terminal ID of a terminal belonging to the home. The home ID identifies the home as a group of users who use the service of the voice recognition system. The unit of the user group is not limited to the home. Any user may be used as long as a plurality of users can be associated with one group. The home ID is associated with each terminal ID of one or more terminals. The number of terminals associated with the home is not particularly limited.
(ユーザ管理テーブル)
ユーザ管理テーブルは、ユーザIDと、ユーザ名と、話者モデルデータと、音声データリストとを含む。 (User management table)
The user management table includes a user ID, a user name, speaker model data, and a voice data list.
ユーザ管理テーブルは、ユーザIDと、ユーザ名と、話者モデルデータと、音声データリストとを含む。 (User management table)
The user management table includes a user ID, a user name, speaker model data, and a voice data list.
ユーザIDは、端末を使用するユーザを識別する。ユーザ名は、当該ユーザIDが割り当てられたユーザを識別する。話者モデルデータは、当該ユーザを識別するためのデータである。話者モデルデータは、たとえば、声紋情報を含み得る。
User ID identifies the user who uses the terminal. The user name identifies the user to whom the user ID is assigned. The speaker model data is data for identifying the user. The speaker model data may include voiceprint information, for example.
音声データリストは、当該ユーザを識別するための音声データを含む。当該音声データは、ユーザから端末に対する発話、端末の発話に対するユーザの応答、端末に表示された文字列のユーザによる発話等を含み得る。
The audio data list includes audio data for identifying the user. The voice data may include an utterance from the user to the terminal, a user response to the terminal utterance, an utterance by the user of a character string displayed on the terminal, and the like.
[話者モデルの生成]
図8を参照して、話者モデルの生成について説明する。図8は、ユーザ1と端末2との間の対話により話者モデル80が生成される状態を表す図である。なお、図1における状態と同様の状態の説明は繰り返さない。 [Generate speaker model]
The generation of a speaker model will be described with reference to FIG. FIG. 8 is a diagram illustrating a state in which thespeaker model 80 is generated by the dialogue between the user 1 and the terminal 2. The description of the same state as that in FIG. 1 will not be repeated.
図8を参照して、話者モデルの生成について説明する。図8は、ユーザ1と端末2との間の対話により話者モデル80が生成される状態を表す図である。なお、図1における状態と同様の状態の説明は繰り返さない。 [Generate speaker model]
The generation of a speaker model will be described with reference to FIG. FIG. 8 is a diagram illustrating a state in which the
ユーザ1と端末2との対話において、ユーザ1が未登録の場合には、端末2は、まず最初にユーザ名を聞いて、以降の一定区間(たとえば、ゲーム終了等)までをそのユーザの発話として音声データをデータベースに登録する。音声データは声紋情報を含み得る。
In the dialog between the user 1 and the terminal 2, when the user 1 is not registered, the terminal 2 first listens to the user name and then utters the user until a certain period thereafter (for example, the end of the game). To register voice data in the database. The audio data can include voiceprint information.
ユーザ発話毎に、話者識別学習部は、対象の音声DB(Database)からこれまでの音声データすべてを学習データとして話者識別の学習を行なう。
For each user utterance, the speaker identification learning unit learns the speaker identification using all the speech data from the target speech DB (Database) as learning data.
IDが端末毎に割り当てられる。端末とユーザ名とによってユーザを管理することにより他端末で同名のユーザがいる場合でも対応可能となる。
ID is assigned to each terminal. By managing the user by the terminal and the user name, it is possible to cope with the case where there is a user with the same name in another terminal.
ユーザ1が自身の名前を発すると(メッセージ12)、端末2はメッセージ12を認識する。端末2は、メッセージ12からユーザ名(=たろう)を抽出すると、当該ユーザ名と端末2の端末IDとをユーザ管理部35に送信する。その後も、ユーザ1が発話すると、各メッセージ15,メッセージ17は、端末2を通してユーザ管理部35に蓄積される。
When the user 1 utters his name (message 12), the terminal 2 recognizes the message 12. When the terminal 2 extracts the user name (= taro) from the message 12, the terminal 2 transmits the user name and the terminal ID of the terminal 2 to the user management unit 35. Thereafter, when the user 1 speaks, the messages 15 and 17 are stored in the user management unit 35 through the terminal 2.
話者識別学習部34は、ユーザ管理部35に保存されている端末IDとユーザ名とを読み出して、話者モデル80を生成する。話者モデル80は、当該ユーザ名と端末IDとを含む。したがって、以降は、端末2がユーザ1と対話することによりユーザ名が特定されると、当該ユーザに関連付けられた話者モデル80が利用可能となる。
The speaker identification learning unit 34 reads the terminal ID and user name stored in the user management unit 35 and generates a speaker model 80. The speaker model 80 includes the user name and the terminal ID. Therefore, thereafter, when the user name is specified by the terminal 2 interacting with the user 1, the speaker model 80 associated with the user can be used.
[制御構造]
図9~図11を参照して、本実施の形態に係る音声認識システムの制御構造について説明する。図9から図11は、それぞれ、ユーザが発話の起点となる場合におけるシーケンスを表すフローチャートである。 [Control structure]
A control structure of the speech recognition system according to the present embodiment will be described with reference to FIGS. FIG. 9 to FIG. 11 are flowcharts showing sequences when the user is the starting point of the utterance.
図9~図11を参照して、本実施の形態に係る音声認識システムの制御構造について説明する。図9から図11は、それぞれ、ユーザが発話の起点となる場合におけるシーケンスを表すフローチャートである。 [Control structure]
A control structure of the speech recognition system according to the present embodiment will be described with reference to FIGS. FIG. 9 to FIG. 11 are flowcharts showing sequences when the user is the starting point of the utterance.
ステップ910にて、ユーザによる話者識別学習用のシーケンスを開始するための発話が行なわれる。たとえば、ユーザは「しりとりしようよ」というメッセージ911を発する。音声入力部31は、メッセージ911を受け付けると、メッセージ911に応じた音声信号を制御部30に送信する。
In step 910, an utterance for starting a sequence for speaker identification learning by the user is performed. For example, the user issues a message 911 “Let's take a bite”. When receiving the message 911, the voice input unit 31 transmits a voice signal corresponding to the message 911 to the control unit 30.
ステップ915にて、制御部30は、当該音声信号を受信したことを検知すると、音声認識リクエストを音声認識部36に送信する。
In step 915, when the control unit 30 detects that the voice signal has been received, the control unit 30 transmits a voice recognition request to the voice recognition unit 36.
ステップ920にて、制御部30は、当該音声信号を受信したことを検知すると、話者モデルリスト取得リクエストをユーザ管理部35に送信する。話者モデルリスト取得リクエストは、当該発話を与えたユーザに関連付けられている話者モデルリストにアクセスすることを要求する。
In step 920, when detecting that the voice signal has been received, the control unit 30 transmits a speaker model list acquisition request to the user management unit 35. The speaker model list acquisition request requests access to the speaker model list associated with the user who gave the utterance.
ステップ925にて、制御部30は、当該話者モデルリスト取得リクエストに応答して、話者モデルリストレスポンスを制御部30に送信する。話者モデルリストレスポンスは、当該ユーザに関連付けられている話者モデルリストの取得結果を含む。
In step 925, the control unit 30 transmits a speaker model list response to the control unit 30 in response to the speaker model list acquisition request. The speaker model list response includes the acquisition result of the speaker model list associated with the user.
ステップ930にて、制御部30は、話者識別部33に対して、話者識別リクエストを送信する。話者識別部33は、話者識別リクエストの受信を検知すると、ユーザ管理部35に保存されているデータを参照して、ステップ910にて発話を行なったユーザ(話者)の識別を試みる。
In step 930, the control unit 30 transmits a speaker identification request to the speaker identification unit 33. When the speaker identification unit 33 detects the reception of the speaker identification request, the speaker identification unit 33 refers to the data stored in the user management unit 35 and tries to identify the user (speaker) who made the utterance in step 910.
ステップ935にて、音声認識部36は、ステップ915における音声認識リクエストに応答して、音声認識レスポンスを制御部30に送信する。音声認識レスポンスは、音声認識が成功したか否かを含む。
In step 935, the voice recognition unit 36 transmits a voice recognition response to the control unit 30 in response to the voice recognition request in step 915. The voice recognition response includes whether or not the voice recognition is successful.
ステップ940にて、話者識別部33は、話者識別失敗レスポンスを話者識別部33に送信する。すなわち、ユーザが音声認識システムに登録されていないため、話者識別部33は、当該発話を与えたユーザ(話者)を識別することができない。そこで、話者の識別が失敗したことを通知する話者識別失敗レスポンスが、話者識別部33から制御部30に送られる。
In step 940, the speaker identification unit 33 transmits a speaker identification failure response to the speaker identification unit 33. That is, since the user is not registered in the voice recognition system, the speaker identifying unit 33 cannot identify the user (speaker) who gave the speech. Therefore, a speaker identification failure response notifying that speaker identification has failed is sent from the speaker identification unit 33 to the control unit 30.
ステップ945にて、制御部30は、話者識別失敗レスポンスの受信に応答して、対話分析・生成リクエストを対話分析・生成部37に送信する。対話分析・生成リクエストは、音声識別結果および話者識別結果を含み得る。対話分析・生成部37は、対話分析・生成リクエストを受信すると、当該発話を与えたユーザの名前を取得するためのメッセージを生成する。たとえば、対話分析・生成部37は、音声認識システムにおいて予め準備されているテンプレートと、メッセージ911に含まれる用語「しりとり」とを用いて、メッセージ946(しりとりをはじめるよ。それじゃ、名前を教えてね。)を作成する。
In step 945, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37 in response to the reception of the speaker identification failure response. The dialogue analysis / generation request may include a voice identification result and a speaker identification result. Upon receiving the dialog analysis / generation request, the dialog analysis / generation unit 37 generates a message for acquiring the name of the user who gave the utterance. For example, the dialogue analysis / generation unit 37 uses a template prepared in advance in the speech recognition system and the term “shiritori” included in the message 911 to start the message 946 (let's start shiritori. Create.)
ステップ950にて、対話分析・生成部37は、生成したメッセージ946を制御部30に送信する。制御部30は、当該メッセージの受信を検知すると、当該発話を与えた端末の端末IDと当該メッセージとを含む音声レスポンスを生成する。
In step 950, the dialog analysis / generation unit 37 transmits the generated message 946 to the control unit 30. When detecting the reception of the message, the control unit 30 generates a voice response including the terminal ID of the terminal that has given the utterance and the message.
ステップ955にて、制御部30は、音声出力部32に対して、当該音声レスポンスを送信する。音声出力部32は、当該音声レスポンスの信号を受信すると、当該信号に基づく音声を出力する。ユーザが当該音声を認識すると、その音声に対する発話を行なう。その発話は、音声入力部31によって受け付けられる。
In step 955, the control unit 30 transmits the audio response to the audio output unit 32. When the audio output unit 32 receives the audio response signal, the audio output unit 32 outputs audio based on the signal. When the user recognizes the voice, the user speaks the voice. The utterance is received by the voice input unit 31.
ステップ960にて、音声入力部31は、受け付けたメッセージ961(名前登録発話)の内容を制御部30に送信する。メッセージ961は、たとえば「たろうだよ」のように、メッセージ946に対する回答(名前)を含む。制御部30は、メッセージ961の受信を検知すると、音声認識リクエストを生成する。
In step 960, the voice input unit 31 transmits the content of the received message 961 (name registration utterance) to the control unit 30. The message 961 includes an answer (name) to the message 946, for example, “Taro would!”. When the control unit 30 detects reception of the message 961, the control unit 30 generates a voice recognition request.
ステップ965にて、制御部30は、音声認識部36に対して音声認識リクエストを送信する。音声認識部36は、音声認識リクエストの受信を検知すると、メッセージ961の音声認識処理を実行する。
In step 965, the control unit 30 transmits a voice recognition request to the voice recognition unit 36. When the voice recognition unit 36 detects the reception of the voice recognition request, the voice recognition unit 36 executes voice recognition processing of the message 961.
ステップ970にて、制御部30は、ユーザ管理部35に対して、話者モデルリスト取得リクエストを送信する。ユーザ管理部35は、話者モデルリクエストの受信を検知すると、話者モデルリストの取得を試みる。ユーザ管理部35は、取得を試みた結果を話者モデルリストレスポンスとして生成する。
In step 970, the control unit 30 transmits a speaker model list acquisition request to the user management unit 35. When the user management unit 35 detects the reception of the speaker model request, the user management unit 35 tries to acquire the speaker model list. The user management unit 35 generates a result of the acquisition attempt as a speaker model list response.
ステップ975にて、ユーザ管理部35は、制御部30に対して、話者モデルリストレスポンスを送信する。
In step 975, the user management unit 35 transmits a speaker model list response to the control unit 30.
ステップ980にて、制御部30は、話者モデルリストレスポンスの受信に応答して、話者識別リクエストを話者識別部33に送信する。話者識別部33は、話者識別リクエストの受信を検知すると、話者の識別を開始し、識別結果を生成する。
In step 980, in response to receiving the speaker model list response, the control unit 30 transmits a speaker identification request to the speaker identification unit 33. When the speaker identification unit 33 detects reception of a speaker identification request, the speaker identification unit 33 starts speaker identification and generates an identification result.
図10を参照して、ステップ1010にて、音声認識部36は、話者識別リクエストに対する応答として、音声認識レスポンスを制御部30に送信する。当該音声認識レスポンスは、メッセージ961の内容を認識できた旨を含み得る。
Referring to FIG. 10, in step 1010, the voice recognition unit 36 transmits a voice recognition response to the control unit 30 as a response to the speaker identification request. The voice recognition response may include information indicating that the content of the message 961 has been recognized.
ステップ1015にて、話者識別部33は、話者識別失敗レスポンスを制御部30に送信する。すなわち、話者(たろう)は、音声認識システムにおいて登録されていない。そこで、話者識別部33は、話者を識別する試みが失敗したことを表すレスポンスを生成する。
In step 1015, the speaker identification unit 33 transmits a speaker identification failure response to the control unit 30. That is, the speaker is not registered in the speech recognition system. Therefore, the speaker identification unit 33 generates a response indicating that the attempt to identify the speaker has failed.
ステップ1020にて、制御部30は、対話分析・生成リクエストを対話分析・生成部37に送信する。対話分析・生成部37は、対話分析・生成リクエストの受信に応答して、対話のためのメッセージ1031を生成する。メッセージ1031は、たとえば「たろうさんだね。それじゃはじめるよ。最初はりんご。」のように、発話の内容および話者を識別する情報を含むメッセージとして生成される。
In step 1020, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37. In response to receiving the dialog analysis / generation request, the dialog analysis / generation unit 37 generates a message 1031 for dialog. The message 1031 is generated as a message including the content of the utterance and information for identifying the speaker, such as “Taro-san. Let's start.
ステップ1030にて、対話分析・生成部37は、メッセージ1031を制御部30に送信する。制御部30は、メッセージ1031の受信を検知すると、端末への発話に対して応答するため、メッセージ1031と端末IDとを含む音声レスポンスを生成する。
In step 1030, the dialog analysis / generation unit 37 transmits the message 1031 to the control unit 30. When detecting the reception of the message 1031, the control unit 30 generates an audio response including the message 1031 and the terminal ID in order to respond to the utterance to the terminal.
ステップ1035にて、制御部30は、当該音声レスポンスを端末に送信する。端末の音声出力部32は、音声レスポンスの信号を受信すると、当該信号に基づく音声を出力する。ユーザは、その音声を認識すると、次の応答を考えて、端末に発話する。音声入力部31は、その発話、たとえば「ゴリラ」を受け付ける。
In step 1035, the control unit 30 transmits the voice response to the terminal. When receiving the audio response signal, the audio output unit 32 of the terminal outputs audio based on the signal. When the user recognizes the voice, the user speaks to the terminal in consideration of the next response. The voice input unit 31 receives the utterance, for example, “gorilla”.
その後、しりとりのための数回のやり取りが行なわれる(ステップ1040以降)。
ステップ1040にて、音声入力部31は、受け付けたメッセージ1041を制御部30に送信する。制御部30は、メッセージ1041の受信を検知すると、音声認識リクエストを生成する。 Thereafter, several exchanges for shiritori are performed (step 1040 and subsequent steps).
Instep 1040, the voice input unit 31 transmits the received message 1041 to the control unit 30. When the control unit 30 detects the reception of the message 1041, the control unit 30 generates a voice recognition request.
ステップ1040にて、音声入力部31は、受け付けたメッセージ1041を制御部30に送信する。制御部30は、メッセージ1041の受信を検知すると、音声認識リクエストを生成する。 Thereafter, several exchanges for shiritori are performed (
In
ステップ1045にて、制御部30は、音声認識リクエストを音声認識部36に送信する。音声認識部36は、当該リクエストを受信すると、音声認識処理を開始する。
In step 1045, the control unit 30 transmits a voice recognition request to the voice recognition unit 36. When the voice recognition unit 36 receives the request, the voice recognition unit 36 starts a voice recognition process.
ステップ1050にて、制御部30は、話者音声保存・リスト取得リクエストをユーザ管理部35に送信する。ユーザ管理部35は、当該リクエストの受信を検知すると、話者(たろう)の識別IDと、話者(たろう)の名前とを、互いに関連付けることにより保存する。さらに、ユーザ管理部35は、話者音声の保存が成功したことを表す応答を生成する。
In step 1050, the control unit 30 transmits a speaker voice storage / list acquisition request to the user management unit 35. When detecting the reception of the request, the user management unit 35 saves the identification ID of the speaker (taro) and the name of the speaker (taro) by associating each other. Further, the user management unit 35 generates a response indicating that the speaker voice has been successfully stored.
ステップ1055にて、ユーザ管理部35は、当該応答として、話者音声保存・リスト取得レスポンスを制御部30に送信する。
In step 1055, the user management unit 35 transmits a speaker voice storage / list acquisition response to the control unit 30 as the response.
ステップ1060にて、制御部30は、話者識別モデル学習リクエストを話者識別学習部34に送信する。話者識別学習部34は、当該リクエストの受信を検知すると、話者識別モデルとして、当該発話を与えたユーザに音声を関連付けてモデルを生成し、適宜、更新する。
In step 1060, the control unit 30 transmits a speaker identification model learning request to the speaker identification learning unit 34. When the speaker identification learning unit 34 detects the reception of the request, the speaker identification learning unit 34 generates a model by associating a voice with the user who gave the utterance as a speaker identification model, and updates it appropriately.
ステップ1065にて、音声認識部36は、音声認識リクエストに基づく処理の結果を音声認識レスポンスとして制御部30に送信する。
In step 1065, the voice recognition unit 36 transmits the processing result based on the voice recognition request to the control unit 30 as a voice recognition response.
ステップ1070にて、話者識別学習部34は、話者識別モデル学習リクエストに対する応答して、話者識別学習レスポンスを制御部30に送信する。
In step 1070, the speaker identification learning unit 34 transmits a speaker identification learning response to the control unit 30 in response to the speaker identification model learning request.
ステップ1075にて、制御部30は、対話分析・生成リクエストを生成して、生成したリクエストを対話分析・生成部37に送信する。たとえば、制御部30は、話者の学習のために十分なデータがなく学習失敗であると判断した場合には、当該リクエストを生成する。対話分析・生成部37は、当該リクエストの受信を検知すると、さらに学習するためのメッセージ1081(たとえば、「ゴリラ・・・。それじゃぁ「ラクダ」)を生成する。
In step 1075, the control unit 30 generates a dialog analysis / generation request, and transmits the generated request to the dialog analysis / generation unit 37. For example, if the control unit 30 determines that there is not enough data for the speaker's learning and the learning has failed, the control unit 30 generates the request. When the dialog analysis / generation unit 37 detects the reception of the request, the dialog analysis / generation unit 37 generates a message 1081 (for example, “gorilla... Then,“ camel ”) for further learning.
ステップ1080にて、対話分析・生成部37は、生成したメッセージ1081を制御部30に送信する。制御部30は、メッセージ1081を受信すると、端末IDとメッセージ1081とを含む音声レスポンスを生成する。
In step 1080, the dialog analysis / generation unit 37 transmits the generated message 1081 to the control unit 30. When the control unit 30 receives the message 1081, the control unit 30 generates a voice response including the terminal ID and the message 1081.
ステップ1085にて、制御部30は、生成した音声レスポンスを端末に送信する。端末は、音声レスポンスを受信すると、音声出力部32は、音声レスポンスに基づく音声を出力する。ユーザは、端末の音声出力部32から発せられた音声を認識すると、その次の応答を考える。予め定められた時間内にユーザが、当該次の応答を発すると、音声入力部31は、ユーザの発話を受け付けて、当該発話に応じた音声応答を生成する。
In step 1085, the control unit 30 transmits the generated voice response to the terminal. When the terminal receives the voice response, the voice output unit 32 outputs voice based on the voice response. When the user recognizes the voice emitted from the voice output unit 32 of the terminal, the user considers the next response. When the user issues the next response within a predetermined time, the voice input unit 31 accepts the user's utterance and generates a voice response corresponding to the utterance.
図11を参照して、ステップ1110にて、音声入力部31は、メッセージ1111(たとえば、「ダイヤモンド」)を制御部30に送信する。制御部30は、メッセージ1111の受信を検知すると、音声認識リクエストと、話者音声保存・リスト取得リクエストとを生成する。
Referring to FIG. 11, in step 1110, voice input unit 31 transmits message 1111 (for example, “diamond”) to control unit 30. When detecting the reception of the message 1111, the control unit 30 generates a speech recognition request and a speaker speech storage / list acquisition request.
ステップ1115にて、制御部30は、音声認識リクエストを音声認識部36に送信する。音声認識部36は、当該リクエストの受信を検知すると、メッセージ1111の音声認識処理を開始する。
In step 1115, the control unit 30 transmits a voice recognition request to the voice recognition unit 36. When the voice recognition unit 36 detects reception of the request, the voice recognition unit 36 starts voice recognition processing of the message 1111.
ステップ1120にて、制御部30は、メッセージ1111と話者音声保存・リスト取得リクエストとをユーザ管理部35に送信する。ユーザ管理部35は、当該リクエストの受信を検知すると、メッセージ1111の内容(音声データ)を、ユーザ(話者)の識別IDに関連付けて格納する。
In step 1120, the control unit 30 transmits a message 1111 and a speaker voice storage / list acquisition request to the user management unit 35. When detecting the reception of the request, the user management unit 35 stores the content (voice data) of the message 1111 in association with the identification ID of the user (speaker).
ステップ1130にて、制御部30は、話者識別モデル学習リクエストを話者識別学習部34に送信する。話者識別学習部34は、当該リクエストの受信を検知すると、話者識別モデルを学習する。より具体的には、話者識別学習部34は、ユーザの識別IDと、メッセージ1111に含まれる音声情報(たとえば、声紋情報)とを関連付けて保存する。学習が完了すると、話者識別学習部34は、話者識別モデルの学習が完了したことを表すレスポンスを生成する。
In step 1130, the control unit 30 transmits a speaker identification model learning request to the speaker identification learning unit 34. When the speaker identification learning unit 34 detects reception of the request, the speaker identification learning unit 34 learns a speaker identification model. More specifically, the speaker identification learning unit 34 stores the user identification ID and voice information (for example, voiceprint information) included in the message 1111 in association with each other. When the learning is completed, the speaker identification learning unit 34 generates a response indicating that the learning of the speaker identification model is completed.
ステップ1135にて、音声認識部36は、音声認識処理が終わったことに応答して、音声認識処理の結果を通知する音声認識レスポンスを生成し、当該レスポンスを制御部30に送信する。
In step 1135, in response to the completion of the voice recognition process, the voice recognition unit 36 generates a voice recognition response notifying the result of the voice recognition process, and transmits the response to the control unit 30.
ステップ1140にて、話者識別学習部34は、生成したレスポンスと制御部30に送信する。制御部30は、音声認識部36からのレスポンスと話者識別学習部34からのレスポンスとを受信すると、学習に十分なデータが揃い、学習が完了したか否かを判断する。たとえば、予め定められた数以上の音声データがユーザの識別IDに関連付けられた場合には、制御部30は、学習に十分なデータが揃い学習が完了したと判断する。
In step 1140, the speaker identification learning unit 34 transmits the generated response to the control unit 30. When the control unit 30 receives the response from the voice recognition unit 36 and the response from the speaker identification learning unit 34, the control unit 30 determines whether data sufficient for learning is prepared and learning is completed. For example, when more than a predetermined number of audio data is associated with the user identification ID, the control unit 30 determines that sufficient data for learning is available and learning is completed.
制御部30は、音声認識部36からのレスポンスと話者識別学習部34からのレスポンスの受信の内容に基づいて、対話分析・生成リクエストを生成する。たとえば、制御部30は、各レスポンスの結果に基づいて、音声認識が成功し、かつ、学習に十分なデータが揃い学習が完了したと判断すると、当該リクエストを生成する。学習に十分なデータとは、たとえば、予め定められた一定時間内に音声データから抽出された情報量(一定のデータサイズを有する声紋情報の個数など)が学習に必要であると規定された情報量を超えているものをいう。
The control unit 30 generates a dialog analysis / generation request based on the content of the response received from the speech recognition unit 36 and the response received from the speaker identification learning unit 34. For example, based on the result of each response, the control unit 30 generates the request when it is determined that the speech recognition is successful, and that sufficient data for learning is available and learning is completed. Data sufficient for learning is, for example, information that is defined as the amount of information extracted from speech data within a predetermined time (such as the number of voiceprint information having a fixed data size) required for learning. The one that exceeds the amount.
ステップ1145にて、制御部30は、生成したリクエストを対話分析・生成部37に送信する。対話分析・生成部37は、当該リクエストの受信を検知すると、メッセージ1111に対するメッセージ1151を生成する。
In step 1145, the control unit 30 transmits the generated request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1151 for the message 1111.
ステップ1150にて、対話分析・生成部37は、生成したメッセージ1151を制御部30に送信する。制御部30は、メッセージ1151の受信を検知すると、端末IDとメッセージ1151とを含む音声レスポンスを生成する。
In step 1150, the dialog analysis / generation unit 37 transmits the generated message 1151 to the control unit 30. When detecting the reception of the message 1151, the control unit 30 generates a voice response including the terminal ID and the message 1151.
ステップ1155にて、制御部30は、端末に音声レスポンスを生成する。端末は、音声レスポンスを受信すると、音声出力部32から音声を出力する。
In step 1155, the control unit 30 generates a voice response to the terminal. When the terminal receives the voice response, the terminal outputs the voice from the voice output unit 32.
<ユーザからの発話起点のシーケンス>
図12を参照して、他の局面について説明する。図12は、ユーザが音声認識システムに既知である場合におけるユーザ1と端末2とのやり取りのシーケンスを表す図である。なお、前述の動作と同じ動作には同じ番号を付してある。したがって、同じ動作の説明は、繰り返さない。 <Speaking sequence from user>
Another aspect will be described with reference to FIG. FIG. 12 is a diagram illustrating an exchange sequence between theuser 1 and the terminal 2 when the user is known to the voice recognition system. The same operations as those described above are denoted by the same numbers. Therefore, description of the same operation will not be repeated.
図12を参照して、他の局面について説明する。図12は、ユーザが音声認識システムに既知である場合におけるユーザ1と端末2とのやり取りのシーケンスを表す図である。なお、前述の動作と同じ動作には同じ番号を付してある。したがって、同じ動作の説明は、繰り返さない。 <Speaking sequence from user>
Another aspect will be described with reference to FIG. FIG. 12 is a diagram illustrating an exchange sequence between the
ユーザが既に登録されている場合には、話者モデルが適宜更新される。したがって、常に直近のユーザの音声データに基づいた話者識別が可能となる。
If the user has already been registered, the speaker model is updated accordingly. Therefore, speaker identification based on the latest user's voice data is always possible.
ユーザ1が端末2に対して、メッセージ10を発する。端末2は、メッセージ10を受け付けると、音声認識処理と話者識別処理とを実行する。端末2は、話者識別処理の結果に基づいて、メッセージ10の話者を識別できたと判断すると、その判断の結果に応じて、メッセージ1210を発する。メッセージ1210は、メッセージ10に対する応答と、メッセージ10の話者を確認するための問いかけとを含む。ユーザ1が、メッセージ1210に対するメッセージ1220を発すると、端末2は、メッセージ1220について音声認識処理と話者識別処理とを行なう。
User 1 issues message 10 to terminal 2. When the terminal 2 accepts the message 10, the terminal 2 executes voice recognition processing and speaker identification processing. If the terminal 2 determines that the speaker of the message 10 has been identified based on the result of the speaker identification process, the terminal 2 issues a message 1210 according to the determination result. Message 1210 includes a response to message 10 and an inquiry to identify the speaker of message 10. When the user 1 issues a message 1220 for the message 1210, the terminal 2 performs voice recognition processing and speaker identification processing on the message 1220.
端末2は、メッセージ1220の内容から、当該問いかけに対する回答が得られたと判断すると、端末2の端末IDとユーザ名(たろう)とを含むデータをユーザ管理部35に送信する。ユーザ管理部35は、当該データを蓄積する。さらに、端末2は、メッセージ1220に対するメッセージ1230を発する。
If the terminal 2 determines from the content of the message 1220 that an answer to the inquiry has been obtained, the terminal 2 transmits data including the terminal ID of the terminal 2 and the user name (taro) to the user management unit 35. The user management unit 35 accumulates the data. Further, the terminal 2 issues a message 1230 for the message 1220.
その後、端末2は、ユーザ1からの発話を認識するたびに、端末IDとユーザ名とを含むデータをユーザ管理部35に送信する。ユーザ管理部35は、各データを保存する。
Thereafter, whenever the terminal 2 recognizes an utterance from the user 1, the terminal 2 transmits data including the terminal ID and the user name to the user management unit 35. The user management unit 35 stores each data.
話者識別学習部34は、ユーザ管理部35から、端末IDとユーザ名とを参照して、蓄積されたデータから、当該ユーザに関連付けられたデータを読み出し、話者モデル80を作成する。
The speaker identification learning unit 34 refers to the terminal ID and the user name from the user management unit 35, reads data associated with the user from the accumulated data, and creates a speaker model 80.
図13および図14を参照して、ある局面に従う音声認識システムにおけるシーケンスについて説明する。図13および図14は、ユーザが既知である場合に行なわれる処理の流れを表すシーケンスチャートである。なお、前述の処理と同一の処理には同一のステップ番号を付してある。したがって、同一の処理の説明は繰り返さない。
With reference to FIG. 13 and FIG. 14, the sequence in the speech recognition system according to a certain aspect will be described. 13 and 14 are sequence charts showing the flow of processing performed when the user is known. The same steps as those described above are denoted by the same step numbers. Therefore, the description of the same process will not be repeated.
ステップ1340にて、話者識別部33は、話者識別が成功したことを通知するために、話者識別レスポンスを制御部30に送信する。制御部30は、当該レスポンスと、音声認識部36からのレスポンスとの受信を検知すると、対話分析・生成リクエストを生成する。当該リクエストは、音声識別結果と話者識別結果とを含む。
In step 1340, the speaker identification unit 33 transmits a speaker identification response to the control unit 30 in order to notify that the speaker identification has been successful. When the control unit 30 detects the reception of the response and the response from the voice recognition unit 36, the control unit 30 generates a dialog analysis / generation request. The request includes a voice identification result and a speaker identification result.
ステップ1345にて、制御部30は、対話分析・生成部37に対して、対話分析・生成リクエストを送信する。対話分析・生成部37は、当該リクエストの受信を検知すると、メッセージ911に応答するためのメッセージ1351を生成する。このとき、メッセージ1351は、メッセージ911に対する応答と、メッセージ911の発話者を確認するための問いかけとを含む。
In step 1345, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1351 for responding to the message 911. At this time, the message 1351 includes a response to the message 911 and an inquiry for confirming the speaker of the message 911.
ステップ1350にて、対話分析・生成部37は、生成したメッセージ1351を制御部30に送信する。制御部30がメッセージ1351と端末IDとを含む音声レスポンスを端末に送信すると、端末の音声出力部32は、音声を発話する。ユーザは、当該音声を認識して当該音声が正しいと判断すると、たとえば「そうだよ」とのメッセージ1361を発する(名前登録発話)。
In step 1350, the dialog analysis / generation unit 37 transmits the generated message 1351 to the control unit 30. When the control unit 30 transmits a voice response including the message 1351 and the terminal ID to the terminal, the voice output unit 32 of the terminal utters voice. When the user recognizes the voice and determines that the voice is correct, for example, the user issues a message 1361 “Yes” (name registration utterance).
ステップ1360にて、音声入力部31は、メッセージ1361の入力を受け付けると、その入力に応じた音声信号を制御部30に送信する。その後、制御部30は、音声認識リクエストを音声認識部36に送信する(ステップ965)。
In step 1360, when the voice input unit 31 receives an input of the message 1361, the voice input unit 31 transmits a voice signal corresponding to the input to the control unit 30. Thereafter, the control unit 30 transmits a voice recognition request to the voice recognition unit 36 (step 965).
図14を参照して、ステップ1410にて、話者識別部33は、話者識別リクエスト(ステップ980)に対する応答を話者認識レスポンスとして話者識別部33に送信する。ユーザが音声認識システムにとって既知である場合、話者認識レスポンスは、話者が識別されたことを表す。制御部30は、当該レスポンスの受信を検知すると、対話分析・生成リクエストを生成する。
Referring to FIG. 14, in step 1410, speaker identification unit 33 transmits a response to the speaker identification request (step 980) to speaker identification unit 33 as a speaker recognition response. If the user is known to the speech recognition system, the speaker recognition response indicates that the speaker has been identified. When detecting the reception of the response, the control unit 30 generates a dialog analysis / generation request.
ステップ1420にて、制御部30は、生成した対話分析・生成リクエストを対話分析・生成部37に送信する。対話分析・生成部37は、当該リクエストの受信を検知すると、メッセージ1431を生成する。メッセージ1431は、これまでのやり取りの結果に基づいて、メッセージ1351に含まれる問いかけ{たろうさんかな?)が正しかったことを踏まえた内容(やっぱり!)を含む。
In step 1420, the control unit 30 transmits the generated dialog analysis / generation request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1431. The message 1431 is a question {Taro-san? Included in the message 1351 based on the result of the exchange so far. ) Is included based on the correctness (after all!).
ステップ1430にて、対話分析・生成部37は、メッセージ1431を制御部30に送信する。制御部30は、メッセージ1431の受信を検知すると、端末IDとメッセージ1431とを含む音声レスポンスを生成する。
In step 1430, the dialog analysis / generation unit 37 transmits a message 1431 to the control unit 30. When the control unit 30 detects reception of the message 1431, the control unit 30 generates an audio response including the terminal ID and the message 1431.
ステップ1440にて、制御部30は、端末に音声レスポンスを送信する。音声出力部32は、当該音声レスポンスに基づいて、メッセージ1431を音声で出力する。
In step 1440, control unit 30 transmits a voice response to the terminal. The voice output unit 32 outputs the message 1431 by voice based on the voice response.
その後、ステップ1040以降の処理が、前述の場合と同様に行なわれる。音声データが保存され、学習データ(たとえば、声紋情報等)は、対象ユーザの常に新しい音声データで更新される。なお、ユーザが既知の場合には、学習が完了しても、端末は、ユーザの名前を確認するための発話を行なわない。
Thereafter, the processing after step 1040 is performed in the same manner as described above. The voice data is stored, and the learning data (for example, voiceprint information) is constantly updated with new voice data of the target user. When the user is known, even when learning is completed, the terminal does not make an utterance for confirming the user's name.
<端末が発話の起点となる場合>
図15~図17を参照して、さらに別の局面について説明する。図15は、端末2からユーザ1に話しかけることが対話のトリガとなる場合を表す図である。 <When the terminal is the starting point of utterance>
Still another aspect will be described with reference to FIGS. FIG. 15 is a diagram illustrating a case in which talking to theuser 1 from the terminal 2 triggers a conversation.
図15~図17を参照して、さらに別の局面について説明する。図15は、端末2からユーザ1に話しかけることが対話のトリガとなる場合を表す図である。 <When the terminal is the starting point of utterance>
Still another aspect will be described with reference to FIGS. FIG. 15 is a diagram illustrating a case in which talking to the
端末2からユーザに話しかけ、ユーザ発話およびユーザ名を聞き出すことによって得られた音声データをユーザ名と端末IDとに紐付けることにより、音声データを学習する。
Talking to the user from the terminal 2 and learning the voice data by associating the voice data obtained by listening to the user utterance and the user name with the user name and the terminal ID.
端末2は、ユーザ1の存在を検知すると、ユーザ1に対して話しかける。ユーザ1の存在の検知は、たとえば、赤外線センサ、人感センサ等からの出力に基づいて行なわれる。端末2は、たとえば、メッセージ1510を発する。ユーザ1は、メッセージ1510を認識する。
When the terminal 2 detects the presence of the user 1, the terminal 2 talks to the user 1. The presence of the user 1 is detected based on, for example, an output from an infrared sensor, a human sensor, or the like. The terminal 2 issues a message 1510, for example. User 1 recognizes message 1510.
ユーザ1は、メッセージ1510に応答して、メッセージ1520を発する。端末2は、メッセージ1510を認識すると、音声認識処理と話者識別処理とを実行する。端末2は、各処理の結果に基づいて、ユーザ1に対する発話を切り換える。たとえば、話者が既知でないと判断すると、端末2は、メッセージ1530を生成し、音声でメッセージ1530を出力する。
User 1 issues message 1520 in response to message 1510. When the terminal 2 recognizes the message 1510, the terminal 2 executes voice recognition processing and speaker identification processing. The terminal 2 switches the utterance to the user 1 based on the result of each process. For example, if it is determined that the speaker is not known, the terminal 2 generates a message 1530 and outputs the message 1530 by voice.
ユーザ1は、メッセージ1530に応答してメッセージ1540を端末2に向けて発する。端末2は、メッセージ1540について音声認識処理および話者識別処理を実行する。さらに、端末2は、端末2のユーザ名として認識された話者「たろう」と端末IDとを関連付け、これまで受け付けたユーザ1のメッセージ1520,1540を話者の音声データとしてユーザ管理部35に蓄積する。
User 1 issues message 1540 to terminal 2 in response to message 1530. The terminal 2 performs voice recognition processing and speaker identification processing on the message 1540. Further, the terminal 2 associates the speaker “Taro” recognized as the user name of the terminal 2 with the terminal ID, and sends the messages 1520 and 1540 of the user 1 received so far to the user management unit 35 as voice data of the speaker. accumulate.
さらに、端末2は、メッセージ1540に対する応答としてメッセージ1550を生成し、音声でメッセージ1550を出力する。
Further, the terminal 2 generates a message 1550 as a response to the message 1540 and outputs the message 1550 by voice.
ユーザ管理部35には、ユーザ「たろう」に関連付けられた音声データと、音声データから取得された識別情報(たとえば声紋情報)とが蓄積される。
The user management unit 35 stores voice data associated with the user “Taro” and identification information (for example, voiceprint information) acquired from the voice data.
図16および図17を参照して、ある局面における音声認識システムの動作について説明する。図16および図17は、音声認識システムで行なわれる処理の一部を表すシーケンスチャートである。
Referring to FIG. 16 and FIG. 17, the operation of the voice recognition system in a certain aspect will be described. 16 and 17 are sequence charts showing a part of processing performed in the voice recognition system.
ステップ1610にて、制御部30は、予め定められた条件が成立したことを検知すると、対話生成リクエストを対話分析・生成部37に送信する。当該条件は、たとえば、音声認識システムの範囲内でユーザの存在が検知されたこと、予め指定された時刻が到来したこと等である。対話生成リクエストは、たとえば、検出されたユーザに対して話しかけるためのメッセージ1510の生成要求を含む。対話分析・生成部37は、当該リクエストの受信を検知すると、予め準備されたテンプレートに基づいて、メッセージ1510を生成する。
In step 1610, when detecting that a predetermined condition is satisfied, the control unit 30 transmits a dialog generation request to the dialog analysis / generation unit 37. The condition is, for example, that the presence of the user is detected within the range of the voice recognition system, that a predetermined time has arrived. The dialog generation request includes, for example, a request to generate a message 1510 for speaking to the detected user. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1510 based on a template prepared in advance.
ステップ1615にて、対話分析・生成部37は、当該リクエストに応答して生成したメッセージ1510を制御部30に送信する。制御部30は、メッセージ1510の受信を検知すると、メッセージ1510と端末IDとを含む音声発話リクエストを端末に送信する。端末の音声入力部31は、当該リクエストを受信すると、メッセージ1510を音声で出力する。ユーザは、メッセージ1510を認識すると、メッセージ1510に対する応答として、メッセージ1520を発する。
In step 1615, the dialog analysis / generation unit 37 transmits the message 1510 generated in response to the request to the control unit 30. When the control unit 30 detects reception of the message 1510, the control unit 30 transmits a voice utterance request including the message 1510 and the terminal ID to the terminal. When receiving the request, the voice input unit 31 of the terminal outputs the message 1510 by voice. When the user recognizes the message 1510, the user issues a message 1520 as a response to the message 1510.
ステップ1625にて、音声入力部31は、メッセージ1520を音声信号として制御部30に送信する。その後、ステップ915からステップ1345まで、前述の処理と同様の処理が実行される。
In step 1625, the voice input unit 31 transmits the message 1520 as a voice signal to the control unit 30. Thereafter, from step 915 to step 1345, processing similar to that described above is executed.
ステップ1350にて、対話分析・生成部37は、メッセージ1530を制御部30に送信する。メッセージ1530に基づく音声が出力されると、ユーザは、メッセージ1540を発する。メッセージ1540は、制御部30から音声認識部36に送られ、音声認識処理が実行される(ステップ1045)。
In step 1350, the dialog analysis / generation unit 37 transmits a message 1530 to the control unit 30. When sound based on message 1530 is output, the user issues message 1540. The message 1540 is sent from the control unit 30 to the voice recognition unit 36, and voice recognition processing is executed (step 1045).
図17を参照して、ステップ1050からステップ1070までの処理が、同様に実行される。その後、制御部30は、学習に十分なデータがなく、学習が失敗したと判断すると、ステップ1740の処理が実行される。より具体的には、ステップ1741にて、制御部30は、対話分析・生成リクエストを対話分析・生成部37に送信する。対話分析・生成部37は、当該リクエストの受信を検知すると、当該リクエストに応じたメッセージ1550を生成する。
Referring to FIG. 17, the processing from step 1050 to step 1070 is similarly executed. Thereafter, when the control unit 30 determines that there is not enough data for learning and learning has failed, the process of step 1740 is executed. More specifically, in step 1741, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1550 corresponding to the request.
ステップ1742にて、対話分析・生成部37は、メッセージ1550を制御部30に送信する。制御部30は、メッセージ1550の受信を検知すると、端末IDとメッセージ1550とを含む音声レスポンスを生成する。
In step 1742, the dialog analysis / generation unit 37 transmits a message 1550 to the control unit 30. When the control unit 30 detects reception of the message 1550, the control unit 30 generates an audio response including the terminal ID and the message 1550.
一方、制御部30は、学習に十分なデータが揃い楽手が完了したと判断すると、ステップ1750の処理を実行する。より詳しくは、ステップ1751にて、制御部30は、対話分析・生成リクエストを対話分析・生成部37に送信する。対話分析・生成部37は、当該リクエストの受信を検知すると、当該リクエストに応答するためのメッセージ1560を生成する。
On the other hand, if the control unit 30 determines that the data sufficient for learning is available and the user has completed, the process of step 1750 is executed. More specifically, in step 1751, the control unit 30 transmits a dialog analysis / generation request to the dialog analysis / generation unit 37. When the dialog analysis / generation unit 37 detects reception of the request, the dialog analysis / generation unit 37 generates a message 1560 for responding to the request.
ステップ1752にて、対話分析・生成部37は、メッセージ1560を制御部30に送信する。制御部30は、メッセージ1560の受信を検知すると、端末IDとメッセージ1560とを含む音声レスポンスを生成する。
In step 1752, the dialog analysis / generation unit 37 transmits a message 1560 to the control unit 30. When the control unit 30 detects reception of the message 1560, the control unit 30 generates an audio response including the terminal ID and the message 1560.
ステップ1760にて、制御部30は、当該音声レスポンスを端末に送信する。音声出力部32は、音声レスポンスを受信すると、メッセージ1560を音声で出力する。
In step 1760, the control unit 30 transmits the voice response to the terminal. When receiving the voice response, the voice output unit 32 outputs the message 1560 by voice.
<他の局面>
さらに他の局面について説明する。他の局面において、以下の構成が用いられてもよい。 <Other aspects>
Still another aspect will be described. In other aspects, the following configurations may be used.
さらに他の局面について説明する。他の局面において、以下の構成が用いられてもよい。 <Other aspects>
Still another aspect will be described. In other aspects, the following configurations may be used.
(1)音声認識と音声認証とが並列に行なわれる。したがって、ユーザの発話内容の認識と当該ユーザの認証とが同時に行なわれる。
(1) Voice recognition and voice authentication are performed in parallel. Therefore, recognition of the user's utterance content and authentication of the user are performed simultaneously.
(2)ユーザ毎に、対話内容のログに基づいて各ユーザの興味ある話題が推定され、推定された話題に基づく対話が生成される。
(2) For each user, a topic of interest of each user is estimated based on the log of the dialog content, and a dialog based on the estimated topic is generated.
(3)対話数やその頻度に基づいて、ロボット(音声対話装置、あるいは音声対話システム)の発話内容が変化する。
(3) Based on the number of conversations and their frequency, the utterance content of the robot (spoken dialogue device or spoken dialogue system) changes.
これらの要素の結果、ユーザは、ロボット(音声対話システム)に親しみを持つことができる。
As a result of these elements, the user can become familiar with the robot (voice dialogue system).
たとえば、構成(1)により、当該技術思想が適用される音声対話システムは、カメラや無線タグ等の機器からの情報を使用することなく、ユーザを特定し(音声認証)、また、当該ユーザの発言内容の取得(音声認識)が可能になる。
For example, with the configuration (1), the voice interactive system to which the technical idea is applied specifies a user (voice authentication) without using information from a device such as a camera or a wireless tag, and the user's Acquisition of speech contents (voice recognition) becomes possible.
次に、構成(2)により、ユーザの日々の会話が音声対話システムに記憶され、必要に応じて分析される。音声対話システムは、分析結果に基づいて、各ユーザが興味ある話題(スポーツ、芸能ニュースなど)を他の情報提供装置から取得し、対話しているユーザに応じた話題を当該ユーザに提供することができる。
Next, according to the configuration (2), the daily conversation of the user is stored in the voice dialogue system and analyzed as necessary. Based on the analysis result, the voice dialogue system acquires a topic (sports, entertainment news, etc.) that each user is interested in from another information providing device, and provides the user with a topic according to the user who is interacting. Can do.
さらに、構成(3)により、音声対話システムとユーザとの対話が長期にかつ定期的に行なわれることにより、対話内容に応じて、音声対話システムからの発話の表現(言葉づかい、語調等)が変化し得る。その結果、ユーザが音声対話システム(あるいは、音声対話システムに含まれるロボットのような音声入出力端末)に対して親近感を持ち得る。これらの各構成は、適宜組み合され得る。
Furthermore, with the configuration (3), the dialogue between the voice dialogue system and the user is performed for a long time and periodically, so that the expression of the utterance from the voice dialogue system (wording, tone, etc.) changes according to the dialogue contents. Can do. As a result, the user can be familiar with the voice interaction system (or a voice input / output terminal such as a robot included in the voice interaction system). Each of these configurations can be combined as appropriate.
<まとめ>
以上のようにして、本実施の形態に係る音声認識システムによれば、ユーザは学習のための前処理を意識せずに、通常の音声対話を行なうことにより、学習に必要な音声データをシステムに与えることができる。したがって、当該システムにより提供される機能を容易に利用することができる。 <Summary>
As described above, according to the voice recognition system according to the present embodiment, the user performs normal voice conversation without being aware of the preprocessing for learning, thereby obtaining voice data necessary for learning. Can be given to. Therefore, the function provided by the system can be easily used.
以上のようにして、本実施の形態に係る音声認識システムによれば、ユーザは学習のための前処理を意識せずに、通常の音声対話を行なうことにより、学習に必要な音声データをシステムに与えることができる。したがって、当該システムにより提供される機能を容易に利用することができる。 <Summary>
As described above, according to the voice recognition system according to the present embodiment, the user performs normal voice conversation without being aware of the preprocessing for learning, thereby obtaining voice data necessary for learning. Can be given to. Therefore, the function provided by the system can be easily used.
さらに他の局面において、ユーザが意識することなくユーザ認証され、当該ユーザに応じた話題が出力されるので、ユーザは音声認識システムにより提供されるサービスや機能に親近感を持ち得る。
In yet another aspect, user authentication is performed without the user's awareness, and a topic corresponding to the user is output. Therefore, the user can be familiar with the services and functions provided by the voice recognition system.
今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて請求の範囲によって示され、請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。
The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.
30 制御部、31 音声入力部、32 音声出力部、33 話者識別部、34 話者識別学習部、35 ユーザ管理部、36 音声認識部、37 生成部、80 話者モデル、350,400,500 サーバ、410 話者識別サーバ、520 音声認識サーバ、600 端末モジュール、610 メインモジュール、620 話者識別モジュール、630 音声認識モジュール。
30 control unit, 31 voice input unit, 32 voice output unit, 33 speaker identification unit, 34 speaker identification learning unit, 35 user management unit, 36 voice recognition unit, 37 generation unit, 80 speaker model, 350, 400, 500 server, 410 speaker identification server, 520 speech recognition server, 600 terminal module, 610 main module, 620 speaker identification module, 630 speech recognition module.
Claims (8)
- 音声認識装置であって、
話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とをそれぞれ受け付けるための音声入力部と、
音声認識処理を行なうための音声認識部と、
音声を出力するための音声出力部と、
前記音声認識処理の結果に基づいて前記音声認識装置を制御するための制御部とを備え、
前記制御部は、前記話者を識別する情報と、前記話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する、音声認識装置。 A speech recognition device,
An audio input unit for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker;
A speech recognition unit for performing speech recognition processing;
An audio output unit for outputting audio;
A control unit for controlling the voice recognition device based on the result of the voice recognition processing,
The said control part is a speech recognition apparatus which produces | generates the speaker identification model for identifying a speaker by associating the information which identifies the said speaker, and the speech which does not contain the information which identifies the said speaker. - 前記音声出力部は、話者を識別する情報を尋ねる問い合わせを出力し、
前記話者を識別する情報と、前記話者を識別する情報を含まない発話とを関連付けることは、前記話者を識別する情報と、前記問い合わせの後に発せられた前記話者を識別する情報を含まない発話とを関連付けることを含む、請求項1に記載の音声認識装置。 The voice output unit outputs an inquiry asking for information for identifying a speaker,
Associating the information for identifying the speaker with the utterance not including the information for identifying the speaker includes information for identifying the speaker and information for identifying the speaker issued after the inquiry. The speech recognition apparatus according to claim 1, comprising associating an utterance not included. - 前記音声出力部は、話者を識別する情報を含まない発話の後に、話者を識別する情報を尋ねる問い合わせを出力し、
前記話者を識別する情報と、前記話者を識別する情報を含まない発話とを関連付けることは、前記問い合わせの前に発せられた発話と、前記問い合わせに応答する発話に含まれる話者を識別する情報とを関連付けることを含む、請求項1または2に記載の音声認識装置。 The voice output unit outputs an inquiry asking for information for identifying a speaker after an utterance that does not include information for identifying the speaker.
Associating the information for identifying the speaker with the utterance not including the information for identifying the speaker identifies the utterance issued before the inquiry and the speaker included in the utterance responding to the inquiry. The speech recognition apparatus according to claim 1, further comprising associating with information to be performed. - 前記制御部は、前記音声出力部から出力される発話に対する応答の内容に基づいて、前記音声出力部から次に出力する発話の内容を決定するように構成されている、請求項1~3のいずれかに記載の音声認識装置。 The control unit according to any one of claims 1 to 3, wherein the control unit is configured to determine a content of an utterance to be output next from the speech output unit based on a content of a response to the utterance output from the speech output unit. The speech recognition device according to any one of the above.
- 生成された前記話者識別モデルを格納するための記憶部をさらに備え、
前記制御部は、
前記問い合わせに対する応答に基づいて、前記生成された話者識別モデルを更新するように構成されている、請求項1~4のいずれかに記載の音声認識装置。 A storage unit for storing the generated speaker identification model;
The controller is
The speech recognition device according to any one of claims 1 to 4, wherein the speech recognition device is configured to update the generated speaker identification model based on a response to the inquiry. - 音声認識システムであって、
端末と、
前記端末と通信可能な装置とを備え、
前記端末は、
話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とを受け付けるための音声入力部と、
音声を出力するための音声出力部と、
前記音声入力部および前記音声出力部に電気的に接続されて、前記装置と通信するための通信部とを備え、
前記装置は、
前記端末と通信するための通信部と、
音声認識処理を行なうための音声認識処理部と、
前記音声認識処理の結果に基づいて前記装置を制御するための制御部とを備え、
前記制御部は、前記話者を識別する情報と、前記話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成する、音声認識システム。 A speech recognition system,
A terminal,
A device capable of communicating with the terminal,
The terminal
A voice input unit for receiving an utterance including information for identifying a speaker and an utterance not including information for identifying a speaker;
An audio output unit for outputting audio;
A communication unit that is electrically connected to the audio input unit and the audio output unit and communicates with the device;
The device is
A communication unit for communicating with the terminal;
A speech recognition processing unit for performing speech recognition processing;
A control unit for controlling the device based on the result of the voice recognition processing,
The said control part is a speech recognition system which produces | generates the speaker identification model for identifying a speaker by associating the information which identifies the said speaker, and the speech which does not contain the information which identifies the said speaker. - 請求項6に記載の音声認識システムで使用される端末。 A terminal used in the voice recognition system according to claim 6.
- 話者識別モデルを生成するための方法であって、
話者を識別する情報を含む発話と、話者を識別する情報を含まない発話とを受け付けることと、
音声認識処理を行なうことと、
音声を出力することと、
前記音声認識処理の結果に基づいて、前記話者を識別する情報と、前記話者を識別する情報を含まない発話とを関連付けることにより、話者を識別するための話者識別モデルを生成することとを含む、方法。 A method for generating a speaker identification model, comprising:
Accepting utterances with information identifying the speaker and utterances without information identifying the speaker;
Performing speech recognition processing,
Outputting audio,
Based on the result of the speech recognition process, a speaker identification model for identifying a speaker is generated by associating information for identifying the speaker with an utterance not including information for identifying the speaker Including the method.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2015113949A JP6084654B2 (en) | 2015-06-04 | 2015-06-04 | Speech recognition apparatus, speech recognition system, terminal used in the speech recognition system, and method for generating a speaker identification model |
JP2015-113949 | 2015-06-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016194740A1 true WO2016194740A1 (en) | 2016-12-08 |
Family
ID=57440499
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2016/065500 WO2016194740A1 (en) | 2015-06-04 | 2016-05-25 | Speech recognition device, speech recognition system, terminal used in said speech recognition system, and method for generating speaker identification model |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP6084654B2 (en) |
WO (1) | WO2016194740A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018173948A1 (en) * | 2017-03-24 | 2018-09-27 | 株式会社日立国際電気 | Service provision system |
WO2018230345A1 (en) * | 2017-06-15 | 2018-12-20 | 株式会社Caiメディア | Dialogue robot, dialogue system, and dialogue program |
CN109243468A (en) * | 2018-11-14 | 2019-01-18 | 北京羽扇智信息科技有限公司 | Audio recognition method, device, electronic equipment and storage medium |
CN110019747A (en) * | 2017-09-26 | 2019-07-16 | 株式会社日立制作所 | Information processing unit, dialog process method and conversational system |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101883301B1 (en) * | 2017-01-11 | 2018-07-30 | (주)파워보이스 | Method for Providing Personalized Voice Recognition Service Using Artificial Intellignent Speaker Recognizing Method, and Service Providing Server Used Therein |
JP7143591B2 (en) * | 2018-01-17 | 2022-09-29 | トヨタ自動車株式会社 | speaker estimation device |
US11992930B2 (en) | 2018-03-20 | 2024-05-28 | Sony Corporation | Information processing apparatus and information processing method, and robot apparatus |
KR20200000604A (en) | 2018-06-25 | 2020-01-03 | 현대자동차주식회사 | Dialogue system and dialogue processing method |
JP7187212B2 (en) * | 2018-08-20 | 2022-12-12 | ヤフー株式会社 | Information processing device, information processing method and information processing program |
JP7280999B2 (en) | 2018-09-12 | 2023-05-24 | マクセル株式会社 | Information processing equipment |
EP3851985A4 (en) * | 2018-09-12 | 2022-04-20 | Maxell, Ltd. | Information processing device, user authentication network system, and user authentication method |
JP7110057B2 (en) * | 2018-10-12 | 2022-08-01 | 浩之 三浦 | speech recognition system |
JP7252883B2 (en) * | 2019-11-21 | 2023-04-05 | Kddi株式会社 | GAME MANAGEMENT DEVICE, GAME MANAGEMENT METHOD AND PROGRAM |
KR20220095973A (en) * | 2020-12-30 | 2022-07-07 | 삼성전자주식회사 | Method for responding to voice input and electronic device supporting the same |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003255989A (en) * | 2002-03-06 | 2003-09-10 | Sony Corp | Learning system and learning method, and robot apparatus |
JP2004101901A (en) * | 2002-09-10 | 2004-04-02 | Matsushita Electric Works Ltd | Speech interaction system and speech interaction program |
JP2004184788A (en) * | 2002-12-05 | 2004-07-02 | Casio Comput Co Ltd | Voice interaction system and program |
-
2015
- 2015-06-04 JP JP2015113949A patent/JP6084654B2/en not_active Expired - Fee Related
-
2016
- 2016-05-25 WO PCT/JP2016/065500 patent/WO2016194740A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003255989A (en) * | 2002-03-06 | 2003-09-10 | Sony Corp | Learning system and learning method, and robot apparatus |
JP2004101901A (en) * | 2002-09-10 | 2004-04-02 | Matsushita Electric Works Ltd | Speech interaction system and speech interaction program |
JP2004184788A (en) * | 2002-12-05 | 2004-07-02 | Casio Comput Co Ltd | Voice interaction system and program |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018173948A1 (en) * | 2017-03-24 | 2018-09-27 | 株式会社日立国際電気 | Service provision system |
JPWO2018173948A1 (en) * | 2017-03-24 | 2020-01-16 | 株式会社日立国際電気 | Service providing system |
JP7026105B2 (en) | 2017-03-24 | 2022-02-25 | 株式会社日立国際電気 | Service provision system |
WO2018230345A1 (en) * | 2017-06-15 | 2018-12-20 | 株式会社Caiメディア | Dialogue robot, dialogue system, and dialogue program |
CN109643550A (en) * | 2017-06-15 | 2019-04-16 | 株式会社Cai梅帝亚 | Talk with robot and conversational system and dialogue program |
JPWO2018230345A1 (en) * | 2017-06-15 | 2019-11-07 | 株式会社Caiメディア | Dialogue robot, dialogue system, and dialogue program |
CN110019747A (en) * | 2017-09-26 | 2019-07-16 | 株式会社日立制作所 | Information processing unit, dialog process method and conversational system |
CN109243468A (en) * | 2018-11-14 | 2019-01-18 | 北京羽扇智信息科技有限公司 | Audio recognition method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
JP2017003611A (en) | 2017-01-05 |
JP6084654B2 (en) | 2017-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6084654B2 (en) | Speech recognition apparatus, speech recognition system, terminal used in the speech recognition system, and method for generating a speaker identification model | |
US10832686B2 (en) | Method and apparatus for pushing information | |
US11875820B1 (en) | Context driven device arbitration | |
US11138977B1 (en) | Determining device groups | |
US10891952B2 (en) | Speech recognition | |
EP2717258B1 (en) | Phrase spotting systems and methods | |
US9633657B2 (en) | Systems and methods for supporting hearing impaired users | |
US10192550B2 (en) | Conversational software agent | |
US10140988B2 (en) | Speech recognition | |
US10706845B1 (en) | Communicating announcements | |
US20170256259A1 (en) | Speech Recognition | |
CN113327609A (en) | Method and apparatus for speech recognition | |
CN109378006A (en) | A kind of striding equipment method for recognizing sound-groove and system | |
US9454959B2 (en) | Method and apparatus for passive data acquisition in speech recognition and natural language understanding | |
JPWO2018043138A1 (en) | INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND PROGRAM | |
TW200304638A (en) | Network-accessible speaker-dependent voice models of multiple persons | |
CN108364638A (en) | A kind of voice data processing method, device, electronic equipment and storage medium | |
CN112435669B (en) | Robot multi-wheel dialogue voice interaction method, system and terminal equipment | |
JP2019074865A (en) | Conversation collection device, conversation collection system, and conversation collection method | |
WO2020017165A1 (en) | Information processing device, information processing system, information processing method, and program | |
WO2019138477A1 (en) | Smart speaker, smart speaker control method, and program | |
US11445056B1 (en) | Telephone system for the hearing impaired | |
CN110838211A (en) | Voice answering method, device and system | |
US12125483B1 (en) | Determining device groups | |
JP2010286943A (en) | Reception device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16803175 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16803175 Country of ref document: EP Kind code of ref document: A1 |