WO2019230090A1 - Voice processing device and voice processing method - Google Patents

Voice processing device and voice processing method Download PDF

Info

Publication number
WO2019230090A1
WO2019230090A1 PCT/JP2019/007485 JP2019007485W WO2019230090A1 WO 2019230090 A1 WO2019230090 A1 WO 2019230090A1 JP 2019007485 W JP2019007485 W JP 2019007485W WO 2019230090 A1 WO2019230090 A1 WO 2019230090A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
response
voice
input
sound
Prior art date
Application number
PCT/JP2019/007485
Other languages
French (fr)
Japanese (ja)
Inventor
寛 黒田
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2019230090A1 publication Critical patent/WO2019230090A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • This disclosure relates to a voice processing device and a voice processing method.
  • the voice agent function is a function that analyzes the meaning of the voice spoken by the user and executes processing according to the meaning obtained by the analysis. For example, a voice processing device having a voice agent function can execute answering a question from a user, adding a schedule, setting a timer, and the like.
  • Patent Document 1 discloses a method in which a user registers an abbreviation in a voice processing device in advance and calls a complex word using the abbreviation.
  • Patent Document 2 discloses a technique for understanding the meaning of a user's utterance using context information when the user utters a voice.
  • the present disclosure proposes a new and improved speech processing apparatus and speech processing method capable of improving the accuracy of response to the user.
  • a voice processing device including a response control unit that controls a response using the input information.
  • the processor recognizes the situation associated with the input information by referring to the storage unit that stores the input information obtained based on the recognition of the voice input to the sound input unit.
  • a voice processing method including controlling a response using the input information.
  • a plurality of constituent elements having substantially the same functional configuration may be distinguished by adding different alphabets after the same reference numeral.
  • only the same reference numeral is given to each of the plurality of constituent elements.
  • FIG. 1 is an explanatory diagram showing an overview of a voice processing device 20 according to an embodiment of the present disclosure.
  • the audio processing device 20 is disposed in a house as an example.
  • the speech processing device 20 has a speech agent function that analyzes the meaning of speech uttered by the user of the speech processing device 20 and executes processing according to the meaning obtained by the analysis.
  • the voice processing device 20 stores a keyword and response information in association with each other, the user's voice has a meaning of requesting a response of information, and the stored keyword is recognized from the voice. A response to the user is executed using the response information stored in association with the keyword.
  • the voice processing device 20 stores the keyword “A company's pasta” and the response information “boiled for 5 minutes” in association with each other.
  • the voice processing device 20 responds to the user with a voice saying, “Boiled time of pasta from Company A is 5 minutes” using response information “Boiled time is 5 minutes”. .
  • a stationary device is shown as the speech processing device 20, but the speech processing device 20 is not limited to a stationary device.
  • the voice processing device 20 may be a portable information processing device such as a smartphone, a mobile phone, a PHS (Personal Handyphone System), a portable music playback device, a portable video processing device, or a portable game device.
  • An autonomous mobile robot may also be used.
  • the keyword “Pasta of Company A” can be associated with response information indicating the above-described recipe, or can be associated with response information indicating the storage location. For this reason, the device for extracting the response information according to a user's intention from the keyword contained in an audio
  • the inventor of the present invention has created the speech processing apparatus 20 according to the embodiment of the present disclosure with the above circumstances taken into consideration.
  • the voice processing device 20 according to the embodiment of the present disclosure can improve the accuracy of a response to voice input by extracting response information that matches a user's intention from a keyword included in the voice.
  • response information that matches a user's intention from a keyword included in the voice.
  • the voice processing device 20 stores the keyword, the sound source direction, and the response information in association with each other, and based on the fact that the voice including the keyword has arrived from the sound source direction corresponding to the stored sound source direction, A response is executed using the response information.
  • the keyword and the response information are examples of input information obtained based on the recognition of the input voice
  • the sound source direction is an example of situation information indicating a situation related to generation of sound or voice. According to the voice processing device 20 according to the first embodiment, even when different response information is associated with the same keyword, the response information according to the user's intention is obtained based on the sound source direction of the voice including the keyword. It is possible to extract.
  • the voice processing device 20 in order to distinguish the voice processing device 20 according to each embodiment, the voice processing device according to the first embodiment is referred to as a voice processing device 21, and the voice processing device according to the second embodiment is referred to as a voice processing device 22. May be called.
  • the audio processing device 21 and the audio processing device 22 according to each embodiment may be simply referred to as the audio processing device 20 in some cases.
  • FIG. 2 is an explanatory diagram showing the configuration of the audio processing device 21 according to the first embodiment.
  • the speech processing apparatus 21 according to the first embodiment includes a sound collection unit 232, a speech recognition unit 234, a sound source direction estimation unit 236, a semantic analysis unit 238, a storage unit 240, a response control unit 250, and An audio output unit 260 is included.
  • the sound collection unit 232 has a function of a sound input unit that acquires an electrical sound signal from aerial vibration including environmental sound and sound.
  • the sound collection unit 232 outputs the acquired sound signal to the voice recognition unit 234 and the sound source direction estimation unit 236.
  • the voice recognition unit 234 detects a voice signal from the sound signal input from the sound collection unit 232, recognizes the voice signal, and obtains a character string representing the voice uttered by the user.
  • the sound source direction estimation unit 236 is an example of a situation recognition unit, and estimates the sound source direction of the sound that has reached the sound collection unit 232.
  • the sound source direction estimation unit 236 estimates the sound source direction based on the phase difference between the sound signals obtained by the sound collection elements.
  • the sound signal is an audio signal
  • the direction of the user viewed from the audio processing device 21 is estimated as the sound source direction.
  • the semantic analysis unit 238 analyzes the meaning of the character string input from the voice recognition unit 234.
  • the semantic analysis method may be any one of a method of realizing semantic analysis using machine learning after creating an utterance corpus, a method of realizing semantic analysis with rules, or a combination thereof.
  • morphological analysis which is a part of semantic analysis processing, has a mechanism for assigning attributes in units of words and maintains a dictionary therein. The semantic analysis unit 238 uses the mechanism and the dictionary for assigning this attribute to determine what kind of word the word included in the utterance is, for example, a person name, a place name, or a general noun. It is possible to grant.
  • the semantic analysis unit 238 performs different processing according to the meaning obtained by the analysis. For example, when the meaning obtained by the analysis has the meaning of requesting storage of information, the semantic analysis unit 238 extracts a keyword as the first part from the character string input from the speech recognition unit 234, and the second Response information corresponding to the keyword is extracted as a part of. Then, the semantic analysis unit 238 stores the extracted keyword, response information, and the sound source direction input from the sound source direction estimation unit 236 in association with each other in the storage unit 240. A specific example of storing such information will be described below with reference to FIGS.
  • FIG. 3 is an explanatory diagram showing a usage pattern of the voice processing device 21.
  • the semantic analysis unit 238 inputs “pastor of company A” as a keyword from the character string “store the pasting time of pasta of company A as 5 minutes” input from the speech recognition unit 234. Extract “boiled time is 5 minutes” as response information.
  • the sound source direction estimation unit 236 estimates 30 °, which is an angle formed by the direction of the position P1 viewed from the sound processing device 21 and the reference direction d of the sound processing device 21, as the sound source direction. For this reason, as shown in FIG. 4, the semantic analysis unit 238 stores the keyword “A company pasta”, the sound source direction “30 °”, and the response information “boiled for 5 minutes” in the storage unit 240 in association with each other. .
  • FIG. 3 further illustrates an example in which the user utters a voice saying “Remember that company A's pasta has been placed on the top shelf of storage” at position P2 included in storage peripheral area A2.
  • the semantic analysis unit 238 inputs “A company A” as a keyword from the character string “store the pasta of company A placed on the top shelf of storage” input from the speech recognition unit 234. Of pasta ”and“ placed on the top shelf of storage ”are extracted as response information.
  • the sound source direction estimation unit 236 estimates ⁇ 20 °, which is an angle formed by the direction of the position P2 viewed from the sound processing device 21 and the reference direction d of the sound processing device 21, as the sound source direction. For this reason, as shown in FIG. 4, the semantic analysis unit 238 associates the keyword “Pasta from Company A”, the sound source direction “ ⁇ 20 °”, and the response information “placed on the top shelf of storage”. The data is stored in the storage unit 240.
  • the response control unit 250 When the meaning obtained by the analysis of the semantic analysis unit 238 has a meaning of requesting an information response, the response control unit 250 refers to the storage unit 240 and receives the voice from the sound source direction associated with the response information. , And that the voice is recognized to include a keyword associated with the response information, the response is controlled using the response information.
  • the response control unit 250 recognizes that the sound source direction whose difference from the sound source direction associated with the response information is less than a predetermined reference is recognized as the sound from the sound source direction associated with the response information. May be treated as arrival.
  • the predetermined reference may be a value between 10 ° and 20 °, for example.
  • the voice output unit 260 outputs a response voice to the user in accordance with the control from the response control unit 250.
  • the audio output unit 260 is merely an example of an output unit that outputs a response, and the audio processing device 21 may include a display unit that outputs a response by display.
  • the response voice output by the voice output unit 260 will be described.
  • FIG. 5 is an explanatory diagram showing a usage pattern of the voice processing device 21.
  • the user utters the voice “What is A company's pasta?”
  • the sound source direction of the sound is “30 °”. Therefore, the response control unit 250 refers to the storage unit 240 and extracts response information “boil time is 5 minutes” associated with the keyword “pasta of company A” and the sound source direction “30 °”. Then, the response control unit 250 uses the extracted response information “boiled time is 5 minutes” and stores “boiled time of company A's pasta as 5 minutes” in the audio output unit 260 as shown in FIG.
  • the response control unit 250 refers to the storage unit 240 and searches for the keyword “A company's pasta”.
  • Response information “Placed on top shelf of storage” associated with the sound source direction “ ⁇ 20 °” is extracted. Then, the response control unit 250 uses the extracted response information “Placed on the top shelf for storage” to the audio output unit 260 as shown in FIG. I remember that I placed it on the top shelf.
  • FIG. 6 is a flowchart showing the operation of the voice processing device 21 according to the first embodiment.
  • the voice recognition unit 234 recognizes the voice
  • the semantic analysis unit 238 indicates the meaning of the character string input from the voice recognition unit 234. Analysis is performed (S308).
  • the semantic analysis unit 238 When the semantic analysis unit 238 understands that the user's request is an information response (S312 / Yes), the semantic analysis unit 238 extracts a keyword from the character string input from the voice recognition unit 234 (S316). Further, the sound source direction estimating unit 236 estimates the sound source direction of the sound (S340), and the response control unit 250 extracts response information from the storage unit 240 based on the keyword and the sound source direction of the sound (S344). And the response control part 250 produces
  • the semantic analysis unit 238 when the semantic analysis unit 238 understands that the user's request is information storage (S312 / No, S352 / Yes), the semantic analysis unit 238 extracts keywords and response information from the character string input from the speech recognition unit 234 (S356). Further, the sound source direction estimation unit 236 estimates the sound source direction of the sound (S360), and the semantic analysis unit 238 associates the keyword, the response information, and the sound source direction input from the sound source direction estimation unit 236 and stores them in the storage unit 240. (S364). Thereafter, for example, the audio output unit 260 outputs a registration completion notification indicating that the storage of information is completed (S368).
  • the response control unit 250 controls whether or not to output a response voice in response to the voice spoken by the user depending on whether or not the user is a target user. May be.
  • the storage unit 240 also stores target information indicating the user as a target user.
  • the target information may be identification information of the target user, or may be a feature amount of the target user's voice.
  • the response control part 250 may output response information, when the user who uttered the audio
  • the target user and the user who requested the storage of information may be different.
  • the voice processing device 21 may instruct the user who has requested the storage of information to input the user to be set as the target user, and set the input user as the target user.
  • all users can be set as target users, or a specific user (user A in the example of FIG. 7) can be set as target users.
  • each user does not have to speak for storing information individually, so that it is possible to reduce the user's trouble.
  • it is useful to set a plurality of users like all users as target users.
  • it is useful to set a specific user as a target user.
  • the voice processing device 22 according to the second embodiment can output an appropriate response voice to the user without an explicit request from the user.
  • the configuration and operation of the sound processing device 22 according to the second embodiment will be sequentially described in detail.
  • FIG. 8 is an explanatory diagram showing the configuration of the audio processing device 22 according to the second embodiment.
  • the speech processing apparatus 22 according to the second embodiment includes a sound collection unit 232, a speech recognition unit 234, a sound source direction estimation unit 236, a semantic analysis unit 238, a storage unit 242, a learning unit 246, a response A control unit 252 and an audio output unit 260 are included.
  • the configurations of the sound collection unit 232, the voice recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, and the voice output unit 260 are the same as those described in the first embodiment, and thus detailed description thereof is omitted here. .
  • the storage unit 242 stores the keyword extracted by the semantic analysis unit 238, the response information, and the sound source direction input from the sound source direction estimation unit 236 in association with each other. For example, when the user utters a voice at the dressing room saying “Remember to brush your teeth when you take a bath”, the storage unit 242 uses the keyword “Take a bath as shown in FIG. ”, The sound source direction“ 50 ° ”(the direction of the dressing room as viewed from the sound processing device 22), and the response information“ brush teeth ”are stored in association with each other.
  • the learning unit 246 generates a recognition model for recognizing sound generated in a situation indicated by a certain keyword by machine learning, and the storage unit 242 uses the recognition model as a learning result of the machine learning by the learning unit 246.
  • the recognition model for example, when a sound generated when a bath door is closed is input, the keyword “Take a bath” can be output. As another example, the keyword “going to running” may be output when a running shoe hits the entrance floor when the running shoe is worn. With reference to FIG. 10, a specific example of the use form of such a recognition model will be described.
  • FIG. 10 is an explanatory diagram showing a specific example of the usage form of the recognition model.
  • the response control unit 252 uses the keyword “Take a bath” corresponding to the sound as a recognition model. Identify based on. Then, the response control unit 252 extracts the response information “brush your teeth” associated with the keyword “Take a bath” based on the information stored in the storage unit 242 (see FIG. 9).
  • the voice output unit 260 outputs a response voice saying “When you take a bath, you remember to brush your teeth.”
  • the response control unit 252 uses the keyword “going to running” corresponding to the sound as a recognition model. Identify based on. Then, the response control unit 252 extracts the response information “check weather” associated with the keyword “going running” based on the information stored in the storage unit 242 (see FIG. 9), The voice output unit 260 is made to output a response voice saying “When you go running, you will remember if you check the weather”.
  • the response control unit 252 obtains weather information via the network, and as a result of executing the process or action of “confirming the weather”, a response such as “the weather after this is rain”. Audio may be output to the audio output unit 260. According to such a configuration, since the user's trouble is reduced, the convenience for the user can be improved.
  • FIG. 11 is a flowchart showing a schematic operation of the voice processing device 22 according to the second embodiment.
  • the learning unit 246 generates a recognition model that recognizes sounds generated in the situation indicated by the keyword by machine learning (S410).
  • the semantic analysis unit 238 analyzes the meaning of the speech uttered by the user, and when the user's request is storage of information, the storage unit 242 estimates the sound source direction to the keyword and response information extracted from the speech.
  • the sound source direction of the sound estimated by the unit 236 is stored in association with each other (S430).
  • the processing of S430 corresponds to the processing of S352 to S368 described with reference to FIG.
  • the response control unit 252 performs sound recognition using the recognition model generated by the learning unit 246, and controls output of the response sound using response information associated with the keyword specified based on the sound recognition. (S450).
  • FIG. 12 is a flowchart showing a recognition model generation method.
  • the voice processing device 22 instructs the user to generate a certain number of sounds generated in the situation indicated by the keyword (S411). For example, when the keyword is “I took a bath”, the user repeats opening and closing the door of the bathroom for a certain number of times.
  • the learning unit 246 stores the generated sound as positive example data for machine learning (S412).
  • the voice processing device 22 instructs the user not to generate the sound generated in the situation indicated by the keyword for a certain period of time (S413).
  • the learning unit 246 stores surrounding sounds as negative example data for machine learning (S414).
  • the learning unit 246 performs machine learning using the saved positive example data and negative example data, and generates a recognition model that recognizes the situation indicated by the keyword (S415). Then, the storage unit 242 stores the recognition model generated by the learning unit 246 (S416).
  • FIG. 13 is a flowchart showing response control using a recognition model.
  • the response control unit 252 determines whether or not the input sound is a sound recognized using the recognition model stored in the storage unit 242. Is determined (S452).
  • the response control unit 252 specifies a keyword corresponding to the sound based on the recognition model (S453).
  • the sound source direction estimating unit 236 estimates the sound source direction of the sound (S454).
  • the response control unit 252 is associated with the keyword identified based on the recognition model, and is associated with the sound source direction whose difference from the sound source direction estimated by the sound source direction estimating unit 236 is less than a predetermined reference.
  • Response information is extracted (S455).
  • the response control unit 252 generates a response sound using the extracted response information, and causes the sound output unit 260 to output the response sound (S456).
  • the sound generated in a certain situation can vary depending on which user the situation is related to. For example, even when the sound of opening / closing a bathroom door is generated, the sound generated when the father opens / closes the bathroom door and the sound generated when the child opens / closes the bathroom door Can be different. For this reason, when a recognition model is generated by a certain user generating a sound a certain number of times in S411 described with reference to FIG. 12, the sound generated by another user may not be recognized by the recognition model. obtain. Therefore, according to the second embodiment, it is possible to limit the output target of response voice based on sound input to a specific user.
  • FIG. 14 is a flowchart showing an application example of the second embodiment, and is a flowchart of processing applied to S430 shown in FIG. Since the processing of S304 to S316 and S340 to S368 in FIG. 14 is as described with reference to FIG. 6, detailed description thereof is omitted here.
  • the learning unit 246 stores the recognition model corresponding to the keyword in the storage unit 242. It is determined whether or not (S320 / Yes). Then, the learning unit 246 extracts, from the sound input to the sound collection unit 232, a sound that matches or is similar to the sound recognized by the recognition model (S324).
  • the learning unit 246 adds the sound extracted in S324 as positive example data (S328), performs machine learning again using the positive example data group including the added positive example data, and regenerates the recognition model. (S332).
  • positive example data for machine learning is added even if the user does not utter speech for machine learning, so that the accuracy of the recognition model is improved without the user feeling troublesome. Is possible.
  • the learning unit 246 may generate a recognition model that recognizes a situation corresponding to a keyword by machine learning using a captured image obtained by the imaging unit.
  • “going to run” can correspond to the keyword, and the situation corresponding to the keyword includes the clothes of the user at the time of “going running”.
  • the response control unit 252 identifies the keyword “going running” when the user ’s clothes at the time of “going running” are recognized based on the recognition model, and the response associated with the keyword “going running” The response output may be controlled using the information “Confirm the weather”.
  • FIG. 15 is an explanatory diagram showing a hardware configuration of the voice processing device 20.
  • the voice processing device 20 includes a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, a RAM (Random Access Memory) 203, an input device 208, an output device 210, and the like.
  • the CPU 201 functions as an arithmetic processing device and a control device, and controls the overall operation in the sound processing device 20 according to various programs. Further, the CPU 201 may be a microprocessor.
  • the ROM 202 stores programs used by the CPU 201, calculation parameters, and the like.
  • the RAM 203 temporarily stores programs used in the execution of the CPU 201, parameters that change as appropriate during the execution, and the like. These are connected to each other by a host bus including a CPU bus. Functions such as the speech recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, the learning unit 246, and the response control unit 252 can be realized by the cooperation of the CPU 201, the ROM 202, the RAM 203, and the software.
  • the input device 208 includes input means for a user to input information, such as a mouse, keyboard, touch panel, button, microphone, switch, and lever, and an input control circuit that generates an input signal based on the input by the user and outputs the input signal to the CPU 201. Etc. A user of the voice processing device 20 can input various data and instruct a processing operation to the voice processing device 20 by operating the input device 208.
  • the output device 210 includes a display device such as a liquid crystal display (LCD) device, an OLED (Organic Light Emitting Diode) device, and a lamp. Furthermore, the output device 210 includes an audio output device such as a speaker and headphones. For example, the display device displays a captured image or a generated image. On the other hand, the audio output device converts audio data or the like into audio and outputs it.
  • a display device such as a liquid crystal display (LCD) device, an OLED (Organic Light Emitting Diode) device, and a lamp.
  • the output device 210 includes an audio output device such as a speaker and headphones.
  • the display device displays a captured image or a generated image.
  • the audio output device converts audio data or the like into audio and outputs it.
  • the storage device 211 is a data storage device configured as an example of a storage unit of the audio processing device 20 according to the present embodiment.
  • the storage device 211 may include a storage medium, a recording device that records data on the storage medium, a reading device that reads data from the storage medium, a deletion device that deletes data recorded on the storage medium, and the like.
  • the storage device 211 stores programs executed by the CPU 201 and various data.
  • the drive 212 is a storage medium reader / writer, and is built in or externally attached to the audio processing device 20.
  • the drive 212 reads information recorded on a removable storage medium 24 such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs it to the RAM 203.
  • the drive 212 can also write information to the removable storage medium 24.
  • the imaging device 213 includes an imaging optical system such as a photographing lens and a zoom lens that collects light, and a signal conversion element such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor).
  • the imaging optical system collects light emitted from the subject and forms a subject image in the signal conversion unit, and the signal conversion element converts the formed subject image into an electrical image signal.
  • the communication device 215 is a communication interface configured with, for example, a communication device for connecting to the network 12.
  • the communication device 215 may be a wireless LAN (Local Area Network) compatible communication device, an LTE (Long Term Evolution) compatible communication device, or a wire communication device that performs wired communication.
  • the network 12 is a wired or wireless transmission path for information transmitted from a device connected to the network 12.
  • the network 12 may include a public line network such as the Internet, a telephone line network, and a satellite communication network, various LANs (Local Area Network) including the Ethernet (registered trademark), a WAN (Wide Area Network), and the like.
  • the network 12 may include a dedicated line network such as an IP-VPN (Internet Protocol-Virtual Private Network).
  • each step in the processing of the voice processing device 20 of the present specification does not necessarily have to be processed in time series in the order described as a flowchart.
  • each step in the processing of the voice processing device 20 may be processed in an order different from the order described as the flowchart, or may be processed in parallel.
  • the voice processing device 20 may be implemented in a cloud server connected to the voice processing device 20 via the network 12.
  • the cloud server may have a function corresponding to the voice recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, the storage unit 240, and the response control unit 250, that is, a function as a voice processing device.
  • the audio processing device 20 transmits an audio signal to the cloud server, and the cloud server can store information or transmit response audio to the audio processing device 20.
  • a computer program for causing hardware such as a CPU, ROM, and RAM incorporated in the voice processing device 20 to perform the same functions as the components of the voice processing device 20 described above.
  • a storage medium storing the computer program is also provided.
  • An audio processing apparatus comprising: (2) The input information includes a first part and a second part; The response control unit is based on the fact that the first part included in the input information stored in the storage unit is obtained by speech recognition, and the situation associated with the input information is recognized.
  • the speech processing apparatus according to (1), wherein the response is controlled using the second portion included in the input information.
  • a sound source direction of the sound when the sound is input to the sound input unit is stored in association with the input information as situation information
  • the response control unit associates the sound source direction with the sound source direction in the storage unit based on the recognition of the sound source direction whose difference from the sound source direction stored as the situation information in the storage unit is less than a predetermined reference.
  • the speech processing apparatus according to (1) or (2), wherein a response is controlled using the stored input information.
  • target information indicating a target user is stored in association with the input information
  • the response control unit recognizes the input information associated with the target information based on the recognition of the situation associated with the input information regarding the target user indicated by the target information stored in the storage unit.
  • the speech processing apparatus according to any one of (1) to (3), wherein the response is controlled by using.
  • the input information includes a first part indicating a situation, and a second part
  • the storage unit stores a recognition model obtained by machine learning using data obtained in the situation indicated by the first part
  • the response control unit uses the recognition model stored in the storage unit based on the fact that the situation indicated by the first part is recognized as the situation related to the first part.
  • the speech processing apparatus according to (1), wherein the response is controlled using the part 2.
  • the voice processing device When a voice indicating a situation recognized by the recognition model is input, the voice processing device performs machine learning again using additional data obtained when the voice is input, and sets the recognition model.
  • the speech processing apparatus according to (5), further including a learning unit that generates again.
  • the plurality of input information stored in the storage unit includes input information in which the first part has ambiguity or the result of the semantic recognition of the first part is unknown (2)
  • the voice processing apparatus according to 1. (8) The speech processing apparatus according to any one of (1) to (7), wherein the response control unit performs control to output at least part of the input information to an output unit as control of the response. (9) The response control unit performs control to output an execution result of a process or action indicated by at least a part of the input information to the output unit as control of the response, any one of (1) to (7) The voice processing apparatus according to 1.
  • the voice processing device A voice recognition unit for recognizing a voice input to the sound input unit; A situation recognition unit for recognizing the situation;
  • the speech processing apparatus according to any one of (1) to (9), further including: (11)
  • the input information is referred to by referring to a storage unit storing input information obtained based on recognition of the voice input to the sound input unit, and by a processor recognizing a situation associated with the input information. To control the response using Including a voice processing method.
  • Speech processing device 20
  • Speech processing device 232
  • Speech recognition unit 236
  • Sound source direction estimation unit 238
  • Semantic analysis unit 240
  • Storage unit 242
  • Storage unit 246
  • Learning unit 250
  • Response control unit 252
  • Response control unit 260
  • Voice output unit

Abstract

[Problem] To provide a voice processing device which improves the accuracy of a response to a user. [Solution] This voice processing device is provided with a response control unit which refers to a storage unit that stores input information obtained on the basis of the recognition of a voice input to a sound input unit, and controls a response by using the input information on the basis of the recognition of the situation related to the input information.

Description

音声処理装置および音声処理方法Audio processing apparatus and audio processing method
 本開示は、音声処理装置および音声処理方法に関する。 This disclosure relates to a voice processing device and a voice processing method.
 近年、音声エージェント機能を備える音声処理装置の普及が進んでいる。音声エージェント機能は、ユーザが発話した音声の意味を解析し、解析により得られた意味に従った処理を実行する機能である。例えば、音声エージェント機能を備える音声処理装置は、ユーザからの質問への回答、予定の追加およびタイマーの設定などを実行し得る。 In recent years, voice processing devices having a voice agent function have been widely used. The voice agent function is a function that analyzes the meaning of the voice spoken by the user and executes processing according to the meaning obtained by the analysis. For example, a voice processing device having a voice agent function can execute answering a question from a user, adding a schedule, setting a timer, and the like.
 音声エージェント機能に関し、特許文献1には、ユーザが事前に省略語を音声処理装置に登録し、省略語を用いて複雑な語を呼び出す方法が開示されている。また、特許文献2には、ユーザが音声を発話した際のコンテキスト情報を用いてユーザの発話の意味を理解する技術が開示されている。 Regarding the voice agent function, Patent Document 1 discloses a method in which a user registers an abbreviation in a voice processing device in advance and calls a complex word using the abbreviation. Patent Document 2 discloses a technique for understanding the meaning of a user's utterance using context information when the user utters a voice.
特開2016-114395号公報JP 2016-114395 A 特開2015-122104号公報JP2015-122104A
 しかし、特許文献1に記載の方法に関し、事前に登録されていた語と同じ語がユーザにより発話されても、その語に込められた意味は状況に応じて異なり得る。また、特許文献2に記載の技術では、事前にコンテキスト情報が登録されておらず、コンテキスト情報としてユーザが発話した際のコンテキスト情報のみが用いられるので、精度向上の余地があった。 However, regarding the method described in Patent Document 1, even if the same word as a word registered in advance is uttered by the user, the meaning embedded in the word may differ depending on the situation. In the technique described in Patent Document 2, there is room for improvement in accuracy because context information is not registered in advance and only context information when a user speaks is used as context information.
 そこで、本開示では、ユーザへの応答の精度を向上することが可能な、新規かつ改良された音声処理装置および音声処理方法を提案する。 Therefore, the present disclosure proposes a new and improved speech processing apparatus and speech processing method capable of improving the accuracy of response to the user.
 本開示によれば、音入力部に入力された音声の認識に基づき得られた入力情報を記憶している記憶部を参照し、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報を用いて応答を制御する応答制御部、を備える、音声処理装置が提供される。 According to the present disclosure, referring to the storage unit storing the input information obtained based on the recognition of the voice input to the sound input unit, based on the fact that the situation associated with the input information is recognized, There is provided a voice processing device including a response control unit that controls a response using the input information.
 また、本開示によれば、音入力部に入力された音声の認識に基づき得られた入力情報を記憶している記憶部を参照し、プロセッサにより、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報を用いて応答を制御すること、を含む、音声処理方法が提供される。 Further, according to the present disclosure, the processor recognizes the situation associated with the input information by referring to the storage unit that stores the input information obtained based on the recognition of the voice input to the sound input unit. On the basis of the above, there is provided a voice processing method including controlling a response using the input information.
 以上説明したように本開示によれば、ユーザへの応答の精度を向上することが可能である。なお、上記の効果は必ずしも限定的なものではなく、上記の効果とともに、または上記の効果に代えて、本明細書に示されたいずれかの効果、または本明細書から把握され得る他の効果が奏されてもよい。 As described above, according to the present disclosure, it is possible to improve the accuracy of response to the user. Note that the above effects are not necessarily limited, and any of the effects shown in the present specification, or other effects that can be grasped from the present specification, together with or in place of the above effects. May be played.
本開示の実施形態による音声処理装置20の概要を示す説明図である。It is explanatory drawing which shows the outline | summary of the audio processing apparatus 20 by embodiment of this indication. 第1の実施形態による音声処理装置21の構成を示す説明図である。It is explanatory drawing which shows the structure of the audio | voice processing apparatus 21 by 1st Embodiment. 音声処理装置21の利用形態を示す説明図である。It is explanatory drawing which shows the utilization form of the audio processing apparatus. 記憶される情報の具体例を示す説明図である。It is explanatory drawing which shows the specific example of the information memorize | stored. 音声処理装置21の利用形態を示す説明図である。It is explanatory drawing which shows the utilization form of the audio processing apparatus. 第1の実施形態による音声処理装置21の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice processing apparatus 21 by 1st Embodiment. 第1の実施形態の応用例を示す説明図である。It is explanatory drawing which shows the application example of 1st Embodiment. 第2の実施形態による音声処理装置22の構成を示す説明図である。It is explanatory drawing which shows the structure of the audio | voice processing apparatus 22 by 2nd Embodiment. 記憶される情報の具体例を示す説明図である。It is explanatory drawing which shows the specific example of the information memorize | stored. 認識モデルの利用形態の具体例を示す説明図である。It is explanatory drawing which shows the specific example of the utilization form of a recognition model. 第2の実施形態による音声処理装置22の概略動作を示すフローチャートである。It is a flowchart which shows schematic operation | movement of the audio | voice processing apparatus 22 by 2nd Embodiment. 認識モデルの生成方法を示すフローチャートである。It is a flowchart which shows the production | generation method of a recognition model. 認識モデルを用いた応答制御を示すフローチャートである。It is a flowchart which shows the response control using a recognition model. 第2の実施形態の応用例を示すフローチャートである。It is a flowchart which shows the application example of 2nd Embodiment. 音声処理装置20のハードウェア構成を示した説明図である。2 is an explanatory diagram showing a hardware configuration of a sound processing device 20. FIG.
 以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.
 また、本明細書及び図面において、実質的に同一の機能構成を有する複数の構成要素を、同一の符号の後に異なるアルファベットを付して区別する場合もある。ただし、実質的に同一の機能構成を有する複数の構成要素の各々を特に区別する必要がない場合、複数の構成要素の各々に同一符号のみを付する。 In the present specification and drawings, a plurality of constituent elements having substantially the same functional configuration may be distinguished by adding different alphabets after the same reference numeral. However, when it is not necessary to particularly distinguish each of a plurality of constituent elements having substantially the same functional configuration, only the same reference numeral is given to each of the plurality of constituent elements.
 また、以下に示す項目順序に従って本開示を説明する。
  0.音声処理装置の概要
  1.第1の実施形態
   1-1.第1の実施形態による音声処理装置の構成
   1-2.第1の実施形態による音声処理装置の動作
   1-3.作用効果
   1-4.第1の実施形態の応用例
  2.第2の実施形態
   2-1.第2の実施形態による音声処理装置の構成
   2-2.第2の実施形態による音声処理装置の動作
   2-3.作用効果
   2-4.第2の実施形態の応用例
  3.ハードウェア構成
  4.むすび
Moreover, this indication is demonstrated according to the item order shown below.
0. Outline of speech processing device 1. First embodiment 1-1. Configuration of speech processing apparatus according to first embodiment 1-2. Operation of the speech processing apparatus according to the first embodiment 1-3. Function and effect 1-4. 1. Application example of first embodiment Second embodiment 2-1. Configuration of speech processing apparatus according to second embodiment 2-2. Operation of the speech processing apparatus according to the second embodiment 2-3. Effect 2-4. 2. Application example of second embodiment Hardware configuration Conclusion
 <0.音声処理装置の概要>
 まず、図1を参照して、本開示の実施形態による音声処理装置の概要を説明する。
<0. Outline of speech processing device>
First, an overview of a speech processing apparatus according to an embodiment of the present disclosure will be described with reference to FIG.
 図1は、本開示の実施形態による音声処理装置20の概要を示す説明図である。図1に示したように、音声処理装置20は、一例として家屋に配置される。音声処理装置20は、音声処理装置20のユーザが発話した音声の意味を解析し、解析により得られた意味に従った処理を実行する、音声エージェント機能を有する。 FIG. 1 is an explanatory diagram showing an overview of a voice processing device 20 according to an embodiment of the present disclosure. As shown in FIG. 1, the audio processing device 20 is disposed in a house as an example. The speech processing device 20 has a speech agent function that analyzes the meaning of speech uttered by the user of the speech processing device 20 and executes processing according to the meaning obtained by the analysis.
 例えば、音声処理装置20は、キーワードと応答情報を関連付けて記憶しており、ユーザの音声が情報の応答を要求する意味を有し、記憶しているキーワードが音声から認識されたことに基づき、当該キーワードに関連付けて記憶されている応答情報を用いてユーザへの応答を実行する。図1に示した例では、音声処理装置20がキーワード「A社のパスタ」および応答情報「茹で時間は5分」を関連付けて記憶しており、ユーザが「A社のパスタは?」という音声を発話すると、音声処理装置20は、応答情報「茹で時間は5分」を用いて、「A社のパスタの茹で時間は5分であると記憶しています。」という音声でユーザに応答する。 For example, the voice processing device 20 stores a keyword and response information in association with each other, the user's voice has a meaning of requesting a response of information, and the stored keyword is recognized from the voice. A response to the user is executed using the response information stored in association with the keyword. In the example illustrated in FIG. 1, the voice processing device 20 stores the keyword “A company's pasta” and the response information “boiled for 5 minutes” in association with each other. , The voice processing device 20 responds to the user with a voice saying, “Boiled time of pasta from Company A is 5 minutes” using response information “Boiled time is 5 minutes”. .
 なお、図1においては、音声処理装置20として据置型の装置を示しているが、音声処理装置20は据置型の装置に限定されない。例えば、音声処理装置20は、スマートフォン、携帯電話、PHS(Personal Handyphone System)、携帯用音楽再生装置、携帯用映像処理装置、携帯用ゲーム機器などの携帯型の情報処理装置であってもよいし、自律移動式のロボットであってもよい。 In FIG. 1, a stationary device is shown as the speech processing device 20, but the speech processing device 20 is not limited to a stationary device. For example, the voice processing device 20 may be a portable information processing device such as a smartphone, a mobile phone, a PHS (Personal Handyphone System), a portable music playback device, a portable video processing device, or a portable game device. An autonomous mobile robot may also be used.
 ここで、多義性を有するキーワードが存在し得る。このため、同一のキーワードに異なる応答情報が関連付けられることが想定される。例えば、「A社のパスタ」というキーワードには、上述したレシピを示す応答情報が関連付けられ得るし、収納場所を示す応答情報が関連付けられ得る。このため、音声に含まれるキーワードからユーザの意図に沿う応答情報を抽出するための工夫が望まれた。 Here, there can be keywords with ambiguity. For this reason, it is assumed that different response information is associated with the same keyword. For example, the keyword “Pasta of Company A” can be associated with response information indicating the above-described recipe, or can be associated with response information indicating the storage location. For this reason, the device for extracting the response information according to a user's intention from the keyword contained in an audio | voice was desired.
 本件発明者は、上記事情を一着眼点にして本開示の実施形態による音声処理装置20を創作するに至った。本開示の実施形態による音声処理装置20は、音声に含まれるキーワードからユーザの意図に沿う応答情報を抽出することにより、音声の入力に対する応答の精度を向上することが可能である。以下、幾つかの本開示の実施形態を順次詳細に説明する。 The inventor of the present invention has created the speech processing apparatus 20 according to the embodiment of the present disclosure with the above circumstances taken into consideration. The voice processing device 20 according to the embodiment of the present disclosure can improve the accuracy of a response to voice input by extracting response information that matches a user's intention from a keyword included in the voice. Hereinafter, several embodiments of the present disclosure will be sequentially described in detail.
 <1.第1の実施形態>
 第1の実施形態による音声処理装置20は、キーワード、音源方向および応答情報を関連付けて記憶しており、記憶している音源方向に対応する音源方向からキーワードを含む音声が到来したことに基づき、応答情報を用いて応答を実行する。キーワードおよび応答情報は、入力音声の認識に基づき得られた入力情報の一例であり、音源方向は、音または音声の発生に関する状況を示す状況情報の一例である。このような第1の実施形態による音声処理装置20によれば、同一のキーワードに異なる応答情報が関連付けられている場合でも、キーワードを含む音声の音源方向に基づき、ユーザの意図に沿う応答情報を抽出することが可能である。
<1. First Embodiment>
The voice processing device 20 according to the first embodiment stores the keyword, the sound source direction, and the response information in association with each other, and based on the fact that the voice including the keyword has arrived from the sound source direction corresponding to the stored sound source direction, A response is executed using the response information. The keyword and the response information are examples of input information obtained based on the recognition of the input voice, and the sound source direction is an example of situation information indicating a situation related to generation of sound or voice. According to the voice processing device 20 according to the first embodiment, even when different response information is associated with the same keyword, the response information according to the user's intention is obtained based on the sound source direction of the voice including the keyword. It is possible to extract.
 なお、以下では、各実施形態による音声処理装置20を区別するために、第1の実施形態による音声処理装置を音声処理装置21と称し、第2の実施形態による音声処理装置を音声処理装置22と称する場合がある。各実施形態による音声処理装置21および音声処理装置22を単に音声処理装置20と総称する場合もある。 In the following, in order to distinguish the voice processing device 20 according to each embodiment, the voice processing device according to the first embodiment is referred to as a voice processing device 21, and the voice processing device according to the second embodiment is referred to as a voice processing device 22. May be called. The audio processing device 21 and the audio processing device 22 according to each embodiment may be simply referred to as the audio processing device 20 in some cases.
  <<1-1.第1の実施形態による音声処理装置の構成>>
 図2は、第1の実施形態による音声処理装置21の構成を示す説明図である。図2に示したように、第1の実施形態による音声処理装置21は、集音部232、音声認識部234、音源方向推定部236、意味解析部238、記憶部240、応答制御部250および音声出力部260を有する。
<< 1-1. Configuration of Speech Processing Device according to First Embodiment >>
FIG. 2 is an explanatory diagram showing the configuration of the audio processing device 21 according to the first embodiment. As shown in FIG. 2, the speech processing apparatus 21 according to the first embodiment includes a sound collection unit 232, a speech recognition unit 234, a sound source direction estimation unit 236, a semantic analysis unit 238, a storage unit 240, a response control unit 250, and An audio output unit 260 is included.
 (集音部)
 集音部232は、環境音および音声を含む空気的な振動から電気的な音信号を取得する音入力部の機能を有する。集音部232は、取得した音信号を音声認識部234および音源方向推定部236に出力する。
(Sound collector)
The sound collection unit 232 has a function of a sound input unit that acquires an electrical sound signal from aerial vibration including environmental sound and sound. The sound collection unit 232 outputs the acquired sound signal to the voice recognition unit 234 and the sound source direction estimation unit 236.
 (音声認識部)
 音声認識部234は、集音部232から入力される音信号から音声信号を検出し、当該音声信号を認識し、ユーザが発話した音声を表す文字列を得る。
(Voice recognition unit)
The voice recognition unit 234 detects a voice signal from the sound signal input from the sound collection unit 232, recognizes the voice signal, and obtains a character string representing the voice uttered by the user.
 (音源方向推定部)
 音源方向推定部236は、状況認識部の一例であり、集音部232に到達した音の音源方向を推定する。集音部232が複数の集音素子から構成される場合、音源方向推定部236は、各集音素子により得られた音信号の位相差に基づき、音源方向を推定する。音信号が音声信号である場合、音声処理装置21から見たユーザの方向が音源方向として推定される。
(Sound source direction estimation unit)
The sound source direction estimation unit 236 is an example of a situation recognition unit, and estimates the sound source direction of the sound that has reached the sound collection unit 232. When the sound collection unit 232 includes a plurality of sound collection elements, the sound source direction estimation unit 236 estimates the sound source direction based on the phase difference between the sound signals obtained by the sound collection elements. When the sound signal is an audio signal, the direction of the user viewed from the audio processing device 21 is estimated as the sound source direction.
 (意味解析部、記憶部)
 意味解析部238は、音声認識部234から入力される文字列の意味を解析する。意味解析の方法は、発話コーパスを作成した上で機械学習を用いて意味解析を実現する方法、ルールで意味解析を実現する方法、またはこれらの組み合わせのいずれであってもよい。また、意味解析の処理の一部である形態素解析では、単語単位で属性を付与する仕組みをもっており、内部には辞書を保持している。意味解析部238は、この属性を付与する仕組みと辞書により、発話に含まれる単語がどのような単語であるか、例えば人名であるのか、地名であるのか、一般名詞であるのか等の属性を付与することが可能である。
(Semantic analysis unit, storage unit)
The semantic analysis unit 238 analyzes the meaning of the character string input from the voice recognition unit 234. The semantic analysis method may be any one of a method of realizing semantic analysis using machine learning after creating an utterance corpus, a method of realizing semantic analysis with rules, or a combination thereof. In addition, morphological analysis, which is a part of semantic analysis processing, has a mechanism for assigning attributes in units of words and maintains a dictionary therein. The semantic analysis unit 238 uses the mechanism and the dictionary for assigning this attribute to determine what kind of word the word included in the utterance is, for example, a person name, a place name, or a general noun. It is possible to grant.
 そして、意味解析部238は、解析により得られた意味に応じて異なる処理を行う。例えば、解析により得られた意味が情報の記憶を要求する意味を有する場合、意味解析部238は、音声認識部234から入力される文字列から、第1の部分としてキーワードを抽出し、第2の部分としてキーワードに対応する応答情報を抽出する。そして、意味解析部238は、抽出したキーワード、応答情報、および音源方向推定部236から入力される音源方向を関連付けて記憶部240に記憶させる。以下、図3および図4を参照して、このような情報の記憶の具体例を説明する。 And the semantic analysis unit 238 performs different processing according to the meaning obtained by the analysis. For example, when the meaning obtained by the analysis has the meaning of requesting storage of information, the semantic analysis unit 238 extracts a keyword as the first part from the character string input from the speech recognition unit 234, and the second Response information corresponding to the keyword is extracted as a part of. Then, the semantic analysis unit 238 stores the extracted keyword, response information, and the sound source direction input from the sound source direction estimation unit 236 in association with each other in the storage unit 240. A specific example of storing such information will be described below with reference to FIGS.
 図3は、音声処理装置21の利用形態を示す説明図である。図3に示した例では、キッチンエリアA1に含まれる位置P1で、ユーザが「A社のパスタの茹で時間は5分と記憶しておいて。」という音声を発話している。この場合、意味解析部238は、音声認識部234から入力される「A社のパスタの茹で時間は5分と記憶しておいて。」という文字列から、キーワードとして「A社のパスタ」を抽出し、応答情報として「茹で時間は5分」を抽出する。一方、音源方向推定部236は、音声処理装置21から見た位置P1の方向と音声処理装置21の基準方向dとが成す角度である30°を、音源方向として推定する。このため、意味解析部238は、図4に示したように、キーワード「A社のパスタ」、音源方向「30°」、応答情報「茹で時間は5分」を関連付けて記憶部240に記憶させる。 FIG. 3 is an explanatory diagram showing a usage pattern of the voice processing device 21. In the example shown in FIG. 3, at a position P1 included in the kitchen area A1, the user utters a voice saying “Remember the time of 5 minutes for cooking with A company's pasta”. In this case, the semantic analysis unit 238 inputs “pastor of company A” as a keyword from the character string “store the pasting time of pasta of company A as 5 minutes” input from the speech recognition unit 234. Extract “boiled time is 5 minutes” as response information. On the other hand, the sound source direction estimation unit 236 estimates 30 °, which is an angle formed by the direction of the position P1 viewed from the sound processing device 21 and the reference direction d of the sound processing device 21, as the sound source direction. For this reason, as shown in FIG. 4, the semantic analysis unit 238 stores the keyword “A company pasta”, the sound source direction “30 °”, and the response information “boiled for 5 minutes” in the storage unit 240 in association with each other. .
 図3には、さらに、収納周辺エリアA2に含まれる位置P2で、ユーザが「A社のパスタを収納の一番上の棚に置いたと記憶しておいて。」という音声を発話した例を示している。この場合、意味解析部238は、音声認識部234から入力される「A社のパスタを収納の一番上の棚に置いたと記憶しておいて。」という文字列から、キーワードとして「A社のパスタ」を抽出し、応答情報として「収納の一番上の棚に置いた」を抽出する。 FIG. 3 further illustrates an example in which the user utters a voice saying “Remember that company A's pasta has been placed on the top shelf of storage” at position P2 included in storage peripheral area A2. Show. In this case, the semantic analysis unit 238 inputs “A company A” as a keyword from the character string “store the pasta of company A placed on the top shelf of storage” input from the speech recognition unit 234. Of pasta ”and“ placed on the top shelf of storage ”are extracted as response information.
 一方、音源方向推定部236は、音声処理装置21から見た位置P2の方向と音声処理装置21の基準方向dとが成す角度である-20°を、音源方向として推定する。このため、意味解析部238は、図4に示したように、キーワード「A社のパスタ」、音源方向「-20°」、応答情報「収納の一番上の棚に置いた」を関連付けて記憶部240に記憶させる。 On the other hand, the sound source direction estimation unit 236 estimates −20 °, which is an angle formed by the direction of the position P2 viewed from the sound processing device 21 and the reference direction d of the sound processing device 21, as the sound source direction. For this reason, as shown in FIG. 4, the semantic analysis unit 238 associates the keyword “Pasta from Company A”, the sound source direction “−20 °”, and the response information “placed on the top shelf of storage”. The data is stored in the storage unit 240.
 (応答制御部)
 応答制御部250は、意味解析部238の解析により得られた意味が情報の応答を要求する意味を有する場合、記憶部240を参照し、応答情報に関連付けられている音源方向からの音声の到来、および当該音声が当該応答情報に関連付けられているキーワードを含むことが認識されたことに基づき、当該応答情報を用いて応答を制御する。ここで、応答制御部250は、応答情報に関連付けられている音源方向との差分が所定の基準未満である音源方向が認識されたことを、応答情報に関連付けられている音源方向からの音声の到来として扱ってもよい。所定の基準は、例えば10°~20°の間の値であってもよい。
(Response control unit)
When the meaning obtained by the analysis of the semantic analysis unit 238 has a meaning of requesting an information response, the response control unit 250 refers to the storage unit 240 and receives the voice from the sound source direction associated with the response information. , And that the voice is recognized to include a keyword associated with the response information, the response is controlled using the response information. Here, the response control unit 250 recognizes that the sound source direction whose difference from the sound source direction associated with the response information is less than a predetermined reference is recognized as the sound from the sound source direction associated with the response information. May be treated as arrival. The predetermined reference may be a value between 10 ° and 20 °, for example.
 (音声出力部)
 音声出力部260は、応答制御部250からの制御に従い、ユーザへの応答音声を出力する。ただし、音声出力部260は応答を出力する出力部の一例に過ぎず、音声処理装置21は、応答の出力は表示により行う表示部を有してもよい。ここで、図5を参照し、音声出力部260により出力される応答音声の具体例を説明する。
(Audio output part)
The voice output unit 260 outputs a response voice to the user in accordance with the control from the response control unit 250. However, the audio output unit 260 is merely an example of an output unit that outputs a response, and the audio processing device 21 may include a display unit that outputs a response by display. Here, with reference to FIG. 5, a specific example of the response voice output by the voice output unit 260 will be described.
 図5は、音声処理装置21の利用形態を示す説明図である。図5に示した例では、キッチンエリアA1に含まれる位置P1で、ユーザが「A社のパスタは?」という音声を発話している。また、当該音声の音源方向は「30°」である。このため、応答制御部250は、記憶部240を参照し、キーワード「A社のパスタ」および音源方向「30°」に関連付けられている「茹で時間は5分」という応答情報を抽出する。そして、応答制御部250は、抽出した応答情報「茹で時間は5分」を用いて、図5に示したように音声出力部260に「A社のパスタの茹で時間は5分と記憶しています。」という応答音声を出力させる。 FIG. 5 is an explanatory diagram showing a usage pattern of the voice processing device 21. In the example shown in FIG. 5, the user utters the voice “What is A company's pasta?” At a position P1 included in the kitchen area A1. The sound source direction of the sound is “30 °”. Therefore, the response control unit 250 refers to the storage unit 240 and extracts response information “boil time is 5 minutes” associated with the keyword “pasta of company A” and the sound source direction “30 °”. Then, the response control unit 250 uses the extracted response information “boiled time is 5 minutes” and stores “boiled time of company A's pasta as 5 minutes” in the audio output unit 260 as shown in FIG. The response voice saying "
 一方、収納周辺エリアA2に含まれる位置P2で、ユーザが「A社のパスタは?」という音声を発話した場合、応答制御部250は、記憶部240を参照し、キーワード「A社のパスタ」および音源方向「―20°」に関連付けられている「収納の一番上の棚に置いた」という応答情報を抽出する。そして、応答制御部250は、抽出した応答情報「収納の一番上の棚に置いた」を用いて、図5に示したように音声出力部260に「A社のパスタは収納の一番上の棚に置いたと記憶しています。」という応答音声を出力させる。 On the other hand, when the user utters a voice “What is Company A's pasta?” At position P2 included in the storage peripheral area A2, the response control unit 250 refers to the storage unit 240 and searches for the keyword “A company's pasta”. Response information “Placed on top shelf of storage” associated with the sound source direction “−20 °” is extracted. Then, the response control unit 250 uses the extracted response information “Placed on the top shelf for storage” to the audio output unit 260 as shown in FIG. I remember that I placed it on the top shelf. "
  <<1-2.第1の実施形態による音声処理装置の動作>>
 以上、第1の実施形態による音声処理装置21の構成を説明した。続いて、図6を参照し、第1の実施形態による音声処理装置21の動作を整理する。
<< 1-2. Operation of the speech processing apparatus according to the first embodiment >>
The configuration of the voice processing device 21 according to the first embodiment has been described above. Subsequently, with reference to FIG. 6, the operation of the speech processing apparatus 21 according to the first embodiment is organized.
 図6は、第1の実施形態による音声処理装置21の動作を示すフローチャートである。まず、集音部232にユーザが発話した音声が入力されると(S304)、音声認識部234が当該音声を認識し、意味解析部238が音声認識部234から入力される文字列の意味を解析する(S308)。 FIG. 6 is a flowchart showing the operation of the voice processing device 21 according to the first embodiment. First, when the voice uttered by the user is input to the sound collection unit 232 (S304), the voice recognition unit 234 recognizes the voice, and the semantic analysis unit 238 indicates the meaning of the character string input from the voice recognition unit 234. Analysis is performed (S308).
 そして、意味解析部238は、ユーザの要求が情報の応答であると理解すると(S312/Yes)、音声認識部234から入力される文字列からキーワードを抽出する(S316)。さらに、音源方向推定部236が音声の音源方向を推定し(S340)、応答制御部250が、キーワードおよび当該音声の音源方向に基づき記憶部240から応答情報を抽出する(S344)。そして、応答制御部250は、抽出した応答情報を用いて応答音声を生成し、音声出力部260が当該応答音声を出力する(S348)。 When the semantic analysis unit 238 understands that the user's request is an information response (S312 / Yes), the semantic analysis unit 238 extracts a keyword from the character string input from the voice recognition unit 234 (S316). Further, the sound source direction estimating unit 236 estimates the sound source direction of the sound (S340), and the response control unit 250 extracts response information from the storage unit 240 based on the keyword and the sound source direction of the sound (S344). And the response control part 250 produces | generates a response audio | voice using the extracted response information, and the audio | voice output part 260 outputs the said response audio | voice (S348).
 一方、意味解析部238は、ユーザの要求が情報の記憶であると理解すると(S312/No、S352/Yes)、音声認識部234から入力される文字列からキーワードおよび応答情報を抽出する(S356)。さらに、音源方向推定部236が音声の音源方向を推定し(S360)、意味解析部238が、キーワード、応答情報、および音源方向推定部236から入力される音源方向を関連付けて記憶部240に記憶させる(S364)。その後、例えば音声出力部260が、情報の記憶が完了した旨を示す登録完了通知を出力する(S368)。 On the other hand, when the semantic analysis unit 238 understands that the user's request is information storage (S312 / No, S352 / Yes), the semantic analysis unit 238 extracts keywords and response information from the character string input from the speech recognition unit 234 (S356). ). Further, the sound source direction estimation unit 236 estimates the sound source direction of the sound (S360), and the semantic analysis unit 238 associates the keyword, the response information, and the sound source direction input from the sound source direction estimation unit 236 and stores them in the storage unit 240. (S364). Thereafter, for example, the audio output unit 260 outputs a registration completion notification indicating that the storage of information is completed (S368).
  <<1-3.作用効果>>
 以上説明した第1の実施形態によれば、多様な作用効果が得られる。例えば、第1の実施形態によれば、同一のキーワードを含む音声に対して、状況の認識(上記の例では、音源方向の推定)に基づき異なる応答音声を出力することが可能である。従って、「A社のパスタは?」のような、茹で時間を問い合わせる意味、置き場所を問い合わせる意味、などの複数の意味を有し得る多義性のある音声に対しても、適切な応答音声を出力することが可能である。
<< 1-3. Effect >>
According to the first embodiment described above, various functions and effects can be obtained. For example, according to the first embodiment, different response sounds can be output based on situation recognition (in the above example, estimation of a sound source direction) for sounds including the same keyword. Therefore, appropriate response voices should be used for ambiguous voices that may have multiple meanings, such as the meaning of inquiring the time with a broom and the meaning of inquiring about the location, such as “What is the pasta of Company A?” It is possible to output.
  <<1-4.応用例>>
 上述した第1の実施形態の応用例として、応答制御部250は、ユーザが発話した音声に対して応答音声を出力するか否かを、ユーザがターゲットユーザであるか否かに応じて制御してもよい。例えば、記憶部240は、あるユーザが発話した音声に基づいて情報の記憶が行われた場合に、ターゲットユーザとして当該ユーザを示すターゲット情報を併せて記憶しておく。ターゲット情報は、ターゲットユーザの識別情報であってもよいし、ターゲットユーザの音声の特徴量であってもよい。そして、応答制御部250は、情報の応答を要求する音声を発話したユーザがターゲットユーザであり、かつ、他の条件(音源方向など)が満たされる場合に、応答情報を出力してもよい。
<< 1-4. Application example >>
As an application example of the first embodiment described above, the response control unit 250 controls whether or not to output a response voice in response to the voice spoken by the user depending on whether or not the user is a target user. May be. For example, when information is stored based on voice uttered by a certain user, the storage unit 240 also stores target information indicating the user as a target user. The target information may be identification information of the target user, or may be a feature amount of the target user's voice. And the response control part 250 may output response information, when the user who uttered the audio | voice which requests | requires the response of information is a target user, and other conditions (sound source direction etc.) are satisfy | filled.
 なお、ターゲットユーザと情報の記憶を要求したユーザは異なってもよい。例えば、情報の記憶に際し、音声処理装置21は、情報の記憶を要求したユーザに、ターゲットユーザとして設定すべきユーザの入力を指示し、入力されたユーザをターゲットユーザとして設定してもよい。例えば、図7に示すように、ターゲットユーザとして全ユーザを設定することも、ターゲットユーザとして特定のユーザ(図7の例ではユーザA)を設定することも可能である。 Note that the target user and the user who requested the storage of information may be different. For example, when storing information, the voice processing device 21 may instruct the user who has requested the storage of information to input the user to be set as the target user, and set the input user as the target user. For example, as shown in FIG. 7, all users can be set as target users, or a specific user (user A in the example of FIG. 7) can be set as target users.
 ターゲットユーザとして全ユーザのように複数のユーザを設定することにより、各ユーザが個別に情報の記憶のための発話をしなくてよくなるので、ユーザの手間を削減することが可能である。汎用性が高い情報には、このようにターゲットユーザとして全ユーザのように複数のユーザを設定することが有用である。一方、あるユーザに特化した情報は、ターゲットユーザとして特定のユーザを設定することが有用である。 By setting a plurality of users as the target users, such as all users, each user does not have to speak for storing information individually, so that it is possible to reduce the user's trouble. For highly versatile information, it is useful to set a plurality of users like all users as target users. On the other hand, for information specific to a certain user, it is useful to set a specific user as a target user.
 <2.第2の実施形態>
 次に、本開示の第2の実施形態による音声処理装置22を説明する。第2の実施形態による音声処理装置22は、ユーザによる明示的な要求が無くても、ユーザに適切な応答音声を出力することが可能である。以下、このような第2の実施形態による音声処理装置22の構成および動作を順次詳細に説明する。
<2. Second Embodiment>
Next, the audio processing device 22 according to the second embodiment of the present disclosure will be described. The voice processing device 22 according to the second embodiment can output an appropriate response voice to the user without an explicit request from the user. Hereinafter, the configuration and operation of the sound processing device 22 according to the second embodiment will be sequentially described in detail.
  <<2-1.第2の実施形態による音声処理装置の構成>>
 図8は、第2の実施形態による音声処理装置22の構成を示す説明図である。図8に示したように、第2の実施形態による音声処理装置22は、集音部232、音声認識部234、音源方向推定部236、意味解析部238、記憶部242、学習部246、応答制御部252および音声出力部260を有する。集音部232、音声認識部234、音源方向推定部236、意味解析部238および音声出力部260の構成は第1の実施形態で説明した通りであるので、ここでの詳細な説明を省略する。
<< 2-1. Configuration of Speech Processing Device according to Second Embodiment >>
FIG. 8 is an explanatory diagram showing the configuration of the audio processing device 22 according to the second embodiment. As shown in FIG. 8, the speech processing apparatus 22 according to the second embodiment includes a sound collection unit 232, a speech recognition unit 234, a sound source direction estimation unit 236, a semantic analysis unit 238, a storage unit 242, a learning unit 246, a response A control unit 252 and an audio output unit 260 are included. The configurations of the sound collection unit 232, the voice recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, and the voice output unit 260 are the same as those described in the first embodiment, and thus detailed description thereof is omitted here. .
 記憶部242は、第1の実施形態と同様に、意味解析部238により抽出したキーワード、応答情報、および音源方向推定部236から入力される音源方向を関連付けて記憶する。例えば、ユーザが「お風呂をでたときには歯を磨くと記憶しておいて」という音声を脱衣所で発話した場合、記憶部242は、図9に示すように、キーワード「お風呂をでた」、音源方向「50°」(音声処理装置22から見た脱衣所の方向)、および応答情報「歯を磨く」を関連付けて記憶する。これにより、「50°」の方向からの音声の到来、および当該音声がキーワード「お風呂をでた」を含むことが認識されたことに基づき、第1の実施形態と同様に、「お風呂をでたときには歯を磨くと記憶しています。」という応答音声を出力することが可能である。 Similarly to the first embodiment, the storage unit 242 stores the keyword extracted by the semantic analysis unit 238, the response information, and the sound source direction input from the sound source direction estimation unit 236 in association with each other. For example, when the user utters a voice at the dressing room saying “Remember to brush your teeth when you take a bath”, the storage unit 242 uses the keyword “Take a bath as shown in FIG. ”, The sound source direction“ 50 ° ”(the direction of the dressing room as viewed from the sound processing device 22), and the response information“ brush teeth ”are stored in association with each other. Thereby, based on the recognition of the arrival of the voice from the direction of “50 °” and the fact that the voice includes the keyword “Take a bath”, as in the first embodiment, It is possible to output a response voice saying “I remember that I brushed my teeth when I came out.”
 さらに、学習部246が、あるキーワードが示す状況において発生する音を認識するための認識モデルを機械学習により生成し、記憶部242は、学習部246による機械学習の学習結果である上記認識モデルを記憶する。当該認識モデルによれば、例えば、お風呂のドアを閉めた時に生じる音が入力されると、キーワード「お風呂をでた」を出力し得る。他の例として、ランニングシューズを履く際にランニングシューズが玄関の床を叩く音が入力されると、キーワード「ランニングに行く」を出力し得る。図10を参照し、このような認識モデルの利用形態の具体例を説明する。 Further, the learning unit 246 generates a recognition model for recognizing sound generated in a situation indicated by a certain keyword by machine learning, and the storage unit 242 uses the recognition model as a learning result of the machine learning by the learning unit 246. Remember. According to the recognition model, for example, when a sound generated when a bath door is closed is input, the keyword “Take a bath” can be output. As another example, the keyword “going to running” may be output when a running shoe hits the entrance floor when the running shoe is worn. With reference to FIG. 10, a specific example of the use form of such a recognition model will be described.
 図10は、認識モデルの利用形態の具体例を示す説明図である。図10に示したように、脱衣所P3においてお風呂場のドアを閉める音(バタン)が生じると、応答制御部252は、当該音に対応するキーワード「お風呂をでた」を認識モデルに基づいて特定する。そして、応答制御部252は、記憶部242に記憶されている情報(図9参照)に基づき、キーワード「お風呂をでた」に関連付けられている応答情報「歯を磨く」を抽出し、「お風呂をでたときには歯を磨くと記憶しています。」という応答音声を音声出力部260に出力させる。 FIG. 10 is an explanatory diagram showing a specific example of the usage form of the recognition model. As shown in FIG. 10, when a sound of closing the bathroom door is generated at the dressing room P3, the response control unit 252 uses the keyword “Take a bath” corresponding to the sound as a recognition model. Identify based on. Then, the response control unit 252 extracts the response information “brush your teeth” associated with the keyword “Take a bath” based on the information stored in the storage unit 242 (see FIG. 9). The voice output unit 260 outputs a response voice saying “When you take a bath, you remember to brush your teeth.”
 また、図10に示したように、玄関P4においてランニングシューズが玄関の床を叩く音(トントン)が生じると、応答制御部252は、当該音に対応するキーワード「ランニングに行く」を認識モデルに基づいて特定する。そして、応答制御部252は、記憶部242に記憶されている情報(図9参照)に基づき、キーワード「ランニングに行く」に関連付けられている応答情報「天気の確認をする」を抽出し、「ランニングに行くときには天気の確認をすると記憶しています。」という応答音声を音声出力部260に出力させる。 Also, as shown in FIG. 10, when the sound of running shoes hitting the floor of the entrance (ton ton) is generated at the entrance P4, the response control unit 252 uses the keyword “going to running” corresponding to the sound as a recognition model. Identify based on. Then, the response control unit 252 extracts the response information “check weather” associated with the keyword “going running” based on the information stored in the storage unit 242 (see FIG. 9), The voice output unit 260 is made to output a response voice saying “When you go running, you will remember if you check the weather”.
 ここで、応答制御部252は、ネットワークを介して天気情報を得ることにより、「天気の確認をする」という処理または行動の実行結果として、「このあとの天気は雨です。」のような応答音声を音声出力部260に出力させてもよい。かかる構成によれば、ユーザの手間が削減されるので、ユーザの利便性を向上することが可能である。 Here, the response control unit 252 obtains weather information via the network, and as a result of executing the process or action of “confirming the weather”, a response such as “the weather after this is rain”. Audio may be output to the audio output unit 260. According to such a configuration, since the user's trouble is reduced, the convenience for the user can be improved.
  <<2-2.第2の実施形態による音声処理装置の動作>>
 続いて、図11~図13を参照し、第2の実施形態による音声処理装置22の動作を整理する。
<< 2-2. Operation of the speech processing apparatus according to the second embodiment >>
Subsequently, with reference to FIGS. 11 to 13, the operation of the sound processing apparatus 22 according to the second embodiment is organized.
 図11は、第2の実施形態による音声処理装置22の概略動作を示すフローチャートである。図11に示したように、学習部246が、キーワードが示す状況で発生する音を認識する認識モデルを機械学習により生成する(S410)。 FIG. 11 is a flowchart showing a schematic operation of the voice processing device 22 according to the second embodiment. As illustrated in FIG. 11, the learning unit 246 generates a recognition model that recognizes sounds generated in the situation indicated by the keyword by machine learning (S410).
 そして、意味解析部238が、ユーザにより発話された音声の意味を解析し、ユーザの要求が情報の記憶である場合、記憶部242が、音声から抽出されたキーワードおよび応答情報に、音源方向推定部236により推定された音声の音源方向を関連付けて記憶する(S430)。当該S430の処理は、図6を参照して説明したS352~S368の処理に対応する。 Then, the semantic analysis unit 238 analyzes the meaning of the speech uttered by the user, and when the user's request is storage of information, the storage unit 242 estimates the sound source direction to the keyword and response information extracted from the speech. The sound source direction of the sound estimated by the unit 236 is stored in association with each other (S430). The processing of S430 corresponds to the processing of S352 to S368 described with reference to FIG.
 その後、応答制御部252が、学習部246により生成された認識モデルを用いて音認識を行い、音認識に基づいて特定されたキーワードに関連付けられている応答情報を用いて応答音声の出力を制御する(S450)。 Thereafter, the response control unit 252 performs sound recognition using the recognition model generated by the learning unit 246, and controls output of the response sound using response information associated with the keyword specified based on the sound recognition. (S450).
 以下、図12を参照してS410の処理をより具体的に説明し、図13を参照してS450の処理をより具体的に説明する。 Hereinafter, the process of S410 will be described more specifically with reference to FIG. 12, and the process of S450 will be described more specifically with reference to FIG.
 図12は、認識モデルの生成方法を示すフローチャートである。まず、ユーザがあるキーワードを発話すると、音声処理装置22は、当該キーワードが示す状況で発生する音を一定回数発生させることをユーザに指示する(S411)。例えば、キーワードが「お風呂をでた」である場合、ユーザは、お風呂場のドアの開け閉めを一定回数に亘って繰り返す。そして、学習部246は、発生した音を機械学習のための正例データとして保存する(S412)。 FIG. 12 is a flowchart showing a recognition model generation method. First, when a user speaks a keyword, the voice processing device 22 instructs the user to generate a certain number of sounds generated in the situation indicated by the keyword (S411). For example, when the keyword is “I took a bath”, the user repeats opening and closing the door of the bathroom for a certain number of times. Then, the learning unit 246 stores the generated sound as positive example data for machine learning (S412).
 続いて、音声処理装置22は、上記キーワードが示す状況で発生する音を一定時間発生させないことをユーザに指示する(S413)。そして、学習部246は、周囲の音を機械学習のための負例データとして保存する(S414)。 Subsequently, the voice processing device 22 instructs the user not to generate the sound generated in the situation indicated by the keyword for a certain period of time (S413). Then, the learning unit 246 stores surrounding sounds as negative example data for machine learning (S414).
 その後、学習部246は、保存済みの正例データおよび負例データを用いて機械学習を行い、キーワードが示す状況を認識する認識モデルを生成する(S415)。そして、記憶部242が、学習部246により生成された認識モデルを記憶する(S416)。 Thereafter, the learning unit 246 performs machine learning using the saved positive example data and negative example data, and generates a recognition model that recognizes the situation indicated by the keyword (S415). Then, the storage unit 242 stores the recognition model generated by the learning unit 246 (S416).
 図13は、認識モデルを用いた応答制御を示すフローチャートである。まず、集音部232に音が入力されると(S451)、応答制御部252は、入力された音が、記憶部242に記憶されている認識モデルを用いて認識される音であるか否かを判断する(S452)。入力された音が認識モデルを用いて認識される音である場合(S452/Yes)、応答制御部252は、当該音に対応するキーワードを認識モデルに基づいて特定する(S453)。また、音源方向推定部236が音の音源方向を推定する(S454)。 FIG. 13 is a flowchart showing response control using a recognition model. First, when a sound is input to the sound collection unit 232 (S451), the response control unit 252 determines whether or not the input sound is a sound recognized using the recognition model stored in the storage unit 242. Is determined (S452). When the input sound is a sound recognized using the recognition model (S452 / Yes), the response control unit 252 specifies a keyword corresponding to the sound based on the recognition model (S453). Further, the sound source direction estimating unit 236 estimates the sound source direction of the sound (S454).
 そして、応答制御部252が、認識モデルに基づいて特定されたキーワードに関連付けられ、かつ、音源方向推定部236により推定された音源方向との差分が所定の基準未満である音源方向と関連付けられている応答情報を抽出する(S455)。さらに、応答制御部252が、抽出した応答情報を用いて応答音声を生成し、音声出力部260に応答音声を出力させる(S456)。 Then, the response control unit 252 is associated with the keyword identified based on the recognition model, and is associated with the sound source direction whose difference from the sound source direction estimated by the sound source direction estimating unit 236 is less than a predetermined reference. Response information is extracted (S455). Further, the response control unit 252 generates a response sound using the extracted response information, and causes the sound output unit 260 to output the response sound (S456).
  <<2-3.作用効果>>
 以上説明した第2の実施形態によれば、多様な作用効果が得られる。例えば、第2の実施形態によれば、ユーザが明示的に音声を発話しなくても、状況の認識(上記の例では、音の認識、および音源方向の推定)に基づき、ユーザに適切な応答音声を出力することが可能である。
<< 2-3. Effect >>
According to the second embodiment described above, various functions and effects can be obtained. For example, according to the second embodiment, even if the user does not explicitly speak, it is appropriate for the user based on situation recognition (in the above example, sound recognition and sound source direction estimation). It is possible to output a response voice.
 また、ある状況で発生する音は、その状況がいずれのユーザに関連するかにより異なり得る。例えば、お風呂場のドアの開け閉めの音であっても、父親がお風呂場のドアの開け閉めする際に生じる音と、子供がお風呂場のドアの開け閉めする際に生じる音とは異なり得る。このため、図12を参照して説明したS411においてあるユーザが音を一定回数発生させることにより認識モデルが生成された場合、他のユーザが発生させた音は当該認識モデルによって認識されないことも起こり得る。従って、第2の実施形態によれば、音の入力に基づく応答音声の出力対象を特定のユーザに制限することが可能である。 Also, the sound generated in a certain situation can vary depending on which user the situation is related to. For example, even when the sound of opening / closing a bathroom door is generated, the sound generated when the father opens / closes the bathroom door and the sound generated when the child opens / closes the bathroom door Can be different. For this reason, when a recognition model is generated by a certain user generating a sound a certain number of times in S411 described with reference to FIG. 12, the sound generated by another user may not be recognized by the recognition model. obtain. Therefore, according to the second embodiment, it is possible to limit the output target of response voice based on sound input to a specific user.
  <<2-4.応用例>>
 上述した第2の実施形態の応用例として、音声処理装置22の使用の中で認識モデルの精度を向上する方法を説明する。
<< 2-4. Application example >>
As an application example of the above-described second embodiment, a method for improving the accuracy of the recognition model while using the speech processing device 22 will be described.
 図14は、第2の実施形態の応用例を示すフローチャートであり、図11に示したS430に適用される処理のフローチャートである。図14におけるS304~S316、S340~S368の処理は図6を参照して説明した通りであるので、ここでの詳細な説明を省略する。 FIG. 14 is a flowchart showing an application example of the second embodiment, and is a flowchart of processing applied to S430 shown in FIG. Since the processing of S304 to S316 and S340 to S368 in FIG. 14 is as described with reference to FIG. 6, detailed description thereof is omitted here.
 ユーザの音声による要求が情報の応答であり(S312/Yes)、当該音声からキーワードが抽出されると(S316)、学習部246は、当該キーワードに対応する認識モデルが記憶部242に記憶されているか否かを判定する(S320/Yes)。そして、学習部246は、集音部232に入力された音から、認識モデルで認識される音に一致または類似する音を抽出する(S324)。 When the user's voice request is an information response (S312 / Yes) and a keyword is extracted from the voice (S316), the learning unit 246 stores the recognition model corresponding to the keyword in the storage unit 242. It is determined whether or not (S320 / Yes). Then, the learning unit 246 extracts, from the sound input to the sound collection unit 232, a sound that matches or is similar to the sound recognized by the recognition model (S324).
 さらに、学習部246は、S324で抽出した音を正例データとして追加し(S328)、追加された正例データを含む正例データ群を用いて機械学習を再度行い、認識モデルを再生成する(S332)。 Further, the learning unit 246 adds the sound extracted in S324 as positive example data (S328), performs machine learning again using the positive example data group including the added positive example data, and regenerates the recognition model. (S332).
 当該応用例によれば、ユーザが機械学習のために音声を発話しなくても機械学習のための正例データが追加されるので、ユーザが手間を感じることなく認識モデルの精度を向上することが可能である。 According to this application example, positive example data for machine learning is added even if the user does not utter speech for machine learning, so that the accuracy of the recognition model is improved without the user feeling troublesome. Is possible.
 なお、上記では、キーワードに対応する状況の認識として音の認識を行う例を説明したが、他の方法によってもキーワードに対応する状況を認識することが可能である。例えば、音声処理装置22が撮像部を備える場合、学習部246は、撮像部により得られる撮像画像を用いた機械学習により、キーワードに対応する状況を認識する認識モデルを生成してもよい。ここで、キーワードには「ランニングに行く」が対応し得て、キーワードに対応する状況としては「ランニングに行く」時のユーザの服装が挙げられる。応答制御部252は、当該認識モデルに基づき「ランニングに行く」時のユーザの服装が認識された場合に、キーワード「ランニングに行く」を特定し、キーワード「ランニングに行く」に関連付けられている応答情報「天気の確認をする。」を用いて応答出力を制御してもよい。 In the above description, an example in which sound is recognized as recognition of a situation corresponding to a keyword has been described. However, a situation corresponding to a keyword can be recognized by other methods. For example, when the voice processing device 22 includes an imaging unit, the learning unit 246 may generate a recognition model that recognizes a situation corresponding to a keyword by machine learning using a captured image obtained by the imaging unit. Here, “going to run” can correspond to the keyword, and the situation corresponding to the keyword includes the clothes of the user at the time of “going running”. The response control unit 252 identifies the keyword “going running” when the user ’s clothes at the time of “going running” are recognized based on the recognition model, and the response associated with the keyword “going running” The response output may be controlled using the information “Confirm the weather”.
 <3.ハードウェア構成>
 以上、本開示の各実施形態を説明した。上述した音声認識および応答制御などの情報処理は、ソフトウェアと、以下に説明する音声処理装置20のハードウェアとの協働により実現される。
<3. Hardware configuration>
The embodiments of the present disclosure have been described above. Information processing such as voice recognition and response control described above is realized by cooperation between software and hardware of the voice processing device 20 described below.
 図15は、音声処理装置20のハードウェア構成を示した説明図である。図15に示したように、音声処理装置20は、CPU(Central Processing Unit)201と、ROM(Read Only Memory)202と、RAM(Random Access Memory)203と、入力装置208と、出力装置210と、ストレージ装置211と、ドライブ212と、撮像装置213と、通信装置215とを備える。 FIG. 15 is an explanatory diagram showing a hardware configuration of the voice processing device 20. As shown in FIG. 15, the voice processing device 20 includes a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, a RAM (Random Access Memory) 203, an input device 208, an output device 210, and the like. A storage device 211, a drive 212, an imaging device 213, and a communication device 215.
 CPU201は、演算処理装置および制御装置として機能し、各種プログラムに従って音声処理装置20内の動作全般を制御する。また、CPU201は、マイクロプロセッサであってもよい。ROM202は、CPU201が使用するプログラムや演算パラメータ等を記憶する。RAM203は、CPU201の実行において使用するプログラムや、その実行において適宜変化するパラメータ等を一時記憶する。これらはCPUバスなどから構成されるホストバスにより相互に接続されている。これらCPU201、ROM202およびRAM203とソフトウェアとの協働により、音声認識部234、音源方向推定部236、意味解析部238、学習部246および応答制御部252などの機能が実現され得る。 The CPU 201 functions as an arithmetic processing device and a control device, and controls the overall operation in the sound processing device 20 according to various programs. Further, the CPU 201 may be a microprocessor. The ROM 202 stores programs used by the CPU 201, calculation parameters, and the like. The RAM 203 temporarily stores programs used in the execution of the CPU 201, parameters that change as appropriate during the execution, and the like. These are connected to each other by a host bus including a CPU bus. Functions such as the speech recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, the learning unit 246, and the response control unit 252 can be realized by the cooperation of the CPU 201, the ROM 202, the RAM 203, and the software.
 入力装置208は、マウス、キーボード、タッチパネル、ボタン、マイクロフォン、スイッチおよびレバーなどユーザが情報を入力するための入力手段と、ユーザによる入力に基づいて入力信号を生成し、CPU201に出力する入力制御回路などから構成されている。音声処理装置20のユーザは、該入力装置208を操作することにより、音声処理装置20に対して各種のデータを入力したり処理動作を指示したりすることができる。 The input device 208 includes input means for a user to input information, such as a mouse, keyboard, touch panel, button, microphone, switch, and lever, and an input control circuit that generates an input signal based on the input by the user and outputs the input signal to the CPU 201. Etc. A user of the voice processing device 20 can input various data and instruct a processing operation to the voice processing device 20 by operating the input device 208.
 出力装置210は、例えば、液晶ディスプレイ(LCD)装置、OLED(Organic Light Emitting Diode)装置およびランプなどの表示装置を含む。さらに、出力装置210は、スピーカおよびヘッドホンなどの音声出力装置を含む。例えば、表示装置は、撮像された画像や生成された画像などを表示する。一方、音声出力装置は、音声データ等を音声に変換して出力する。 The output device 210 includes a display device such as a liquid crystal display (LCD) device, an OLED (Organic Light Emitting Diode) device, and a lamp. Furthermore, the output device 210 includes an audio output device such as a speaker and headphones. For example, the display device displays a captured image or a generated image. On the other hand, the audio output device converts audio data or the like into audio and outputs it.
 ストレージ装置211は、本実施形態にかかる音声処理装置20の記憶部の一例として構成されたデータ格納用の装置である。ストレージ装置211は、記憶媒体、記憶媒体にデータを記録する記録装置、記憶媒体からデータを読み出す読出し装置および記憶媒体に記録されたデータを削除する削除装置などを含んでもよい。このストレージ装置211は、CPU201が実行するプログラムや各種データを格納する。 The storage device 211 is a data storage device configured as an example of a storage unit of the audio processing device 20 according to the present embodiment. The storage device 211 may include a storage medium, a recording device that records data on the storage medium, a reading device that reads data from the storage medium, a deletion device that deletes data recorded on the storage medium, and the like. The storage device 211 stores programs executed by the CPU 201 and various data.
 ドライブ212は、記憶媒体用リーダライタであり、音声処理装置20に内蔵、あるいは外付けされる。ドライブ212は、装着されている磁気ディスク、光ディスク、光磁気ディスク、または半導体メモリ等のリムーバブル記憶媒体24に記録されている情報を読み出して、RAM203に出力する。また、ドライブ212は、リムーバブル記憶媒体24に情報を書き込むこともできる。 The drive 212 is a storage medium reader / writer, and is built in or externally attached to the audio processing device 20. The drive 212 reads information recorded on a removable storage medium 24 such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs it to the RAM 203. The drive 212 can also write information to the removable storage medium 24.
 撮像装置213は、光を集光する撮影レンズおよびズームレンズなどの撮像光学系、およびCCD(Charge Coupled Device)またはCMOS(Complementary Metal Oxide Semiconductor)などの信号変換素子を備える。撮像光学系は、被写体から発せられる光を集光して信号変換部に被写体像を形成し、信号変換素子は、形成された被写体像を電気的な画像信号に変換する。 The imaging device 213 includes an imaging optical system such as a photographing lens and a zoom lens that collects light, and a signal conversion element such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor). The imaging optical system collects light emitted from the subject and forms a subject image in the signal conversion unit, and the signal conversion element converts the formed subject image into an electrical image signal.
 通信装置215は、例えば、ネットワーク12に接続するための通信デバイス等で構成された通信インタフェースである。また、通信装置215は、無線LAN(Local Area Network)対応通信装置であっても、LTE(Long Term Evolution)対応通信装置であっても、有線による通信を行うワイヤー通信装置であってもよい。 The communication device 215 is a communication interface configured with, for example, a communication device for connecting to the network 12. The communication device 215 may be a wireless LAN (Local Area Network) compatible communication device, an LTE (Long Term Evolution) compatible communication device, or a wire communication device that performs wired communication.
 なお、ネットワーク12は、ネットワーク12に接続されている装置から送信される情報の有線、または無線の伝送路である。例えば、ネットワーク12は、インターネット、電話回線網、衛星通信網などの公衆回線網や、Ethernet(登録商標)を含む各種のLAN(Local Area Network)、WAN(Wide Area Network)などを含んでもよい。また、ネットワーク12は、IP-VPN(Internet Protocol-Virtual Private Network)などの専用回線網を含んでもよい。 The network 12 is a wired or wireless transmission path for information transmitted from a device connected to the network 12. For example, the network 12 may include a public line network such as the Internet, a telephone line network, and a satellite communication network, various LANs (Local Area Network) including the Ethernet (registered trademark), a WAN (Wide Area Network), and the like. Further, the network 12 may include a dedicated line network such as an IP-VPN (Internet Protocol-Virtual Private Network).
 <4.補足>
 以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。
<4. Supplement>
The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that it belongs to the technical scope of the present disclosure.
 例えば、本明細書の音声処理装置20の処理における各ステップは、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はない。例えば、音声処理装置20の処理における各ステップは、フローチャートとして記載した順序と異なる順序で処理されても、並列的に処理されてもよい。 For example, each step in the processing of the voice processing device 20 of the present specification does not necessarily have to be processed in time series in the order described as a flowchart. For example, each step in the processing of the voice processing device 20 may be processed in an order different from the order described as the flowchart, or may be processed in parallel.
 また、上述した音声処理装置20の機能の一部は、ネットワーク12を介して音声処理装置20に接続されるクラウドサーバに実装されてもよい。例えば、クラウドサーバが、音声認識部234、音源方向推定部236、意味解析部238、記憶部240および応答制御部250に相当する機能、すなわち、音声処理装置としての機能を有してもよい。この場合、音声処理装置20は音声信号をクラウドサーバに送信し、クラウドサーバが情報の記憶、または応答音声の音声処理装置20への送信を行い得る。 Also, some of the functions of the voice processing device 20 described above may be implemented in a cloud server connected to the voice processing device 20 via the network 12. For example, the cloud server may have a function corresponding to the voice recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, the storage unit 240, and the response control unit 250, that is, a function as a voice processing device. In this case, the audio processing device 20 transmits an audio signal to the cloud server, and the cloud server can store information or transmit response audio to the audio processing device 20.
 また、音声処理装置20に内蔵されるCPU、ROMおよびRAMなどのハードウェアに、上述した音声処理装置20の各構成と同等の機能を発揮させるためのコンピュータプログラムも作成可能である。また、該コンピュータプログラムを記憶させた記憶媒体も提供される。 Also, it is possible to create a computer program for causing hardware such as a CPU, ROM, and RAM incorporated in the voice processing device 20 to perform the same functions as the components of the voice processing device 20 described above. A storage medium storing the computer program is also provided.
 また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 In addition, the effects described in this specification are merely illustrative or illustrative, and are not limited. That is, the technology according to the present disclosure can exhibit other effects that are apparent to those skilled in the art from the description of the present specification in addition to or instead of the above effects.
 また、以下のような構成も本開示の技術的範囲に属する。
(1)
 音入力部に入力された音声の認識に基づき得られた入力情報を記憶している記憶部を参照し、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報を用いて応答を制御する応答制御部、
を備える、音声処理装置。
(2)
 前記入力情報は、第1の部分および第2の部分を含み、
 前記応答制御部は、前記記憶部に記憶されている入力情報に含まれる第1の部分が音声の認識により得られ、かつ、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報に含まれる第2の部分を用いて応答を制御する、前記(1)に記載の音声処理装置。
(3)
 前記記憶部には、前記音声が前記音入力部に入力された際の前記音声の音源方向が状況情報として前記入力情報と関連付けて記憶され、
 前記応答制御部は、前記記憶部に前記状況情報として記憶されている音源方向との差分が所定の基準未満である音源方向が認識されたことに基づき、前記記憶部において前記音源方向に関連付けて記憶されている前記入力情報を用いて応答を制御する、前記(1)または(2)に記載の音声処理装置。
(4)
 前記記憶部には、ターゲットユーザを示すターゲット情報が前記入力情報と関連付けて記憶され、
 前記応答制御部は、前記記憶部に記憶されている前記ターゲット情報が示すターゲットユーザに関して前記入力情報に関連付けられた状況が認識されたことに基づき、前記ターゲット情報に関連付けられている前記入力情報を用いて応答を制御する、前記(1)~(3)のいずれか一項に記載の音声処理装置。
(5)
 前記入力情報は、状況を示す第1の部分、および第2の部分を含み、
 前記記憶部には、前記第1の部分が示す状況において得られたデータを用いる機械学習により得られた認識モデルが記憶され、
 前記応答制御部は、前記記憶部に記憶されている前記認識モデルを用いて、前記第1の部分に関連する状況として、前記第1の部分が示す状況が認識されたことに基づき、前記第2の部分を用いて応答を制御する、前記(1)に記載の音声処理装置。
(6)
 前記音声処理装置は、前記認識モデルにより認識される状況を示す音声が入力された場合、当該音声が入力された際に得られたデータを追加的に用いて機械学習を再度行い、認識モデルを再度生成する学習部をさらに備える、前記(5)に記載の音声処理装置。
(7)
 前記記憶部に記憶される複数の入力情報には、前記第1の部分が多義性を有する、または前記第1の部分の意味認識の結果が不明である入力情報が含まれる、前記(2)に記載の音声処理装置。
(8)
 前記応答制御部は、前記応答の制御として、前記入力情報の少なくとも一部を出力部に出力させる制御を行う、前記(1)~(7)のいずれか一項に記載の音声処理装置。
(9)
 前記応答制御部は、前記応答の制御として、前記入力情報の少なくとも一部が示す処理または行動の実行結果を出力部に出力させる制御を行う、前記(1)~(7)のいずれか一項に記載の音声処理装置。
(10)
 前記音声処理装置は、
 前記音入力部に入力された音声を認識する音声認識部と、
 前記状況を認識する状況認識部と、
をさらに備える、前記(1)~(9)のいずれか一項に記載の音声処理装置。
(11)
 音入力部に入力された音声の認識に基づき得られた入力情報を記憶している記憶部を参照し、プロセッサにより、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報を用いて応答を制御すること、
を含む、音声処理方法。
The following configurations also belong to the technical scope of the present disclosure.
(1)
With reference to the storage unit storing the input information obtained based on the recognition of the voice input to the sound input unit, and using the input information based on the recognition of the situation associated with the input information A response control unit for controlling the response,
An audio processing apparatus comprising:
(2)
The input information includes a first part and a second part;
The response control unit is based on the fact that the first part included in the input information stored in the storage unit is obtained by speech recognition, and the situation associated with the input information is recognized. The speech processing apparatus according to (1), wherein the response is controlled using the second portion included in the input information.
(3)
In the storage unit, a sound source direction of the sound when the sound is input to the sound input unit is stored in association with the input information as situation information,
The response control unit associates the sound source direction with the sound source direction in the storage unit based on the recognition of the sound source direction whose difference from the sound source direction stored as the situation information in the storage unit is less than a predetermined reference. The speech processing apparatus according to (1) or (2), wherein a response is controlled using the stored input information.
(4)
In the storage unit, target information indicating a target user is stored in association with the input information,
The response control unit recognizes the input information associated with the target information based on the recognition of the situation associated with the input information regarding the target user indicated by the target information stored in the storage unit. The speech processing apparatus according to any one of (1) to (3), wherein the response is controlled by using.
(5)
The input information includes a first part indicating a situation, and a second part,
The storage unit stores a recognition model obtained by machine learning using data obtained in the situation indicated by the first part,
The response control unit uses the recognition model stored in the storage unit based on the fact that the situation indicated by the first part is recognized as the situation related to the first part. The speech processing apparatus according to (1), wherein the response is controlled using the part 2.
(6)
When a voice indicating a situation recognized by the recognition model is input, the voice processing device performs machine learning again using additional data obtained when the voice is input, and sets the recognition model. The speech processing apparatus according to (5), further including a learning unit that generates again.
(7)
The plurality of input information stored in the storage unit includes input information in which the first part has ambiguity or the result of the semantic recognition of the first part is unknown (2) The voice processing apparatus according to 1.
(8)
The speech processing apparatus according to any one of (1) to (7), wherein the response control unit performs control to output at least part of the input information to an output unit as control of the response.
(9)
The response control unit performs control to output an execution result of a process or action indicated by at least a part of the input information to the output unit as control of the response, any one of (1) to (7) The voice processing apparatus according to 1.
(10)
The voice processing device
A voice recognition unit for recognizing a voice input to the sound input unit;
A situation recognition unit for recognizing the situation;
The speech processing apparatus according to any one of (1) to (9), further including:
(11)
The input information is referred to by referring to a storage unit storing input information obtained based on recognition of the voice input to the sound input unit, and by a processor recognizing a situation associated with the input information. To control the response using
Including a voice processing method.
20、21、22 音声処理装置
232 集音部
234 音声認識部
236 音源方向推定部
238 意味解析部
240 記憶部
242 記憶部
246 学習部
250 応答制御部
252 応答制御部
260 音声出力部
20, 21, 22 Speech processing device 232 Sound collection unit 234 Speech recognition unit 236 Sound source direction estimation unit 238 Semantic analysis unit 240 Storage unit 242 Storage unit 246 Learning unit 250 Response control unit 252 Response control unit 260 Voice output unit

Claims (11)

  1.  音入力部に入力された音声の認識に基づき得られた入力情報を記憶している記憶部を参照し、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報を用いて応答を制御する応答制御部、
    を備える、音声処理装置。
    With reference to the storage unit storing the input information obtained based on the recognition of the voice input to the sound input unit, and using the input information based on the recognition of the situation associated with the input information A response control unit for controlling the response,
    An audio processing apparatus comprising:
  2.  前記入力情報は、第1の部分および第2の部分を含み、
     前記応答制御部は、前記記憶部に記憶されている入力情報に含まれる第1の部分が音声の認識により得られ、かつ、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報に含まれる第2の部分を用いて応答を制御する、請求項1に記載の音声処理装置。
    The input information includes a first part and a second part;
    The response control unit is based on the fact that the first part included in the input information stored in the storage unit is obtained by speech recognition, and the situation associated with the input information is recognized. The speech processing apparatus according to claim 1, wherein the response is controlled using a second portion included in the input information.
  3.  前記記憶部には、前記音声が前記音入力部に入力された際の前記音声の音源方向が状況情報として前記入力情報と関連付けて記憶され、
     前記応答制御部は、前記記憶部に前記状況情報として記憶されている音源方向との差分が所定の基準未満である音源方向が認識されたことに基づき、前記記憶部において前記音源方向に関連付けて記憶されている前記入力情報を用いて応答を制御する、請求項1に記載の音声処理装置。
    In the storage unit, a sound source direction of the sound when the sound is input to the sound input unit is stored in association with the input information as situation information,
    The response control unit associates the sound source direction with the sound source direction in the storage unit based on the recognition of the sound source direction whose difference from the sound source direction stored as the situation information in the storage unit is less than a predetermined reference. The speech processing apparatus according to claim 1, wherein a response is controlled using the stored input information.
  4.  前記記憶部には、ターゲットユーザを示すターゲット情報が前記入力情報と関連付けて記憶され、
     前記応答制御部は、前記記憶部に記憶されている前記ターゲット情報が示すターゲットユーザに関して前記入力情報に関連付けられた状況が認識されたことに基づき、前記ターゲット情報に関連付けられている前記入力情報を用いて応答を制御する、請求項1に記載の音声処理装置。
    In the storage unit, target information indicating a target user is stored in association with the input information,
    The response control unit recognizes the input information associated with the target information based on the recognition of the situation associated with the input information regarding the target user indicated by the target information stored in the storage unit. The voice processing apparatus according to claim 1, wherein the voice processing apparatus is used to control a response.
  5.  前記入力情報は、状況を示す第1の部分、および第2の部分を含み、
     前記記憶部には、前記第1の部分が示す状況において得られたデータを用いる機械学習により得られた認識モデルが記憶され、
     前記応答制御部は、前記記憶部に記憶されている前記認識モデルを用いて、前記第1の部分に関連する状況として、前記第1の部分が示す状況が認識されたことに基づき、前記第2の部分を用いて応答を制御する、請求項1に記載の音声処理装置。
    The input information includes a first part indicating a situation, and a second part,
    The storage unit stores a recognition model obtained by machine learning using data obtained in the situation indicated by the first part,
    The response control unit uses the recognition model stored in the storage unit based on the fact that the situation indicated by the first part is recognized as the situation related to the first part. The speech processing apparatus according to claim 1, wherein the response is controlled using the part 2.
  6.  前記音声処理装置は、前記認識モデルにより認識される状況を示す音声が入力された場合、当該音声が入力された際に得られたデータを追加的に用いて機械学習を再度行い、認識モデルを再度生成する学習部をさらに備える、請求項5に記載の音声処理装置。 When a voice indicating a situation recognized by the recognition model is input, the voice processing device performs machine learning again using additional data obtained when the voice is input, and sets the recognition model. The speech processing apparatus according to claim 5, further comprising a learning unit that generates again.
  7.  前記記憶部に記憶される複数の入力情報には、前記第1の部分が多義性を有する、または前記第1の部分の意味認識の結果が不明である入力情報が含まれる、請求項2に記載の音声処理装置。 The plurality of pieces of input information stored in the storage unit include input information in which the first part has ambiguity or the result of semantic recognition of the first part is unknown. The speech processing apparatus according to the description.
  8.  前記応答制御部は、前記応答の制御として、前記入力情報の少なくとも一部を出力部に出力させる制御を行う、請求項1に記載の音声処理装置。 The speech processing apparatus according to claim 1, wherein the response control unit performs control to output at least part of the input information to an output unit as control of the response.
  9.  前記応答制御部は、前記応答の制御として、前記入力情報の少なくとも一部が示す処理または行動の実行結果を出力部に出力させる制御を行う、請求項1に記載の音声処理装置。 The speech processing apparatus according to claim 1, wherein the response control unit performs control to output an execution result of processing or action indicated by at least part of the input information to the output unit as control of the response.
  10.  前記音声処理装置は、
     前記音入力部に入力された音声を認識する音声認識部と、
     前記状況を認識する状況認識部と、
    をさらに備える、請求項1に記載の音声処理装置。
    The voice processing device
    A voice recognition unit for recognizing a voice input to the sound input unit;
    A situation recognition unit for recognizing the situation;
    The speech processing apparatus according to claim 1, further comprising:
  11.  音入力部に入力された音声の認識に基づき得られた入力情報を記憶している記憶部を参照し、プロセッサにより、前記入力情報に関連付けられた状況が認識されたことに基づき、前記入力情報を用いて応答を制御すること、
    を含む、音声処理方法。
    The input information is referred to by referring to a storage unit storing input information obtained based on recognition of the voice input to the sound input unit, and by a processor recognizing a situation associated with the input information. To control the response using
    Including a voice processing method.
PCT/JP2019/007485 2018-05-31 2019-02-27 Voice processing device and voice processing method WO2019230090A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018104629A JP2021139920A (en) 2018-05-31 2018-05-31 Voice processing device and voice processing method
JP2018-104629 2018-05-31

Publications (1)

Publication Number Publication Date
WO2019230090A1 true WO2019230090A1 (en) 2019-12-05

Family

ID=68698057

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/007485 WO2019230090A1 (en) 2018-05-31 2019-02-27 Voice processing device and voice processing method

Country Status (2)

Country Link
JP (1) JP2021139920A (en)
WO (1) WO2019230090A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014048522A (en) * 2012-08-31 2014-03-17 Nippon Telegr & Teleph Corp <Ntt> Situation generation model creation apparatus and situation estimation apparatus
JP2016114395A (en) * 2014-12-12 2016-06-23 クラリオン株式会社 Voice input auxiliary device, voice input auxiliary system, and voice input method
JP2016151928A (en) * 2015-02-18 2016-08-22 ソニー株式会社 Information processing device, information processing method, and program
JP2017509917A (en) * 2014-02-19 2017-04-06 ノキア テクノロジーズ オサケユイチア Determination of motion commands based at least in part on spatial acoustic characteristics
WO2018055898A1 (en) * 2016-09-23 2018-03-29 ソニー株式会社 Information processing device and information processing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014048522A (en) * 2012-08-31 2014-03-17 Nippon Telegr & Teleph Corp <Ntt> Situation generation model creation apparatus and situation estimation apparatus
JP2017509917A (en) * 2014-02-19 2017-04-06 ノキア テクノロジーズ オサケユイチア Determination of motion commands based at least in part on spatial acoustic characteristics
JP2016114395A (en) * 2014-12-12 2016-06-23 クラリオン株式会社 Voice input auxiliary device, voice input auxiliary system, and voice input method
JP2016151928A (en) * 2015-02-18 2016-08-22 ソニー株式会社 Information processing device, information processing method, and program
WO2018055898A1 (en) * 2016-09-23 2018-03-29 ソニー株式会社 Information processing device and information processing method

Also Published As

Publication number Publication date
JP2021139920A (en) 2021-09-16

Similar Documents

Publication Publication Date Title
US11875820B1 (en) Context driven device arbitration
US9967382B2 (en) Enabling voice control of telephone device
US11138977B1 (en) Determining device groups
US10311863B2 (en) Classifying segments of speech based on acoustic features and context
EP3676828A1 (en) Context-based device arbitration
US11509525B1 (en) Device configuration by natural language processing system
US11508378B2 (en) Electronic device and method for controlling the same
KR20190046631A (en) System and method for natural language processing
US11367443B2 (en) Electronic device and method for controlling electronic device
WO2022206602A1 (en) Speech wakeup method and apparatus, and storage medium and system
US11830502B2 (en) Electronic device and method for controlling the same
KR20210042523A (en) An electronic apparatus and Method for controlling the electronic apparatus thereof
US11575758B1 (en) Session-based device grouping
US11348579B1 (en) Volume initiated communications
US11693622B1 (en) Context configurable keywords
US20230362026A1 (en) Output device selection
KR20210042520A (en) An electronic apparatus and Method for controlling the electronic apparatus thereof
WO2019230090A1 (en) Voice processing device and voice processing method
US20210166685A1 (en) Speech processing apparatus and speech processing method
WO2019235013A1 (en) Information processing device and information processing method
US11798538B1 (en) Answer prediction in a speech processing system
US11161038B2 (en) Systems and devices for controlling network applications
US11741969B1 (en) Controlled access to device data
US10847158B2 (en) Multi-modality presentation and execution engine
WO2023202635A1 (en) Voice interaction method, and electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19811994

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19811994

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP