WO2019230090A1

WO2019230090A1 - Voice processing device and voice processing method

Info

Publication number: WO2019230090A1
Application number: PCT/JP2019/007485
Authority: WO
Inventors: 寛黒田
Original assignee: ソニー株式会社
Priority date: 2018-05-31
Filing date: 2019-02-27
Publication date: 2019-12-05
Also published as: JP2021139920A

Abstract

[Problem] To provide a voice processing device which improves the accuracy of a response to a user. [Solution] This voice processing device is provided with a response control unit which refers to a storage unit that stores input information obtained on the basis of the recognition of a voice input to a sound input unit, and controls a response by using the input information on the basis of the recognition of the situation related to the input information.

Description

Audio processing apparatus and audio processing method

This disclosure relates to a voice processing device and a voice processing method.

In recent years, voice processing devices having a voice agent function have been widely used. The voice agent function is a function that analyzes the meaning of the voice spoken by the user and executes processing according to the meaning obtained by the analysis. For example, a voice processing device having a voice agent function can execute answering a question from a user, adding a schedule, setting a timer, and the like.

Regarding the voice agent function, Patent Document 1 discloses a method in which a user registers an abbreviation in a voice processing device in advance and calls a complex word using the abbreviation. Patent Document 2 discloses a technique for understanding the meaning of a user's utterance using context information when the user utters a voice.

JP 2016-114395 A JP2015-122104A

However, regarding the method described in Patent Document 1, even if the same word as a word registered in advance is uttered by the user, the meaning embedded in the word may differ depending on the situation. In the technique described in Patent Document 2, there is room for improvement in accuracy because context information is not registered in advance and only context information when a user speaks is used as context information.

Therefore, the present disclosure proposes a new and improved speech processing apparatus and speech processing method capable of improving the accuracy of response to the user.

According to the present disclosure, referring to the storage unit storing the input information obtained based on the recognition of the voice input to the sound input unit, based on the fact that the situation associated with the input information is recognized, There is provided a voice processing device including a response control unit that controls a response using the input information.

Further, according to the present disclosure, the processor recognizes the situation associated with the input information by referring to the storage unit that stores the input information obtained based on the recognition of the voice input to the sound input unit. On the basis of the above, there is provided a voice processing method including controlling a response using the input information.

As described above, according to the present disclosure, it is possible to improve the accuracy of response to the user. Note that the above effects are not necessarily limited, and any of the effects shown in the present specification, or other effects that can be grasped from the present specification, together with or in place of the above effects. May be played.

It is explanatory drawing which shows the outline | summary of the audio processing apparatus 20 by embodiment of this indication. It is explanatory drawing which shows the structure of the audio | voice processing apparatus 21 by 1st Embodiment. It is explanatory drawing which shows the utilization form of the audio processing apparatus. It is explanatory drawing which shows the specific example of the information memorize | stored. It is explanatory drawing which shows the utilization form of the audio processing apparatus. It is a flowchart which shows operation | movement of the audio | voice processing apparatus 21 by 1st Embodiment. It is explanatory drawing which shows the application example of 1st Embodiment. It is explanatory drawing which shows the structure of the audio | voice processing apparatus 22 by 2nd Embodiment. It is explanatory drawing which shows the specific example of the information memorize | stored. It is explanatory drawing which shows the specific example of the utilization form of a recognition model. It is a flowchart which shows schematic operation | movement of the audio | voice processing apparatus 22 by 2nd Embodiment. It is a flowchart which shows the production | generation method of a recognition model. It is a flowchart which shows the response control using a recognition model. It is a flowchart which shows the application example of 2nd Embodiment. 2 is an explanatory diagram showing a hardware configuration of a sound processing device 20. FIG.

Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

In the present specification and drawings, a plurality of constituent elements having substantially the same functional configuration may be distinguished by adding different alphabets after the same reference numeral. However, when it is not necessary to particularly distinguish each of a plurality of constituent elements having substantially the same functional configuration, only the same reference numeral is given to each of the plurality of constituent elements.

Moreover, this indication is demonstrated according to the item order shown below.
0. Outline of speech processing device 1. First embodiment 1-1. Configuration of speech processing apparatus according to first embodiment 1-2. Operation of the speech processing apparatus according to the first embodiment 1-3. Function and effect 1-4. 1. Application example of first embodiment Second embodiment 2-1. Configuration of speech processing apparatus according to second embodiment 2-2. Operation of the speech processing apparatus according to the second embodiment 2-3. Effect 2-4. 2. Application example of second embodiment Hardware configuration Conclusion

<0. Outline of speech processing device>
First, an overview of a speech processing apparatus according to an embodiment of the present disclosure will be described with reference to FIG.

FIG. 1 is an explanatory diagram showing an overview of a voice processing device 20 according to an embodiment of the present disclosure. As shown in FIG. 1, the audio processing device 20 is disposed in a house as an example. The speech processing device 20 has a speech agent function that analyzes the meaning of speech uttered by the user of the speech processing device 20 and executes processing according to the meaning obtained by the analysis.

For example, the voice processing device 20 stores a keyword and response information in association with each other, the user's voice has a meaning of requesting a response of information, and the stored keyword is recognized from the voice. A response to the user is executed using the response information stored in association with the keyword. In the example illustrated in FIG. 1, the voice processing device 20 stores the keyword “A company's pasta” and the response information “boiled for 5 minutes” in association with each other. , The voice processing device 20 responds to the user with a voice saying, “Boiled time of pasta from Company A is 5 minutes” using response information “Boiled time is 5 minutes”. .

In FIG. 1, a stationary device is shown as the speech processing device 20, but the speech processing device 20 is not limited to a stationary device. For example, the voice processing device 20 may be a portable information processing device such as a smartphone, a mobile phone, a PHS (Personal Handyphone System), a portable music playback device, a portable video processing device, or a portable game device. An autonomous mobile robot may also be used.

Here, there can be keywords with ambiguity. For this reason, it is assumed that different response information is associated with the same keyword. For example, the keyword “Pasta of Company A” can be associated with response information indicating the above-described recipe, or can be associated with response information indicating the storage location. For this reason, the device for extracting the response information according to a user's intention from the keyword contained in an audio | voice was desired.

The inventor of the present invention has created the speech processing apparatus 20 according to the embodiment of the present disclosure with the above circumstances taken into consideration. The voice processing device 20 according to the embodiment of the present disclosure can improve the accuracy of a response to voice input by extracting response information that matches a user's intention from a keyword included in the voice. Hereinafter, several embodiments of the present disclosure will be sequentially described in detail.

<1. First Embodiment>
The voice processing device 20 according to the first embodiment stores the keyword, the sound source direction, and the response information in association with each other, and based on the fact that the voice including the keyword has arrived from the sound source direction corresponding to the stored sound source direction, A response is executed using the response information. The keyword and the response information are examples of input information obtained based on the recognition of the input voice, and the sound source direction is an example of situation information indicating a situation related to generation of sound or voice. According to the voice processing device 20 according to the first embodiment, even when different response information is associated with the same keyword, the response information according to the user's intention is obtained based on the sound source direction of the voice including the keyword. It is possible to extract.

In the following, in order to distinguish the voice processing device 20 according to each embodiment, the voice processing device according to the first embodiment is referred to as a voice processing device 21, and the voice processing device according to the second embodiment is referred to as a voice processing device 22. May be called. The audio processing device 21 and the audio processing device 22 according to each embodiment may be simply referred to as the audio processing device 20 in some cases.

<< 1-1. Configuration of Speech Processing Device according to First Embodiment >>
FIG. 2 is an explanatory diagram showing the configuration of the audio processing device 21 according to the first embodiment. As shown in FIG. 2, the speech processing apparatus 21 according to the first embodiment includes a sound collection unit 232, a speech recognition unit 234, a sound source direction estimation unit 236, a semantic analysis unit 238, a storage unit 240, a response control unit 250, and An audio output unit 260 is included.

(Sound collector)
The sound collection unit 232 has a function of a sound input unit that acquires an electrical sound signal from aerial vibration including environmental sound and sound. The sound collection unit 232 outputs the acquired sound signal to the voice recognition unit 234 and the sound source direction estimation unit 236.

(Voice recognition unit)
The voice recognition unit 234 detects a voice signal from the sound signal input from the sound collection unit 232, recognizes the voice signal, and obtains a character string representing the voice uttered by the user.

(Sound source direction estimation unit)
The sound source direction estimation unit 236 is an example of a situation recognition unit, and estimates the sound source direction of the sound that has reached the sound collection unit 232. When the sound collection unit 232 includes a plurality of sound collection elements, the sound source direction estimation unit 236 estimates the sound source direction based on the phase difference between the sound signals obtained by the sound collection elements. When the sound signal is an audio signal, the direction of the user viewed from the audio processing device 21 is estimated as the sound source direction.

(Semantic analysis unit, storage unit)
The semantic analysis unit 238 analyzes the meaning of the character string input from the voice recognition unit 234. The semantic analysis method may be any one of a method of realizing semantic analysis using machine learning after creating an utterance corpus, a method of realizing semantic analysis with rules, or a combination thereof. In addition, morphological analysis, which is a part of semantic analysis processing, has a mechanism for assigning attributes in units of words and maintains a dictionary therein. The semantic analysis unit 238 uses the mechanism and the dictionary for assigning this attribute to determine what kind of word the word included in the utterance is, for example, a person name, a place name, or a general noun. It is possible to grant.

And the semantic analysis unit 238 performs different processing according to the meaning obtained by the analysis. For example, when the meaning obtained by the analysis has the meaning of requesting storage of information, the semantic analysis unit 238 extracts a keyword as the first part from the character string input from the speech recognition unit 234, and the second Response information corresponding to the keyword is extracted as a part of. Then, the semantic analysis unit 238 stores the extracted keyword, response information, and the sound source direction input from the sound source direction estimation unit 236 in association with each other in the storage unit 240. A specific example of storing such information will be described below with reference to FIGS.

FIG. 3 is an explanatory diagram showing a usage pattern of the voice processing device 21. In the example shown in FIG. 3, at a position P1 included in the kitchen area A1, the user utters a voice saying “Remember the time of 5 minutes for cooking with A company's pasta”. In this case, the semantic analysis unit 238 inputs “pastor of company A” as a keyword from the character string “store the pasting time of pasta of company A as 5 minutes” input from the speech recognition unit 234. Extract “boiled time is 5 minutes” as response information. On the other hand, the sound source direction estimation unit 236 estimates 30 °, which is an angle formed by the direction of the position P1 viewed from the sound processing device 21 and the reference direction d of the sound processing device 21, as the sound source direction. For this reason, as shown in FIG. 4, the semantic analysis unit 238 stores the keyword “A company pasta”, the sound source direction “30 °”, and the response information “boiled for 5 minutes” in the storage unit 240 in association with each other. .

FIG. 3 further illustrates an example in which the user utters a voice saying “Remember that company A's pasta has been placed on the top shelf of storage” at position P2 included in storage peripheral area A2. Show. In this case, the semantic analysis unit 238 inputs “A company A” as a keyword from the character string “store the pasta of company A placed on the top shelf of storage” input from the speech recognition unit 234. Of pasta ”and“ placed on the top shelf of storage ”are extracted as response information.

On the other hand, the sound source direction estimation unit 236 estimates −20 °, which is an angle formed by the direction of the position P2 viewed from the sound processing device 21 and the reference direction d of the sound processing device 21, as the sound source direction. For this reason, as shown in FIG. 4, the semantic analysis unit 238 associates the keyword “Pasta from Company A”, the sound source direction “−20 °”, and the response information “placed on the top shelf of storage”. The data is stored in the storage unit 240.

(Response control unit)
When the meaning obtained by the analysis of the semantic analysis unit 238 has a meaning of requesting an information response, the response control unit 250 refers to the storage unit 240 and receives the voice from the sound source direction associated with the response information. , And that the voice is recognized to include a keyword associated with the response information, the response is controlled using the response information. Here, the response control unit 250 recognizes that the sound source direction whose difference from the sound source direction associated with the response information is less than a predetermined reference is recognized as the sound from the sound source direction associated with the response information. May be treated as arrival. The predetermined reference may be a value between 10 ° and 20 °, for example.

(Audio output part)
The voice output unit 260 outputs a response voice to the user in accordance with the control from the response control unit 250. However, the audio output unit 260 is merely an example of an output unit that outputs a response, and the audio processing device 21 may include a display unit that outputs a response by display. Here, with reference to FIG. 5, a specific example of the response voice output by the voice output unit 260 will be described.

FIG. 5 is an explanatory diagram showing a usage pattern of the voice processing device 21. In the example shown in FIG. 5, the user utters the voice “What is A company's pasta?” At a position P1 included in the kitchen area A1. The sound source direction of the sound is “30 °”. Therefore, the response control unit 250 refers to the storage unit 240 and extracts response information “boil time is 5 minutes” associated with the keyword “pasta of company A” and the sound source direction “30 °”. Then, the response control unit 250 uses the extracted response information “boiled time is 5 minutes” and stores “boiled time of company A's pasta as 5 minutes” in the audio output unit 260 as shown in FIG. The response voice saying "

On the other hand, when the user utters a voice “What is Company A's pasta?” At position P2 included in the storage peripheral area A2, the response control unit 250 refers to the storage unit 240 and searches for the keyword “A company's pasta”. Response information “Placed on top shelf of storage” associated with the sound source direction “−20 °” is extracted. Then, the response control unit 250 uses the extracted response information “Placed on the top shelf for storage” to the audio output unit 260 as shown in FIG. I remember that I placed it on the top shelf. "

<< 1-2. Operation of the speech processing apparatus according to the first embodiment >>
The configuration of the voice processing device 21 according to the first embodiment has been described above. Subsequently, with reference to FIG. 6, the operation of the speech processing apparatus 21 according to the first embodiment is organized.

FIG. 6 is a flowchart showing the operation of the voice processing device 21 according to the first embodiment. First, when the voice uttered by the user is input to the sound collection unit 232 (S304), the voice recognition unit 234 recognizes the voice, and the semantic analysis unit 238 indicates the meaning of the character string input from the voice recognition unit 234. Analysis is performed (S308).

When the semantic analysis unit 238 understands that the user's request is an information response (S312 / Yes), the semantic analysis unit 238 extracts a keyword from the character string input from the voice recognition unit 234 (S316). Further, the sound source direction estimating unit 236 estimates the sound source direction of the sound (S340), and the response control unit 250 extracts response information from the storage unit 240 based on the keyword and the sound source direction of the sound (S344). And the response control part 250 produces | generates a response audio | voice using the extracted response information, and the audio | voice output part 260 outputs the said response audio | voice (S348).

On the other hand, when the semantic analysis unit 238 understands that the user's request is information storage (S312 / No, S352 / Yes), the semantic analysis unit 238 extracts keywords and response information from the character string input from the speech recognition unit 234 (S356). ). Further, the sound source direction estimation unit 236 estimates the sound source direction of the sound (S360), and the semantic analysis unit 238 associates the keyword, the response information, and the sound source direction input from the sound source direction estimation unit 236 and stores them in the storage unit 240. (S364). Thereafter, for example, the audio output unit 260 outputs a registration completion notification indicating that the storage of information is completed (S368).

<< 1-3. Effect >>
According to the first embodiment described above, various functions and effects can be obtained. For example, according to the first embodiment, different response sounds can be output based on situation recognition (in the above example, estimation of a sound source direction) for sounds including the same keyword. Therefore, appropriate response voices should be used for ambiguous voices that may have multiple meanings, such as the meaning of inquiring the time with a broom and the meaning of inquiring about the location, such as “What is the pasta of Company A?” It is possible to output.

<< 1-4. Application example >>
As an application example of the first embodiment described above, the response control unit 250 controls whether or not to output a response voice in response to the voice spoken by the user depending on whether or not the user is a target user. May be. For example, when information is stored based on voice uttered by a certain user, the storage unit 240 also stores target information indicating the user as a target user. The target information may be identification information of the target user, or may be a feature amount of the target user's voice. And the response control part 250 may output response information, when the user who uttered the audio | voice which requests | requires the response of information is a target user, and other conditions (sound source direction etc.) are satisfy | filled.

Note that the target user and the user who requested the storage of information may be different. For example, when storing information, the voice processing device 21 may instruct the user who has requested the storage of information to input the user to be set as the target user, and set the input user as the target user. For example, as shown in FIG. 7, all users can be set as target users, or a specific user (user A in the example of FIG. 7) can be set as target users.

By setting a plurality of users as the target users, such as all users, each user does not have to speak for storing information individually, so that it is possible to reduce the user's trouble. For highly versatile information, it is useful to set a plurality of users like all users as target users. On the other hand, for information specific to a certain user, it is useful to set a specific user as a target user.

<2. Second Embodiment>
Next, the audio processing device 22 according to the second embodiment of the present disclosure will be described. The voice processing device 22 according to the second embodiment can output an appropriate response voice to the user without an explicit request from the user. Hereinafter, the configuration and operation of the sound processing device 22 according to the second embodiment will be sequentially described in detail.

<< 2-1. Configuration of Speech Processing Device according to Second Embodiment >>
FIG. 8 is an explanatory diagram showing the configuration of the audio processing device 22 according to the second embodiment. As shown in FIG. 8, the speech processing apparatus 22 according to the second embodiment includes a sound collection unit 232, a speech recognition unit 234, a sound source direction estimation unit 236, a semantic analysis unit 238, a storage unit 242, a learning unit 246, a response A control unit 252 and an audio output unit 260 are included. The configurations of the sound collection unit 232, the voice recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, and the voice output unit 260 are the same as those described in the first embodiment, and thus detailed description thereof is omitted here. .

Similarly to the first embodiment, the storage unit 242 stores the keyword extracted by the semantic analysis unit 238, the response information, and the sound source direction input from the sound source direction estimation unit 236 in association with each other. For example, when the user utters a voice at the dressing room saying “Remember to brush your teeth when you take a bath”, the storage unit 242 uses the keyword “Take a bath as shown in FIG. ”, The sound source direction“ 50 ° ”(the direction of the dressing room as viewed from the sound processing device 22), and the response information“ brush teeth ”are stored in association with each other. Thereby, based on the recognition of the arrival of the voice from the direction of “50 °” and the fact that the voice includes the keyword “Take a bath”, as in the first embodiment, It is possible to output a response voice saying “I remember that I brushed my teeth when I came out.”

Further, the learning unit 246 generates a recognition model for recognizing sound generated in a situation indicated by a certain keyword by machine learning, and the storage unit 242 uses the recognition model as a learning result of the machine learning by the learning unit 246. Remember. According to the recognition model, for example, when a sound generated when a bath door is closed is input, the keyword “Take a bath” can be output. As another example, the keyword “going to running” may be output when a running shoe hits the entrance floor when the running shoe is worn. With reference to FIG. 10, a specific example of the use form of such a recognition model will be described.

FIG. 10 is an explanatory diagram showing a specific example of the usage form of the recognition model. As shown in FIG. 10, when a sound of closing the bathroom door is generated at the dressing room P3, the response control unit 252 uses the keyword “Take a bath” corresponding to the sound as a recognition model. Identify based on. Then, the response control unit 252 extracts the response information “brush your teeth” associated with the keyword “Take a bath” based on the information stored in the storage unit 242 (see FIG. 9). The voice output unit 260 outputs a response voice saying “When you take a bath, you remember to brush your teeth.”

Also, as shown in FIG. 10, when the sound of running shoes hitting the floor of the entrance (ton ton) is generated at the entrance P4, the response control unit 252 uses the keyword “going to running” corresponding to the sound as a recognition model. Identify based on. Then, the response control unit 252 extracts the response information “check weather” associated with the keyword “going running” based on the information stored in the storage unit 242 (see FIG. 9), The voice output unit 260 is made to output a response voice saying “When you go running, you will remember if you check the weather”.

Here, the response control unit 252 obtains weather information via the network, and as a result of executing the process or action of “confirming the weather”, a response such as “the weather after this is rain”. Audio may be output to the audio output unit 260. According to such a configuration, since the user's trouble is reduced, the convenience for the user can be improved.

<< 2-2. Operation of the speech processing apparatus according to the second embodiment >>
Subsequently, with reference to FIGS. 11 to 13, the operation of the sound processing apparatus 22 according to the second embodiment is organized.

FIG. 11 is a flowchart showing a schematic operation of the voice processing device 22 according to the second embodiment. As illustrated in FIG. 11, the learning unit 246 generates a recognition model that recognizes sounds generated in the situation indicated by the keyword by machine learning (S410).

Then, the semantic analysis unit 238 analyzes the meaning of the speech uttered by the user, and when the user's request is storage of information, the storage unit 242 estimates the sound source direction to the keyword and response information extracted from the speech. The sound source direction of the sound estimated by the unit 236 is stored in association with each other (S430). The processing of S430 corresponds to the processing of S352 to S368 described with reference to FIG.

Thereafter, the response control unit 252 performs sound recognition using the recognition model generated by the learning unit 246, and controls output of the response sound using response information associated with the keyword specified based on the sound recognition. (S450).

Hereinafter, the process of S410 will be described more specifically with reference to FIG. 12, and the process of S450 will be described more specifically with reference to FIG.

FIG. 12 is a flowchart showing a recognition model generation method. First, when a user speaks a keyword, the voice processing device 22 instructs the user to generate a certain number of sounds generated in the situation indicated by the keyword (S411). For example, when the keyword is “I took a bath”, the user repeats opening and closing the door of the bathroom for a certain number of times. Then, the learning unit 246 stores the generated sound as positive example data for machine learning (S412).

Subsequently, the voice processing device 22 instructs the user not to generate the sound generated in the situation indicated by the keyword for a certain period of time (S413). Then, the learning unit 246 stores surrounding sounds as negative example data for machine learning (S414).

Thereafter, the learning unit 246 performs machine learning using the saved positive example data and negative example data, and generates a recognition model that recognizes the situation indicated by the keyword (S415). Then, the storage unit 242 stores the recognition model generated by the learning unit 246 (S416).

FIG. 13 is a flowchart showing response control using a recognition model. First, when a sound is input to the sound collection unit 232 (S451), the response control unit 252 determines whether or not the input sound is a sound recognized using the recognition model stored in the storage unit 242. Is determined (S452). When the input sound is a sound recognized using the recognition model (S452 / Yes), the response control unit 252 specifies a keyword corresponding to the sound based on the recognition model (S453). Further, the sound source direction estimating unit 236 estimates the sound source direction of the sound (S454).

Then, the response control unit 252 is associated with the keyword identified based on the recognition model, and is associated with the sound source direction whose difference from the sound source direction estimated by the sound source direction estimating unit 236 is less than a predetermined reference. Response information is extracted (S455). Further, the response control unit 252 generates a response sound using the extracted response information, and causes the sound output unit 260 to output the response sound (S456).

<< 2-3. Effect >>
According to the second embodiment described above, various functions and effects can be obtained. For example, according to the second embodiment, even if the user does not explicitly speak, it is appropriate for the user based on situation recognition (in the above example, sound recognition and sound source direction estimation). It is possible to output a response voice.

Also, the sound generated in a certain situation can vary depending on which user the situation is related to. For example, even when the sound of opening / closing a bathroom door is generated, the sound generated when the father opens / closes the bathroom door and the sound generated when the child opens / closes the bathroom door Can be different. For this reason, when a recognition model is generated by a certain user generating a sound a certain number of times in S411 described with reference to FIG. 12, the sound generated by another user may not be recognized by the recognition model. obtain. Therefore, according to the second embodiment, it is possible to limit the output target of response voice based on sound input to a specific user.

<< 2-4. Application example >>
As an application example of the above-described second embodiment, a method for improving the accuracy of the recognition model while using the speech processing device 22 will be described.

FIG. 14 is a flowchart showing an application example of the second embodiment, and is a flowchart of processing applied to S430 shown in FIG. Since the processing of S304 to S316 and S340 to S368 in FIG. 14 is as described with reference to FIG. 6, detailed description thereof is omitted here.

When the user's voice request is an information response (S312 / Yes) and a keyword is extracted from the voice (S316), the learning unit 246 stores the recognition model corresponding to the keyword in the storage unit 242. It is determined whether or not (S320 / Yes). Then, the learning unit 246 extracts, from the sound input to the sound collection unit 232, a sound that matches or is similar to the sound recognized by the recognition model (S324).

Further, the learning unit 246 adds the sound extracted in S324 as positive example data (S328), performs machine learning again using the positive example data group including the added positive example data, and regenerates the recognition model. (S332).

According to this application example, positive example data for machine learning is added even if the user does not utter speech for machine learning, so that the accuracy of the recognition model is improved without the user feeling troublesome. Is possible.

In the above description, an example in which sound is recognized as recognition of a situation corresponding to a keyword has been described. However, a situation corresponding to a keyword can be recognized by other methods. For example, when the voice processing device 22 includes an imaging unit, the learning unit 246 may generate a recognition model that recognizes a situation corresponding to a keyword by machine learning using a captured image obtained by the imaging unit. Here, “going to run” can correspond to the keyword, and the situation corresponding to the keyword includes the clothes of the user at the time of “going running”. The response control unit 252 identifies the keyword “going running” when the user ’s clothes at the time of “going running” are recognized based on the recognition model, and the response associated with the keyword “going running” The response output may be controlled using the information “Confirm the weather”.

<3. Hardware configuration>
The embodiments of the present disclosure have been described above. Information processing such as voice recognition and response control described above is realized by cooperation between software and hardware of the voice processing device 20 described below.

FIG. 15 is an explanatory diagram showing a hardware configuration of the voice processing device 20. As shown in FIG. 15, the voice processing device 20 includes a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, a RAM (Random Access Memory) 203, an input device 208, an output device 210, and the like. A storage device 211, a drive 212, an imaging device 213, and a communication device 215.

The CPU 201 functions as an arithmetic processing device and a control device, and controls the overall operation in the sound processing device 20 according to various programs. Further, the CPU 201 may be a microprocessor. The ROM 202 stores programs used by the CPU 201, calculation parameters, and the like. The RAM 203 temporarily stores programs used in the execution of the CPU 201, parameters that change as appropriate during the execution, and the like. These are connected to each other by a host bus including a CPU bus. Functions such as the speech recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, the learning unit 246, and the response control unit 252 can be realized by the cooperation of the CPU 201, the ROM 202, the RAM 203, and the software.

The input device 208 includes input means for a user to input information, such as a mouse, keyboard, touch panel, button, microphone, switch, and lever, and an input control circuit that generates an input signal based on the input by the user and outputs the input signal to the CPU 201. Etc. A user of the voice processing device 20 can input various data and instruct a processing operation to the voice processing device 20 by operating the input device 208.

The output device 210 includes a display device such as a liquid crystal display (LCD) device, an OLED (Organic Light Emitting Diode) device, and a lamp. Furthermore, the output device 210 includes an audio output device such as a speaker and headphones. For example, the display device displays a captured image or a generated image. On the other hand, the audio output device converts audio data or the like into audio and outputs it.

The storage device 211 is a data storage device configured as an example of a storage unit of the audio processing device 20 according to the present embodiment. The storage device 211 may include a storage medium, a recording device that records data on the storage medium, a reading device that reads data from the storage medium, a deletion device that deletes data recorded on the storage medium, and the like. The storage device 211 stores programs executed by the CPU 201 and various data.

The drive 212 is a storage medium reader / writer, and is built in or externally attached to the audio processing device 20. The drive 212 reads information recorded on a removable storage medium 24 such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs it to the RAM 203. The drive 212 can also write information to the removable storage medium 24.

The imaging device 213 includes an imaging optical system such as a photographing lens and a zoom lens that collects light, and a signal conversion element such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor). The imaging optical system collects light emitted from the subject and forms a subject image in the signal conversion unit, and the signal conversion element converts the formed subject image into an electrical image signal.

The communication device 215 is a communication interface configured with, for example, a communication device for connecting to the network 12. The communication device 215 may be a wireless LAN (Local Area Network) compatible communication device, an LTE (Long Term Evolution) compatible communication device, or a wire communication device that performs wired communication.

The network 12 is a wired or wireless transmission path for information transmitted from a device connected to the network 12. For example, the network 12 may include a public line network such as the Internet, a telephone line network, and a satellite communication network, various LANs (Local Area Network) including the Ethernet (registered trademark), a WAN (Wide Area Network), and the like. Further, the network 12 may include a dedicated line network such as an IP-VPN (Internet Protocol-Virtual Private Network).

<4. Supplement>
The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that it belongs to the technical scope of the present disclosure.

For example, each step in the processing of the voice processing device 20 of the present specification does not necessarily have to be processed in time series in the order described as a flowchart. For example, each step in the processing of the voice processing device 20 may be processed in an order different from the order described as the flowchart, or may be processed in parallel.

Also, some of the functions of the voice processing device 20 described above may be implemented in a cloud server connected to the voice processing device 20 via the network 12. For example, the cloud server may have a function corresponding to the voice recognition unit 234, the sound source direction estimation unit 236, the semantic analysis unit 238, the storage unit 240, and the response control unit 250, that is, a function as a voice processing device. In this case, the audio processing device 20 transmits an audio signal to the cloud server, and the cloud server can store information or transmit response audio to the audio processing device 20.

Also, it is possible to create a computer program for causing hardware such as a CPU, ROM, and RAM incorporated in the voice processing device 20 to perform the same functions as the components of the voice processing device 20 described above. A storage medium storing the computer program is also provided.

In addition, the effects described in this specification are merely illustrative or illustrative, and are not limited. That is, the technology according to the present disclosure can exhibit other effects that are apparent to those skilled in the art from the description of the present specification in addition to or instead of the above effects.

The following configurations also belong to the technical scope of the present disclosure.
(1)
With reference to the storage unit storing the input information obtained based on the recognition of the voice input to the sound input unit, and using the input information based on the recognition of the situation associated with the input information A response control unit for controlling the response,
An audio processing apparatus comprising:
(2)
The input information includes a first part and a second part;
The response control unit is based on the fact that the first part included in the input information stored in the storage unit is obtained by speech recognition, and the situation associated with the input information is recognized. The speech processing apparatus according to (1), wherein the response is controlled using the second portion included in the input information.
(3)
In the storage unit, a sound source direction of the sound when the sound is input to the sound input unit is stored in association with the input information as situation information,
The response control unit associates the sound source direction with the sound source direction in the storage unit based on the recognition of the sound source direction whose difference from the sound source direction stored as the situation information in the storage unit is less than a predetermined reference. The speech processing apparatus according to (1) or (2), wherein a response is controlled using the stored input information.
(4)
In the storage unit, target information indicating a target user is stored in association with the input information,
The response control unit recognizes the input information associated with the target information based on the recognition of the situation associated with the input information regarding the target user indicated by the target information stored in the storage unit. The speech processing apparatus according to any one of (1) to (3), wherein the response is controlled by using.
(5)
The input information includes a first part indicating a situation, and a second part,
The storage unit stores a recognition model obtained by machine learning using data obtained in the situation indicated by the first part,
The response control unit uses the recognition model stored in the storage unit based on the fact that the situation indicated by the first part is recognized as the situation related to the first part. The speech processing apparatus according to (1), wherein the response is controlled using the part 2.
(6)
When a voice indicating a situation recognized by the recognition model is input, the voice processing device performs machine learning again using additional data obtained when the voice is input, and sets the recognition model. The speech processing apparatus according to (5), further including a learning unit that generates again.
(7)
The plurality of input information stored in the storage unit includes input information in which the first part has ambiguity or the result of the semantic recognition of the first part is unknown (2) The voice processing apparatus according to 1.
(8)
The speech processing apparatus according to any one of (1) to (7), wherein the response control unit performs control to output at least part of the input information to an output unit as control of the response.
(9)
The response control unit performs control to output an execution result of a process or action indicated by at least a part of the input information to the output unit as control of the response, any one of (1) to (7) The voice processing apparatus according to 1.
(10)
The voice processing device
A voice recognition unit for recognizing a voice input to the sound input unit;
A situation recognition unit for recognizing the situation;
The speech processing apparatus according to any one of (1) to (9), further including:
(11)
The input information is referred to by referring to a storage unit storing input information obtained based on recognition of the voice input to the sound input unit, and by a processor recognizing a situation associated with the input information. To control the response using
Including a voice processing method.

20, 21, 22 Speech processing device 232 Sound collection unit 234 Speech recognition unit 236 Sound source direction estimation unit 238 Semantic analysis unit 240 Storage unit 242 Storage unit 246 Learning unit 250 Response control unit 252 Response control unit 260 Voice output unit

Claims

With reference to the storage unit storing the input information obtained based on the recognition of the voice input to the sound input unit, and using the input information based on the recognition of the situation associated with the input information A response control unit for controlling the response,
An audio processing apparatus comprising:
The input information includes a first part and a second part;
The response control unit is based on the fact that the first part included in the input information stored in the storage unit is obtained by speech recognition, and the situation associated with the input information is recognized. The speech processing apparatus according to claim 1, wherein the response is controlled using a second portion included in the input information.
In the storage unit, a sound source direction of the sound when the sound is input to the sound input unit is stored in association with the input information as situation information,
The response control unit associates the sound source direction with the sound source direction in the storage unit based on the recognition of the sound source direction whose difference from the sound source direction stored as the situation information in the storage unit is less than a predetermined reference. The speech processing apparatus according to claim 1, wherein a response is controlled using the stored input information.
In the storage unit, target information indicating a target user is stored in association with the input information,
The response control unit recognizes the input information associated with the target information based on the recognition of the situation associated with the input information regarding the target user indicated by the target information stored in the storage unit. The voice processing apparatus according to claim 1, wherein the voice processing apparatus is used to control a response.
The input information includes a first part indicating a situation, and a second part,
The storage unit stores a recognition model obtained by machine learning using data obtained in the situation indicated by the first part,
The response control unit uses the recognition model stored in the storage unit based on the fact that the situation indicated by the first part is recognized as the situation related to the first part. The speech processing apparatus according to claim 1, wherein the response is controlled using the part 2.
When a voice indicating a situation recognized by the recognition model is input, the voice processing device performs machine learning again using additional data obtained when the voice is input, and sets the recognition model. The speech processing apparatus according to claim 5, further comprising a learning unit that generates again.
The plurality of pieces of input information stored in the storage unit include input information in which the first part has ambiguity or the result of semantic recognition of the first part is unknown. The speech processing apparatus according to the description.
The speech processing apparatus according to claim 1, wherein the response control unit performs control to output at least part of the input information to an output unit as control of the response.
The speech processing apparatus according to claim 1, wherein the response control unit performs control to output an execution result of processing or action indicated by at least part of the input information to the output unit as control of the response.
The voice processing device
A voice recognition unit for recognizing a voice input to the sound input unit;
A situation recognition unit for recognizing the situation;
The speech processing apparatus according to claim 1, further comprising:
The input information is referred to by referring to a storage unit storing input information obtained based on recognition of the voice input to the sound input unit, and by a processor recognizing a situation associated with the input information. To control the response using
Including a voice processing method.