WO2024057381A1

WO2024057381A1 - Information processing device, information processing method, program, and recording medium

Info

Publication number: WO2024057381A1
Application number: PCT/JP2022/034147
Authority: WO
Inventors: 哲也三ツ井; 宣昭田上; 洋人河内
Original assignee: パイオニア株式会社
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2024-03-21

Abstract

The purpose of the present invention is to provide an information processing device, an information processing method, a program, and a recording medium with which, on the basis of a user's movement information, determination is made as to whether the user is in a situation in which the user is likely to speak a wake word, so as to enable easier detection of the wake word even if the user is moving.　This invention comprises: a movement situation detection unit 120 for determining, on the basis of a user's movement information, whether the user is in a situation in which the user is likely to speak the wake word; and a wake word detection unit 140 for making it easier to detect the wake word by modifying a threshold value for detecting the wake word if the movement situation detection unit 120 has determined that the user is in a situation in which the user is likely to speak the wake word, and detecting the wake word from audio uttered by the user.

Description

Information processing device, information processing method, program and recording medium

The present invention relates to an information processing device, an information processing method, a program, and a recording medium.

In recent years, devices such as smartphones and smart speakers that have a function of activating a voice assistant by a user speaking a wake word (also referred to as a wake-up word or hot word) have become popular.
This type of device may not be able to correctly detect the wake word if it is operated in an environment that contains noise, so for example, multiple devices may be linked to properly estimate the noise level. However, a technique for detecting a wake word has been disclosed (for example, see Patent Document 1).

JP 2021-15202 Publication

However, in the technology described in Patent Document 1 mentioned above, noise in a predetermined fixed space is estimated. An example of this problem is that the wake word may not be detected if the

SUMMARY OF THE INVENTION The present invention has been made in view of the problems mentioned above as an example, and includes an information processing device, an information processing method, a program, and a recording medium that facilitate detection of a wake word even when a user is moving. The purpose is to provide a medium.

In order to solve the above problem, the invention according to claim 1 includes a movement situation detection unit that determines whether the user is in a situation where it is easy to utter a wake word based on movement information of the user; a wake word detection unit that makes it easier to detect the wake word and detects the wake word from the voice uttered by the user when the movement situation detection unit determines that the user is in a situation where it is easy to utter the wake word; An information processing device characterized by comprising the following.

Further, the invention according to claim 5 is an information processing method for an information processing apparatus, comprising a movement situation detection section and a wake word detection section, wherein the movement situation detection section is configured to detect movement information of a user. a first step of determining whether or not the user is in a situation where it is easy to utter a wake word based on the above information; , the wake word detection unit is an information processing method comprising: a second step of making the wake word easy to detect and detecting the wake word from a voice uttered by the user.

Further, the invention according to claim 6 is a program for causing a computer to execute an information processing method of an information processing apparatus, the program comprising a movement situation detection section and a wake word detection section, wherein the movement situation detection section a first step of determining whether the user is in a situation where the user is likely to issue a wake word based on movement information of the user; a second step of making it easier to detect the wake word and detecting the wake word from the voice uttered by the user. This is a program that causes a computer to execute.

In addition, the invention according to claim 7 is a computer-readable computer-readable program recorded with a program for causing a computer to execute an information processing method of an information processing apparatus, which includes a movement status detection section and a wake word detection section. A first step of determining whether the user is in a situation where the user is likely to issue a wake word based on the user's movement information, and the movement of the non-transitory recording medium. If the situation detection unit determines that the user is in a situation where it is easy to utter the wake word, the wake word detection unit makes it easy to detect the wake word, and detects the wake word from the voice uttered by the user. The second step of detecting a word is a non-transitory recording medium recording a program for causing a computer to execute an information processing method.

1 is a diagram showing the configuration of an information processing device according to a first embodiment. FIG. 2 is a diagram showing a processing flow of the information processing apparatus according to the first embodiment. FIG. 3 is a diagram illustrating a speed determination processing flow of the information processing apparatus according to the first embodiment. FIG. 2 is a diagram showing the configuration of an information processing device according to a second embodiment. 3 is a diagram illustrating a processing flow of an information processing apparatus according to a second embodiment. FIG. 7 is a diagram illustrating a speed determination processing flow of the information processing apparatus according to the second embodiment. FIG.

The information processing device according to the embodiment includes a movement status detection section and a wake word detection section.
The movement status detection unit determines whether the user is in a situation where it is easy to utter a wake word, based on the user's movement information.
For example, if there is little change in the user's movement speed, the movement status detection unit determines that the user is moving stably and is therefore in a situation where it is easy to issue the wake word.
Further, when the movement situation detection unit determines that the user is in a situation where it is easy to utter a wake word, the wake word detection unit makes it easier to detect the wake word and detects the wake word from the voice uttered by the user. .
For example, the wake word detection unit detects the wake word from the voice uttered by the user by changing the threshold for detecting the wake word to make it easier to detect the wake word.
In other words, when the movement status detection section determines that the user is moving stably, the wake word detection section detects the wake word by lowering the threshold for detecting the wake word. Make it easier.
Therefore, the wake word detection unit can easily detect the wake word even when the user is moving.

<Example 1>
An information processing device 1 according to this embodiment will be explained using FIGS. 1 to 3.

<Configuration of information processing device 1>
As shown in FIG. 1, the information processing device 1 includes a movement information acquisition section 110, a movement situation detection section 120, a voice acquisition section 130, and a wake word detection section 140.

The movement information acquisition unit 110 acquires movement information of the user.
Specifically, the movement information acquisition unit 110 is configured with, for example, a speed sensor, and acquires the user's movement speed.

Furthermore, the movement information acquisition unit 110 may acquire the current position from a navigation device, a GPS (Global-Positioning System) receiver, etc., and calculate the movement speed.
The movement information acquisition unit 110 acquires speed information at regular time intervals and transmits it to the movement status detection unit 120, which will be described later.

The movement status detection unit 120 determines whether the user is in a situation where it is easy to utter a wake word, based on the user's movement information.
Specifically, the movement status detection unit 120 refers to the user movement information received from the movement information acquisition unit 110, and determines that the user is in a situation where it is easy to issue a wake word when there is little change in the user's movement speed. judge.

More specifically, the movement status detection unit 120 continuously acquires speed information from the movement information acquisition unit 110, and, for example, when it is confirmed that the change in movement speed is within 5%, The moving status detection unit 120 determines that there is little change in speed, and determines that the user is in a situation where it is easy to issue a wake word.
The movement status detection unit 120 transmits the determined result to a wake word detection unit 140, which will be described later.

The voice acquisition unit 130 acquires the voice uttered by the user using, for example, a microphone connected to the voice acquisition unit 130, and stores the voice in a storage unit (not shown).
The voice acquisition unit 130 transmits information that the voice uttered by the user has been stored in the storage unit to the wake word detection unit 140, which will be described later.

When the movement situation detection unit 120 determines that the user is in a situation where it is easy to utter a wake word, the wake word detection unit 140 makes it easier to detect the wake word and detects the wake word from the voice uttered by the user. .

Here, the wake word recognition rate is used, for example, to detect the wake word.
The wake word recognition rate is, for example, a value indicating how much a word acquired from the voice uttered by the user matches a comparative wake word stored in advance in a storage unit (not shown).
The wake word detection unit 140 calculates a recognition rate from the voice uttered by the user, and compares the calculated recognition rate with a predetermined threshold set for detecting a wake word.
Then, if the calculated recognition rate is greater than or equal to a predetermined threshold, the wake word detection unit 140 detects the voice uttered by the user as a wake word.

Further, the wake word detection unit 140 changes the threshold value based on whether the user is in a situation where it is easy to utter a wake word, thereby making it easier to detect the wake word.
Specifically, when the wake word detection unit 140 receives information indicating that the user is unlikely to utter a wake word, the wake word detection unit 140 sets the threshold value to 80%, for example.
On the other hand, when the wake word detection unit 140 receives information indicating that the user is in a situation where it is easy to issue a wake word, the wake word detection unit 140 sets the threshold value to 50%, for example.
In this case, the wake word detection unit 140 can more easily detect the wake word than when the user has received information indicating that it is difficult for the user to utter the wake word.

<Processing of information processing device 1>
The processing of the information processing device 1 will be explained using FIGS. 2 and 3.

As shown in FIG. 2, the wake word detection unit 140 starts determination processing A (step S110).

As shown in FIG. 3, the movement information acquisition unit 110 acquires speed information from a speed sensor or the like, and transmits the speed information to the movement status detection unit 120 (step S510).

Based on the speed information received from the movement information acquisition unit 110, the movement status detection unit 120 determines whether or not the speed change is small (step S520).

If the movement status detection unit 120 determines that the speed change is large (“NO” in step S520), it determines that it is difficult to emit a wake word, and transmits the determination result to the wake word detection unit 140 ( Step S530), the process moves to step S540.

The wake word detection unit 140 sets the threshold TH to, for example, 80%, and ends the determination process A (step S540).
The threshold value TH is used for comparison with the recognition rate in step S140 described later.

On the other hand, if the movement status detection unit 120 determines that the speed change is small (“YES” in step S520), it determines that the wake word is likely to be emitted, and transmits the determination result to the wake word detection unit 140. Transmit (step S550).

The wake word detection unit 140 sets the threshold value TH to, for example, 50%, and ends the determination process A (step S560).
That is, in determination process A (step S110), a process is executed to determine whether the user is in a situation where it is easy to utter a wake word, and to determine the value of the threshold value TH based on the determination result.

As shown in FIG. 2, the wake word detection unit 140 determines whether the user has spoken (step S120).
When the wake word detection unit 140 receives information from the voice acquisition unit 130 that the voice uttered by the user has been stored in the storage unit, it determines that the voice has been uttered by the user.

If the wake word detection unit 140 determines that the user is not speaking (“NO” in step S120), the process returns to step S110.

On the other hand, if the wake word detection unit 140 determines that the user is speaking (“YES” in step S120), the wake word detection unit 140 uses the voice spoken by the user and the comparison wake word stored in the storage unit. The recognition rate of the wake word is calculated by comparing the wake word with the wake word (step S130).

The wake word detection unit 140 determines whether the calculated recognition rate is greater than or equal to the threshold TH set in step S110 (step S140).

If the calculated recognition rate is less than the threshold TH (“NO” in step S140), the wake word detection unit 140 returns the process to step S110 (step S140).

On the other hand, if the calculated recognition rate is equal to or higher than the threshold TH ("YES" in step S140), the wake word detection unit 140 detects the voice uttered by the user as a wake word, and ends the process. (Step S150).

The information processing device 1 according to this embodiment is configured to include a movement status detection section 120 and a wake word detection section 140.
The movement situation detection unit 120 determines whether the user is in a situation where it is easy to issue a wake word based on the user movement information, and if the change in the user's movement speed is small, the movement situation detection unit 120 , it is determined that the user is likely to issue a wake word.
Further, if the movement situation detection unit 120 determines that the user is in a situation where it is easy to utter a wake word, the wake word detection unit 140 makes it easier to detect the wake word, and extracts the wake word from the voice uttered by the user. To detect.
That is, when the moving situation detection section 120 determines that the user is moving stably, the wake word detection section 140 detects the wake word by making it easier to detect the wake word.
Therefore, the wake word can be easily detected even when the user is moving.

Further, in the information processing device 1 according to the present embodiment, the wake word detection unit 140 sets a threshold value for detecting a wake word when the movement situation detection unit 120 determines that the user is in a situation where the user is likely to utter a wake word. By changing this, the wake word can be easily detected.
In other words, if the movement situation detection unit 120 determines that the user is in a situation where the user is likely to utter the wake word, the wake word detection unit 140 detects the wake word by changing the threshold value with which the calculated recognition rate is compared to a lower value. Make it easier to detect.
Therefore, the wake word can be easily detected even when the user is moving.

<Example 2>
The information processing device 1A according to this embodiment will be explained using FIGS. 4 to 6.

<Configuration of information processing device 1A>
As shown in FIG. 4, the information processing device 1A includes a movement information acquisition section 110, a movement status detection section 120, a voice acquisition section 130, a wake word detection section 140A, and a communication section 210. ing.
It should be noted that the constituent elements denoted by the same reference numerals as those in the first embodiment have the same functions, and therefore detailed explanations thereof will be omitted.

When the movement status detection unit 120 determines that the wake word is likely to be uttered, the wake word detection unit 140A transmits audio data uttered by the user to the server 900 (described later) via the communication unit 210 (described later). , the wake word is detected based on the determination by the server 900.

The wake word detection unit 140A calculates a recognition rate from the voice uttered by the user, and compares the calculated recognition rate with a predetermined threshold set for detecting a wake word.
If the calculated recognition rate is equal to or higher than the set threshold, the wake word detection unit 140A detects the voice uttered by the user as a wake word.
Specifically, the wake word detection unit 140A sets the threshold to 80%, for example, and detects the voice uttered by the user as the wake word when the calculated recognition rate is equal to or higher than the threshold.
On the other hand, if the calculated recognition rate is less than the set threshold, the wake word detection unit 140A transmits the voice uttered by the user to the server 900, which will be described later.
When the wake word detection unit 140A receives information from the server 900 that the voice uttered by the user has been detected as the wake word, the wake word detection unit 140A detects the voice uttered by the user as the wake word.

Note that communication between the wake word detection unit 140A and the server 900 is performed via a communication unit 210 configured with, for example, a communication module that can be connected to the Internet.

The server 900 is a server on the cloud, and is equipped with high-performance speech recognition processing that has higher wake word detection accuracy than the wake word detection unit 140A.
Therefore, the server 900 has the ability to calculate a higher recognition rate than the recognition rate calculated by the wake word detection unit 140A.

The server 900 calculates the recognition rate from the user's voice transmitted from the wake word detection unit 140A, and detects the wake word based on a threshold for detecting the wake word (server threshold).
When the calculated recognition rate is equal to or higher than the server threshold, the server 900 transmits information indicating that the voice uttered by the user is a wake word to the wake word detection unit 140A.

<Processing of information processing device 1A>
The processing of the information processing device 1A will be explained using FIGS. 5 and 6.

As shown in FIG. 5, the wake word detection unit 140A starts determination processing B (step S210).

As shown in FIG. 6, the movement information acquisition unit 110 acquires speed information from a speed sensor or the like, and transmits the speed information to the movement status detection unit 120 (step S610).

The movement status detection unit 120 determines whether the speed change is small based on the speed information received from the movement information acquisition unit 110 (step S620).

If the movement status detection unit 120 determines that the speed change is large (“NO” in step S620), it determines that it is difficult to emit a wake word, and transmits the determined result to the wake word detection unit 140A, The determination process B is ended (step S630).

On the other hand, if the movement status detection unit 120 determines that the speed change is small (“YES” in step S620), it determines that the wake word is likely to be emitted, and sends the determined result to the wake word detection unit 140A. The information is transmitted, and the determination process B is ended (step S640).
That is, in determination process B (step S210), a process is executed to determine whether the user is in a situation where it is easy to utter a wake word.

As shown in FIG. 5, the wake word detection unit 140A determines whether the user has spoken (step S220).

When the wake word detection unit 140A determines that the user is not speaking ("NO" in step S220), the process returns to step S210.

On the other hand, if the wake word detection unit 140A determines that the user is speaking ("YES" in step S220), the wake word detection unit 140A compares the voice spoken by the user with the wake word for comparison, and detects the wake word. The word recognition rate is calculated (step S230).

The wake word detection unit 140A determines whether the calculated recognition rate is, for example, a threshold of 80% or more (step S240).

If the calculated recognition rate is equal to or higher than the threshold of 80% (“YES” in step S240), the wake word detection unit 140A detects the voice uttered by the user as a wake word, and ends the process. (Step S290)

On the other hand, if the calculated recognition rate is less than the threshold of 80% ("NO" in step S240), the wake word detection unit 140A moves the process to step S250.

The wake word detection unit 140A determines based on determination process B whether or not it is in a situation where it is easy to issue a wake word. (Step S250).

When the wake word detection unit 140A determines that the wake word is difficult to emit in step S210 ("NO" in step S250), the process returns to step S210.

On the other hand, if the determination result in step S210 is that the wake word is likely to be emitted ("YES" in step S250), the wake word detection unit 140A moves the process to step S260.

The wake word detection unit 140A checks whether the calculated recognition rate is, for example, a threshold value of 50% or more (step S260).
If the calculated recognition rate is less than the threshold of 50% (“NO” in step S260), the wake word detection unit 140A returns the process to step S210.
If the recognition rate is less than the threshold of 50%, it is difficult for the server 900 to recognize the wake word, so the voice uttered by the user is not transmitted to the server 900.

On the other hand, if the calculated recognition rate is equal to or higher than the threshold of 50% (“YES” in step S260), the wake word detection unit 140A transmits the voice uttered by the user to the server 900 via the communication unit 210. , and the process moves to step S270.

The server 900 calculates the recognition rate of the voice uttered by the user transmitted from the wake word detection unit 140A. (Step S270).

The server 900 compares the calculated recognition rate with the server threshold, and if the calculated recognition rate is less than the server threshold, the server 900 sends information to the wake word detection unit 140A that the recognition rate is less than the server threshold. Send to.
On the other hand, if the calculated recognition rate is greater than or equal to the server threshold, the server 900 transmits information to the effect that the recognition rate is greater than or equal to the server threshold to the wake word detection unit 140A (step S280).

When the wake word detection unit 140A receives information from the server 900 that the calculated recognition rate is less than the server threshold (“NO” in step S280), the process returns to step S210.
On the other hand, if the wake word detection unit 140A receives information from the server 900 that the calculated recognition rate is equal to or higher than the server threshold (“YES” in step S280), the wake word detection unit 140A wakes the voice uttered by the user. A word is detected, and the process is terminated (step S290).

The information processing device 1A according to the present embodiment includes a movement status detection section 120, a wake word detection section 140A, and a communication section 210.
When the moving situation detection unit 120 determines that the user is likely to utter a wake word, the wake word detection unit 140A transmits the voice uttered by the user to the server 900 via the communication unit 210.
The server 900 detects the wake word using high-performance speech recognition processing with high detection accuracy provided in the server 900, and transmits the detection result to the wake word detection unit 140A.
Then, the wake word detection unit 140A detects a wake word based on the detection result sent from the server 900.
In other words, the wake word detection unit 140A detects the wake word based on the detection result of the high performance speech recognition processing with high detection accuracy performed by the server 900.
Therefore, the wake word can be easily detected even when the user is moving.

<Other Examples>
In the

information processing apparatuses

1 and 1A described above, when the user is walking and cannot acquire speed information from the vehicle, the speed is calculated by acquiring current position information from a smartphone etc., and the user It may also be determined whether the user is in a situation where it is likely to issue a wake word.
Thereby, the wake word can be easily detected even when the user is walking.

Furthermore, in the

information processing apparatuses

1 and 1A described above, the moving situation detection unit 120 determines whether the user is in a situation where it is easy to issue a wake word based on the speed information, but it does not determine based on the acceleration. Good too.
That is, the movement situation detection unit 120 determines the change in the user's speed based on the acceleration information, and if the change in speed is small, the movement situation detection unit 120 determines that the user is in a situation where it is easy to issue the wake word.
Therefore, the wake word can be easily detected even when the user is moving.

Further, in the

information processing apparatuses

1 and 1A described above, the moving situation detection unit 120 determines whether the user is in a situation where it is easy to utter a wake word based on the speed information, but the heart rate, which is biological information, The determination may be based on the respiratory rate.
Specifically, biometric information such as heart rate, heart rate variability, and breathing rate is acquired from a smartwatch worn by the user.
If "heart rate is not fast", "there is no heart rate fluctuation", or "breathing is not fast", it may be determined that the user is not in a nervous state and is therefore likely to utter the wake word. .
That is, when the moving situation detecting section 120 determines that the user is not nervous based on the biological information, the moving situation detecting section 120 determines that the user is in a situation where it is easy to utter the wake word.
Therefore, if the user is not nervous even when moving, the wake word can be easily detected.

Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

1; Information processing device 1A; Information processing device 110; Movement information acquisition unit 120; Movement status detection unit 140; Wake word detection unit 140A; Wake word detection unit 210; Communication unit 900; Server

Claims

a movement situation detection unit that determines whether the user is in a situation where it is easy to issue a wake word based on movement information of the user;
When the movement situation detection unit determines that the user is in a situation where it is easy to utter the wake word, a wake word that makes it easier to detect the wake word and detects the wake word from the voice uttered by the user. a detection section;
An information processing device comprising:
The information processing apparatus according to claim 1, wherein the movement situation detection unit determines that the user is in a situation where it is easy to utter the wake word when there is little change in the movement speed of the user.
The information processing device according to claim 1 or 2, wherein the wake word detection unit makes it easier to detect the wake word by changing a threshold value for detecting the wake word.
It also includes a communication section that sends and receives data to and from a server on the cloud.
When the movement situation detection section determines that the situation is such that the wake word is likely to be uttered, the wake word detection section transmits the audio data uttered by the user to the server via the communication section, and 3. The information processing apparatus according to claim 1, wherein the wake word is detected based on a determination made by a server.
An information processing method for an information processing device, comprising a movement status detection section and a wake word detection section,
a first step in which the movement status detection unit determines whether the user is in a situation where it is easy to utter a wake word, based on the user's movement information;
When the movement situation detection unit determines that the user is in a situation where it is easy to utter the wake word, the wake word detection unit makes it easy to detect the wake word, and detects the wake word from the voice uttered by the user. a second step of detecting the wake word;
An information processing method comprising:
A program for causing a computer to execute an information processing method of an information processing device, the program comprising a movement status detection unit and a wake word detection unit,
a first step in which the movement status detection unit determines whether the user is in a situation where it is easy to utter a wake word, based on the user's movement information;
When the movement situation detection unit determines that the user is in a situation where it is easy to utter the wake word, the wake word detection unit makes it easy to detect the wake word, and detects the wake word from the voice uttered by the user. a second step of detecting the wake word;
A program for causing a computer to execute an information processing method comprising:
A computer-readable non-transitory recording medium recording a program for causing a computer to execute an information processing method of an information processing apparatus, comprising a movement status detection section and a wake word detection section,
a first step in which the movement status detection unit determines whether the user is in a situation where it is easy to utter a wake word, based on the user's movement information;
When the movement situation detection unit determines that the user is in a situation where it is easy to utter the wake word, the wake word detection unit makes it easy to detect the wake word, and detects the wake word from the voice uttered by the user. a second step of detecting the wake word;
A non-transitory recording medium that records a program for causing a computer to execute an information processing method comprising: