CN111757217A

CN111757217A - Voice input method, recording medium, and voice input device

Info

Publication number: CN111757217A
Application number: CN202010211028.5A
Authority: CN
Inventors: 野村和也; 古川博基; 金森丈郎; 杠慎一
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2019-03-27
Filing date: 2020-03-24
Publication date: 2020-10-09
Also published as: JP7411422B2; JP2020162112A

Abstract

Provided are a voice input method, a recording medium, and a voice input device, which can suppress a reduction in voice recognition performance caused by the approach of a user's face to the voice input device. The voice input method comprises the following steps: a detection step (S11) for detecting whether the face of a user is close to a voice input device having at least one microphone; and a correction step (S12) of performing correction processing on the voice signal picked up by the at least one microphone in a case where it is detected that the face of the user is close to the voice input device (YES at S11).

Description

Voice input method, recording medium, and voice input device

Technical Field

The present application relates to a voice input method, a recording medium, and a voice input device.

Background

A technique has been disclosed in the related art for performing voice pickup while reducing the influence of noise in a voice pickup apparatus that performs voice pickup of a speaker's voice (for example, patent document 1).

(Prior art document)

(patent document)

Patent document 1 Japanese laid-open patent application No. 2010-50571

In recent years, communication systems have been developed for street ends and the like, which translate a user's voice picked up by an audio pickup device (referred to as a voice input device herein) and display the translation result on a display provided in the voice input device or output the translation result by a microphone provided in the voice input device to communicate with the other party. However, since the voice recognition cannot be performed accurately due to surrounding noise or the like, the user brings the voice input device close to the face (specifically, the mouth) to reproduce the voice and performs the voice recognition again. In this case, the face of the user is close to the voice input device, which may reduce the voice recognition performance.

Disclosure of Invention

Accordingly, the present application provides a voice input method and the like capable of suppressing a decrease in voice recognition performance due to the proximity of the face of the user to the voice input device.

The voice input method includes: a detection step of detecting whether a face of a user is close to a voice input device having at least one microphone; and a correction step of performing correction processing on the voice signal picked up by the at least one microphone in a case where it is detected that the face of the user is close to the voice input device.

The general or specific aspects can be realized by a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, or by any combination of the system, the method, the integrated circuit, the computer program, and the recording medium.

With the voice input method and the like according to one aspect of the present application, it is possible to suppress a decrease in voice recognition performance caused by the proximity of the face of the user to the voice input device.

Drawings

Fig. 1 is a diagram for explaining a decrease in voice recognition performance caused by the proximity of the face of a user to a voice input device.

Fig. 2 is a block diagram showing an example of the configuration of the voice input device according to the embodiment.

Fig. 3 is a flowchart showing an example of the voice input method according to the embodiment.

Fig. 4A is a diagram for explaining a force applied to the voice input device when the voice input device according to the embodiment is brought close to the face of the user.

Fig. 4B shows an example of output signals of the triaxial acceleration sensor provided in the voice input device according to the embodiment when the voice input device is brought close to the face of the user.

Fig. 5A is a diagram for explaining changes in the position and orientation of a camera included in the voice input device when the voice input device according to the embodiment is brought close to the face of the user.

Fig. 5B shows a change in the size of the face of the user reflected in the image captured by the camera included in the voice input device when the voice input device according to the embodiment is brought close to the face of the user.

Description of the symbols

10 microphone

20 detection part

30 triaxial acceleration sensor

31 comparing part

32 mode data

40 Camera

41 face detection unit

42 face size measuring part

50 ADC

60 correction unit

61 amplifier circuit

62 directivity synthesis unit

63 proximity effect correction section

100 voice input device

200 users

Detailed Description

(latitude and longitude to obtain one form of the present application)

First, warp and weft forming one embodiment of the present application will be described with reference to fig. 1.

Fig. 1 is a diagram for explaining a decrease in voice recognition performance due to the proximity of the face of a user 200 to the voice input device 100.

For example, when the user 200 wants to communicate with a person who is not in the language used by the user, the voice input device 100 is used. In general, the user 200 holds the voice input device 100 in front of his or her chest to utter a voice desired to be translated. Accordingly, the voice input device 100 picks up the voice, performs voice recognition by the server device, for example, and translates the voice into a desired language.

However, in the street, or the like, speech recognition cannot be performed accurately due to surrounding noise or the like, and as shown in fig. 1, there is a case where the user 200 moves the speech input device 100 close to the face of the user to utter speech again and performs speech recognition again. In this way, when the face of the user 200 is close to the voice input device 100, a problem described later occurs. Further, when the user 200 brings his face close to the voice input device 100, the face of the user 200 may come close to the voice input device 100.

For example, the voice input device 100 may include at least two microphones, and voice signals picked up by the at least two microphones may be voice signals having a single directivity. That is, the voice input device 100 may have a high voice pickup sensitivity in a specific direction, or in other words, may have a low voice pickup sensitivity in a direction other than the specific direction. For example, in the case where the voice input device 100 is positioned in front of the chest of the user 200, the unidirectional directivity is a directivity in which the sound pickup sensitivity in the direction toward the face of the user 200 is increased. In this way, when the voice signal to be voice-picked up is a voice signal having a single directivity, the face of the user 200 approaches the voice input device 100, and the mouth of the user 200 is deviated from the direction in which the voice pick-up sensitivity is high, and thus the voice recognition may not be performed normally.

Further, for example, when the face of the user 200 is close to the voice input device 100, the input signal level of the voice picked up by the microphone provided in the voice input device 100 rises, and the voice recognition may be disabled in a saturated state depending on the situation.

Further, for example, when the face of the user 200 is close to the voice input device 100, a low-pitched area of the voice picked up by the microphone provided in the voice input device 100 is emphasized by the proximity effect, and thus the voice recognition may not be performed normally.

Accordingly, a speech input method according to an aspect of the present application includes: a detection step of detecting whether a face of a user is close to a voice input device having at least one microphone; and a correction step of performing correction processing on the voice signal picked up by the at least one microphone in a case where it is detected that the face of the user is close to the voice input device.

Accordingly, since whether or not the face of the user is close to the voice input device is detected, when the face of the user is detected to be close to the voice input device, it is possible to perform the correction process of suppressing the deterioration of the voice recognition performance due to the proximity of the face of the user to the voice input device. Therefore, it is possible to suppress a decrease in the voice recognition performance due to the proximity of the face of the user to the voice input device. Since the degradation of the voice recognition performance is suppressed, for example, the picked-up voice can be correctly interpreted.

In addition, the at least one microphone may be at least two microphones, the voice signal may be a voice signal having a single directivity picked up by the at least two microphones, and the correction process may include a process of converting the single directivity into an omni-directivity.

In the case where the face of the user is close to the voice input device, even if the picked-up voice signal is omnidirectional, the voice pickup sensitivity is liable to reach a considerably high level. Therefore, when the face of the user is close to the voice input device, the processing of converting the unidirectional directivity into the omnidirectional directivity is performed, so that the voice recognition performance can be suppressed from being degraded without being affected by the direction of the face of the user with respect to the microphone.

And it is also possible that the correction processing includes processing of reducing a gain.

By performing the processing of reducing the gain, it is possible to suppress saturation of the input signal level of the voice picked up by the microphone of the voice input device 100 when the face of the user is close to the voice input device, and to suppress a reduction in the voice recognition performance.

The correction process may include a process of reducing a gain of a component having a predetermined frequency or less.

Accordingly, by performing processing for reducing the gain of a component (e.g., a component of a bass region) of a predetermined frequency or less, when the face of the user is close to the voice input device, it is possible to suppress emphasis of the bass region due to the proximity effect, and to suppress reduction in voice recognition performance.

In the voice input device, the voice input device may include a three-axis acceleration sensor, and the detecting may detect whether the face of the user is close to the voice input device, based on a result of comparing a pattern in which an output of the three-axis acceleration sensor changes with time with a pattern measured in advance.

Accordingly, the operation state of the voice input device can be recognized by the three-axis acceleration sensor provided in the voice input device. In particular, by measuring in advance a pattern in which the output of the triaxial acceleration sensor changes with time when the voice input device approaches the user's face, it is possible to detect the approach of the user's face to the voice input device when a pattern similar to this pattern is output from the triaxial acceleration sensor.

In the voice input device, the detection step may detect whether or not the face of the user and the voice input device are close to each other in accordance with a change in size of the face of the user included in an image captured by the camera.

In the case where the face of the user is close to the voice input device, the size of the face of the user included in the image captured by the camera becomes larger than that in the case where there is no proximity. Therefore, when the size of the face of the user in the image becomes large, it can be detected that the face of the user is close to the voice input device.

In the detecting step, it may be detected whether or not the face of the user is close to the voice input device in accordance with a change in the gain of the picked-up voice signal.

When the face of the user is close to the voice input device, the gain of the picked-up voice signal increases. Therefore, when the gain of picking up the voice signal increases, it can be detected that the face of the user is close to the voice input device.

In the detecting step, whether the face of the user is close to the voice input device may be detected based on a change in an average value of gains of the voice signal picked up in a 2 nd period after the 1 st period from an average value of gains of the voice signal picked up in the 1 st period.

Even when the face of the user is not close to the voice input device, the gain of the picked-up voice signal may be instantaneously increased. Therefore, whether or not the face of the user is close to the voice input device can be detected in accordance with the change in the average value of the gain of the picked-up voice signal in a predetermined period, and thus the detection can be performed accurately.

In the detecting step, it may be detected whether or not the face of the user and the voice input device are close to each other in accordance with a change in gain of a component of the picked-up voice signal having a frequency equal to or lower than a predetermined frequency.

When the face of the user is close to the voice input device, the gain of a component (e.g., a component in a bass region) of the picked-up voice signal having a predetermined frequency or less is increased by the proximity effect. Therefore, when the gain of the component of the picked-up voice signal having the predetermined frequency or less is increased, it can be detected that the face of the user is close to the voice input device.

In the detecting step, whether or not the face of the user and the voice input device are in proximity may be detected in accordance with a change in an average value of gains of components of the voice signal picked up in a 3 rd period and components of the voice signal picked up in a 4 th period after the 3 rd period with respect to an average value of gains of the components of the voice signal of the predetermined frequency or less.

Even when the face of the user is not close to the voice input device, the gain of a component of the picked-up voice signal having a frequency equal to or lower than a predetermined frequency may be increased instantaneously. Therefore, it is possible to detect whether or not the face of the user is close to the voice input device in accordance with a change in the average value of the gain of the component of the picked-up voice signal having a frequency equal to or lower than a predetermined frequency in a predetermined period of time, thereby enabling accurate detection.

A recording medium according to an aspect of the present invention is a recording medium readable by a computer, and the recording medium stores a program for causing the computer to execute the above-described voice input method.

Further, a voice input device according to an aspect of the present application includes at least one microphone, and the voice input device includes: a detection unit that detects whether or not the face of the user is close to the voice input device; and a correction unit configured to perform correction processing on the voice signal picked up by the at least one microphone when the face of the user is detected to be close to the voice input device.

Accordingly, it is possible to provide a voice input device capable of suppressing a decrease in voice recognition performance caused by the proximity of the face of the user to the voice input device.

The embodiments are specifically described below with reference to the drawings.

In addition, the embodiments to be described below are general or specific examples. The numerical values, shapes, materials, constituent elements, arrangement positions and connection forms of the constituent elements, steps, order of the steps, and the like shown in the following embodiments are merely examples, and the present application is not limited thereto.

(embodiment mode)

The following describes an embodiment with reference to fig. 2 to 5B.

Fig. 2 is a block diagram showing an example of the configuration of the voice input device 100 according to the embodiment.

The speech input device 100 is a device for inputting speech uttered by a user for speech recognition of the speech uttered by the user, for example, for translation. For example, a voice signal indicating an inputted voice is transmitted to a server device capable of communicating with the voice input device 100, voice recognition and translation are performed in the server device, and information indicating the translated voice is transmitted to the voice input device 100. The voice input device 100 outputs the translated voice from a speaker provided in the voice input device 100, or displays the text of the translated voice on a display provided in the voice input device 100. The voice input device 100 is, for example, a smartphone, a tablet terminal, or a dedicated translation machine for performing translation.

The voice input device 100 includes: at least one microphone, a detection unit 20, a triaxial acceleration sensor 30, a comparison unit 31, mode data 32, a camera 40, a face detection unit 41, a face size measurement unit 42, an ADC (Analog to digital converter) 50, and a correction unit 60.

For example, at least one microphone may be at least two microphones, and here, the voice input device 100 is provided with two microphones 10. Since the voice uttered by the user reaches the microphones 10 with a time difference, the picked-up voice signal can be regarded as a voice signal having a single directivity by using the positional relationship of the microphones 10 and the time difference of the voice reaching the microphones 10.

The detection unit 20 detects whether or not the face of the user is close to the voice input device 100. The detection unit 20 will be described in detail later.

The triaxial acceleration sensor 30 is a sensor that detects acceleration in 3 directions orthogonal to each other. As shown in fig. 4A, when the voice input device 100 has a plate shape like a smartphone or the like, the three-axis acceleration sensor 30 detects acceleration in the lateral direction (x-axis direction), acceleration in the longitudinal direction (y-axis direction), and acceleration in the vertical direction (z-axis direction) with respect to the plane of the plate shape.

The pattern data 32 is data of a pattern in which the output of the triaxial acceleration sensor changes with time when the voice input device 100 is brought close to the face of the user, and is data of a pattern measured in advance. Details regarding the mode data 32 will be described later.

The comparison unit 31 compares a pattern in which the output of the triaxial acceleration sensor 30 changes with time with a pattern measured in advance. Specifically, it is determined whether or not the pattern of the change over time of the output of the triaxial acceleration sensor 30 is similar to the pattern measured in advance.

The camera 40 is a device that obtains an image by shooting. The camera 40 is provided at a position where, for example, when the user holds the voice input device 100 in his hand and looks at the voice input device 100, the face of the user can be reflected on an image captured by the camera 40. For example, in the case where the voice input device 100 is a smartphone or the like, the camera 40 is provided beside a display provided in the voice input device 100 and is used to capture the image of the user who holds the voice input device 100 in his hand.

The face detection unit 41 detects the face of the user reflected on the image captured by the camera 40. The method of detecting the face of the user appearing on the image is not particularly limited, and a general face detection technique can be employed.

The face size measurement unit 42 measures the size of the face of the user reflected on the image captured by the camera 40.

The ADC50 is a circuit that converts an analog signal into a digital signal, and the speech input device 100 includes two ADCs 50 corresponding to the two microphones 10. The ADC50 converts an analog voice signal picked up by the microphone 10 into a digital voice signal. As will be described later, the ADC50 converts the analog voice signal amplified by the amplifying circuit 61 into a digital voice signal.

The correction unit 60 includes: an amplification circuit 61, a directivity synthesis unit 62, and a proximity effect correction unit 63. The details of the correction unit 60 (the amplification circuit 61, the directivity synthesis unit 62, and the proximity effect correction unit 63) will be described later.

The voice input device 100 is a computer including a processor (microprocessor), a user interface, a communication interface (a communication circuit and the like not shown), a memory, and the like. The user interface includes, for example, a display such as an lcd (liquid Crystal display), or an input device such as a keyboard and a touch panel. The memory is a rom (read Only memory), a ram (random access memory), or the like, and can store a program executed by the processor. In addition, the voice input apparatus 100 may have one memory, or may have a plurality of memories. The pattern data 32 is stored in one or more memories. The processor operates according to a program, thereby enabling the detection unit 20, the comparison unit 31, the face detection unit 41, the face size measurement unit 42, and the correction unit 60 to operate.

The operation of the detection unit 20 and the correction unit 60 will be described in detail with reference to fig. 3.

The voice input method comprises the following steps: a detection step (step S11) of detecting whether or not the face of the user is close to the voice input device 100; and a correction step (step S12) of performing correction processing on the voice signal picked up by the at least one microphone in a case where it is detected that the face of the user is close to the voice input device 100.

For example, the voice input method according to the embodiment is a method executed by the voice input device 100. That is, fig. 3 is a flowchart showing the operations of the detection unit 20 and the correction unit 60, and the detection step corresponds to the detection unit 20 and the correction step corresponds to the correction unit 60.

The detection unit 20 determines whether or not the face of the user is close to the voice input device 100 (step S11).

For example, the detection unit 20 detects whether or not the face of the user is close to the voice input device 100, based on a comparison result between a pattern in which the output of the triaxial acceleration sensor 30 changes with time and a pattern measured in advance. This will be explained with reference to fig. 4A and 4B.

Fig. 4A is a diagram for explaining a force applied to the voice input device 100 when the voice input device 100 according to the embodiment is brought close to the face of the user. Fig. 4B shows an example of output signals of the triaxial acceleration sensor 30 included in the voice input device 100 according to the embodiment when the voice input device 100 is brought close to the face of the user.

As shown in fig. 4A, the operation of bringing the voice input device 100 close to the face of the user is, for example, an operation of holding the voice input device 100 positioned near the chest of the user by the hand of the user and moving the voice input device 100 to the vicinity of the mouth on the face of the user. In other words, the operation of bringing the voice input apparatus 100 close to the face of the user is an operation of raising the voice input apparatus 100, which is tilted substantially horizontally, toward the face of the user. A state in which the voice input device 100 is tilted substantially horizontally in the vicinity of the chest of the user is referred to as state 1, and a state in which the voice input device 100 is raised from the horizontal direction by about 45 ° to about 90 ° in the vicinity of the face of the user (specifically, the mouth of the user) is referred to as state 2.

In the case where the voice input device 100 moves from the state 1 to the state 2, the triaxial acceleration sensor 30 outputs a signal shown in fig. 4B. As described above, when the voice input device 100 has a plate shape like a smartphone, the lateral direction on the plane of the plate shape is defined as the x-axis direction, the longitudinal direction is defined as the y-axis direction, and the direction perpendicular to the plane of the plate shape is defined as the z-axis direction, and the three-axis acceleration sensor 30 detects the acceleration of three axes in the x-axis direction, the y-axis direction, and the z-axis direction.

In state 1, gravity is generated in the z-axis direction of the voice input device 100, and almost no force is generated in the x-axis direction and the y-axis direction. Therefore, the triaxial acceleration sensor 30 outputs a signal corresponding to the gravitational acceleration g in the z-axis direction, and outputs almost 0 in the x-axis direction and the y-axis direction. However, as shown in fig. 4B, in order to make outputs in the x-axis direction, the y-axis direction, and the z-axis direction in state 1 all substantially 0, a signal capable of canceling the gravitational acceleration in the z-axis direction is applied to the triaxial acceleration sensor 30.

Then, when the voice input device 100 is moved closer to the face of the user as shown in fig. 4A, a force of shaking the hand is generated in the x-axis direction, a force of gravity is generated in the y-axis direction, and a force of lifting the voice input device 100 is generated in the z-axis direction as shown in fig. 4B, so that the voice input device 100 enters state 2.

In this way, when the voice input device 100 is moved close to the face of the user, the pattern in which the output of the triaxial acceleration sensor 30 changes with time is shown in fig. 4B. Therefore, for example, as the mode measured in advance, the mode shown in fig. 4B is stored as the mode data 32 in advance, and after that, when a mode similar to the mode shown in fig. 4B is detected as the mode in which the output of the triaxial acceleration sensor 30 changes with time, it can be determined that the operation of bringing the voice input apparatus 100 close to the face of the user has been performed.

In addition, since it is considered that the operation of bringing the voice input apparatus 100 close to the face varies from person to person, various patterns of the operation close to the face can be measured in advance and various pattern data 32 can be stored.

In this way, the detection unit 20 can detect that the face of the user is close to the voice input device 100 when the mode in which the output of the triaxial acceleration sensor 30 changes with time is similar to the mode measured in advance.

For example, the detection unit 20 detects whether or not the face of the user is close to the voice input device 100 according to a change in the size of the face of the user included in the image captured by the camera 40. This will be explained using fig. 5A and 5B.

Fig. 5A is a diagram for explaining changes in the position and orientation of the camera 40 included in the voice input device 100 according to the embodiment when the voice input device 100 is close to the face of the user. Fig. 5B shows a change in the size of the face of the user reflected in the image captured by the camera 40 included in the voice input device 100 when the voice input device 100 according to the embodiment is close to the face of the user.

As shown in fig. 5A, when the voice input apparatus 100 is in the state 1, the camera 40 is directed upward (for example, vertically upward) near the chest of the user. When the voice input device 100 is in the state 2, the camera 40 faces the user near the user's mouth. In state 1, as shown by the broken line frame on the left side of fig. 5B, the face of the user reflected in the image is small and compressed in the up-down direction. This is because in the state 1, the position of the camera 40 is farther from the user than in the state 2, and the face of the user is located at the end of the range where the camera 40 can capture images. In contrast, in state 2, the face of the user reflected in the image is large as shown by the broken line box on the right side of fig. 5B.

In this way, the detection unit 20 can detect that the face of the user is close to the voice input device 100 when the size of the face of the user in the image captured by the camera 40 is large.

The detection unit 20 may detect whether or not the face of the user is close to the voice input device 100 according to a change in the gain of the picked-up voice signal. This is because when the face of the user is close to the voice input device 100, the gain of the voice signal becomes larger than when the face is not close to the voice input device. For example, when the gain of the picked-up voice signal becomes greater than or equal to a predetermined value (e.g., 10 dB), the detection unit 20 detects that the face of the user is close to the voice input device 100. However, even when the face of the user is not close to the voice input device 100, the gain of the picked-up voice signal may increase instantaneously due to a difference in the way the user utters a voice or the like.

Then, the detection unit 20 may detect whether or not the face of the user is close to the voice input device 100, based on a change in the average value of the gain of the voice signal picked up in the 1 st period (for example, 3 seconds) and the average value of the gain of the voice signal picked up in the 2 nd period (for example, 3 seconds) after the 1 st period. For example, when the time average gain of the picked-up voice signal increases to a predetermined value (for example, 10dB or more), the detection unit 20 detects that the face of the user is close to the voice input device 100. In this way, by detecting whether or not the face of the user is close to the voice input device 100 in accordance with the time-averaged change in the gain of the picked-up voice signal in a certain period, accurate detection can be performed.

The detection unit 20 may detect whether or not the face of the user is close to the voice input device 100 in accordance with a change in gain of a component of the picked-up voice signal having a frequency equal to or lower than a predetermined frequency. This is because, when the face of the user is close to the voice input device 100, the gain of a component (e.g., a component in a bass region) of a predetermined frequency or less is increased by the proximity effect as compared with the case where the face is not close to the voice input device. The gain of a component equal to or lower than a predetermined frequency is, for example, a frequency average of gains of components between predetermined frequencies from 0 Hz. For example, when the gain of a component of the picked-up voice signal having a predetermined frequency (e.g., 200Hz) or less becomes greater than or equal to a predetermined value (e.g., 5dB or more), the detection unit 20 detects that the face of the user is close to the voice input device 100. However, even when the face of the user is not close to the voice input device 100, the gain of a component of the picked-up voice signal having a frequency equal to or lower than a predetermined frequency may increase momentarily due to a difference in the manner in which the user utters a voice.

Then, the detection unit 20 may detect whether or not the face of the user is close to the voice input device 100, based on a change in an average value of gains of components of a predetermined frequency or less of the voice signal picked up in a 3 rd period (for example, 3 seconds) and an average value of gains of components of a predetermined frequency or less of the voice signal picked up in a 4 th period (for example, 3 seconds) after the 3 rd period. For example, when the time average gain of the component of the picked-up voice signal having a frequency equal to or lower than a predetermined frequency is increased to a predetermined value (for example, 5dB or more), the detection unit 20 detects that the face of the user is close to the voice input device 100. In this way, whether or not the face of the user is close to the voice input device 100 is detected based on the time-averaged change in the gain of the component of the picked-up voice signal having a frequency equal to or lower than the predetermined frequency in a certain period, and thereby the detection can be performed positively.

The detection unit 20 may detect whether or not the face of the user is close to the voice input device 100, according to whether or not the picked-up voice has an echo. This is because, when the face of the user is close to the voice input device 100, the picked-up voice is less likely to generate an echo than when the face is not close to the voice input device. As to whether the picked-up voice produces echo, it can be judged by using autocorrelation, for example. For example, since the components increase once and after as the reverberation increases, the components increase once and after when the face of the user does not approach the voice input device 100. In other words, when the face of the user approaches the voice input device 100, the components decrease once or later. In this way, whether the face of the user is close to the voice input apparatus 100 can be detected by determining whether the picked-up voice generates an echo using autocorrelation.

Returning to the description of fig. 3, when it is detected that the face of the user is close to the voice input device 100 (yes in step S11), the correction unit 60 performs a correction process on the voice signal picked up by at least one microphone (step S12). As described above, the correction unit 60 includes: the amplification circuit 61, the directivity synthesis unit 62, and the proximity effect correction unit 63, in other words, the correction unit 60 is realized by the amplification circuit 61, the directivity synthesis unit 62, and the proximity effect correction unit 63.

The amplifier circuit 61 is a circuit that amplifies an input speech signal (here, an analog speech signal), and has a function of adjusting the gain of the speech signal. Here, the amplifier circuit 61 performs a process of reducing the gain.

The directivity synthesis unit 62 adjusts the phase of each of the input voice signals (here, two digital voice signals output from the two ADCs 50) and adjusts the directivity. Here, the directivity synthesis unit 62 performs a process of converting the single directivity into the omni-directivity.

The proximity effect correction unit 63 is an equalizer that changes the frequency characteristics of the input speech signal (here, the speech signal whose directivity has been adjusted by the directivity synthesis unit 62). Here, the proximity effect correction unit 63 performs a process of reducing the gain of a component of a predetermined frequency or less (for example, a bass region of 200Hz or less).

The correction processing performed by the correction unit 60 includes: the processing for converting the single directivity into the omni-directivity by the directivity synthesis unit 62, the processing for reducing the gain by the amplification circuit 61, and the processing for reducing the gain of the component of the predetermined frequency or less by the proximity effect correction unit 63.

When detecting that the face of the user is close to the voice input device 100, the correction unit 60 may perform processing for reducing the gain with respect to the voice signal, may perform processing for converting the unidirectional directivity into the omnidirectional directivity, and may perform processing for reducing the gain of a component of a predetermined frequency or less.

The correction unit 60 may not be required to perform all of the processing of reducing the gain, the processing of converting the unidirectional directivity into the omnidirectional directivity, and the processing of reducing the gain of the component of a predetermined frequency or less. For example, the content of the correction process may be changed according to the content of detection by the detection unit 20. For example, when the gain of the picked-up voice signal becomes greater than or equal to a predetermined value and it is detected that the face of the user is close to the voice input device 100, the correction unit 60 may perform only the processing of reducing the gain as the correction processing. For example, when the approach of the face of the user to the voice input device 100 is detected by increasing the gain of a component of the picked-up voice signal having a frequency equal to or lower than a predetermined frequency to a predetermined value or more, the correction unit 60 may perform only the process of reducing the gain of the component having the frequency equal to or lower than the predetermined frequency as the correction process.

Then, the voice input device 100 outputs the voice signal on which the correction processing is performed to a server device or the like for voice recognition or the like.

When detecting that the face of the user is not close to the voice input device 100 (no in step S11), the correction unit 60 does not perform the correction processing on the voice signal picked up by at least one microphone, and the voice input device 100 outputs the voice signal that has not been subjected to the correction processing to a server device or the like for voice recognition or the like.

As described above, since whether or not the face of the user is close to the voice input device 100 is detected, when it is detected that the face of the user is close to the voice input device 100, it is possible to perform the correction processing in which the degradation of the voice recognition performance due to the proximity of the face of the user to the voice input device 100 is suppressed. Therefore, it is possible to suppress a decrease in the voice recognition performance caused by the proximity of the face of the user to the voice input device 100. Since the deterioration of the speech recognition performance is suppressed, for example, the picked-up speech can be correctly interpreted.

(other embodiments)

Although the voice input method and the voice input device 100 according to one or more aspects of the present invention have been described above based on the embodiments, the present invention is not limited to these embodiments. The embodiment obtained by performing various modifications that can be conceived by a person skilled in the art to each embodiment and the embodiment constructed by combining the constituent elements in different embodiments are included in the scope of one or more embodiments of the present application within a scope not departing from the gist of the present application.

For example, in the above-described embodiment, the voice input device 100 has been described as including two microphones 10, but the present invention is not limited thereto. For example, the voice input device 100 may include one or three or more microphones. The voice input device 100 includes the amplifier circuits 61 and the ADCs 50 corresponding to the number of microphones. When the voice input device 100 includes one microphone, the directivity synthesis unit 62 may not be provided.

For example, in the above-described embodiment, the example in which the correction unit 60 includes the amplification circuit 61, the directivity synthesis unit 62, and the proximity effect correction unit 63 has been described, but the present invention is not limited thereto. For example, the correction unit 60 may include at least one of the amplification circuit 61, the directivity synthesis unit 62, and the proximity effect correction unit 63.

For example, in the above-described embodiment, the voice input device 100 has been described as including the triaxial acceleration sensor 30, the comparison unit 31, and the pattern data 32, but may not be included. That is, the detection unit 20 may detect whether the face of the user is close to the voice input device 100 according to a comparison result between a pattern in which the output of the triaxial acceleration sensor 30 changes with time and a pattern measured in advance.

For example, in the above-described embodiment, the voice input device 100 has been described as including the camera 40, the face detection unit 41, and the face size measurement unit 42, but may not be included. That is, the detection unit 20 may detect whether or not the face of the user is close to the voice input device 100 according to a change in the size of the face of the user in the image captured by the camera 40.

For example, the present application may be realized as a server apparatus that executes a voice input method. For example, the server device may include a detection unit 20, a comparison unit 31, pattern data 32, a face detection unit 41, a face size measurement unit 42, a directivity synthesis unit 62, a proximity effect correction unit 63, and the like. That is, the server device may have functions other than the microphone 10, the triaxial acceleration sensor 30, the camera 40, and the like included in the voice input device 100.

The present application can be implemented as a program for causing a processor to execute the steps included in the voice input method. The present application can be realized as a non-transitory computer-readable recording medium such as a CD-ROM on which the program is recorded.

For example, when the present application is implemented as a program (software), each step is executed by executing the program using hardware resources such as a CPU, a memory, and an input/output circuit of a computer. That is, the CPU obtains data from the memory, the input/output circuit, or the like, and performs an operation, or outputs an operation result to the memory, the input/output circuit, or the like, whereby each step is executed.

In the above-described embodiment, each component included in the voice input device 100 may be configured by dedicated hardware, or may be realized by executing a software program suitable for each component. Each component can be realized by a program execution unit such as a CPU or a processor, reading out and executing a software program recorded in a recording medium such as a hard disk or a semiconductor memory.

A part or all of the functions of the voice input device 100 according to the above-described embodiment can be typically realized as an LSI which is an integrated circuit. These may be individually formed into one chip, or a part or all of them may be formed into one chip. The integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. After LSI manufacturing, a Programmable FPGA (Field Programmable Gate Array) or a reconfigurable processor capable of reconfiguring connection and setting of circuit cells within LSI may be used.

Various modifications of the embodiments of the present application, which are made within the scope of the present application as can be conceived by those skilled in the art, are included in the present application.

The voice input method and the like of the present application can be applied to, for example, a mobile device such as a smartphone, a tablet terminal, or a translator used for translating voices.

Claims

1. A voice input method comprising:

a detection step of detecting whether a face of a user is close to a voice input device having at least one microphone; and

a correction step of performing correction processing on the voice signal picked up by the at least one microphone in a case where it is detected that the face of the user is close to the voice input device.

2. The voice input method of claim 1,

the number of the at least one microphone is at least two,

the voice signal is a voice signal having a single directivity picked up by the at least two microphones,

the correction process includes a process of converting a unidirectional directivity into an omni-directional directivity.

3. The voice input method of claim 1,

the correction process includes a process of reducing a gain.

4. The voice input method of claim 1,

the correction processing includes processing for reducing the gain of a component having a predetermined frequency or less.

5. The voice input method of claim 1,

the voice input device is provided with a three-axis acceleration sensor,

in the detecting step, whether the face of the user is close to the voice input device is detected according to a comparison result of a pattern of the output of the three-axis acceleration sensor changing with time and a pattern measured in advance.

6. The voice input method of claim 1,

the voice input device is provided with a camera,

in the detecting step, whether the face of the user and the voice input device are close to each other is detected in accordance with a change in size of the face of the user included in an image captured by the camera.

7. The voice input method of claim 1,

in the detecting step, whether the face of the user is close to the voice input device is detected in accordance with a change in the gain of the picked-up voice signal.

8. The voice input method of claim 7,

in the detecting step, whether the face of the user is close to the voice input device is detected in accordance with a change in an average value of gains of the voice signal picked up during a 2 nd period after the 1 st period with respect to an average value of gains of the voice signal picked up during the 1 st period.

9. The speech input method of any one of claims 1 to 8,

in the detecting step, whether the face of the user is close to the voice input device is detected in accordance with a change in gain of a component of the picked-up voice signal having a predetermined frequency or less.

10. The voice input method of claim 9,

in the detecting step, whether the face of the user is close to the voice input device is detected in accordance with a change in an average value of gains of components of the voice signal picked up in a 4 th period after the 3 rd period with respect to an average value of gains of components of the voice signal picked up in the 3 rd period that are not more than the predetermined frequency.

11. A computer-readable recording medium recording a program for causing a computer to execute the voice input method according to any one of claims 1 to 10.

12. A voice input device is provided with at least one microphone,

the voice input device is provided with:

a detection unit that detects whether or not the face of the user is close to the voice input device; and

and a correction unit configured to perform correction processing on the voice signal picked up by the at least one microphone when the face of the user is detected to be close to the voice input device.