WO2020153158A1

WO2020153158A1 - Determination device, method therefor, and program

Info

Publication number: WO2020153158A1
Application number: PCT/JP2020/000695
Authority: WO
Inventors: 弘章伊藤; 小林　和則
Original assignee: 日本電信電話株式会社
Priority date: 2019-01-23
Filing date: 2020-01-10
Publication date: 2020-07-30
Also published as: JP2020118838A

Abstract

Provided is a determination device and the like for changing a threshold value in accordance with ambient noise and reducing erroneous detection. This determination device determines whether an input acoustic signal includes a speech signal issued from a user. The determination device has: a threshold value determination unit that determines a threshold value on the basis of reference information; and a determination unit that determines whether the input acoustic signal includes the speech signal on the basis of the threshold value. The reference information is information relating to the magnitude of ambient noise that is an acoustic signal which arrives at a microphone for collecting the input acoustic signal and which excludes speech from the user. The threshold value determination unit determines the threshold value such that the larger the magnitude of the ambient noise indicated by the reference information is, the more difficult it becomes to determine that the input acoustic signal includes the speech signal. The threshold value determination unit determines the threshold value such that the lower the magnitude of the ambient noise indicated by the reference information is, the easier it becomes to determine that the input acoustic signal includes the speech signal, but the determination that the input acoustic signal includes the speech signal cannot be made more easily than when a prescribed reference is used.

Description

Judgment device, its method, and program

The present invention relates to a technique for determining whether an input audio signal includes a voice signal emitted by a user.

Voice activity detection (Voice Activity Detection: VAD) technology is known as a technology that determines whether the input acoustic signal includes a voice signal uttered by the user.VAD uses some method to detect voice or non-voice from the observed signal. judge. For example, Non-Patent Document 1 is known as a deterministic method, and Non-Patent Document 2 is known as a statistical method. In the deterministic method, when the observed signal exceeds a preset threshold value, it is determined as voice. In the statistical method, a discrimination model of voice-likeness and non-voice-likeness is learned, and whether or not the observed signal is voice is determined by the discrimination model.

However, if the conventional VAD technology is applied to a terminal equipped with a microphone and a speaker (smart speaker, robot, in-vehicle terminal, etc.), when the ambient noise such as the sound reproduced from the speaker of the terminal increases, the noise is erroneously detected as a voice. In some cases (see FIG. 1).

An object of the present invention is to provide a determination device, a method therefor, and a program that change a threshold value according to ambient noise to reduce erroneous detection.

In order to solve the above problems, according to one aspect of the present invention, a determination device determines whether an input acoustic signal includes a voice signal emitted by a user. The determination device includes a threshold determination unit that determines a threshold value based on the reference information, and a determination unit that determines whether the input acoustic signal includes a voice signal based on the threshold value. The reference information is information relating to the magnitude of ambient noise, which is an acoustic signal excluding the voice uttered by the user, which reaches the microphone that picks up the input acoustic signal. The threshold value determining unit determines the threshold value such that it becomes more difficult to determine that the input acoustic signal includes the audio signal as the size of the ambient noise indicated by the reference information increases, and the threshold value determining unit determines the noise level of the ambient noise indicated by the reference information. The threshold value is determined such that the smaller the size, the easier it is to determine that the input acoustic signal includes a voice signal, and the easier it is to determine if the input acoustic signal includes the voice signal than the predetermined reference.

According to the present invention, it is possible to change the threshold value according to the ambient noise and reduce the false detection.

The figure for demonstrating the conventional VAD. The figure for demonstrating the determination apparatus which concerns on 1st embodiment. FIG. 3 is a functional block diagram of the determination device according to the first embodiment. The figure which shows the example of the process flow of the determination apparatus which concerns on 1st embodiment. The figure for demonstrating the determination method of threshold value Th when the information relevant to the magnitude of ambient noise is a continuous value.

The embodiments of the present invention will be described below. In the drawings used for the following description, components having the same function and steps for performing the same process are denoted by the same reference numerals, and duplicate description will be omitted. Unless otherwise specified, the processing performed for each element of the vector or matrix is applied to all the elements of the vector or matrix.

<Points of the first embodiment>
In the present embodiment, the VAD threshold is dynamically changed (see FIG. 2). At this time, the timing and amount of dynamic change are specified from the reference information. The reference information is information related to the magnitude of ambient noise.

<First embodiment>
FIG. 3 is a functional block diagram of the determination device according to the first embodiment, and FIG. 4 shows its processing flow.

The determination device includes a threshold value determination unit 110 and a VAD processing unit 120.

The determination device receives the reference information and the observation signal (input acoustic signal) as input, determines whether the observation signal includes a voice signal emitted by the user, and outputs the determination result. It should be noted that a section including a voice signal emitted from the user is referred to as a voice section, and the determination device may be referred to as determining a voice section. For example, the determination result is information indicating that it is a voice section or information indicating that it is not a voice section. In addition, only the voice uttered by the user who is the target of VAD is treated as a voice signal, and the voice uttered by another person or a speaker is treated as noise. The input acoustic signal may be an observation signal picked up in real time, or may be a signal in which a signal picked up in advance is stored in some storage medium.

The determination device is, for example, a special device configured by reading a special program into a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Is. The determination device executes each process under the control of the central processing unit, for example. The data input to the determination device and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary and other data is stored. Used for processing. At least a part of each processing unit of the determination device may be configured by hardware such as an integrated circuit. Each storage unit included in the determination device can be configured by, for example, a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily have to be provided inside the determination device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disc, or a flash memory (Flash Memory), and is provided outside the determination device. The configuration may be provided for.

Hereinafter, each part will be described.
<Threshold decision unit 110>
The threshold determination unit 110 receives the reference information, determines the threshold based on the reference information (S110), and outputs the determined threshold. The timing for outputting the threshold may be (i) output in a predetermined cycle regardless of the input of the reference information or the change of the threshold, or (ii) every time the reference information is input and the threshold is determined. It may be output, or may be output only when the threshold value changes as a result of (iii) determination.

The reference information is information related to the magnitude of ambient noise, which is an acoustic signal that arrives at the microphone that collects the observation signal and excludes the voice uttered by the user.

The threshold value determination unit 110 determines the threshold value so that it becomes more difficult to determine that the observed signal includes a voice signal as the ambient noise level indicated by the reference information increases.

Further, the threshold value determining unit 110 makes it easier to determine that the observation signal includes the voice signal as the ambient noise indicated by the reference information decreases, and the threshold determination unit 110 includes the voice signal in the observation signal more than a predetermined criterion. The threshold value is determined so that the judgment is not easy.

For example, as reference information, consider the presence/absence of a speaker playback signal (binary), turning on/off the engine of the car (binary), the presence/absence of a speaker approaching (binary), and measuring the ambient noise level (continuous value). To be When there is a speaker reproduction signal, when the engine of the car is ON, when the speaker is approaching, it is determined that the ambient noise is large. Such reference information can also be said to be information regarding the cause of increase or decrease in ambient noise. As the ambient noise increases, the audio signal is more likely to be erroneously detected. Therefore, in the present embodiment, the threshold is changed so that it is more difficult to determine that the audio signal is included as the ambient noise increases. For example, as shown in FIG. 1, when it is determined whether or not it is a voice signal based on the power of the observed signal (when it is determined that the signal is a voice signal when the power of the observed signal is larger than the threshold value), the ambient noise is large. When it is determined that the threshold value is changed, the threshold value is changed. For example, as a deterministic rule, binary input (0 or 1) such as presence/absence of speaker playback signal (2 values), car engine ON/OFF (2 values), presence/absence of speaker approach (2 values), etc. However, a binary (0.3 or 1.0) threshold value may be determined. In this case, 0.3 corresponds to the above-mentioned predetermined standard. By providing the predetermined reference, it is possible to prevent the value of the threshold value from being lowered more than necessary and prevent the audio signal from being erroneously determined to include the audio signal when the state is close to silence.

The threshold value Th may be determined as follows by combining M binary values.
Th=b ₁ a ₁ +b ₂ a ₂ +… +b _M a _M +c
a _m is a binary value of 0 or 1, which is 1 when the ambient noise is increased (when it causes an increase in ambient noise), and when the ambient noise is reduced (when it does not cause an increase in ambient noise). ) Has a value of 0. m=1, 2,..., M, and M is any positive integer. b _m is a weight for the cause of the m-th increase in ambient noise and is a positive real number. c is the above-mentioned predetermined standard.

As the information related to the magnitude of ambient noise, the magnitude of ambient noise itself (for example, ambient noise level) may be measured and used. When the information related to the magnitude of ambient noise is a continuous value, the threshold Th may be determined as follows.
Th=aL+d
However, when Th<c
Th=c
(See FIG. 5). a and d are parameters that are obtained in advance by experiments or simulations.

<VAD processing unit 120>
The VAD processing unit 120 receives a threshold value and an observation signal, determines whether the observation signal includes a voice signal emitted by the user based on the threshold value (S120), and outputs the determination result. More specifically, the VAD processing unit 120 determines whether the observation signal includes a voice signal emitted by the user, based on the magnitude relationship between the threshold value and the observation signal. As described above, the timing at which the threshold value determining unit 110 outputs the threshold value varies, but the VAD processing unit 120 may make the determination based on the threshold value received immediately before.

In this example, when the power of the observed signal is greater than the threshold value or the power of the observed signal is equal to or greater than the threshold value, it is determined that the observed signal is a voice section including a voice signal emitted by the user, and the power of the observed signal is the threshold value. Below, or when the power of the observation signal is smaller than the threshold value, it is determined to be a non-voice section in which the observation signal does not include the voice signal emitted from the user.

<Effect>
With the above configuration, erroneous detection of VAD can be suppressed even if the noise level included in the observed signal changes.

<Modification>
In the present embodiment, the VAD processing unit 120 determines whether or not the signal is a voice signal based on the magnitude relationship between the power of the observed signal (a value that increases as ambient noise increases) and the threshold value. In that case, it may be determined whether or not the signal is a voice signal based on the magnitude relationship between the value that becomes smaller (for example, the reciprocal of the power of the observed signal) and the threshold value. In that case, if the value that decreases together with the increase in the ambient noise is smaller than the threshold value, it is determined to be a voice signal. Therefore, the larger the ambient noise, the more difficult it is to determine that the observed signal includes a voice signal To a smaller threshold. For example, the M threshold values may be combined to determine the threshold value Th as follows.
Th=cb ₁ a ₁ -b ₂ a ₂ -...-b _M a _M
Further, when the information related to the magnitude of the ambient noise is a continuous value, the threshold Th may be determined by Th=-aL+d. However, when Th≧c, Th=c.

In this embodiment, the deterministic rule has been described, but a statistical method can be similarly applied. For example, when using a speech-like or non-speech-like discriminant model, the output value of the discriminant model that takes a value based on the observed signal as an input is a value indicating voice-likeness (e.g., likelihood), and is as large as a voice signal. In the case of the above value, the threshold value is changed so as to increase when it is determined that the ambient noise is large, and it is determined whether or not the signal is a voice signal based on the magnitude relationship between the value indicating the likelihood of voice and the threshold value.

<Other modifications>
The present invention is not limited to the above embodiments and modifications. For example, the above-described various processes may be executed not only in time series according to the description but also in parallel or individually according to the processing capability of the device that executes the process or the need. Other changes can be made as appropriate without departing from the spirit of the present invention.

<Program and recording medium>
Further, various processing functions in each device described in the above-described embodiments and modifications may be realized by a computer. In that case, the processing content of the function that each device should have is described by the program. Then, by executing this program on a computer, various processing functions of the above devices are realized on the computer.

The program describing this processing content can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like.

Distribution of this program is carried out by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

A computer that executes such a program first stores, for example, the program recorded in a portable recording medium or the program transferred from the server computer in its own storage unit. Then, when executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. Further, as another embodiment of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to this computer, the processing according to the received program may be executed successively. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer May be The program includes information used for processing by an electronic computer and equivalent to the program (data that is not a direct command to a computer but has the property of defining the processing of the computer).

Also, although each device is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized by hardware.

Claims

A determination device for determining whether the input acoustic signal includes a voice signal emitted from a user,
A threshold value determining unit that determines a threshold value based on the reference information,
A determination unit that determines whether the input acoustic signal includes the audio signal based on the threshold value,
The reference information is information related to the magnitude of ambient noise that is an acoustic signal excluding the voice uttered by the user that arrives at the microphone that picks up the input acoustic signal,
The threshold value determination unit determines the threshold value such that it is more difficult to determine that the input acoustic signal includes the audio signal, as the size of the ambient noise indicated by the reference information increases.
The threshold value determination unit makes it easier to determine that the input audio signal includes the audio signal as the ambient noise indicated by the reference information decreases, and includes the audio signal in the input audio signal. And determine the threshold value so that it is not easier to determine than a predetermined criterion,
Judgment device.
A determination method for determining whether an input audio signal includes a voice signal emitted by a user,
A threshold determination step of determining a threshold based on the reference information,
A determination step of determining whether the input acoustic signal includes the audio signal based on the threshold value,
The reference information is information related to the magnitude of ambient noise that is an acoustic signal excluding the voice uttered by the user that arrives at the microphone that picks up the input acoustic signal,
In the threshold value determining step, the threshold value is determined such that it is more difficult to determine that the input acoustic signal includes the audio signal, as the size of the ambient noise indicated by the reference information increases.
In the threshold value determining step, it becomes easier to determine that the input audio signal includes the audio signal as the ambient noise indicated by the reference information decreases, and the input audio signal includes the audio signal. And determine the threshold value so that it is not easier to determine than a predetermined criterion,
Judgment method.
A program for causing a computer to function as the determination device according to claim 1.