WO2020153158A1 - Determination device, method therefor, and program - Google Patents

Determination device, method therefor, and program Download PDF

Info

Publication number
WO2020153158A1
WO2020153158A1 PCT/JP2020/000695 JP2020000695W WO2020153158A1 WO 2020153158 A1 WO2020153158 A1 WO 2020153158A1 JP 2020000695 W JP2020000695 W JP 2020000695W WO 2020153158 A1 WO2020153158 A1 WO 2020153158A1
Authority
WO
WIPO (PCT)
Prior art keywords
threshold value
ambient noise
reference information
audio signal
acoustic signal
Prior art date
Application number
PCT/JP2020/000695
Other languages
French (fr)
Japanese (ja)
Inventor
弘章 伊藤
小林 和則
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Publication of WO2020153158A1 publication Critical patent/WO2020153158A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Definitions

  • the present invention relates to a technique for determining whether an input audio signal includes a voice signal emitted by a user.
  • Voice activity detection Voice Activity Detection: VAD
  • VAD Voice Activity Detection
  • Non-Patent Document 1 is known as a deterministic method
  • Non-Patent Document 2 is known as a statistical method.
  • deterministic method when the observed signal exceeds a preset threshold value, it is determined as voice.
  • statistical method a discrimination model of voice-likeness and non-voice-likeness is learned, and whether or not the observed signal is voice is determined by the discrimination model.
  • the conventional VAD technology is applied to a terminal equipped with a microphone and a speaker (smart speaker, robot, in-vehicle terminal, etc.), when the ambient noise such as the sound reproduced from the speaker of the terminal increases, the noise is erroneously detected as a voice. In some cases (see FIG. 1).
  • An object of the present invention is to provide a determination device, a method therefor, and a program that change a threshold value according to ambient noise to reduce erroneous detection.
  • a determination device determines whether an input acoustic signal includes a voice signal emitted by a user.
  • the determination device includes a threshold determination unit that determines a threshold value based on the reference information, and a determination unit that determines whether the input acoustic signal includes a voice signal based on the threshold value.
  • the reference information is information relating to the magnitude of ambient noise, which is an acoustic signal excluding the voice uttered by the user, which reaches the microphone that picks up the input acoustic signal.
  • the threshold value determining unit determines the threshold value such that it becomes more difficult to determine that the input acoustic signal includes the audio signal as the size of the ambient noise indicated by the reference information increases, and the threshold value determining unit determines the noise level of the ambient noise indicated by the reference information.
  • the threshold value is determined such that the smaller the size, the easier it is to determine that the input acoustic signal includes a voice signal, and the easier it is to determine if the input acoustic signal includes the voice signal than the predetermined reference.
  • the present invention it is possible to change the threshold value according to the ambient noise and reduce the false detection.
  • FIG. 3 is a functional block diagram of the determination device according to the first embodiment.
  • the VAD threshold is dynamically changed (see FIG. 2).
  • the timing and amount of dynamic change are specified from the reference information.
  • the reference information is information related to the magnitude of ambient noise.
  • FIG. 3 is a functional block diagram of the determination device according to the first embodiment, and FIG. 4 shows its processing flow.
  • the determination device includes a threshold value determination unit 110 and a VAD processing unit 120.
  • the determination device receives the reference information and the observation signal (input acoustic signal) as input, determines whether the observation signal includes a voice signal emitted by the user, and outputs the determination result.
  • a section including a voice signal emitted from the user is referred to as a voice section, and the determination device may be referred to as determining a voice section.
  • the determination result is information indicating that it is a voice section or information indicating that it is not a voice section.
  • the input acoustic signal may be an observation signal picked up in real time, or may be a signal in which a signal picked up in advance is stored in some storage medium.
  • the determination device is, for example, a special device configured by reading a special program into a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Is.
  • the determination device executes each process under the control of the central processing unit, for example.
  • the data input to the determination device and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary and other data is stored. Used for processing.
  • At least a part of each processing unit of the determination device may be configured by hardware such as an integrated circuit.
  • Each storage unit included in the determination device can be configured by, for example, a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store.
  • a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store.
  • middleware such as a relational database or a key-value store.
  • each storage unit does not necessarily have to be provided inside the determination device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disc, or a flash memory (Flash Memory), and is provided outside the determination device.
  • flash Memory Flash Memory
  • the threshold determination unit 110 receives the reference information, determines the threshold based on the reference information (S110), and outputs the determined threshold.
  • the timing for outputting the threshold may be (i) output in a predetermined cycle regardless of the input of the reference information or the change of the threshold, or (ii) every time the reference information is input and the threshold is determined. It may be output, or may be output only when the threshold value changes as a result of (iii) determination.
  • the reference information is information related to the magnitude of ambient noise, which is an acoustic signal that arrives at the microphone that collects the observation signal and excludes the voice uttered by the user.
  • the threshold value determination unit 110 determines the threshold value so that it becomes more difficult to determine that the observed signal includes a voice signal as the ambient noise level indicated by the reference information increases.
  • the threshold value determining unit 110 makes it easier to determine that the observation signal includes the voice signal as the ambient noise indicated by the reference information decreases, and the threshold determination unit 110 includes the voice signal in the observation signal more than a predetermined criterion. The threshold value is determined so that the judgment is not easy.
  • reference information consider the presence/absence of a speaker playback signal (binary), turning on/off the engine of the car (binary), the presence/absence of a speaker approaching (binary), and measuring the ambient noise level (continuous value).
  • a speaker reproduction signal when the engine of the car is ON, when the speaker is approaching, it is determined that the ambient noise is large.
  • Such reference information can also be said to be information regarding the cause of increase or decrease in ambient noise. As the ambient noise increases, the audio signal is more likely to be erroneously detected. Therefore, in the present embodiment, the threshold is changed so that it is more difficult to determine that the audio signal is included as the ambient noise increases. For example, as shown in FIG.
  • the threshold value is changed.
  • binary input such as presence/absence of speaker playback signal (2 values), car engine ON/OFF (2 values), presence/absence of speaker approach (2 values), etc.
  • a binary (0.3 or 1.0) threshold value may be determined. In this case, 0.3 corresponds to the above-mentioned predetermined standard.
  • the magnitude of ambient noise itself (for example, ambient noise level) may be measured and used.
  • the VAD processing unit 120 receives a threshold value and an observation signal, determines whether the observation signal includes a voice signal emitted by the user based on the threshold value (S120), and outputs the determination result. More specifically, the VAD processing unit 120 determines whether the observation signal includes a voice signal emitted by the user, based on the magnitude relationship between the threshold value and the observation signal. As described above, the timing at which the threshold value determining unit 110 outputs the threshold value varies, but the VAD processing unit 120 may make the determination based on the threshold value received immediately before.
  • the observed signal when the power of the observed signal is greater than the threshold value or the power of the observed signal is equal to or greater than the threshold value, it is determined that the observed signal is a voice section including a voice signal emitted by the user, and the power of the observed signal is the threshold value. Below, or when the power of the observation signal is smaller than the threshold value, it is determined to be a non-voice section in which the observation signal does not include the voice signal emitted from the user.
  • the VAD processing unit 120 determines whether or not the signal is a voice signal based on the magnitude relationship between the power of the observed signal (a value that increases as ambient noise increases) and the threshold value. In that case, it may be determined whether or not the signal is a voice signal based on the magnitude relationship between the value that becomes smaller (for example, the reciprocal of the power of the observed signal) and the threshold value. In that case, if the value that decreases together with the increase in the ambient noise is smaller than the threshold value, it is determined to be a voice signal. Therefore, the larger the ambient noise, the more difficult it is to determine that the observed signal includes a voice signal To a smaller threshold. For example, the M threshold values may be combined to determine the threshold value Th as follows.
  • Th cb 1 a 1 -b 2 a 2 -...-b M a M
  • Th cb 1 a 1 -b 2 a 2 -...-b M a M
  • the deterministic rule has been described, but a statistical method can be similarly applied.
  • the output value of the discriminant model that takes a value based on the observed signal as an input is a value indicating voice-likeness (e.g., likelihood), and is as large as a voice signal.
  • the threshold value is changed so as to increase when it is determined that the ambient noise is large, and it is determined whether or not the signal is a voice signal based on the magnitude relationship between the value indicating the likelihood of voice and the threshold value.
  • the program describing this processing content can be recorded in a computer-readable recording medium.
  • the computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like.
  • Distribution of this program is carried out by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
  • a computer that executes such a program first stores, for example, the program recorded in a portable recording medium or the program transferred from the server computer in its own storage unit. Then, when executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. Further, as another embodiment of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to this computer, the processing according to the received program may be executed successively.
  • ASP Application Service Provider
  • the program includes information used for processing by an electronic computer and equivalent to the program (data that is not a direct command to a computer but has the property of defining the processing of the computer).
  • each device is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized by hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Provided is a determination device and the like for changing a threshold value in accordance with ambient noise and reducing erroneous detection. This determination device determines whether an input acoustic signal includes a speech signal issued from a user. The determination device has: a threshold value determination unit that determines a threshold value on the basis of reference information; and a determination unit that determines whether the input acoustic signal includes the speech signal on the basis of the threshold value. The reference information is information relating to the magnitude of ambient noise that is an acoustic signal which arrives at a microphone for collecting the input acoustic signal and which excludes speech from the user. The threshold value determination unit determines the threshold value such that the larger the magnitude of the ambient noise indicated by the reference information is, the more difficult it becomes to determine that the input acoustic signal includes the speech signal. The threshold value determination unit determines the threshold value such that the lower the magnitude of the ambient noise indicated by the reference information is, the easier it becomes to determine that the input acoustic signal includes the speech signal, but the determination that the input acoustic signal includes the speech signal cannot be made more easily than when a prescribed reference is used.

Description

判定装置、その方法、およびプログラムJudgment device, its method, and program
 本発明は、入力音響信号にユーザから発せられた音声信号を含むかを判定する技術に関する。 The present invention relates to a technique for determining whether an input audio signal includes a voice signal emitted by a user.
 入力音響信号にユーザから発せられた音声信号を含むかを判定する技術として、発話区間検出(Voice Activity Detection : VAD)技術が知られており、VADでは観測信号から何らかの方法で音声または非音声を判定する。例えば、決定論的な方法として非特許文献1が、統計的な方法として非特許文献2が知られている。決定論的な方法では、観測信号が予め設定された閾値を超えた場合音声と判定する。統計的な方法では、音声らしさ、非音声らしさの識別モデルを学習し、識別モデルによって観測信号が音声か否かを判定する。 Voice activity detection (Voice Activity Detection: VAD) technology is known as a technology that determines whether the input acoustic signal includes a voice signal uttered by the user.VAD uses some method to detect voice or non-voice from the observed signal. judge. For example, Non-Patent Document 1 is known as a deterministic method, and Non-Patent Document 2 is known as a statistical method. In the deterministic method, when the observed signal exceeds a preset threshold value, it is determined as voice. In the statistical method, a discrimination model of voice-likeness and non-voice-likeness is learned, and whether or not the observed signal is voice is determined by the discrimination model.
 しかしながら、マイクとスピーカを具備する端末(スマートスピーカー、ロボット、車載端末等)に従来のVAD技術を適用すると、端末のスピーカ再生音などの周囲雑音が増大した際に、雑音を音声として誤検出してしまう場合がある(図1参照)。 However, if the conventional VAD technology is applied to a terminal equipped with a microphone and a speaker (smart speaker, robot, in-vehicle terminal, etc.), when the ambient noise such as the sound reproduced from the speaker of the terminal increases, the noise is erroneously detected as a voice. In some cases (see FIG. 1).
 本発明は、周囲雑音に応じて閾値を変更し、誤検出を低減する判定装置、その方法、およびプログラムを提供することを目的とする。 An object of the present invention is to provide a determination device, a method therefor, and a program that change a threshold value according to ambient noise to reduce erroneous detection.
 上記の課題を解決するために、本発明の一態様によれば、判定装置は入力音響信号にユーザから発せられた音声信号を含むかを判定する。判定装置は、参照情報に基づき閾値を決定する閾値決定部と、閾値に基づき、入力音響信号に音声信号を含むかを判定する判定部と、を有する。参照情報は、入力音響信号を収音したマイクロホンに到来する、ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報である。閾値決定部は、参照情報が示す周囲雑音の大きさが大きくなるほど、入力音響信号に音声信号を含むと判定しづらくなるように閾値を決定し、閾値決定部は、参照情報が示す周囲雑音の大きさが小さくなるほど、入力音響信号に音声信号を含むと判定しやすくなるように、かつ、入力音響信号に音声信号を含むと所定の基準よりも判定しやすくならないように閾値を決定する。 In order to solve the above problems, according to one aspect of the present invention, a determination device determines whether an input acoustic signal includes a voice signal emitted by a user. The determination device includes a threshold determination unit that determines a threshold value based on the reference information, and a determination unit that determines whether the input acoustic signal includes a voice signal based on the threshold value. The reference information is information relating to the magnitude of ambient noise, which is an acoustic signal excluding the voice uttered by the user, which reaches the microphone that picks up the input acoustic signal. The threshold value determining unit determines the threshold value such that it becomes more difficult to determine that the input acoustic signal includes the audio signal as the size of the ambient noise indicated by the reference information increases, and the threshold value determining unit determines the noise level of the ambient noise indicated by the reference information. The threshold value is determined such that the smaller the size, the easier it is to determine that the input acoustic signal includes a voice signal, and the easier it is to determine if the input acoustic signal includes the voice signal than the predetermined reference.
 本発明によれば、周囲雑音に応じて閾値を変更し、誤検出を低減することができるという効果を奏する。 According to the present invention, it is possible to change the threshold value according to the ambient noise and reduce the false detection.
従来のVADを説明するための図。The figure for demonstrating the conventional VAD. 第一実施形態に係る判定装置を説明するための図。The figure for demonstrating the determination apparatus which concerns on 1st embodiment. 第一実施形態に係る判定装置の機能ブロック図。FIG. 3 is a functional block diagram of the determination device according to the first embodiment. 第一実施形態に係る判定装置の処理フローの例を示す図。The figure which shows the example of the process flow of the determination apparatus which concerns on 1st embodiment. 周囲雑音の大きさに関連する情報が連続値の場合の閾値Thの決定方法を説明するための図。The figure for demonstrating the determination method of threshold value Th when the information relevant to the magnitude of ambient noise is a continuous value.
 以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 The embodiments of the present invention will be described below. In the drawings used for the following description, components having the same function and steps for performing the same process are denoted by the same reference numerals, and duplicate description will be omitted. Unless otherwise specified, the processing performed for each element of the vector or matrix is applied to all the elements of the vector or matrix.
<第一実施形態のポイント>
 本実施形態では、VADの閾値を動的に変化させる(図2参照)。このとき、動的変化のタイミングや変化量を参照情報から特定する。参照情報は、周囲雑音の大きさに関連する情報である。
<Points of the first embodiment>
In the present embodiment, the VAD threshold is dynamically changed (see FIG. 2). At this time, the timing and amount of dynamic change are specified from the reference information. The reference information is information related to the magnitude of ambient noise.
<第一実施形態>
 図3は第一実施形態に係る判定装置の機能ブロック図を、図4はその処理フローを示す。
<First embodiment>
FIG. 3 is a functional block diagram of the determination device according to the first embodiment, and FIG. 4 shows its processing flow.
 判定装置は、閾値決定部110とVAD処理部120とを含む。 The determination device includes a threshold value determination unit 110 and a VAD processing unit 120.
 判定装置は、参照情報と観測信号(入力音響信号)を入力とし、観測信号にユーザから発せられた音声信号を含むかを判定し、判定結果を出力する。なお、ユーザから発せられた音声信号を含む区間を音声区間といい、判定装置は、音声区間を判定すると言ってもよい。例えば、判定結果は音声区間であることを示す情報、または、音声区間でないことを示す情報である。また、VADの対象となるユーザから発せられた音声のみを音声信号とし、他の人物やスピーカから発せられた音声は雑音として扱う。入力音響信号は、リアルタイムに収音した観測信号であってもよいし、事前に収音した信号を何らかの記憶媒体に記憶した信号であってもよい。 The determination device receives the reference information and the observation signal (input acoustic signal) as input, determines whether the observation signal includes a voice signal emitted by the user, and outputs the determination result. It should be noted that a section including a voice signal emitted from the user is referred to as a voice section, and the determination device may be referred to as determining a voice section. For example, the determination result is information indicating that it is a voice section or information indicating that it is not a voice section. In addition, only the voice uttered by the user who is the target of VAD is treated as a voice signal, and the voice uttered by another person or a speaker is treated as noise. The input acoustic signal may be an observation signal picked up in real time, or may be a signal in which a signal picked up in advance is stored in some storage medium.
 判定装置は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。判定装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。判定装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。判定装置の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。判定装置が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしも判定装置がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置により構成し、判定装置の外部に備える構成としてもよい。 The determination device is, for example, a special device configured by reading a special program into a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Is. The determination device executes each process under the control of the central processing unit, for example. The data input to the determination device and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary and other data is stored. Used for processing. At least a part of each processing unit of the determination device may be configured by hardware such as an integrated circuit. Each storage unit included in the determination device can be configured by, for example, a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily have to be provided inside the determination device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disc, or a flash memory (Flash Memory), and is provided outside the determination device. The configuration may be provided for.
 以下、各部について説明する。
<閾値決定部110>
 閾値決定部110は、参照情報を入力とし、参照情報に基づき閾値を決定し(S110)、決定した閾値を出力する。なお、閾値を出力するタイミングは、(i)参照情報の入力や閾値の変化に関わらず、所定の周期で出力してもよいし、(ii)参照情報を入力として受け取り閾値を決定する度に出力してもよいし、(iii)決定の結果、閾値が変化したときにのみ出力する構成としてもよい。
Hereinafter, each part will be described.
<Threshold decision unit 110>
The threshold determination unit 110 receives the reference information, determines the threshold based on the reference information (S110), and outputs the determined threshold. The timing for outputting the threshold may be (i) output in a predetermined cycle regardless of the input of the reference information or the change of the threshold, or (ii) every time the reference information is input and the threshold is determined. It may be output, or may be output only when the threshold value changes as a result of (iii) determination.
 参照情報は、観測信号を収音したマイクロホンに到来する、ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報である。 The reference information is information related to the magnitude of ambient noise, which is an acoustic signal that arrives at the microphone that collects the observation signal and excludes the voice uttered by the user.
 閾値決定部110は、参照情報が示す周囲雑音の大きさが大きくなるほど観測信号に音声信号を含むと判定しづらくなるように閾値を決定する。 The threshold value determination unit 110 determines the threshold value so that it becomes more difficult to determine that the observed signal includes a voice signal as the ambient noise level indicated by the reference information increases.
 また、閾値決定部110は、参照情報が示す周囲雑音の大きさが小さくなるほど観測信号に音声信号を含むと判定しやすくなるように、かつ、観測信号に音声信号を含むと所定の基準よりも判定しやすくならないように閾値を決定する。 Further, the threshold value determining unit 110 makes it easier to determine that the observation signal includes the voice signal as the ambient noise indicated by the reference information decreases, and the threshold determination unit 110 includes the voice signal in the observation signal more than a predetermined criterion. The threshold value is determined so that the judgment is not easy.
 例えば、参照情報として、スピーカ再生信号の有無(2値)、車のエンジンON・OFF(2値)、発話者接近の有無(2値)、周囲雑音レベルの測定結果(連続値)などが考えられる。スピーカ再生信号がある場合、車のエンジンがONの場合、発話者の接近がある場合等に周囲雑音が大きくなると判断する。このような参照情報は、周囲雑音の増減の起因に関する情報とも言える。周囲雑音が大きくなるほど、音声信号と誤検出されやすくなるため、本実施形態では、周囲雑音の大きさが大きくなるほど音声信号を含むと判定しづらくなるように閾値を変化させる。例えば、図1のように観測信号のパワーに基づき音声信号か否かを判断する場合(閾値よりも観測信号のパワーが大きい場合に音声信号であると判断する場合)には、周囲雑音が大きいと判断したときに閾値が大きくなるように変化させる。例えば、決定論的なルールとして、スピーカ再生信号の有無(2値)、車のエンジンON・OFF(2値)、発話者接近の有無(2値)等の2値の入力(0 or 1)に対し、2値(0.3 or 1.0)の閾値を決定してもよい。この場合、0.3が前述の所定の基準に相当する。所定の基準を設けることで、必要以上に閾値の値を下げて無音に近い状態などのときに誤って音声信号を含むと判定されることを防ぐ。 For example, as reference information, consider the presence/absence of a speaker playback signal (binary), turning on/off the engine of the car (binary), the presence/absence of a speaker approaching (binary), and measuring the ambient noise level (continuous value). To be When there is a speaker reproduction signal, when the engine of the car is ON, when the speaker is approaching, it is determined that the ambient noise is large. Such reference information can also be said to be information regarding the cause of increase or decrease in ambient noise. As the ambient noise increases, the audio signal is more likely to be erroneously detected. Therefore, in the present embodiment, the threshold is changed so that it is more difficult to determine that the audio signal is included as the ambient noise increases. For example, as shown in FIG. 1, when it is determined whether or not it is a voice signal based on the power of the observed signal (when it is determined that the signal is a voice signal when the power of the observed signal is larger than the threshold value), the ambient noise is large. When it is determined that the threshold value is changed, the threshold value is changed. For example, as a deterministic rule, binary input (0 or 1) such as presence/absence of speaker playback signal (2 values), car engine ON/OFF (2 values), presence/absence of speaker approach (2 values), etc. However, a binary (0.3 or 1.0) threshold value may be determined. In this case, 0.3 corresponds to the above-mentioned predetermined standard. By providing the predetermined reference, it is possible to prevent the value of the threshold value from being lowered more than necessary and prevent the audio signal from being erroneously determined to include the audio signal when the state is close to silence.
 また、M個の2値を組合せて、閾値Thを以下のように決定してもよい。
Th=b1a1+b2a2+…+bMaM+c
amは、0または1の2値であり、周囲雑音を大きくする場合(周囲雑音の増大の原因となる場合)には1、周囲雑音を小さくする場合(周囲雑音の増大の原因とならない場合)には0となる値である。m=1,2,…,Mであり、Mは正の整数の何れかである。bmは、m番目の周囲雑音の増大の原因に対する重みであり、正の実数である。cは前述の所定の基準である。
The threshold value Th may be determined as follows by combining M binary values.
Th=b 1 a 1 +b 2 a 2 +… +b M a M +c
a m is a binary value of 0 or 1, which is 1 when the ambient noise is increased (when it causes an increase in ambient noise), and when the ambient noise is reduced (when it does not cause an increase in ambient noise). ) Has a value of 0. m=1, 2,..., M, and M is any positive integer. b m is a weight for the cause of the m-th increase in ambient noise and is a positive real number. c is the above-mentioned predetermined standard.
 なお、周囲雑音の大きさに関連する情報として、周囲雑音の大きさそのもの(例えば周囲雑音レベル)を計測し、用いてもよい。周囲雑音の大きさに関連する情報が連続値の場合、閾値Thを以下のように決定してもよい。
Th=aL+d
ただし、Th<cのとき
Th=c
とする(図5参照)。a,dは実験やシミュレーション等により予め求めておくパラメータである。
As the information related to the magnitude of ambient noise, the magnitude of ambient noise itself (for example, ambient noise level) may be measured and used. When the information related to the magnitude of ambient noise is a continuous value, the threshold Th may be determined as follows.
Th=aL+d
However, when Th<c
Th=c
(See FIG. 5). a and d are parameters that are obtained in advance by experiments or simulations.
<VAD処理部120>
 VAD処理部120は、閾値と観測信号とを入力とし、閾値に基づき、観測信号にユーザから発せられた音声信号を含むかを判定し(S120)、判定結果を出力する。より詳しく言うと、VAD処理部120は、閾値と観測信号との大小関係に基づき、観測信号にユーザから発せられた音声信号を含むかを判定する。なお、前述の通り、なお、閾値決定部110が閾値を出力するタイミングは様々であるが、VAD処理部120では、直前に受け取った閾値に基づき判定を行えばよい。
<VAD processing unit 120>
The VAD processing unit 120 receives a threshold value and an observation signal, determines whether the observation signal includes a voice signal emitted by the user based on the threshold value (S120), and outputs the determination result. More specifically, the VAD processing unit 120 determines whether the observation signal includes a voice signal emitted by the user, based on the magnitude relationship between the threshold value and the observation signal. As described above, the timing at which the threshold value determining unit 110 outputs the threshold value varies, but the VAD processing unit 120 may make the determination based on the threshold value received immediately before.
 この例では、観測信号のパワーが閾値より大きい、または、観測信号のパワーが閾値以上の場合に、観測信号にユーザから発せられた音声信号を含む音声区間と判定し、観測信号のパワーが閾値以下、または、観測信号のパワーが閾値より小さい場合に、観測信号にユーザから発せられた音声信号を含まない非音声区間と判定する。 In this example, when the power of the observed signal is greater than the threshold value or the power of the observed signal is equal to or greater than the threshold value, it is determined that the observed signal is a voice section including a voice signal emitted by the user, and the power of the observed signal is the threshold value. Below, or when the power of the observation signal is smaller than the threshold value, it is determined to be a non-voice section in which the observation signal does not include the voice signal emitted from the user.
<効果>
 以上の構成により、観測信号に含まれる雑音レベルが変化しても、VADの誤検出を抑制することができる。
<Effect>
With the above configuration, erroneous detection of VAD can be suppressed even if the noise level included in the observed signal changes.
<変形例>
 本実施形態では、VAD処理部120において、観測信号のパワー(周囲雑音が大きくなると合わせて大きくなる値)と閾値の大小関係に基づき音声信号か否かを判断しているが、周囲雑音が大きくなると合わせて小さくなる値(例えば観測信号のパワーの逆数)と閾値の大小関係に基づき音声信号か否かを判断してもよい。その場合、周囲雑音が大きくなると合わせて小さくなる値が閾値より小さい場合に音声信号であると判断するため、周囲雑音の大きさが大きくなるほど、観測信号に音声信号を含むと判定しづらくなるように閾値が小さくなるように変化させる。例えば、M個の2値を組合せて、閾値Thを以下のように決定してもよい。
Th=c-b1a1-b2a2-…-bMaM
また、周囲雑音の大きさに関連する情報が連続値の場合、閾値ThをTh=-aL+dにより決定してもよい。ただし、Th≧cのときTh=cとする。
<Modification>
In the present embodiment, the VAD processing unit 120 determines whether or not the signal is a voice signal based on the magnitude relationship between the power of the observed signal (a value that increases as ambient noise increases) and the threshold value. In that case, it may be determined whether or not the signal is a voice signal based on the magnitude relationship between the value that becomes smaller (for example, the reciprocal of the power of the observed signal) and the threshold value. In that case, if the value that decreases together with the increase in the ambient noise is smaller than the threshold value, it is determined to be a voice signal. Therefore, the larger the ambient noise, the more difficult it is to determine that the observed signal includes a voice signal To a smaller threshold. For example, the M threshold values may be combined to determine the threshold value Th as follows.
Th=cb 1 a 1 -b 2 a 2 -...-b M a M
Further, when the information related to the magnitude of the ambient noise is a continuous value, the threshold Th may be determined by Th=-aL+d. However, when Th≧c, Th=c.
 本実施形態では、決定論的なルールについて説明したが、統計的な方法であっても同様に適用できる。例えば、音声らしさ、非音声らしさの識別モデルを用いる場合、観測信号に基づく値を入力とする識別モデルの出力値が、音声らしさを示す値(例えば、尤度)であり、音声信号らしいほど大きくなる値の場合、周囲雑音が大きいと判断したときに閾値が大きくなるように変化させ、音声らしさを示す値と閾値の大小関係に基づき音声信号か否かを判断する。 In this embodiment, the deterministic rule has been described, but a statistical method can be similarly applied. For example, when using a speech-like or non-speech-like discriminant model, the output value of the discriminant model that takes a value based on the observed signal as an input is a value indicating voice-likeness (e.g., likelihood), and is as large as a voice signal. In the case of the above value, the threshold value is changed so as to increase when it is determined that the ambient noise is large, and it is determined whether or not the signal is a voice signal based on the magnitude relationship between the value indicating the likelihood of voice and the threshold value.
<その他の変形例>
 本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。
<Other modifications>
The present invention is not limited to the above embodiments and modifications. For example, the above-described various processes may be executed not only in time series according to the description but also in parallel or individually according to the processing capability of the device that executes the process or the need. Other changes can be made as appropriate without departing from the spirit of the present invention.
<プログラム及び記録媒体>
 また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。
<Program and recording medium>
Further, various processing functions in each device described in the above-described embodiments and modifications may be realized by a computer. In that case, the processing content of the function that each device should have is described by the program. Then, by executing this program on a computer, various processing functions of the above devices are realized on the computer.
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing this processing content can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 Distribution of this program is carried out by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 A computer that executes such a program first stores, for example, the program recorded in a portable recording medium or the program transferred from the server computer in its own storage unit. Then, when executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. Further, as another embodiment of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to this computer, the processing according to the received program may be executed successively. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer May be The program includes information used for processing by an electronic computer and equivalent to the program (data that is not a direct command to a computer but has the property of defining the processing of the computer).
 また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Also, although each device is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized by hardware.

Claims (3)

  1.  入力音響信号にユーザから発せられた音声信号を含むかを判定する判定装置であって、
     参照情報に基づき閾値を決定する閾値決定部と、
     前記閾値に基づき、前記入力音響信号に前記音声信号を含むかを判定する判定部と、を有し、
     前記参照情報は、前記入力音響信号を収音したマイクロホンに到来する、前記ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報であり、
     前記閾値決定部は、前記参照情報が示す周囲雑音の大きさが大きくなるほど、前記入力音響信号に前記音声信号を含むと判定しづらくなるように前記閾値を決定し、
     前記閾値決定部は、前記参照情報が示す周囲雑音の大きさが小さくなるほど、前記入力音響信号に前記音声信号を含むと判定しやすくなるように、かつ、前記入力音響信号に前記音声信号を含むと所定の基準よりも判定しやすくならないように前記閾値を決定する、
     判定装置。
    A determination device for determining whether the input acoustic signal includes a voice signal emitted from a user,
    A threshold value determining unit that determines a threshold value based on the reference information,
    A determination unit that determines whether the input acoustic signal includes the audio signal based on the threshold value,
    The reference information is information related to the magnitude of ambient noise that is an acoustic signal excluding the voice uttered by the user that arrives at the microphone that picks up the input acoustic signal,
    The threshold value determination unit determines the threshold value such that it is more difficult to determine that the input acoustic signal includes the audio signal, as the size of the ambient noise indicated by the reference information increases.
    The threshold value determination unit makes it easier to determine that the input audio signal includes the audio signal as the ambient noise indicated by the reference information decreases, and includes the audio signal in the input audio signal. And determine the threshold value so that it is not easier to determine than a predetermined criterion,
    Judgment device.
  2.  入力音響信号にユーザから発せられた音声信号を含むかを判定する判定方法であって、
     参照情報に基づき閾値を決定する閾値決定ステップと、
     前記閾値に基づき、前記入力音響信号に前記音声信号を含むかを判定する判定ステップと、を有し、
     前記参照情報は、前記入力音響信号を収音したマイクロホンに到来する、前記ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報であり、
     前記閾値決定ステップにおいて、前記参照情報が示す周囲雑音の大きさが大きくなるほど、前記入力音響信号に前記音声信号を含むと判定しづらくなるように前記閾値を決定し、
     前記閾値決定ステップにおいて、前記参照情報が示す周囲雑音の大きさが小さくなるほど、前記入力音響信号に前記音声信号を含むと判定しやすくなるように、かつ、前記入力音響信号に前記音声信号を含むと所定の基準よりも判定しやすくならないように前記閾値を決定する、
     判定方法。
    A determination method for determining whether an input audio signal includes a voice signal emitted by a user,
    A threshold determination step of determining a threshold based on the reference information,
    A determination step of determining whether the input acoustic signal includes the audio signal based on the threshold value,
    The reference information is information related to the magnitude of ambient noise that is an acoustic signal excluding the voice uttered by the user that arrives at the microphone that picks up the input acoustic signal,
    In the threshold value determining step, the threshold value is determined such that it is more difficult to determine that the input acoustic signal includes the audio signal, as the size of the ambient noise indicated by the reference information increases.
    In the threshold value determining step, it becomes easier to determine that the input audio signal includes the audio signal as the ambient noise indicated by the reference information decreases, and the input audio signal includes the audio signal. And determine the threshold value so that it is not easier to determine than a predetermined criterion,
    Judgment method.
  3.  請求項1の判定装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the determination device according to claim 1.
PCT/JP2020/000695 2019-01-23 2020-01-10 Determination device, method therefor, and program WO2020153158A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019009449A JP2020118838A (en) 2019-01-23 2019-01-23 Determination device, method thereof and program
JP2019-009449 2019-01-23

Publications (1)

Publication Number Publication Date
WO2020153158A1 true WO2020153158A1 (en) 2020-07-30

Family

ID=71736033

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/000695 WO2020153158A1 (en) 2019-01-23 2020-01-10 Determination device, method therefor, and program

Country Status (2)

Country Link
JP (1) JP2020118838A (en)
WO (1) WO2020153158A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5730913A (en) * 1980-08-01 1982-02-19 Nissan Motor Co Ltd Speech recognition response device for automobile
JPH0627986A (en) * 1992-07-13 1994-02-04 Toshiba Corp Equipment control system utilizing speech recognizing device
JP2004341339A (en) * 2003-05-16 2004-12-02 Mitsubishi Electric Corp Noise restriction device
JP2005215204A (en) * 2004-01-28 2005-08-11 Ntt Docomo Inc Device and method for judging voiced or unvoiced
JP2005300958A (en) * 2004-04-13 2005-10-27 Mitsubishi Electric Corp Talker check system
JP2009109536A (en) * 2007-10-26 2009-05-21 Panasonic Electric Works Co Ltd Voice recognition system and voice recognizer
JP2013160938A (en) * 2012-02-06 2013-08-19 Mitsubishi Electric Corp Voice section detection device
JP2018040982A (en) * 2016-09-08 2018-03-15 富士通株式会社 Speech production interval detection device, speech production interval detection method, and computer program for speech production interval detection

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5730913A (en) * 1980-08-01 1982-02-19 Nissan Motor Co Ltd Speech recognition response device for automobile
JPH0627986A (en) * 1992-07-13 1994-02-04 Toshiba Corp Equipment control system utilizing speech recognizing device
JP2004341339A (en) * 2003-05-16 2004-12-02 Mitsubishi Electric Corp Noise restriction device
JP2005215204A (en) * 2004-01-28 2005-08-11 Ntt Docomo Inc Device and method for judging voiced or unvoiced
JP2005300958A (en) * 2004-04-13 2005-10-27 Mitsubishi Electric Corp Talker check system
JP2009109536A (en) * 2007-10-26 2009-05-21 Panasonic Electric Works Co Ltd Voice recognition system and voice recognizer
JP2013160938A (en) * 2012-02-06 2013-08-19 Mitsubishi Electric Corp Voice section detection device
JP2018040982A (en) * 2016-09-08 2018-03-15 富士通株式会社 Speech production interval detection device, speech production interval detection method, and computer program for speech production interval detection

Also Published As

Publication number Publication date
JP2020118838A (en) 2020-08-06

Similar Documents

Publication Publication Date Title
US10714122B2 (en) Speech classification of audio for wake on voice
US11184298B2 (en) Methods and systems for improving chatbot intent training by correlating user feedback provided subsequent to a failed response to an initial user intent
CN109493850A (en) Growing Interface
WO2020166322A1 (en) Learning-data acquisition device, model learning device, methods for same, and program
JP6306528B2 (en) Acoustic model learning support device and acoustic model learning support method
JP6495792B2 (en) Speech recognition apparatus, speech recognition method, and program
CN111968625A (en) Sensitive audio recognition model training method and recognition method fusing text information
CN104361311A (en) Multi-modal online incremental access recognition system and recognition method thereof
US20150255090A1 (en) Method and apparatus for detecting speech segment
TW202030684A (en) Claim settlement service processing method and device
JP4787979B2 (en) Noise detection apparatus and noise detection method
WO2020110815A1 (en) Keyword extraction device, keyword extraction method, and program
US20200219496A1 (en) Methods and systems for managing voice response systems based on signals from external devices
WO2020153158A1 (en) Determination device, method therefor, and program
JP2009271465A (en) Word addition device, word addition method and program therefor
CN115223584B (en) Audio data processing method, device, equipment and storage medium
JP6716513B2 (en) VOICE SEGMENT DETECTING DEVICE, METHOD THEREOF, AND PROGRAM
CN115810353A (en) Method for detecting keywords in voice and storage medium
CN110634486A (en) Voice processing method and device
US20220335927A1 (en) Learning apparatus, estimation apparatus, methods and programs for the same
JP5982265B2 (en) Speech recognition apparatus, speech recognition method, and program
JP7028311B2 (en) Learning audio data generator, its method, and program
WO2018216511A1 (en) Attribute identification device, attribute identification method, and program
US11889168B1 (en) Systems and methods for generating a video summary of a virtual event
JP2014002336A (en) Content processing device, content processing method, and computer program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20744807

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20744807

Country of ref document: EP

Kind code of ref document: A1