WO2020153158A1 - Determination device, method therefor, and program - Google Patents
Determination device, method therefor, and program Download PDFInfo
- Publication number
- WO2020153158A1 WO2020153158A1 PCT/JP2020/000695 JP2020000695W WO2020153158A1 WO 2020153158 A1 WO2020153158 A1 WO 2020153158A1 JP 2020000695 W JP2020000695 W JP 2020000695W WO 2020153158 A1 WO2020153158 A1 WO 2020153158A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- threshold value
- ambient noise
- reference information
- audio signal
- acoustic signal
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 17
- 230000005236 sound signal Effects 0.000 claims description 19
- 230000007423 decrease Effects 0.000 claims description 5
- 238000001514 detection method Methods 0.000 abstract description 7
- 238000012545 processing Methods 0.000 description 26
- 230000006870 function Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000007619 statistical method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
Definitions
- the present invention relates to a technique for determining whether an input audio signal includes a voice signal emitted by a user.
- Voice activity detection Voice Activity Detection: VAD
- VAD Voice Activity Detection
- Non-Patent Document 1 is known as a deterministic method
- Non-Patent Document 2 is known as a statistical method.
- deterministic method when the observed signal exceeds a preset threshold value, it is determined as voice.
- statistical method a discrimination model of voice-likeness and non-voice-likeness is learned, and whether or not the observed signal is voice is determined by the discrimination model.
- the conventional VAD technology is applied to a terminal equipped with a microphone and a speaker (smart speaker, robot, in-vehicle terminal, etc.), when the ambient noise such as the sound reproduced from the speaker of the terminal increases, the noise is erroneously detected as a voice. In some cases (see FIG. 1).
- An object of the present invention is to provide a determination device, a method therefor, and a program that change a threshold value according to ambient noise to reduce erroneous detection.
- a determination device determines whether an input acoustic signal includes a voice signal emitted by a user.
- the determination device includes a threshold determination unit that determines a threshold value based on the reference information, and a determination unit that determines whether the input acoustic signal includes a voice signal based on the threshold value.
- the reference information is information relating to the magnitude of ambient noise, which is an acoustic signal excluding the voice uttered by the user, which reaches the microphone that picks up the input acoustic signal.
- the threshold value determining unit determines the threshold value such that it becomes more difficult to determine that the input acoustic signal includes the audio signal as the size of the ambient noise indicated by the reference information increases, and the threshold value determining unit determines the noise level of the ambient noise indicated by the reference information.
- the threshold value is determined such that the smaller the size, the easier it is to determine that the input acoustic signal includes a voice signal, and the easier it is to determine if the input acoustic signal includes the voice signal than the predetermined reference.
- the present invention it is possible to change the threshold value according to the ambient noise and reduce the false detection.
- FIG. 3 is a functional block diagram of the determination device according to the first embodiment.
- the VAD threshold is dynamically changed (see FIG. 2).
- the timing and amount of dynamic change are specified from the reference information.
- the reference information is information related to the magnitude of ambient noise.
- FIG. 3 is a functional block diagram of the determination device according to the first embodiment, and FIG. 4 shows its processing flow.
- the determination device includes a threshold value determination unit 110 and a VAD processing unit 120.
- the determination device receives the reference information and the observation signal (input acoustic signal) as input, determines whether the observation signal includes a voice signal emitted by the user, and outputs the determination result.
- a section including a voice signal emitted from the user is referred to as a voice section, and the determination device may be referred to as determining a voice section.
- the determination result is information indicating that it is a voice section or information indicating that it is not a voice section.
- the input acoustic signal may be an observation signal picked up in real time, or may be a signal in which a signal picked up in advance is stored in some storage medium.
- the determination device is, for example, a special device configured by reading a special program into a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Is.
- the determination device executes each process under the control of the central processing unit, for example.
- the data input to the determination device and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary and other data is stored. Used for processing.
- At least a part of each processing unit of the determination device may be configured by hardware such as an integrated circuit.
- Each storage unit included in the determination device can be configured by, for example, a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store.
- a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store.
- middleware such as a relational database or a key-value store.
- each storage unit does not necessarily have to be provided inside the determination device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disc, or a flash memory (Flash Memory), and is provided outside the determination device.
- flash Memory Flash Memory
- the threshold determination unit 110 receives the reference information, determines the threshold based on the reference information (S110), and outputs the determined threshold.
- the timing for outputting the threshold may be (i) output in a predetermined cycle regardless of the input of the reference information or the change of the threshold, or (ii) every time the reference information is input and the threshold is determined. It may be output, or may be output only when the threshold value changes as a result of (iii) determination.
- the reference information is information related to the magnitude of ambient noise, which is an acoustic signal that arrives at the microphone that collects the observation signal and excludes the voice uttered by the user.
- the threshold value determination unit 110 determines the threshold value so that it becomes more difficult to determine that the observed signal includes a voice signal as the ambient noise level indicated by the reference information increases.
- the threshold value determining unit 110 makes it easier to determine that the observation signal includes the voice signal as the ambient noise indicated by the reference information decreases, and the threshold determination unit 110 includes the voice signal in the observation signal more than a predetermined criterion. The threshold value is determined so that the judgment is not easy.
- reference information consider the presence/absence of a speaker playback signal (binary), turning on/off the engine of the car (binary), the presence/absence of a speaker approaching (binary), and measuring the ambient noise level (continuous value).
- a speaker reproduction signal when the engine of the car is ON, when the speaker is approaching, it is determined that the ambient noise is large.
- Such reference information can also be said to be information regarding the cause of increase or decrease in ambient noise. As the ambient noise increases, the audio signal is more likely to be erroneously detected. Therefore, in the present embodiment, the threshold is changed so that it is more difficult to determine that the audio signal is included as the ambient noise increases. For example, as shown in FIG.
- the threshold value is changed.
- binary input such as presence/absence of speaker playback signal (2 values), car engine ON/OFF (2 values), presence/absence of speaker approach (2 values), etc.
- a binary (0.3 or 1.0) threshold value may be determined. In this case, 0.3 corresponds to the above-mentioned predetermined standard.
- the magnitude of ambient noise itself (for example, ambient noise level) may be measured and used.
- the VAD processing unit 120 receives a threshold value and an observation signal, determines whether the observation signal includes a voice signal emitted by the user based on the threshold value (S120), and outputs the determination result. More specifically, the VAD processing unit 120 determines whether the observation signal includes a voice signal emitted by the user, based on the magnitude relationship between the threshold value and the observation signal. As described above, the timing at which the threshold value determining unit 110 outputs the threshold value varies, but the VAD processing unit 120 may make the determination based on the threshold value received immediately before.
- the observed signal when the power of the observed signal is greater than the threshold value or the power of the observed signal is equal to or greater than the threshold value, it is determined that the observed signal is a voice section including a voice signal emitted by the user, and the power of the observed signal is the threshold value. Below, or when the power of the observation signal is smaller than the threshold value, it is determined to be a non-voice section in which the observation signal does not include the voice signal emitted from the user.
- the VAD processing unit 120 determines whether or not the signal is a voice signal based on the magnitude relationship between the power of the observed signal (a value that increases as ambient noise increases) and the threshold value. In that case, it may be determined whether or not the signal is a voice signal based on the magnitude relationship between the value that becomes smaller (for example, the reciprocal of the power of the observed signal) and the threshold value. In that case, if the value that decreases together with the increase in the ambient noise is smaller than the threshold value, it is determined to be a voice signal. Therefore, the larger the ambient noise, the more difficult it is to determine that the observed signal includes a voice signal To a smaller threshold. For example, the M threshold values may be combined to determine the threshold value Th as follows.
- Th cb 1 a 1 -b 2 a 2 -...-b M a M
- Th cb 1 a 1 -b 2 a 2 -...-b M a M
- the deterministic rule has been described, but a statistical method can be similarly applied.
- the output value of the discriminant model that takes a value based on the observed signal as an input is a value indicating voice-likeness (e.g., likelihood), and is as large as a voice signal.
- the threshold value is changed so as to increase when it is determined that the ambient noise is large, and it is determined whether or not the signal is a voice signal based on the magnitude relationship between the value indicating the likelihood of voice and the threshold value.
- the program describing this processing content can be recorded in a computer-readable recording medium.
- the computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like.
- Distribution of this program is carried out by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
- a computer that executes such a program first stores, for example, the program recorded in a portable recording medium or the program transferred from the server computer in its own storage unit. Then, when executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. Further, as another embodiment of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to this computer, the processing according to the received program may be executed successively.
- ASP Application Service Provider
- the program includes information used for processing by an electronic computer and equivalent to the program (data that is not a direct command to a computer but has the property of defining the processing of the computer).
- each device is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized by hardware.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Provided is a determination device and the like for changing a threshold value in accordance with ambient noise and reducing erroneous detection. This determination device determines whether an input acoustic signal includes a speech signal issued from a user. The determination device has: a threshold value determination unit that determines a threshold value on the basis of reference information; and a determination unit that determines whether the input acoustic signal includes the speech signal on the basis of the threshold value. The reference information is information relating to the magnitude of ambient noise that is an acoustic signal which arrives at a microphone for collecting the input acoustic signal and which excludes speech from the user. The threshold value determination unit determines the threshold value such that the larger the magnitude of the ambient noise indicated by the reference information is, the more difficult it becomes to determine that the input acoustic signal includes the speech signal. The threshold value determination unit determines the threshold value such that the lower the magnitude of the ambient noise indicated by the reference information is, the easier it becomes to determine that the input acoustic signal includes the speech signal, but the determination that the input acoustic signal includes the speech signal cannot be made more easily than when a prescribed reference is used.
Description
本発明は、入力音響信号にユーザから発せられた音声信号を含むかを判定する技術に関する。
The present invention relates to a technique for determining whether an input audio signal includes a voice signal emitted by a user.
入力音響信号にユーザから発せられた音声信号を含むかを判定する技術として、発話区間検出(Voice Activity Detection : VAD)技術が知られており、VADでは観測信号から何らかの方法で音声または非音声を判定する。例えば、決定論的な方法として非特許文献1が、統計的な方法として非特許文献2が知られている。決定論的な方法では、観測信号が予め設定された閾値を超えた場合音声と判定する。統計的な方法では、音声らしさ、非音声らしさの識別モデルを学習し、識別モデルによって観測信号が音声か否かを判定する。
Voice activity detection (Voice Activity Detection: VAD) technology is known as a technology that determines whether the input acoustic signal includes a voice signal uttered by the user.VAD uses some method to detect voice or non-voice from the observed signal. judge. For example, Non-Patent Document 1 is known as a deterministic method, and Non-Patent Document 2 is known as a statistical method. In the deterministic method, when the observed signal exceeds a preset threshold value, it is determined as voice. In the statistical method, a discrimination model of voice-likeness and non-voice-likeness is learned, and whether or not the observed signal is voice is determined by the discrimination model.
しかしながら、マイクとスピーカを具備する端末(スマートスピーカー、ロボット、車載端末等)に従来のVAD技術を適用すると、端末のスピーカ再生音などの周囲雑音が増大した際に、雑音を音声として誤検出してしまう場合がある(図1参照)。
However, if the conventional VAD technology is applied to a terminal equipped with a microphone and a speaker (smart speaker, robot, in-vehicle terminal, etc.), when the ambient noise such as the sound reproduced from the speaker of the terminal increases, the noise is erroneously detected as a voice. In some cases (see FIG. 1).
本発明は、周囲雑音に応じて閾値を変更し、誤検出を低減する判定装置、その方法、およびプログラムを提供することを目的とする。
An object of the present invention is to provide a determination device, a method therefor, and a program that change a threshold value according to ambient noise to reduce erroneous detection.
上記の課題を解決するために、本発明の一態様によれば、判定装置は入力音響信号にユーザから発せられた音声信号を含むかを判定する。判定装置は、参照情報に基づき閾値を決定する閾値決定部と、閾値に基づき、入力音響信号に音声信号を含むかを判定する判定部と、を有する。参照情報は、入力音響信号を収音したマイクロホンに到来する、ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報である。閾値決定部は、参照情報が示す周囲雑音の大きさが大きくなるほど、入力音響信号に音声信号を含むと判定しづらくなるように閾値を決定し、閾値決定部は、参照情報が示す周囲雑音の大きさが小さくなるほど、入力音響信号に音声信号を含むと判定しやすくなるように、かつ、入力音響信号に音声信号を含むと所定の基準よりも判定しやすくならないように閾値を決定する。
In order to solve the above problems, according to one aspect of the present invention, a determination device determines whether an input acoustic signal includes a voice signal emitted by a user. The determination device includes a threshold determination unit that determines a threshold value based on the reference information, and a determination unit that determines whether the input acoustic signal includes a voice signal based on the threshold value. The reference information is information relating to the magnitude of ambient noise, which is an acoustic signal excluding the voice uttered by the user, which reaches the microphone that picks up the input acoustic signal. The threshold value determining unit determines the threshold value such that it becomes more difficult to determine that the input acoustic signal includes the audio signal as the size of the ambient noise indicated by the reference information increases, and the threshold value determining unit determines the noise level of the ambient noise indicated by the reference information. The threshold value is determined such that the smaller the size, the easier it is to determine that the input acoustic signal includes a voice signal, and the easier it is to determine if the input acoustic signal includes the voice signal than the predetermined reference.
本発明によれば、周囲雑音に応じて閾値を変更し、誤検出を低減することができるという効果を奏する。
According to the present invention, it is possible to change the threshold value according to the ambient noise and reduce the false detection.
以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。
The embodiments of the present invention will be described below. In the drawings used for the following description, components having the same function and steps for performing the same process are denoted by the same reference numerals, and duplicate description will be omitted. Unless otherwise specified, the processing performed for each element of the vector or matrix is applied to all the elements of the vector or matrix.
<第一実施形態のポイント>
本実施形態では、VADの閾値を動的に変化させる(図2参照)。このとき、動的変化のタイミングや変化量を参照情報から特定する。参照情報は、周囲雑音の大きさに関連する情報である。 <Points of the first embodiment>
In the present embodiment, the VAD threshold is dynamically changed (see FIG. 2). At this time, the timing and amount of dynamic change are specified from the reference information. The reference information is information related to the magnitude of ambient noise.
本実施形態では、VADの閾値を動的に変化させる(図2参照)。このとき、動的変化のタイミングや変化量を参照情報から特定する。参照情報は、周囲雑音の大きさに関連する情報である。 <Points of the first embodiment>
In the present embodiment, the VAD threshold is dynamically changed (see FIG. 2). At this time, the timing and amount of dynamic change are specified from the reference information. The reference information is information related to the magnitude of ambient noise.
<第一実施形態>
図3は第一実施形態に係る判定装置の機能ブロック図を、図4はその処理フローを示す。 <First embodiment>
FIG. 3 is a functional block diagram of the determination device according to the first embodiment, and FIG. 4 shows its processing flow.
図3は第一実施形態に係る判定装置の機能ブロック図を、図4はその処理フローを示す。 <First embodiment>
FIG. 3 is a functional block diagram of the determination device according to the first embodiment, and FIG. 4 shows its processing flow.
判定装置は、閾値決定部110とVAD処理部120とを含む。
The determination device includes a threshold value determination unit 110 and a VAD processing unit 120.
判定装置は、参照情報と観測信号(入力音響信号)を入力とし、観測信号にユーザから発せられた音声信号を含むかを判定し、判定結果を出力する。なお、ユーザから発せられた音声信号を含む区間を音声区間といい、判定装置は、音声区間を判定すると言ってもよい。例えば、判定結果は音声区間であることを示す情報、または、音声区間でないことを示す情報である。また、VADの対象となるユーザから発せられた音声のみを音声信号とし、他の人物やスピーカから発せられた音声は雑音として扱う。入力音響信号は、リアルタイムに収音した観測信号であってもよいし、事前に収音した信号を何らかの記憶媒体に記憶した信号であってもよい。
The determination device receives the reference information and the observation signal (input acoustic signal) as input, determines whether the observation signal includes a voice signal emitted by the user, and outputs the determination result. It should be noted that a section including a voice signal emitted from the user is referred to as a voice section, and the determination device may be referred to as determining a voice section. For example, the determination result is information indicating that it is a voice section or information indicating that it is not a voice section. In addition, only the voice uttered by the user who is the target of VAD is treated as a voice signal, and the voice uttered by another person or a speaker is treated as noise. The input acoustic signal may be an observation signal picked up in real time, or may be a signal in which a signal picked up in advance is stored in some storage medium.
判定装置は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。判定装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。判定装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。判定装置の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。判定装置が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしも判定装置がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置により構成し、判定装置の外部に備える構成としてもよい。
The determination device is, for example, a special device configured by reading a special program into a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Is. The determination device executes each process under the control of the central processing unit, for example. The data input to the determination device and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary and other data is stored. Used for processing. At least a part of each processing unit of the determination device may be configured by hardware such as an integrated circuit. Each storage unit included in the determination device can be configured by, for example, a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily have to be provided inside the determination device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disc, or a flash memory (Flash Memory), and is provided outside the determination device. The configuration may be provided for.
以下、各部について説明する。
<閾値決定部110>
閾値決定部110は、参照情報を入力とし、参照情報に基づき閾値を決定し(S110)、決定した閾値を出力する。なお、閾値を出力するタイミングは、(i)参照情報の入力や閾値の変化に関わらず、所定の周期で出力してもよいし、(ii)参照情報を入力として受け取り閾値を決定する度に出力してもよいし、(iii)決定の結果、閾値が変化したときにのみ出力する構成としてもよい。 Hereinafter, each part will be described.
<Threshold decision unit 110>
Thethreshold determination unit 110 receives the reference information, determines the threshold based on the reference information (S110), and outputs the determined threshold. The timing for outputting the threshold may be (i) output in a predetermined cycle regardless of the input of the reference information or the change of the threshold, or (ii) every time the reference information is input and the threshold is determined. It may be output, or may be output only when the threshold value changes as a result of (iii) determination.
<閾値決定部110>
閾値決定部110は、参照情報を入力とし、参照情報に基づき閾値を決定し(S110)、決定した閾値を出力する。なお、閾値を出力するタイミングは、(i)参照情報の入力や閾値の変化に関わらず、所定の周期で出力してもよいし、(ii)参照情報を入力として受け取り閾値を決定する度に出力してもよいし、(iii)決定の結果、閾値が変化したときにのみ出力する構成としてもよい。 Hereinafter, each part will be described.
<
The
参照情報は、観測信号を収音したマイクロホンに到来する、ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報である。
The reference information is information related to the magnitude of ambient noise, which is an acoustic signal that arrives at the microphone that collects the observation signal and excludes the voice uttered by the user.
閾値決定部110は、参照情報が示す周囲雑音の大きさが大きくなるほど観測信号に音声信号を含むと判定しづらくなるように閾値を決定する。
The threshold value determination unit 110 determines the threshold value so that it becomes more difficult to determine that the observed signal includes a voice signal as the ambient noise level indicated by the reference information increases.
また、閾値決定部110は、参照情報が示す周囲雑音の大きさが小さくなるほど観測信号に音声信号を含むと判定しやすくなるように、かつ、観測信号に音声信号を含むと所定の基準よりも判定しやすくならないように閾値を決定する。
Further, the threshold value determining unit 110 makes it easier to determine that the observation signal includes the voice signal as the ambient noise indicated by the reference information decreases, and the threshold determination unit 110 includes the voice signal in the observation signal more than a predetermined criterion. The threshold value is determined so that the judgment is not easy.
例えば、参照情報として、スピーカ再生信号の有無(2値)、車のエンジンON・OFF(2値)、発話者接近の有無(2値)、周囲雑音レベルの測定結果(連続値)などが考えられる。スピーカ再生信号がある場合、車のエンジンがONの場合、発話者の接近がある場合等に周囲雑音が大きくなると判断する。このような参照情報は、周囲雑音の増減の起因に関する情報とも言える。周囲雑音が大きくなるほど、音声信号と誤検出されやすくなるため、本実施形態では、周囲雑音の大きさが大きくなるほど音声信号を含むと判定しづらくなるように閾値を変化させる。例えば、図1のように観測信号のパワーに基づき音声信号か否かを判断する場合(閾値よりも観測信号のパワーが大きい場合に音声信号であると判断する場合)には、周囲雑音が大きいと判断したときに閾値が大きくなるように変化させる。例えば、決定論的なルールとして、スピーカ再生信号の有無(2値)、車のエンジンON・OFF(2値)、発話者接近の有無(2値)等の2値の入力(0 or 1)に対し、2値(0.3 or 1.0)の閾値を決定してもよい。この場合、0.3が前述の所定の基準に相当する。所定の基準を設けることで、必要以上に閾値の値を下げて無音に近い状態などのときに誤って音声信号を含むと判定されることを防ぐ。
For example, as reference information, consider the presence/absence of a speaker playback signal (binary), turning on/off the engine of the car (binary), the presence/absence of a speaker approaching (binary), and measuring the ambient noise level (continuous value). To be When there is a speaker reproduction signal, when the engine of the car is ON, when the speaker is approaching, it is determined that the ambient noise is large. Such reference information can also be said to be information regarding the cause of increase or decrease in ambient noise. As the ambient noise increases, the audio signal is more likely to be erroneously detected. Therefore, in the present embodiment, the threshold is changed so that it is more difficult to determine that the audio signal is included as the ambient noise increases. For example, as shown in FIG. 1, when it is determined whether or not it is a voice signal based on the power of the observed signal (when it is determined that the signal is a voice signal when the power of the observed signal is larger than the threshold value), the ambient noise is large. When it is determined that the threshold value is changed, the threshold value is changed. For example, as a deterministic rule, binary input (0 or 1) such as presence/absence of speaker playback signal (2 values), car engine ON/OFF (2 values), presence/absence of speaker approach (2 values), etc. However, a binary (0.3 or 1.0) threshold value may be determined. In this case, 0.3 corresponds to the above-mentioned predetermined standard. By providing the predetermined reference, it is possible to prevent the value of the threshold value from being lowered more than necessary and prevent the audio signal from being erroneously determined to include the audio signal when the state is close to silence.
また、M個の2値を組合せて、閾値Thを以下のように決定してもよい。
Th=b1a1+b2a2+…+bMaM+c
amは、0または1の2値であり、周囲雑音を大きくする場合(周囲雑音の増大の原因となる場合)には1、周囲雑音を小さくする場合(周囲雑音の増大の原因とならない場合)には0となる値である。m=1,2,…,Mであり、Mは正の整数の何れかである。bmは、m番目の周囲雑音の増大の原因に対する重みであり、正の実数である。cは前述の所定の基準である。 The threshold value Th may be determined as follows by combining M binary values.
Th=b 1 a 1 +b 2 a 2 +… +b M a M +c
a m is a binary value of 0 or 1, which is 1 when the ambient noise is increased (when it causes an increase in ambient noise), and when the ambient noise is reduced (when it does not cause an increase in ambient noise). ) Has a value of 0. m=1, 2,..., M, and M is any positive integer. b m is a weight for the cause of the m-th increase in ambient noise and is a positive real number. c is the above-mentioned predetermined standard.
Th=b1a1+b2a2+…+bMaM+c
amは、0または1の2値であり、周囲雑音を大きくする場合(周囲雑音の増大の原因となる場合)には1、周囲雑音を小さくする場合(周囲雑音の増大の原因とならない場合)には0となる値である。m=1,2,…,Mであり、Mは正の整数の何れかである。bmは、m番目の周囲雑音の増大の原因に対する重みであり、正の実数である。cは前述の所定の基準である。 The threshold value Th may be determined as follows by combining M binary values.
Th=b 1 a 1 +b 2 a 2 +… +b M a M +c
a m is a binary value of 0 or 1, which is 1 when the ambient noise is increased (when it causes an increase in ambient noise), and when the ambient noise is reduced (when it does not cause an increase in ambient noise). ) Has a value of 0. m=1, 2,..., M, and M is any positive integer. b m is a weight for the cause of the m-th increase in ambient noise and is a positive real number. c is the above-mentioned predetermined standard.
なお、周囲雑音の大きさに関連する情報として、周囲雑音の大きさそのもの(例えば周囲雑音レベル)を計測し、用いてもよい。周囲雑音の大きさに関連する情報が連続値の場合、閾値Thを以下のように決定してもよい。
Th=aL+d
ただし、Th<cのとき
Th=c
とする(図5参照)。a,dは実験やシミュレーション等により予め求めておくパラメータである。 As the information related to the magnitude of ambient noise, the magnitude of ambient noise itself (for example, ambient noise level) may be measured and used. When the information related to the magnitude of ambient noise is a continuous value, the threshold Th may be determined as follows.
Th=aL+d
However, when Th<c
Th=c
(See FIG. 5). a and d are parameters that are obtained in advance by experiments or simulations.
Th=aL+d
ただし、Th<cのとき
Th=c
とする(図5参照)。a,dは実験やシミュレーション等により予め求めておくパラメータである。 As the information related to the magnitude of ambient noise, the magnitude of ambient noise itself (for example, ambient noise level) may be measured and used. When the information related to the magnitude of ambient noise is a continuous value, the threshold Th may be determined as follows.
Th=aL+d
However, when Th<c
Th=c
(See FIG. 5). a and d are parameters that are obtained in advance by experiments or simulations.
<VAD処理部120>
VAD処理部120は、閾値と観測信号とを入力とし、閾値に基づき、観測信号にユーザから発せられた音声信号を含むかを判定し(S120)、判定結果を出力する。より詳しく言うと、VAD処理部120は、閾値と観測信号との大小関係に基づき、観測信号にユーザから発せられた音声信号を含むかを判定する。なお、前述の通り、なお、閾値決定部110が閾値を出力するタイミングは様々であるが、VAD処理部120では、直前に受け取った閾値に基づき判定を行えばよい。 <VAD processing unit 120>
TheVAD processing unit 120 receives a threshold value and an observation signal, determines whether the observation signal includes a voice signal emitted by the user based on the threshold value (S120), and outputs the determination result. More specifically, the VAD processing unit 120 determines whether the observation signal includes a voice signal emitted by the user, based on the magnitude relationship between the threshold value and the observation signal. As described above, the timing at which the threshold value determining unit 110 outputs the threshold value varies, but the VAD processing unit 120 may make the determination based on the threshold value received immediately before.
VAD処理部120は、閾値と観測信号とを入力とし、閾値に基づき、観測信号にユーザから発せられた音声信号を含むかを判定し(S120)、判定結果を出力する。より詳しく言うと、VAD処理部120は、閾値と観測信号との大小関係に基づき、観測信号にユーザから発せられた音声信号を含むかを判定する。なお、前述の通り、なお、閾値決定部110が閾値を出力するタイミングは様々であるが、VAD処理部120では、直前に受け取った閾値に基づき判定を行えばよい。 <
The
この例では、観測信号のパワーが閾値より大きい、または、観測信号のパワーが閾値以上の場合に、観測信号にユーザから発せられた音声信号を含む音声区間と判定し、観測信号のパワーが閾値以下、または、観測信号のパワーが閾値より小さい場合に、観測信号にユーザから発せられた音声信号を含まない非音声区間と判定する。
In this example, when the power of the observed signal is greater than the threshold value or the power of the observed signal is equal to or greater than the threshold value, it is determined that the observed signal is a voice section including a voice signal emitted by the user, and the power of the observed signal is the threshold value. Below, or when the power of the observation signal is smaller than the threshold value, it is determined to be a non-voice section in which the observation signal does not include the voice signal emitted from the user.
<効果>
以上の構成により、観測信号に含まれる雑音レベルが変化しても、VADの誤検出を抑制することができる。 <Effect>
With the above configuration, erroneous detection of VAD can be suppressed even if the noise level included in the observed signal changes.
以上の構成により、観測信号に含まれる雑音レベルが変化しても、VADの誤検出を抑制することができる。 <Effect>
With the above configuration, erroneous detection of VAD can be suppressed even if the noise level included in the observed signal changes.
<変形例>
本実施形態では、VAD処理部120において、観測信号のパワー(周囲雑音が大きくなると合わせて大きくなる値)と閾値の大小関係に基づき音声信号か否かを判断しているが、周囲雑音が大きくなると合わせて小さくなる値(例えば観測信号のパワーの逆数)と閾値の大小関係に基づき音声信号か否かを判断してもよい。その場合、周囲雑音が大きくなると合わせて小さくなる値が閾値より小さい場合に音声信号であると判断するため、周囲雑音の大きさが大きくなるほど、観測信号に音声信号を含むと判定しづらくなるように閾値が小さくなるように変化させる。例えば、M個の2値を組合せて、閾値Thを以下のように決定してもよい。
Th=c-b1a1-b2a2-…-bMaM
また、周囲雑音の大きさに関連する情報が連続値の場合、閾値ThをTh=-aL+dにより決定してもよい。ただし、Th≧cのときTh=cとする。 <Modification>
In the present embodiment, theVAD processing unit 120 determines whether or not the signal is a voice signal based on the magnitude relationship between the power of the observed signal (a value that increases as ambient noise increases) and the threshold value. In that case, it may be determined whether or not the signal is a voice signal based on the magnitude relationship between the value that becomes smaller (for example, the reciprocal of the power of the observed signal) and the threshold value. In that case, if the value that decreases together with the increase in the ambient noise is smaller than the threshold value, it is determined to be a voice signal. Therefore, the larger the ambient noise, the more difficult it is to determine that the observed signal includes a voice signal To a smaller threshold. For example, the M threshold values may be combined to determine the threshold value Th as follows.
Th=cb 1 a 1 -b 2 a 2 -...-b M a M
Further, when the information related to the magnitude of the ambient noise is a continuous value, the threshold Th may be determined by Th=-aL+d. However, when Th≧c, Th=c.
本実施形態では、VAD処理部120において、観測信号のパワー(周囲雑音が大きくなると合わせて大きくなる値)と閾値の大小関係に基づき音声信号か否かを判断しているが、周囲雑音が大きくなると合わせて小さくなる値(例えば観測信号のパワーの逆数)と閾値の大小関係に基づき音声信号か否かを判断してもよい。その場合、周囲雑音が大きくなると合わせて小さくなる値が閾値より小さい場合に音声信号であると判断するため、周囲雑音の大きさが大きくなるほど、観測信号に音声信号を含むと判定しづらくなるように閾値が小さくなるように変化させる。例えば、M個の2値を組合せて、閾値Thを以下のように決定してもよい。
Th=c-b1a1-b2a2-…-bMaM
また、周囲雑音の大きさに関連する情報が連続値の場合、閾値ThをTh=-aL+dにより決定してもよい。ただし、Th≧cのときTh=cとする。 <Modification>
In the present embodiment, the
Th=cb 1 a 1 -b 2 a 2 -...-b M a M
Further, when the information related to the magnitude of the ambient noise is a continuous value, the threshold Th may be determined by Th=-aL+d. However, when Th≧c, Th=c.
本実施形態では、決定論的なルールについて説明したが、統計的な方法であっても同様に適用できる。例えば、音声らしさ、非音声らしさの識別モデルを用いる場合、観測信号に基づく値を入力とする識別モデルの出力値が、音声らしさを示す値(例えば、尤度)であり、音声信号らしいほど大きくなる値の場合、周囲雑音が大きいと判断したときに閾値が大きくなるように変化させ、音声らしさを示す値と閾値の大小関係に基づき音声信号か否かを判断する。
In this embodiment, the deterministic rule has been described, but a statistical method can be similarly applied. For example, when using a speech-like or non-speech-like discriminant model, the output value of the discriminant model that takes a value based on the observed signal as an input is a value indicating voice-likeness (e.g., likelihood), and is as large as a voice signal. In the case of the above value, the threshold value is changed so as to increase when it is determined that the ambient noise is large, and it is determined whether or not the signal is a voice signal based on the magnitude relationship between the value indicating the likelihood of voice and the threshold value.
<その他の変形例>
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other modifications>
The present invention is not limited to the above embodiments and modifications. For example, the above-described various processes may be executed not only in time series according to the description but also in parallel or individually according to the processing capability of the device that executes the process or the need. Other changes can be made as appropriate without departing from the spirit of the present invention.
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other modifications>
The present invention is not limited to the above embodiments and modifications. For example, the above-described various processes may be executed not only in time series according to the description but also in parallel or individually according to the processing capability of the device that executes the process or the need. Other changes can be made as appropriate without departing from the spirit of the present invention.
<プログラム及び記録媒体>
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
Further, various processing functions in each device described in the above-described embodiments and modifications may be realized by a computer. In that case, the processing content of the function that each device should have is described by the program. Then, by executing this program on a computer, various processing functions of the above devices are realized on the computer.
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
Further, various processing functions in each device described in the above-described embodiments and modifications may be realized by a computer. In that case, the processing content of the function that each device should have is described by the program. Then, by executing this program on a computer, various processing functions of the above devices are realized on the computer.
この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。
The program describing this processing content can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like.
また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。
Distribution of this program is carried out by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。
A computer that executes such a program first stores, for example, the program recorded in a portable recording medium or the program transferred from the server computer in its own storage unit. Then, when executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. Further, as another embodiment of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to this computer, the processing according to the received program may be executed successively. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer May be The program includes information used for processing by an electronic computer and equivalent to the program (data that is not a direct command to a computer but has the property of defining the processing of the computer).
また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。
Also, although each device is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized by hardware.
Claims (3)
- 入力音響信号にユーザから発せられた音声信号を含むかを判定する判定装置であって、
参照情報に基づき閾値を決定する閾値決定部と、
前記閾値に基づき、前記入力音響信号に前記音声信号を含むかを判定する判定部と、を有し、
前記参照情報は、前記入力音響信号を収音したマイクロホンに到来する、前記ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報であり、
前記閾値決定部は、前記参照情報が示す周囲雑音の大きさが大きくなるほど、前記入力音響信号に前記音声信号を含むと判定しづらくなるように前記閾値を決定し、
前記閾値決定部は、前記参照情報が示す周囲雑音の大きさが小さくなるほど、前記入力音響信号に前記音声信号を含むと判定しやすくなるように、かつ、前記入力音響信号に前記音声信号を含むと所定の基準よりも判定しやすくならないように前記閾値を決定する、
判定装置。 A determination device for determining whether the input acoustic signal includes a voice signal emitted from a user,
A threshold value determining unit that determines a threshold value based on the reference information,
A determination unit that determines whether the input acoustic signal includes the audio signal based on the threshold value,
The reference information is information related to the magnitude of ambient noise that is an acoustic signal excluding the voice uttered by the user that arrives at the microphone that picks up the input acoustic signal,
The threshold value determination unit determines the threshold value such that it is more difficult to determine that the input acoustic signal includes the audio signal, as the size of the ambient noise indicated by the reference information increases.
The threshold value determination unit makes it easier to determine that the input audio signal includes the audio signal as the ambient noise indicated by the reference information decreases, and includes the audio signal in the input audio signal. And determine the threshold value so that it is not easier to determine than a predetermined criterion,
Judgment device. - 入力音響信号にユーザから発せられた音声信号を含むかを判定する判定方法であって、
参照情報に基づき閾値を決定する閾値決定ステップと、
前記閾値に基づき、前記入力音響信号に前記音声信号を含むかを判定する判定ステップと、を有し、
前記参照情報は、前記入力音響信号を収音したマイクロホンに到来する、前記ユーザから発せられた音声を除く音響信号である周囲雑音の大きさに関連する情報であり、
前記閾値決定ステップにおいて、前記参照情報が示す周囲雑音の大きさが大きくなるほど、前記入力音響信号に前記音声信号を含むと判定しづらくなるように前記閾値を決定し、
前記閾値決定ステップにおいて、前記参照情報が示す周囲雑音の大きさが小さくなるほど、前記入力音響信号に前記音声信号を含むと判定しやすくなるように、かつ、前記入力音響信号に前記音声信号を含むと所定の基準よりも判定しやすくならないように前記閾値を決定する、
判定方法。 A determination method for determining whether an input audio signal includes a voice signal emitted by a user,
A threshold determination step of determining a threshold based on the reference information,
A determination step of determining whether the input acoustic signal includes the audio signal based on the threshold value,
The reference information is information related to the magnitude of ambient noise that is an acoustic signal excluding the voice uttered by the user that arrives at the microphone that picks up the input acoustic signal,
In the threshold value determining step, the threshold value is determined such that it is more difficult to determine that the input acoustic signal includes the audio signal, as the size of the ambient noise indicated by the reference information increases.
In the threshold value determining step, it becomes easier to determine that the input audio signal includes the audio signal as the ambient noise indicated by the reference information decreases, and the input audio signal includes the audio signal. And determine the threshold value so that it is not easier to determine than a predetermined criterion,
Judgment method. - 請求項1の判定装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the determination device according to claim 1.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2019009449A JP2020118838A (en) | 2019-01-23 | 2019-01-23 | Determination device, method thereof and program |
JP2019-009449 | 2019-01-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020153158A1 true WO2020153158A1 (en) | 2020-07-30 |
Family
ID=71736033
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/000695 WO2020153158A1 (en) | 2019-01-23 | 2020-01-10 | Determination device, method therefor, and program |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2020118838A (en) |
WO (1) | WO2020153158A1 (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5730913A (en) * | 1980-08-01 | 1982-02-19 | Nissan Motor Co Ltd | Speech recognition response device for automobile |
JPH0627986A (en) * | 1992-07-13 | 1994-02-04 | Toshiba Corp | Equipment control system utilizing speech recognizing device |
JP2004341339A (en) * | 2003-05-16 | 2004-12-02 | Mitsubishi Electric Corp | Noise restriction device |
JP2005215204A (en) * | 2004-01-28 | 2005-08-11 | Ntt Docomo Inc | Device and method for judging voiced or unvoiced |
JP2005300958A (en) * | 2004-04-13 | 2005-10-27 | Mitsubishi Electric Corp | Talker check system |
JP2009109536A (en) * | 2007-10-26 | 2009-05-21 | Panasonic Electric Works Co Ltd | Voice recognition system and voice recognizer |
JP2013160938A (en) * | 2012-02-06 | 2013-08-19 | Mitsubishi Electric Corp | Voice section detection device |
JP2018040982A (en) * | 2016-09-08 | 2018-03-15 | 富士通株式会社 | Speech production interval detection device, speech production interval detection method, and computer program for speech production interval detection |
-
2019
- 2019-01-23 JP JP2019009449A patent/JP2020118838A/en active Pending
-
2020
- 2020-01-10 WO PCT/JP2020/000695 patent/WO2020153158A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5730913A (en) * | 1980-08-01 | 1982-02-19 | Nissan Motor Co Ltd | Speech recognition response device for automobile |
JPH0627986A (en) * | 1992-07-13 | 1994-02-04 | Toshiba Corp | Equipment control system utilizing speech recognizing device |
JP2004341339A (en) * | 2003-05-16 | 2004-12-02 | Mitsubishi Electric Corp | Noise restriction device |
JP2005215204A (en) * | 2004-01-28 | 2005-08-11 | Ntt Docomo Inc | Device and method for judging voiced or unvoiced |
JP2005300958A (en) * | 2004-04-13 | 2005-10-27 | Mitsubishi Electric Corp | Talker check system |
JP2009109536A (en) * | 2007-10-26 | 2009-05-21 | Panasonic Electric Works Co Ltd | Voice recognition system and voice recognizer |
JP2013160938A (en) * | 2012-02-06 | 2013-08-19 | Mitsubishi Electric Corp | Voice section detection device |
JP2018040982A (en) * | 2016-09-08 | 2018-03-15 | 富士通株式会社 | Speech production interval detection device, speech production interval detection method, and computer program for speech production interval detection |
Also Published As
Publication number | Publication date |
---|---|
JP2020118838A (en) | 2020-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10714122B2 (en) | Speech classification of audio for wake on voice | |
US11184298B2 (en) | Methods and systems for improving chatbot intent training by correlating user feedback provided subsequent to a failed response to an initial user intent | |
CN109493850A (en) | Growing Interface | |
WO2020166322A1 (en) | Learning-data acquisition device, model learning device, methods for same, and program | |
JP6306528B2 (en) | Acoustic model learning support device and acoustic model learning support method | |
JP6495792B2 (en) | Speech recognition apparatus, speech recognition method, and program | |
CN111968625A (en) | Sensitive audio recognition model training method and recognition method fusing text information | |
CN104361311A (en) | Multi-modal online incremental access recognition system and recognition method thereof | |
US20150255090A1 (en) | Method and apparatus for detecting speech segment | |
TW202030684A (en) | Claim settlement service processing method and device | |
JP4787979B2 (en) | Noise detection apparatus and noise detection method | |
WO2020110815A1 (en) | Keyword extraction device, keyword extraction method, and program | |
US20200219496A1 (en) | Methods and systems for managing voice response systems based on signals from external devices | |
WO2020153158A1 (en) | Determination device, method therefor, and program | |
JP2009271465A (en) | Word addition device, word addition method and program therefor | |
CN115223584B (en) | Audio data processing method, device, equipment and storage medium | |
JP6716513B2 (en) | VOICE SEGMENT DETECTING DEVICE, METHOD THEREOF, AND PROGRAM | |
CN115810353A (en) | Method for detecting keywords in voice and storage medium | |
CN110634486A (en) | Voice processing method and device | |
US20220335927A1 (en) | Learning apparatus, estimation apparatus, methods and programs for the same | |
JP5982265B2 (en) | Speech recognition apparatus, speech recognition method, and program | |
JP7028311B2 (en) | Learning audio data generator, its method, and program | |
WO2018216511A1 (en) | Attribute identification device, attribute identification method, and program | |
US11889168B1 (en) | Systems and methods for generating a video summary of a virtual event | |
JP2014002336A (en) | Content processing device, content processing method, and computer program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20744807 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20744807 Country of ref document: EP Kind code of ref document: A1 |