WO2023148955A1 - Dispositif de génération de fenêtre temporelle, procédé et programme - Google Patents

Dispositif de génération de fenêtre temporelle, procédé et programme Download PDF

Info

Publication number
WO2023148955A1
WO2023148955A1 PCT/JP2022/004616 JP2022004616W WO2023148955A1 WO 2023148955 A1 WO2023148955 A1 WO 2023148955A1 JP 2022004616 W JP2022004616 W JP 2022004616W WO 2023148955 A1 WO2023148955 A1 WO 2023148955A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
window
frequency domain
unit
analysis window
Prior art date
Application number
PCT/JP2022/004616
Other languages
English (en)
Japanese (ja)
Inventor
伸 村田
洋平 脇阪
記良 鎌土
翔一郎 齊藤
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2023578328A priority Critical patent/JPWO2023148955A1/ja
Priority to PCT/JP2022/004616 priority patent/WO2023148955A1/fr
Publication of WO2023148955A1 publication Critical patent/WO2023148955A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • the present invention relates to technology for processing sound signals such as voice.
  • a method using short-time Fourier transform is used to analyze acoustic signals in real time in terms of time and frequency.
  • a time window is used to clip a signal of constant length and treat it as a periodic signal.
  • the time window has frequency resolution and dynamic range determined according to its shape.
  • frequency resolution and dynamic range are in a trade-off relationship. Therefore, if the separation performance of adjacent frequency components is improved, minute components may be overlooked.
  • a specific window function is fixed and used, or a plurality of window functions prepared in advance are used by switching (for example, see Non-Patent Document 1). .).
  • An object of the present invention is to provide a time window generation device, method, and program for generating an appropriate time window.
  • a time window generation device uses a signal clipping unit that creates a clipped signal by clipping a sound signal into a predetermined length, and an analysis window model that is determined by the clipped signal and analysis window model parameters.
  • an analysis window generation unit that generates an analysis window;
  • a synthesis window generation unit that generates a synthesis window using at least the cutout signal or using the analysis window; and transforms the cutout signal into a frequency domain using the analysis window.
  • a frequency domain transformation unit that generates a frequency domain signal by using a frequency domain signal, a signal processing unit that performs a predetermined process on the frequency domain signal to generate a post-processing frequency domain signal, and a post-processing frequency domain signal using a synthesis window.
  • a time domain transforming unit that generates a time domain signal by transforming it into the time domain; and a learning unit that learns at least analysis window model parameters using the time domain signal and correct data corresponding to the time domain signal. ing.
  • An appropriate time window can be generated.
  • FIG. 1 is a diagram showing an example of the functional configuration of a time window generation device.
  • FIG. 2 is a diagram showing an example of the processing procedure of the time window generation method.
  • FIG. 3 is a diagram illustrating a functional configuration example of a computer.
  • the time window generation device includes a signal extraction unit 1, an analysis window generation unit 2, a synthesis window generation unit 3, a frequency domain transformation unit 4, a signal processing unit 5, a time domain transformation unit 6, and a learning unit 7. for example.
  • the time window generation method is realized, for example, by each component of the time window generation device performing the processing from step S1 to step S7 described below and shown in FIG.
  • the time window means at least one of the analysis window and the synthesis window.
  • the signal clipping unit 1 generates a clipped signal by clipping the sound signal to a predetermined length (step S1).
  • the generated clipping signal is output to the analysis window generating section 2 and the synthesis window generating section 3.
  • a clipping signal is input to the analysis window generator 2 .
  • Analysis window model parameters generated by a learning unit 7 to be described later are input to the analysis window generation unit 2 .
  • the analysis window generation unit 2 generates an analysis window using the cutout signal and the analysis window model determined by the analysis window model parameters (step S2).
  • the generated analysis window is output to the frequency domain transform section 4 .
  • the analysis window model parameters are, for example, analysis window model parameters generated by the learning unit 7, which will be described later.
  • the analysis window generation unit 2 uses predetermined analysis window model parameters.
  • the analysis window generator 2 may generate and output a predetermined analysis window.
  • Synthetic window generator 3 A clipping signal is input to the synthesis window generator 3 .
  • Synthetic window model parameters generated by a learning unit 7 to be described later are input to the synthetic window generation unit 3 .
  • the synthetic window generation unit 3 generates a synthetic window using the clipping signal and the synthetic window model determined by the synthetic window model parameters (step S3).
  • the generated synthesis window is output to the time domain transforming section 6 .
  • the synthetic window model parameters are, for example, synthetic window model parameters generated by the learning unit 7, which will be described later. If the synthetic window model parameters generated by the learning unit 7 are not present in the first processing of the synthetic window generation unit 3, the synthetic window generation unit 3 uses predetermined synthetic window model parameters. In this case, the synthetic window generator 3 may generate and output a predetermined synthetic window.
  • ⁇ Frequency domain transformation unit 4> The cutout signal and the analysis window are input to the frequency domain transforming unit 4 .
  • the frequency domain transformation unit 4 generates a frequency domain signal by transforming the clipped signal into the frequency domain using the analysis window (step S4).
  • the generated frequency domain signal is output to the signal processing section 5 .
  • the frequency domain transformation unit 4 performs transformation into the frequency domain using a technique such as short-time Fourier transformation.
  • a frequency domain signal is input to the signal processing unit 5 .
  • the signal processing unit 5 performs predetermined processing on the frequency domain signal to generate a post-processing frequency domain signal (step S5).
  • the generated post-processing frequency domain signal is output to the time domain transform section 6 .
  • An example of the predetermined processing is at least one of processing for enhancing predetermined signals such as voice (in other words, speech signal enhancement processing, processing for suppressing noise), and classification processing for classifying noise and the like.
  • the signal processing unit 5 estimates a speech enhancement filter using, for example, a frequency domain signal. Then, the signal processing unit 5 multiplies the estimated speech enhancement filter and the frequency domain signal to generate a frequency domain signal with enhanced speech.
  • This frequency domain signal is an example of a post-processing frequency domain signal.
  • the predetermined processing may include noise classification processing.
  • the signal processing unit 5 uses the frequency domain signal to estimate noise included in the frequency domain signal.
  • a noise label that is the estimation result is output to the learning unit 7 .
  • ⁇ Time domain conversion unit 6> The processed frequency domain signal and the synthesis window are input to the time domain transformation unit 6 .
  • the time domain transforming unit 6 generates a time domain signal by transforming the processed frequency domain signal into the time domain using a synthesis window (step S6).
  • the generated time domain signal is output to the learning section 7 .
  • the generated time domain signal may be output from the time window generator as a result of predetermined processing by the signal processor 5 .
  • the frequency domain transformation unit 4 performs transformation into the time domain using a technique such as inverse short-time Fourier transformation.
  • a time domain signal is input to the learning unit 7 .
  • Correct data corresponding to the time domain signal is also input to the learning unit 7 .
  • the learning unit 7 uses the time domain signal and the correct data corresponding to the time domain signal to learn the analysis window parameter and the synthesis window parameter (step S7).
  • Analysis window parameters and synthesis window parameters are learned by, for example, the gradient descent method.
  • the predetermined processing in the signal processing unit 5 is audio signal enhancement processing
  • the original sound signal without noise becomes the correct data.
  • a gradient method or the like is used for learning the analysis window parameter and the synthesis window parameter so that the mean sum of squares error obtained by averaging the squared differences between the time domain signal and the original sound signal becomes small.
  • the true noise label is input to the learning unit 7 as correct data.
  • the noise label estimated by the predetermined processing in the signal processing unit 5 is further input to the learning unit 7 .
  • the learning unit 7 further uses the true noise label and the estimated noise label in addition to the time domain signal and the correct data corresponding to the time domain signal to learn the analysis window parameter and the synthesis window parameter.
  • step S1 to step S7 described so far may be repeated as appropriate.
  • the learning unit 7 learns the analysis window parameter and the synthesis window parameter based on the time domain signal, which is the signal that has undergone predetermined processing in the signal processing unit 5 and has been converted into the time domain. Therefore, it can be said that the analysis window parameter and the synthesis window parameter are learned in consideration of the predetermined processing of the signal processing section 5 which is the processing subsequent to the analysis window generation section 2 and the synthesis window generation section 3 . By learning the analysis window parameter and the synthesis window parameter in consideration of the subsequent processing in this way, it is possible to generate the time window more appropriately than before.
  • data exchange between components of the time window generation device may be performed directly or may be performed via a storage unit (not shown).
  • the synthesis window can be generated from the analysis window. Therefore, the synthesis window generator 3 may use the analysis window generated by the analysis window generator 2 to generate a synthesis window. That is, the synthesis window generation unit 3 may generate a synthesis window using at least the clipped signal or the analysis window.
  • the learning unit 7 does not need to learn synthetic window model parameters. That is, the learning unit 7 may learn at least the analysis window model parameters using the time domain signal and the correct data corresponding to the time domain signal.
  • the predetermined processing in the signal processing unit 5 may be implemented by deep learning.
  • the predetermined processing in the signal processing unit 5 may be processing for generating the post-processing frequency domain signal using the frequency domain signal and a model determined by the model parameters.
  • the learning unit 7 may further learn the model parameters by using at least the time domain signal and the correct data corresponding to the time domain signal.
  • the predetermined processing in the signal processing unit 5 may include processing for estimating a noise label using a frequency domain signal and a model determined by noise estimation model parameters.
  • the learning unit 7 further uses the true noise label input to the learning unit 7 and the noise label estimated by the signal processing unit 5 in addition to the time domain signal and the correct data corresponding to the time domain signal.
  • the noise estimation model parameters may be further learned.
  • a program that describes this process can be recorded on a computer-readable recording medium.
  • a computer-readable recording medium is, for example, a non-temporary recording medium, specifically a magnetic recording device, an optical disc, or the like.
  • this program will be carried out, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded.
  • the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.
  • a computer that executes such a program for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer once in the auxiliary recording unit 1050, which is its own non-temporary storage device. Store. When executing the process, this computer reads the program stored in the auxiliary recording section 1050, which is its own non-temporary storage device, into the storage section 1020, and executes the process according to the read program. As another execution form of this program, the computer may read the program directly from the portable recording medium into the storage unit 1020 and execute processing according to the program. It is also possible to execute processing in accordance with the received program each time the is transferred.
  • ASP Application Service Provider
  • the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition.
  • ASP Application Service Provider
  • the program in this embodiment includes information that is used for processing by a computer and that conforms to the program (data that is not a direct instruction to the computer but has the property of prescribing the processing of the computer, etc.).
  • the device is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.
  • the signal clipping unit 1, the analysis window generating unit 2, the synthesis window generating unit 3, the frequency domain transforming unit 4, the signal processing unit 5, the time domain transforming unit 6, and the learning unit 7 may be configured by processing circuits.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)

Abstract

L'invention concerne un dispositif de génération de fenêtre temporelle comprenant : une unité de découpe de signal 1 pour générer un signal découpé par découpage d'un signal audio à une longueur prédéterminée ; une unité de génération de fenêtre d'analyse 2 pour générer une fenêtre d'analyse par l'utilisation du signal découpé et d'un modèle de fenêtre d'analyse déterminé par un paramètre de modèle de fenêtre d'analyse ; une unité de génération de fenêtre de synthèse 3 pour générer une fenêtre de synthèse par l'utilisation d'au moins le signal découpé ou par l'utilisation de la fenêtre d'analyse ; une unité de conversion au domaine fréquentiel 4 pour générer un signal de domaine fréquentiel par conversion du signal découpé dans un domaine fréquentiel par le biais de l'utilisation de la fenêtre d'analyse ; une unité de traitement de signal 5 pour générer un signal de domaine fréquentiel traité par réalisation d'un processus prédéfini sur le signal de domaine fréquentiel ; une unité de conversion au domaine temporel 6 pour générer un signal de domaine temporel par conversion du signal de domaine fréquentiel traité dans un domaine temporel par le biais de l'utilisation de la fenêtre de synthèse ; et une unité d'apprentissage 7 pour apprendre au moins le paramètre de modèle de fenêtre d'analyse par l'utilisation du signal de domaine temporel et corriger des données de réponse qui correspondent au signal de domaine temporel.
PCT/JP2022/004616 2022-02-07 2022-02-07 Dispositif de génération de fenêtre temporelle, procédé et programme WO2023148955A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023578328A JPWO2023148955A1 (fr) 2022-02-07 2022-02-07
PCT/JP2022/004616 WO2023148955A1 (fr) 2022-02-07 2022-02-07 Dispositif de génération de fenêtre temporelle, procédé et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/004616 WO2023148955A1 (fr) 2022-02-07 2022-02-07 Dispositif de génération de fenêtre temporelle, procédé et programme

Publications (1)

Publication Number Publication Date
WO2023148955A1 true WO2023148955A1 (fr) 2023-08-10

Family

ID=87551996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/004616 WO2023148955A1 (fr) 2022-02-07 2022-02-07 Dispositif de génération de fenêtre temporelle, procédé et programme

Country Status (2)

Country Link
JP (1) JPWO2023148955A1 (fr)
WO (1) WO2023148955A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH103297A (ja) * 1996-06-14 1998-01-06 Oki Electric Ind Co Ltd 背景雑音消去装置
JP2015049354A (ja) * 2013-08-30 2015-03-16 富士通株式会社 音声処理装置、音声処理方法及び音声処理用コンピュータプログラム
JP2016537891A (ja) * 2013-11-28 2016-12-01 ヴェーデクス・アクティーセルスカプ 補聴器システムの動作方法および補聴器システム
JP2017102488A (ja) * 2012-05-04 2017-06-08 カオニックス ラブス リミテッド ライアビリティ カンパニー 源信号分離のためのシステム及び方法
JP2020030373A (ja) * 2018-08-24 2020-02-27 日本電信電話株式会社 音源強調装置、音源強調学習装置、音源強調方法、プログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH103297A (ja) * 1996-06-14 1998-01-06 Oki Electric Ind Co Ltd 背景雑音消去装置
JP2017102488A (ja) * 2012-05-04 2017-06-08 カオニックス ラブス リミテッド ライアビリティ カンパニー 源信号分離のためのシステム及び方法
JP2015049354A (ja) * 2013-08-30 2015-03-16 富士通株式会社 音声処理装置、音声処理方法及び音声処理用コンピュータプログラム
JP2016537891A (ja) * 2013-11-28 2016-12-01 ヴェーデクス・アクティーセルスカプ 補聴器システムの動作方法および補聴器システム
JP2020030373A (ja) * 2018-08-24 2020-02-27 日本電信電話株式会社 音源強調装置、音源強調学習装置、音源強調方法、プログラム

Also Published As

Publication number Publication date
JPWO2023148955A1 (fr) 2023-08-10

Similar Documents

Publication Publication Date Title
JP6903611B2 (ja) 信号生成装置、信号生成システム、信号生成方法およびプログラム
JP4774100B2 (ja) 残響除去装置、残響除去方法、残響除去プログラム及び記録媒体
CN113436643B (zh) 语音增强模型的训练及应用方法、装置、设备及存储介质
KR101224755B1 (ko) 음성-상태 모델을 사용하는 다중-감각 음성 향상
JP2019078864A (ja) 楽音強調装置、畳み込みオートエンコーダ学習装置、楽音強調方法、プログラム
JP4964259B2 (ja) パラメタ推定装置、音源分離装置、方向推定装置、それらの方法、プログラム
JP2008203474A (ja) 多信号強調装置、方法、プログラム及びその記録媒体
Bayram et al. Primal-dual algorithms for audio decomposition using mixed norms
WO2023148955A1 (fr) Dispositif de génération de fenêtre temporelle, procédé et programme
Panda et al. Sliding mode singular spectrum analysis for the elimination of cross-terms in Wigner–Ville distribution
Bilen et al. Joint audio inpainting and source separation
WO2023152895A1 (fr) Système de génération de signal de forme d'onde, procédé de génération de signal de forme d'onde et programme
JP7120573B2 (ja) 推定装置、その方法、およびプログラム
JP7375904B2 (ja) フィルタ係数最適化装置、潜在変数最適化装置、フィルタ係数最適化方法、潜在変数最適化方法、プログラム
JP5583181B2 (ja) 縦続接続型伝達系パラメータ推定方法、縦続接続型伝達系パラメータ推定装置、プログラム
WO2021255925A1 (fr) Dispositif de génération de signal sonore cible, procédé de génération de signal sonore cible, et programme
WO2021205494A1 (fr) Dispositif de traitement de signal, procédé de traitement de signal, et programme
JP7156064B2 (ja) 潜在変数最適化装置、フィルタ係数最適化装置、潜在変数最適化方法、フィルタ係数最適化方法、プログラム
JP2022102319A (ja) ベクトル推定プログラム、ベクトル推定装置、及び、ベクトル推定方法
WO2019208137A1 (fr) Dispositif de séparation de sources sonores, procédé pour sa mise en œuvre et programme
JP2019090930A (ja) 音源強調装置、音源強調学習装置、音源強調方法、プログラム
JP2020030373A (ja) 音源強調装置、音源強調学習装置、音源強調方法、プログラム
WO2022172348A1 (fr) Procédé de déduction de scène, dispositif de déduction de scène et programme
WO2021144934A1 (fr) Dispositif d'amélioration de voix, dispositif d'apprentissage, procédés associés et programme
JP6445417B2 (ja) 信号波形推定装置、信号波形推定方法、プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22924865

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023578328

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE